# The importance of data preparation

### Summary

This segment emphasizes the critical role of data preprocessing in NLP, highlighting that the quality of input data directly impacts the accuracy of machine learning outcomes. It outlines the key steps involved in cleaning and formatting text data for effective analysis.

### Highlights

- 🧹 Data quality is paramount for accurate NLP results.
- 🗑️ "Garbage in, garbage out" principle applies to NLP data.
- 📝 Preprocessing involves cleaning, noise removal, and formatting.
- 📂 General cleaning organizes and tidies text data.
- 🔇 Noise removal eliminates irrelevant data, reducing memory usage.
- 🛠️ Formatting prepares data for specific machine learning algorithms.
- 📈 Preprocessing transforms raw text into a clean, analyzable format.

### Code Examples

```python
# Example: General Cleaning (Conceptual)
def clean_text(text):
    # Remove unwanted characters, correct formatting
    cleaned_text = remove_special_chars(text)
    cleaned_text = correct_spelling(cleaned_text)
    return cleaned_text

# Example: Noise Removal (Conceptual)
def remove_noise(text):
    # Remove stop words, punctuation, etc.
    no_noise_text = remove_stopwords(text)
    no_noise_text = remove_punctuation(no_noise_text)
    return no_noise_text

# Example: Formatting (Conceptual)
def format_data(text):
    # Convert text to a suitable format for the model
    formatted_text = tokenize_text(text)
    formatted_text = vectorize_text(formatted_text)
    return formatted_text
```

# Lowercase

### Summary

This segment explains the importance of converting text data to lowercase in NLP for consistency and uniformity. It demonstrates how to use Python's `lower()` function for this purpose, highlighting its benefits and potential drawbacks.

### Highlights

- 🔡 Lowercasing ensures consistent word recognition in NLP.
- ⚖️ It prevents models from treating capitalized and lowercase words differently.
- 🧹 Lowercasing simplifies further data cleaning processes.
- ⚠️ It can alter the meaning of certain words or abbreviations (e.g., "US").
- 🐍 Python's `lower()` function efficiently converts strings to lowercase.
- 📝 Lowercasing can be applied to individual strings or lists of strings.
- 🚀 It streamlines text data preparation for analysis and modeling.

### Code Examples

```python
# Example: Lowercasing a single sentence
sentence = "Her Cat's name is Luna."
lowercase_sentence = sentence.lower()
print(lowercase_sentence)

# Example: Lowercasing a list of sentences
sentence_list = ["The Dog is friendly.", "Cats are playful.", "BIRDS can fly."]
lowercase_sentence_list = [x.lower() for x in sentence_list]
print(lowercase_sentence_list)
```

# Removing stop words

### Summary

This segment demonstrates how to remove stopwords from text using the NLTK library in Python. Stopwords are common words that don't contribute much to the meaning of a text and their removal can simplify data, improve machine learning accuracy, and speed up processing.

### Highlights

- 🛑 Stopwords are common, low-meaning words (e.g., "the," "and," "a").
- 🧹 Removing stopwords simplifies data and improves model performance.
- 📦 NLTK library is used for stopword removal.
- 📥 NLTK's `stopwords` corpus provides a list of common stopwords.
- 📝 Stopwords can be customized by adding or removing words.
- 🚀 Removing stopwords results in a smaller, cleaner dataset.
- 🐍 Python code efficiently filters out stopwords from text.

### Code Examples

```python
import nltk
from nltk.corpus import stopwords

# Download stopwords (if not already downloaded)
nltk.download('stopwords')

# Load English stopwords
stop_words = set(stopwords.words('english'))

# Example sentence
sentence = "it was too far to go to the shop and he did not want her to walk."

# Remove stopwords
sentence_no_stopwords = " ".join([word for word in sentence.split() if word not in stop_words])
print(sentence_no_stopwords)

# Customize stopwords
stop_words.remove("did")
stop_words.remove("not")
stop_words.append("go")

# Remove custom stopwords
sentence_no_stopwords_custom = " ".join([word for word in sentence.split() if word not in stop_words])
print(sentence_no_stopwords_custom)
```

# Regular expressions

### **Summary**

This video tutorial introduces regular expressions (regex) in Python, explaining their syntax and usage for pattern matching within strings. It covers essential functions like `re.search` and `re.sub`, demonstrating how to find and replace text, filter reviews based on specific criteria, and remove punctuation.

### **Highlights**

- 🐍 Importing the `re` package is the first step to using regular expressions in Python.
- 📝 Raw strings, denoted by `r`, are crucial for treating backslashes literally, avoiding unintended escape sequences.
- 🔍 `re.search` helps identify if a pattern exists within a string, returning the match or `None`.
- 🔄 `re.sub` enables replacing specific patterns with new text, useful for correcting errors or standardizing text.
- ❓ The question mark `?` makes a preceding character optional in a pattern.
- 🚀 The caret `^` symbol matches the start of a string, and the dollar sign `$` matches the end.
- 🔗 The pipe `|` operator allows matching multiple patterns, like "needed" or "wanted".
- 🧹 Regex is powerful for removing punctuation, using `[^\\w\\s]` to target non-word and non-whitespace characters.

### **Code Examples**

- Importing the `re` package:
    
    ```python
    import re
    
    ```
    
- Using raw strings:
    
    ```python
    file_path = r"c:\desktop\notes"
    print(file_path)
    
    ```
    
- Using `re.search`:
    
    ```python
    pattern = "Sarah?"
    string = "Sarah was able to help."
    result = re.search(pattern, string)
    print(result)
    
    ```
    
- Using `re.sub`:
    
    ```python
    string = "Sarah was able to help."
    new_string = re.sub("Sarah", "Sara", string)
    print(new_string)
    
    ```
    
- Removing punctuation
    
    ```python
    pattern = r"[^\w\s]"
    string = "Hello, world!"
    no_punct = re.sub(pattern, "", string)
    print(no_punct)
    
    ```
    

# Tokenization

### Summary

This video explains tokenization in Natural Language Processing (NLP), focusing on word and sentence tokenization using the NLTK library. Tokenization breaks text into smaller units (tokens) for better analysis, and it's a crucial step before further processing like vectorization.

### Highlights

- 🧩 Tokenization is the process of breaking text into smaller units called tokens.
- 📝 Word tokenization splits text into individual words, while sentence tokenization splits it into sentences.
- 📚 NLTK (Natural Language Toolkit) is a Python library used for NLP tasks, including tokenization.
- 📥 NLTK requires downloading additional resources for tokenization.
- ✂️ `sent_tokenize` function from NLTK splits a text into sentences.
- 🔡 `word_tokenize` function from NLTK splits a sentence into words.
- 📉 Case sensitivity can affect token analysis, highlighting the importance of lowercasing text for consistency.

### Code Examples

- Importing NLTK and necessary functions:
    
    ```python
    import nltk
    nltk.download('punkt') # Download required resources
    from nltk.tokenize import word_tokenize, sent_tokenize
    
    ```
    
- Sentence tokenization:
    
    ```python
    text = "Her cat's name is Luna. Her dog's name is Max."
    sentences = sent_tokenize(text)
    print(sentences)
    
    ```
    
- Word tokenization:
    
    ```python
    sentence = "Her cat's name is Luna."
    words = word_tokenize(sentence)
    print(words)
    
    ```
    
- Word tokenization of a longer sentence:
    
    ```python
    sentence = "Her cat's name is Luna. Her dog's name is Max."
    words = word_tokenize(sentence)
    print(words)
    
    ```
    

# Stemming

### **Summary**

This video explains stemming, a text standardization technique in NLP that reduces words to their base form. It uses the Porter stemmer from the NLTK library to demonstrate how words like "connecting" and "learned" are transformed. While stemming simplifies text and reduces data complexity, it can sometimes produce non-meaningful or improper words.

### **Highlights**

- 🛠️ Stemming standardizes text by reducing words to their root form.
- ✂️ It removes word suffixes, but can result in non-standard words.
- 📉 Stemming reduces the number of unique words, simplifying data for machine learning.
- 📚 NLTK's Porter stemmer is a common tool for this process.
- 🔄 Words like "connecting" become "connect," and "learning" becomes "learn."
- ⚠️ Some words, like "worse," might be stemmed to non-words ("wors").

### **Code Examples**

- Importing the Porter stemmer:
    
    ```python
    from nltk.stem import PorterStemmer
    
    ```
    
- Initializing the stemmer:
    
    ```python
    ps = PorterStemmer()
    
    ```
    
- Stemming words:
    
    ```python
    tokens = ["connecting", "connected", "connectivity", "connect", "connects"]
    for token in tokens:
        print(ps.stem(token))
    
    ```
    
- Stemming different words:
    
    ```python
    tokens = ["learned", "learning", "learn", "learns", "learner", "learners"]
    for token in tokens:
        print(ps.stem(token))
    
    ```
    
- Example with "likes," "better," and "worse":
    
    ```python
    tokens = ["likes", "better", "worse"]
    for token in tokens:
        print(ps.stem(token))
    
    ```
    

# Lemmatization

### **Summary**

This video contrasts stemming with lemmatization, focusing on how lemmatization reduces words to their base form while maintaining meaning. It uses WordNet Lemmatizer from the NLTK library to demonstrate how lemmatization preserves word meaning better than stemming, though it results in a larger dataset.

### **Highlights**

- 🧠 Lemmatization reduces words to their base form using a dictionary for context.
- 📜 It aims to produce meaningful base words, unlike stemming which can create non-words.
- 📈 Lemmatization often results in a larger dataset as it preserves more word variations.
- 📚 NLTK's WordNet Lemmatizer is used for this process.
- 🔄 Words are reduced to meaningful base forms, e.g., "learners" to "learner."
- 💬 Lemmatization retains word meaning, e.g., "worse" remains "worse," unlike stemming which produced "wors."

### **Code Examples**

- Downloading WordNet and importing the lemmatizer:
    
    ```python
    import nltk
    nltk.download('wordnet')
    from nltk.stem import WordNetLemmatizer
    
    ```
    
- Initializing the lemmatizer:
    
    **Python**
    
    ```python
    lemmatizer = WordNetLemmatizer()
    
    ```
    
- Lemmatizing "connect" tokens:
    
    **Python**
    
    ```python
    tokens = ["connecting", "connected", "connectivity", "connect", "connects"]
    for token in tokens:
        print(lemmatizer.lemmatize(token))
    
    ```
    
- Lemmatizing "learn" tokens:
    
    ```python
    tokens = ["learned", "learning", "learn", "learns", "learner", "learners"]
    for token in tokens:
        print(lemmatizer.lemmatize(token))
    
    ```
    
- Lemmatizing "likes" tokens:
    
    ```python
    tokens = ["likes", "better", "worse"]
    for token in tokens:
        print(lemmatizer.lemmatize(token))
    
    ```
    

# N-grams

### Summary

This video explains n-grams, which are sequences of n neighboring words or tokens used to analyze text data. It demonstrates how to compute and visualize unigrams, bigrams, and trigrams using Python libraries like NLTK, pandas, and matplotlib, highlighting their utility in preprocessing analysis and feature creation for machine learning.

### Highlights

- 🔢 N-grams are sequences of n adjacent words or tokens in a text.
- 📊 Unigrams (n=1), bigrams (n=2), and trigrams (n=3) are common types of n-grams.
- 🐍 Libraries like NLTK, pandas, and matplotlib are used for n-gram analysis and visualization.
- 📈 Visualizing n-grams, especially unigrams, helps identify frequent words or phrases.
- 📝 N-gram analysis can reveal interesting patterns and insights in text data.
- 📉 Preprocessing, like stop word removal, can significantly alter n-gram results.
- 🖼️ Matplotlib allows for creating charts, like horizontal bar plots, to represent n-gram frequencies.

### Code Examples

- Importing necessary libraries:
    
    ```python
    import nltk
    import pandas as pd
    import matplotlib.pyplot as plt
    
    ```
    
- Creating tokens:
    
    ```python
    tokens = ["the", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog", "the", "dog", "jumps"]
    
    ```
    
- Computing and displaying unigrams:
    
    ```python
    unigrams = pd.Series(nltk.ngrams(tokens, 1)).value_counts()
    print(unigrams)
    
    ```
    
- Visualizing top 10 unigrams:
    
    ```python
    unigrams[:10].sort_values().plot.barh(color='lightsalmon', width=0.8, figsize=(12, 6), title='Ten Most Frequently Occurring Unigrams')
    plt.show()
    
    ```
    
- Computing and displaying bigrams:
    
    ```python
    bigrams = pd.Series(nltk.ngrams(tokens, 2)).value_counts()
    print(bigrams)
    
    ```
    
- Computing and displaying trigrams:
    
    ```python
    trigrams = pd.Series(nltk.ngrams(tokens, 3)).value_counts()
    print(trigrams)
    
    ```