# Preparing your NLP Data
## IMD1107 - Natural Language Processing
### [Dr. Elias Jacob de Menezes Neto](https://docente.ufrn.br/elias.jacob)

## Summary

### Keypoints

- The Natural Language Processing (NLP) pipeline consists of several stages: Data Collection, Text Cleaning, Preprocessing, Feature Extraction, Modeling, Evaluation, Deployment, and Maintenance and Monitoring.

- Text cleaning is crucial for improving model performance, efficiency, and accuracy by removing noise, inconsistencies, and irrelevant details from the text data.

- Preprocessing includes tasks like tokenization (splitting text into words or phrases), stemming/lemmatization (reducing words to their base or root form), removing stop words (common words that don't carry much meaning), and case folding (converting text to lowercase).

- Feature extraction transforms preprocessed text data into numerical formats using techniques like Bag-of-Words (BoW), which represents text by the frequency of words, and TF-IDF (Term Frequency-Inverse Document Frequency), which adjusts the frequency of words by their rarity across documents.

- The choice of preprocessing techniques can significantly impact the performance of machine learning models, especially in the context of deep learning. While traditional techniques like stemming and lemmatization are useful, they may not always improve performance in deep learning models.

- Managing vocabulary size is important to control model complexity and generalization. Strategies include limiting the vocabulary to the most frequent words and using subword tokenization to handle out-of-vocabulary words.

### Takeaways

- Understanding each step in the NLP pipeline is crucial for effective text data processing and analysis.

- Text cleaning is essential for removing dirty data and improving model performance. It involves handling noise, inconsistencies, and irrelevant information in the text.

- Preprocessing techniques like tokenization, stemming/lemmatization, stop word removal, and case folding prepare the text for feature extraction and modeling. They help in reducing dimensionality and focusing on meaningful information.

- Feature extraction methods like BoW and TF-IDF convert preprocessed text into numerical representations that machine learning models can understand and learn from.

- While traditional NLP techniques such as stemming and lemmatization are useful, they may not always improve performance, especially in deep learning models. It's important to experiment and evaluate their impact on specific tasks and datasets.

- Managing vocabulary size is crucial for controlling model complexity and generalization. Techniques like limiting the vocabulary to frequent words and using subword tokenization help in handling large and evolving vocabularies.

- The distributional hypothesis, which states that words appearing in similar contexts tend to have similar meanings, is a fundamental concept in NLP that enables techniques like word embeddings to capture semantic relationships between words.

- Experimenting with different preprocessing techniques, feature extraction methods, and vocabulary management strategies is essential to optimize the performance of NLP models for specific tasks and datasets.

# Understanding The Natural Language Processing (NLP) Pipeline

The NLP pipeline is a series of structured steps that help transform raw text into a format that machines can understand and use to make decisions or predictions. Here we explore each step for a complete understanding.

## 1. Data Collection

Data collection is the first and crucial step in the pipeline, where we gather raw text data from various sources. The quality and quantity of this data can significantly impact the effectiveness of your NLP model.

## 2. Text Cleaning

In this stage, we clean the collected data by removing noise such as HTML tags, emojis, punctuation marks, etc., which do not contribute to understanding the actual content. This cleaned-up data will improve the model's performance and save computational resources.

## 3. Preprocessing

Preprocessing involves transformation to ready the data for feature extraction, including tasks like tokenization (splitting text into words or phrases), stemming/lemmatization (reducing words to their base/root form), and removing stop words (common words like 'is', 'an', 'the' that don't carry much meaning).

## 4. Feature Extraction

This stage involves converting preprocessed data into a format that can be understood by machine learning algorithms. Techniques like Bag-of-Words or TF-IDF (Term Frequency-Inverse Document Frequency) are employed here to create numerical representations of the text.

`We'll cover up to this point in this notebook. The remaining steps will be covered in the next notebook.`

---

## 5. Modeling

Once the features are extracted and in proper format, we use them to build and train our NLP model. Depending on the end-goal, different models can be used, like Naive Bayes for classification, or LSTM (Long Short Term Memory) for sequence prediction.

## 6. Evaluation

After the model has been trained, it must be evaluated to ascertain its performance. Metrics like precision, recall, accuracy, and F1-score are typically considered. Also, the model might be tested with new data to validate its performance.

## 7. Deployment

Once satisfied with the model's performance, the next step is to deploy it for practical use. This can range from integrating within an existing system or application, to deploying on a server for production use.

## 8. Maintenance and Monitoring

Post-deployment, continuous monitoring is essential to ensure the model's performance doesn't degrade over time, due to changes in data patterns. Periodic retraining and tuning may be necessary to keep the model up-to-date.

# Step 1 - Skipping data collection: our dataset
For the purpose of this class, we won't be going through the data collection process. Instead, a carefully curated dataset is provided, tailored specifically for our needs. Here are some key details about it:

## About The Dataset

- Custom tailored for this class from the [n2c2 NLP Researchh Data Sets](https://portal.dbmi.hms.harvard.edu/projects/n2c2-nlp/)
- Around 1200 clinical notes from 1990-2006.
- Subset from [Recognizing Obesity and Commorbities in Sparse Data](https://academic.oup.com/jamia/article/16/4/561/766997)
- I've used OpenAI API to translate it to Portuguese.
- Data was anonymized by the original authors.
- I've kept all artifacts, problems and errors from the original data set.
> The drawn records were de-identified semi-automatically. An automatic pass, followed by two parallel manual passes were made over each record. These were followed by a third manual pass that resolved the disagreements between the two parallel manual passes. In order to make the data HIPAA compliant, patient names, health proxy and patient family member names, doctor names, hospital names, ID numbers, phone and pager numbers, dates, locations, ages, mentions of companies related to patient's occupations, some nationalities, and other potential identifiers were replaced with surrogates. The surrogate replacement process inserted random names from the US Census Bureau Database for each of the patient, health proxy, family member, and doctor names in the data. We made no effort to keep co-reference, though any name could have been drawn from the US Census Bureau Database more than once. For hospital names, ID numbers, phone and pager numbers, dates, locations, and ages, we generated surrogates randomly.

- The data is in the `data` folder.
- Despite the fact that the data is anonymized, I ask you to not share it outside this class.
- If you want access to more data in portuguese, I suggest the following datasets (they have strict requirements about using and sharing the data, so please read the instructions carefully):
- [SemClinBR - a multi-institutional and multi-specialty semantically annotared corpus for Portuguese clinical NLP tasks](https://rdcu.be/cNgqV)
- [BRATECA - Brazilian Tertiary Care Dataset: a Clinical Information Dataset for the Portuguese Language](https://physionet.org/content/brateca/1.1/)


# Step 2 - Introduction to NLP and data cleaning: why should you clean your text?

You cannot go straight from raw text to ﬁtting a machine learning or deep learning model. Dirty or unorganized data can severely hinder your NLP projects, leading to inaccuracies and lower quality results. Hence, the importance of preprocessing and cleaning steps in NLP cannot be overstated.

## Importance of Cleaning

1. **Improving Model Performance**: Cleaned data improves algorithms' performances by reducing noise and irrelevant details in the dataset.
2. **Efficiency**: It saves computing resources as the algorithms won't have to process irrelevant features or instances.
3. **Accuracy**: Removing errors or discrepancies in your data reduces bias and improves the validity of your model's predictions.

## Common Problems in Text Data

### 1. Noise
Noise includes any unwanted components or errors within the text, such as HTML tags, punctuation marks, digits, etc., which do not contribute to the actual meaning of content.

### 2. Inconsistencies
Text data is often full of inconsistencies like spelling mistakes, acronyms, abbreviations, slang, and different languages.

### 3. High Dimensionality
Text data can become high-dimensional if not preprocessed correctly, especially when one-hot encoding techniques are applied. This can increase the computation complexity.

## Potential Consequences of Dirty Text Data

- **Decreased Performance**: Noise and irrelevant data can dilute important trends and patterns, degrading algorithm performance.
- **Misinterpretation**: Inaccurate results due to poor data quality can lead to incorrect decisions or misinterpretations.
- **Increased Complexity**: Unnecessary features or instances can increase the dimensionality of the dataset, resulting in higher computational costs.

Next, we'll discuss some basic techniques for NLP data cleaning.

In [1]:
# Loading our data
import pandas as pd

df_train = pd.read_parquet('data/healthcare/train.parquet')
df_valid = pd.read_parquet('data/healthcare/valid.parquet')
df_test = pd.read_parquet('data/healthcare/test.parquet')



Now that we have loaded our dataset, it's important to remember one crucial aspect before these datasets are ready to be fed into a production environment.

**Consistent Transformations:** All the transformations that you apply on your training data should also be applied on your validation and test sets. This ensures uniformity and consistency across all your data, thereby improving the performance of your model.

> For instance, if you perform a specific adjustment to one of the features in your training set (like normalizing or standardizing), you must ensure to apply the same transformation to that feature in your validation and test sets too.

As a best practice after experimentation, it's recommended to create a custom class or module that will help streamline this process of applying consistent transformations across your datasets. Organizing transformations in a modular fashion would not only result in cleaner code but also save time during the preprocessing phase.

> **Exercise:** As an exercise for you, consider designing such a custom class or module to automate the process of applying the necessary transformations to your datasets.

Remember, no matter how complex your modeling technique may be, flawed or inconsistent data can significantly hamper your model's performance!

In [None]:
df_train.head()

In [None]:
txts_train = df_train['text_pt'].copy()
len(txts_train)

In [None]:
txts_train

In [None]:
# Let's check some of our texts to get a feeling for what they look like

for txt in txts_train.sample(10, random_state=271828):
    print(txt, end='\n\n\n\n\n')


After actually getting a hold of your text data, the ﬁrst step in cleaning up text data is to have a strong idea about what you’re trying to achieve, and in that context review your text to see what exactly might help. Take a moment to look at the text. What do you notice? Here are some noteworthy points:


1. **Meta Information:** The first line of each text contains some metadata about the Clinical Note. This could include valuable details such as date, time, author name etc.

2. **Artificial Line Wraps:** The lines seem artificially wrapped with new lines. This may distort the continuity of sentences or paragraphs.

3. **Translation Issues:** There appear to be some translation problems within the text, which might result in misinterpretation or loss of context.

4. **Absence of Delimiters:** At the initial glance, no obvious delimiters can be identified that could help us break down the text into columns or tables.

5. **Punctuation Marks:** Certain punctuation marks like '-------' or '**********' don't seem to convey any meaning for our analysis and hence could be treated as noise.

6. **Inconsistent Case Use:** The usage of letter case (upper case and lower case) within the text is inconsistent, which might affect precision in certain text processing tasks.

7. **Irrelevant Numbers:** There are numbers present in the text which might not hold any relevancy to our analysis.

8. **Potential Useful Features:** Despite the above issues, there are some clear useful features visible such as names of people, drugs and diseases.

### Sidenote: Regular Expressions

Regular expressions, often shortened as `regex`, are sequences of characters used primarily for searching and replacing patterns within a string. They are widely supported in most programming languages.

Regular expressions utilize two types of characters:
1. **Metacharacters**: As the term implies, these characters hold special meanings. An example is `\n`.
2. **Literals**: These include regular alphabets and numbers like a, b, 1, 2...

#### Commonly Used Operators in Regex

Regex can specify patterns, not just fixed characters. Here, we have enlisted the most frequently used operators, which help in creating an expression to represent required characters in a string.

| Operators | Description |
|:----------:|:-------------------------------------------------------------------------------------------------------------------------------------------|
| . | Matches any single character except newline (`\n`). |
| ? | Matches zero or one occurrence of the pattern found to its left |
| + | Matches one or more occurrences of the pattern located to its left |
| * | Matches zero or more occurrences of the preceding pattern |
| \w | Matches any alphanumeric (word) character |
| \W | Matches any non-word character |
| \d | Matches any digit [0-9] |
| \D | Matches any non-digit character |
| \s | Matches any whitespace character (spaces, newlines, return, tab, form) |
| \S | Matches any non-whitespace character |
| \b | Matches word boundary |
| \B | Matches anywhere but a word boundary |
| [..] | Matches any single character in brackets and [^…] matches any single character not in brackets |
| [^…] | Matches any single character not in brackets |
| \ | It's used before characters that have a special meaning like `\.` for period or `\+` for plus sign |
| ^ and $ | `^` and `$` match the start and end of a string respectively |
| {n, m} | Finds at least "n" and at most "m" occurrences of the preceding expression. `{,m}` will find at least any minimum occurrence up to "m" max |
| a \| b | Matches either a or b |
| () | Groups regular expressions and returns the matched text |
| \t, \n, \r | Matches tab, newline, return respectively


In Python, there exists a module `re` which aids us with regular expressions. You first need to import the `re` library to be able to use regex in Python.

The `re` package provides various methods to perform queries on an input string. Here are the most commonly used ones:

- re.match()
- re.search()
- re.findall()
- re.split()
- re.sub()
- re.compile()

**Further resources:**

- To test your regex, you can visit this site: [Regex101](https://regex101.com/)
- For the documentation of the `re` library, you can refer to: [Python RE documentation](https://docs.python.org/3/library/re.html)

### End of Sidenote: Regular Expressions

In [None]:
import re

def remove_excessive_whitespace(text: str) -> str:
    """
    Remove excessive whitespace from a string.

    Args:
        text (str): The input string.

    Returns:
        str: The string with excessive whitespace removed.
    """
    return re.sub(r'\s+', ' ', text).strip() 

def remove_repeated_non_word_characters(text: str) -> str:
    """
    Remove repeated non-word characters from a string.

    Args:
        text (str): The input string.

    Returns:
        str: The string with repeated non-word characters removed.
    """
    return re.sub(r'(\W)\1+', r'\1', text).strip() # \W matches any non-word character (equivalent to [^a-zA-Z0-9_ ])

def remove_first_line_of_text(text: str) -> str:
    """
    Remove the first line of a string.

    Args:
        text (str): The input string.

    Returns:
        str: The string with the first line removed.
    """
    return re.sub(r'^.*\n', '', text).strip()

def remove_last_line_of_text(text: str) -> str:
    """
    Remove the last line of a string.

    Args:
        text (str): The input string.

    Returns:
        str: The string with the last line removed.
    """
    return re.sub(r'\n.*$', '', text).strip()

def correct_isolated_commas(text: str) -> str:
    """
    Correct isolated commas in a string.

    Args:
        text (str): The input string.

    Returns:
        str: The string with isolated commas corrected.
    """
    # Replace punctuation with a blank character before
    text = re.sub(r' ([.,:;!?])', r'\1', text)
    return text.strip()

example = '      Hello,     World!!!!!!,,,,     \n\n\n\n It is a beautiful , beautiful day!   \n garbage '
print(example)

In [None]:
print(correct_isolated_commas(example))

In [None]:
print(remove_last_line_of_text(example))

In [None]:
print(remove_excessive_whitespace(example))

In [None]:
print(remove_repeated_non_word_characters(example))

In [None]:
# Import the necessary classes from sklearn
from sklearn.pipeline import Pipeline  # Pipeline applies a list of transforms sequentially. You can also add an estimator at the end.
from sklearn.preprocessing import FunctionTransformer  # FunctionTransformer allows applying an arbitrary function to the data, useful for custom transformations.

# Define a pipeline to clean text data by applying a series of transformations
pipeline_clean_text = Pipeline([
    # Step 1: Remove the first line of text
    # FunctionTransformer wraps the custom function remove_first_line_of_text to be used in the pipeline
    ('remove_first_line_of_text', FunctionTransformer(remove_first_line_of_text)),
    
    # Step 2: Remove the last line of text
    # FunctionTransformer wraps the custom function remove_last_line_of_text to be used in the pipeline
    ('remove_last_line_of_text', FunctionTransformer(remove_last_line_of_text)),
    
    # Step 3: Remove excessive whitespace
    # FunctionTransformer wraps the custom function remove_excessive_whitespace to be used in the pipeline
    ('remove_excessive_whitespace', FunctionTransformer(remove_excessive_whitespace)),
    
    # Step 4: Remove repeated non-word characters (e.g., punctuation)
    # FunctionTransformer wraps the custom function remove_repeated_non_word_characters to be used in the pipeline
    ('remove_repeated_non_word_characters', FunctionTransformer(remove_repeated_non_word_characters)),
    
    # Step 5: Correct isolated commas
    # FunctionTransformer wraps the custom function correct_isolated_commas to be used in the pipeline
    ('correct_isolated_commas', FunctionTransformer(correct_isolated_commas)),
])

# Apply the pipeline to the example data
# The transform method applies all the transformations defined in the pipeline to the input data
cleaned_text = pipeline_clean_text.transform(example)

# Display the cleaned text
cleaned_text

In [12]:
txts_train_cleaned = txts_train.apply(pipeline_clean_text.transform)

In [None]:
print(txts_train_cleaned[1], txts_train[1], sep='\n\n\n---------\n')

In [None]:
print(txts_train_cleaned[3], txts_train[3], sep='\n\n\n---------\n')

## A Broader Perspective on Text Cleaning

While we've covered the fundamentals, text cleaning is a vast field with numerous complexities. Real-world projects often necessitate a more complete approach to text cleaning, as the data encountered is rarely as clean as what we've worked with so far. Here are some advanced considerations:

### Handling Large-scale Documents

When dealing with large documents or numerous text files that exceed memory capacities, effective memory management techniques become essential. Strategies such as chunking, streaming, or distributed processing may need to be employed to efficiently handle the data.

### Extracting Text from Structured Formats

In some cases, you may need to extract text from structured document formats like HTML or PDFs. This presents unique challenges and requires specific parsing techniques to accurately extract the desired text while discarding the markup or formatting.

### Character Transliteration

When working with non-English languages, transliterating characters into English can ease text processing. This involves converting characters from one script or alphabet to another while preserving the phonetic pronunciation.

### Unicode Normalization

Dealing with internationalized data often requires normalizing Unicode characters into a standardized form, such as UTF-8. This ensures consistent representation and helps avoid issues related to character encoding and compatibility.

### Domain-Specific Terminology

Text from specific domains may contain unique words, phrases, or acronyms that add complexity to the cleaning process. Familiarity with the domain and its terminology is crucial for effectively handling and interpreting such text.

### Handling Numeric Data

Numbers, such as dates and amounts, require special attention. Depending on their relevance to the analysis, they may need to be processed in a specific manner or removed altogether. Consistent formatting and representation of numeric data are important considerations.

### Error Correction

Identifying and correcting common typos and misspellings can significantly improve the quality of your data and lead to more accurate results. Techniques such as spell checking, dictionary-based correction, or even machine learning models can be employed for error correction.

> **Mojibake**: Mojibake refers to the garbled text that results from encoding and decoding issues, commonly encountered when dealing with text data from multiple sources. The [`ftfy` library](https://ftfy.readthedocs.io/en/latest/index.html) is a Python tool specifically designed to fix mojibake and other Unicode-related issues.

### The Elusive Nature of 'Clean Text'

Achieving truly clean text is an elusive goal, as the definition of 'clean' varies depending on the specific requirements of your project. The aim is to do the best we can given the available time, resources, and knowledge.

As you progress through the text cleaning process, it's crucial to **continually review your tokens after each transformation**. Saving a new file after each transform allows for a thorough examination of your data in its altered state. Taking the time to review your data often leads to unexpected insights and a deeper understanding of its characteristics.

*Remember, text cleaning is an iterative and ongoing process. There's always room for refinement, and each step brings you closer to cleaner, more analyzable text.*

`For now, this text seems good enough`. Let's check the overall statistics of the cleaned dataset.

In [None]:
txts_train_cleaned.str.split().apply(len).describe()

In [None]:
import plotly.graph_objects as go

# Calculate word counts for each text in the cleaned training dataset
word_counts = txts_train_cleaned.str.split().apply(len)

# Create a histogram plot using Plotly
fig = go.Figure(data=[go.Histogram(x=word_counts)])

# Customize the layout of the plot
fig.update_layout(
    title_text='Word Count Distribution',  # Set the title of the plot
    title_x=0.5,  # Center the title horizontally
    xaxis_title="Number of Words",  # Label for x-axis
    yaxis_title="Frequency",  # Label for y-axis
    bargap=0.1  # Add some gap between bars for better readability
)

# Display the plot
fig.show()


In [None]:
from collections import Counter
import plotly.graph_objects as go

def plot_histogram_word(text_list, n_most_common=30):
    # Create a list of all words from all texts
    words = [word for txt in text_list for word in txt.split()]

    # Count the frequency of each word
    word_counts = Counter(words)

    # Select the top n most frequent words
    top_words = dict(word_counts.most_common(n_most_common))

    # Create a bar chart using Plotly
    fig = go.Figure([go.Bar(
        x=list(top_words.keys()),
        y=list(top_words.values()),
        text=list(top_words.values()),  # Display count on each bar
        textposition='auto'
    )])

    # Customize the layout
    fig.update_layout(
        title_text=f'Top {n_most_common} most frequent words in the text',
        title_x=0.5,  # Center the title
        xaxis_title="Words",
        yaxis_title="Frequency",
        xaxis_tickangle=-45  # Rotate x-axis labels for better readability
    )

    # Display the plot
    fig.show()

# Call the function with the cleaned training data
plot_histogram_word(txts_train_cleaned, 30)


Not very useful yet. What do you think we can do with this? We'll check that with the `preprocessing` step


# Step 3 - Preprocessing

Data preprocessing is a crucial step for any data analysis task. Specifically, in Natural Language Processing (NLP), this phase involves several transformations to prepare the raw data for feature extraction. These tasks include tokenization, stemming/lemmatization, and stop words removal.

## Tokenization

Tokenization is the initial preprocessing step in NLP. It is the process that divides text into smaller pieces called 'tokens.' Here are the two prevalent types of tokenization:

### Word Tokenization
Word tokenization breaks a sentence or a paragraph into individual words.

> **Note:** The simplest form of this procedure uses spaces as delimiters to separate words. But it's worth noting that this method might not always ensure accurate results. For instance, punctuation attached to words can lead to incorrect tokenization. To address this, we use Regular Expression Tokenization.
>
>In addition, some languages like Chinese and Japanese don't have spaces between words, making it even more challenging to tokenize them using this method. Some common approaches to address this include using a dictionary-based approach or a statistical approach.

In [None]:
# Split the sentence into words based on spaces.
sentence = "There is no dark side of the moon, really. Matter of fact, it's all dark!"
tokens = sentence.split()

# Print the tokens.
print(tokens)

As you can see, the words `really,` , `fact,` and `dark!` are not split correctly. This is because the punctuation after these words is treated as part of the word itself. To overcome this problem, we can use a more advanced approach to word tokenization called **Regular Expression Tokenization**.

#### Regular Expression Tokenization

Regular Expression Tokenization is a more advanced approach to word tokenization that allows us to split a sentence into words based on a regular expression. A regular expression is a sequence of characters that define a search pattern. We can use regular expressions to search for and split words based on a particular pattern.

For example, we can define a regular expression that searches for all non-alphanumeric characters and split the sentence based on the occurrence of these characters:

In [None]:
import re
tokens = re.findall(r"[\w']+", sentence)
print(tokens)

As you can see, the words `really,` , `fact,` and `dark!` are now split correctly. However, the regular expression we used treats the apostrophe in `it's` one single word, when in fact it should be split into two words: `it` and `s`. For now, let’s forge ahead with your imperfect tokenizer. You’ll deal with punctuation and other challenges later.

## Vocabulary

The concept of **vocabulary** plays a significant role in text data processing and machine learning. In this context, vocabulary refers to the set of unique words that exist in a corpus, or collection of documents.

### The Importance of Vocabulary Size

1. **Vocabulary size** can be regarded as a hyperparameter of the model.
2. An important point to note here is that a larger vocabulary would allow the model to have more features implying more available information about the input data.
3. Conversely, a larger vocabulary implies that the vector representation of the text will be larger too, slowing down the training process of the model and making it prone to overfitting.

### Visualizing the Extremes

Let's visualize how the extremes look:

##### Using Letters as Tokens

* The first extreme approach to defining your vocabulary would be using individual letters as tokens.
* Although this results in a very small vocabulary, each word gets represented by a considerably large feature vector, equal to the number of unique letters, 26 for English for instance.
* Note that this approach leads to a very sparse matrix. This means most of its entries are zeros.
* Training a model with such sparse representation could severely slow down the training process.
* Each document would be represented by a vector of length 26, with each entry representing the frequency of the corresponding letter in the document. That, however, would be a very inefficient representation, as we'd quickly lose signal about the actual words in the document. Representing entire documents would result in all signal turning into noise.


##### Using Words as Unrestricted Tokens

* Moving to the other extreme, if we use every unique word as a token without any restriction on the vocabulary size, we'll encounter different issues.
* As can be anticipated, the vocabulary size would grow exponentially, triggering a huge, sparse feature vector for each document.
* The downsides similar to the previous extreme — a slowed training process, increased overfitting likelihood.
* In addition, the resulting vocab would be technically infinite, as new words are constantly being added to the language.


### Striking the Balance: Fixed Vocabulary Size

What we ideally seek is an equilibrium. A spot between these two extremes that best serves our purpose. A key strategy to achieve this balance is fixing the vocabulary size:

* By fixing the vocabulary size, we'd only consider the top 'N' frequently occurring words in the entire corpus.
* This ensures the vocabulary size remains 'N', with each word or document being represented by a vector of length 'N'.
* Such a setup offers a more reasonable, dense representation (opposite of sparse), and significantly speeding up model training compared to the previously explained extreme strategies.
* This however involves trade-offs — less frequent, potentially significant words might get left out.

> Manipulating the size of the vocabulary allows us to control the complexity of our model and its ability to generalize from the training data to unseen data. Therefore, it's vital to choose your strategy wisely based on your specific requirements and constraints.

---


### Curiosity: Zipf's Law

Zipf's law is a renowned empirical rule connected with the fields of linguistics, information theory and statistics. This empirical principle claims that the frequency of any particular word in a corpus (a large and structured set of texts) of language is inversely proportional to its rank in the frequency table.

This principle was first formulated by George Kingsley Zipf, an American linguist, back in 1935. He originally focused on quantitative linguistics, stating that within any natural language corpus, the rate of any word is inversely correlated to its ranking position in the frequency index.

What does this actually mean? Simply put, within a text or collection of text, the most commonly occurring word will appear approximately twice as often as the second most frequent word, thrice as often as the third recurring word, and so on. In reality, if you were to list all the words from a text in order of how often they occur, you would see this pattern emerge.

The application of Zipf's law doesn't end at words; it extends to letters too. The pattern repeats itself – the most common letter in the English language, "e", appears approximately twice as often as the second most common letter, "t", and thrice as often as the third most common letter, "a", and so on.

Interestingly, Zipf's law can also apply beyond linguistic contexts. It can describe a range of phenomena that exhibit similar patterns of frequency relative to rank. It can depict the population rank distribution of cities (i.e., the largest city is twice as populous as the second-largest city), viewing rates for TV channels, incoming calls at call centers or even the number of Internet links pointing towards each individual webpage on the World Wide Web.

In essence, Zipf's law presents us with a fascinating insight into patterns of repetition and prevalence found amongst elements within a given system - be it words, cities, calls or webpages.

> Let's check if that applies to our dataset. We'll use the `Counter` class from the `collections` module to count the frequency of each word in our dataset.

In [None]:
from collections import Counter
import numpy as np
import plotly.graph_objects as go

# Convert all text to lowercase and remove non-alphabetic characters
all_text_lowercase = txts_train_cleaned.str.lower()
all_text_lowercase = all_text_lowercase.str.replace(r'[^a-z\s]', '', regex=True)

# Create a Counter object to count word frequencies
word_counter = Counter()

# Iterate through each text, split into words, and update the counter
for txt in all_text_lowercase:
    word_counter.update(txt.split())

# Get the top 3000 most frequent words
top_words = dict(word_counter.most_common(3000))

# Calculate the total number of word occurrences for Zipf's law
total_occurrences = sum(top_words.values())

# Prepare data for plotting
word_rank = np.log(np.arange(1, len(top_words)+1))  # Log of word ranks
word_freq = np.log(list(top_words.values()))  # Log of word frequencies

# Create the scatter plot
fig = go.Figure([go.Scatter(x=word_rank, y=word_freq, mode='lines+markers', name='Top 3000 words')])

# Customize the layout
fig.update_layout(title_text='Word frequency as a function of its frequency rank', title_x=0.5)
fig.update_xaxes(title_text='Log Rank')
fig.update_yaxes(title_text='Log Frequency')

# Add a trace for the theoretical Zipf's law
zipf_x = np.linspace(1, len(top_words), 100)
zipf_y = total_occurrences / zipf_x
fig.add_trace(go.Scatter(x=np.log(zipf_x), y=np.log(zipf_y), mode='lines', name='Zipf\'s law', line=dict(color='red', dash='dash')))

# Display the plot
fig.show()

### N-Grams

Simply put, an N-gram is a sequence that contains up to 'n' number of elements which have been extracted from a larger sequence. These elements could extend across a wide range from characters, syllables, words and even symbols like “A,” “T,” “G,” and “C” in DNA sequencing. For our discussion, we'll focus on words as the primary elements.

In the context of language sequences and linguistic patterns, these 'n' elements usually refer to words. They could be single words (unigrams), two-word sequences (bigrams), three-word sequences (trigrams), and so on. The term extends to identify longer sequences by the numeric value of 'n', such as four-gram, five-gram, and beyond!

#### Applications of N-Grams

The application of n-grams is extensive; it spans from statistical natural language processing to genetic sequence analysis. However, it's important to note that the n-grams don't inherently carry any special significance - they are merely sequences of 'n' words which commonly appear together within a text.

To draw an illustration, consider this sentence: "Eu gosto muito de estudar PLN." Here, we can identify these bigrams: "Eu gosto", "gosto muito", "muito de", "de estudar", "estudar PLN". Additionally, the trigrams within the given sentence would be: "Eu gosto muito", "gosto muito de", "muito de estudar", "de estudar PLN".

#### Importance of N-Grams

So, why should we regard n-grams important?

When we break down a sequence of tokens into individual words or what we call a bag-of-words vector, we tend to lose much of the innate meaning carried through their order and combination. By recognizing these multiword tokens, i.e., n-grams within our pipeline, we allow our NLP model to retain the semantic structure built-in to word order in sentences.

For instance, take the word “não". Without considering n-grams, this word could end up floating around freely, attaching its negative connotation to the entire sentence or document, rather than specifically to its neighboring words. If we recognize "não gostou" as a 2-gram, we retain more of the original intent behind the phrase than we would by treating "não" and "gostou" as isolated 1-grams in a bag-of-words vector. This approach keeps a part of the context of a word intact by associating it with its neighbours within our pipeline.

---

In [None]:
import nltk

sentence = 'Eu gosto muito de estudar PLN'
bigrams = nltk.bigrams(sentence.split()) # or nltk.ngrams(sentence.split(), 2)
print(list(bigrams))

trigrams = nltk.ngrams(sentence.split(), 3) # or nltk.trigrams(sentence.split())
print(list(trigrams))


### Case folding
Case folding, also known as case normalization, is the process wherein we consolidate different "spellings" of a word that differ only in their capitalization. Words often become case "denormalized" when they are capitalized at the beginning of a sentence or emphasized through upper-case spelling. The purpose of this process is to standardize words that essentially mean the same thing but appear differently due to varied capitalization.

#### Importance of Case Folding

Case folding, while seemingly simple, plays a crucial role in Natural Language Processing (NLP). It helps to reduce your vocabulary size and streamline your NLP pipeline by consolidating words with equivalent meanings and spells under a single token.

For instance, 'doctor' and 'Doctor' - though meaning the same - may be treated as different words because of the difference in capitalization. While the lower-case variant could indicate a medical professional, the capitalized form might usually refer to a title or an individual's name. Being able to distinguish between the two can be important if recognizing proper nouns is crucial to your NLP task.

However, if tokens aren't normalized for case, your vocabulary might expand approximately twice as large, consuming double the memory and processing time. This might also increase the amount of training data you need to label for your machine learning pipeline to converge onto an accurate solution.

#### Balancing Efficiency and Information Content

Though case folding can streamline your NLP pipeline, it's essential to strike a balance between computational efficiency and information content.

In many situations, the efficiency gained through reducing vocabulary size by half might be outweighed by the loss of valuable information specific to proper nouns. Nevertheless, some information may inadvertently get lost even without case normalization. For example, if the word "The" at the beginning of a sentence isn't identified as a stop word, it could impair certain applications.

Sophisticated pipelines usually perform selective case normalization, identifying proper nouns first before normalizing case for other words at the start of sentences.

#### Choosing the Right Approach

The approach to case normalization largely depends on the specifics of your application and the nature of your corpus. If similar terms like “Smith's” and “wordsmiths” aren't crucial to your analysis, you could simply convert all words to lowercase.

However, the most effective strategy typically involves experimenting with multiple approaches and choosing the one which optimizes the performance of your NLP project. There's no one-size-fits-all solution – the best method would be the one that aligns perfectly with your project objectives.

---

In [None]:
# Sometimes, casing can be important. See the exemple below, where the uppercased word is used as part of the text structure, indicating the beginning of a section of the text.
txts_train_cleaned.iloc[0][:2000]

See the example below from a legal document


> AGRAVO REGIMENTAL NO RECURSO EXTRAORDINÁRIO COM AGRAVO. ADMINISTRATIVO. ESTABELECIMENTO DE ENSINO. INGRESSO DE ALUNO PORTANDO ARMA BRANCA. AGRESSÃO. OMISSÃO DO PODER PÚBLICO. RESPONSABILIDADE OBJETIVA. ELEMENTOS DA RESPONSABILIDADE CIVIL ESTATAL DEMONSTRADOS NA ORIGEM. REEXAME DE FATOS E PROVAS. IMPOSSIBILIDADE. PRECEDENTES.
> 1. A jurisprudência da Corte firmou-se no sentido de que as pessoas jurídicas de direito público respondem objetivamente pelos danos que causarem a terceiros, com fundamento no art. 37, § 6º, da Constituição Federal, tanto por atos comissivos quanto por omissivos, desde que demonstrado o nexo causal entre o dano e a omissão do Poder Público.
> 2. O Tribunal de origem concluiu, com base nos fatos e nas provas dos autos, que restaram devidamente demonstrados os pressupostos necessários à configuração da responsabilidade extracontratual do Estado.
> 3. Inadmissível, em recurso extraordinário, o reexame de fatos e provas dos autos. Incidência da Súmula nº 279/STF.
> 4. Agravo regimental não provido.
> (STF. ARE nº 697326. Primeira Turma. Rel. Min. Dias Tóffoli. Julgado em 05/03/2013. Publicado em 26/04/2013)


There are several important elements in the text above that needs to have their casing preserved. The first part is called an "ementa" and it is a summary of the case. It's usually written in uppercase and is a very important section of this document. Had we lowercased the text, we would have lost this information.

### Another approach: Subword tokenization

Subword tokenization is an advanced technique in Natural Language Processing (NLP) that offers a powerful approach to handling challenges such as Out-Of-Vocabulary (OOV) words, reducing vocabulary size, and capturing finer-grained morphological and semantic information.

#### Advantageous Aspects

Some major advantages of subword tokenization include:

1. **Handling OOV Words**: Subword tokenization allows your model to handle words it has not previously encountered. Even if the model hasn't seen a particular word during training, it can correctly process it if it recognizes the subwords or word-parts that make up the word.

2. **Reducing Vocabulary Size**: With subword tokenization, we break down words into smaller units, leading to a reduction in the overall size of the vocabulary. This can lead to lower memory usage and faster processing times.

3. **Capturing Morphological Information**: Subword tokenization carries the potential to capture meaningful lingual fragments and provide insights into the morphological structure of words.

However, it's important to recognize that subword tokenization also introduces its own set of challenges.

#### Challenges with Subword Tokenization

One critical challenge of using subword tokenization is that it adds a layer of complexity to the tokenization process. Additionally, the interpretability and explainability of the model's output might be compromised. The reason for this trade-off lies in the nature of subword units themselves: they aren't actual 'words' in the strictest sense and hence, the model could struggle to establish context or derive meaning from these tokens.

> **Note:**
>
> Subword tokenization is a crucial component of state-of-the-art models like transformers. We'll see how and why the usage of subword tokenization enhances the performance of these models in an upcoming lecture.

---

## Stop words

Stop words hold a significant position in the world of Natural Language Processing (NLP) and Information Retrieval (IR). But what exactly are stop words?

As per the [Introduction to Information Retrieval](https://nlp.stanford.edu/IR-book/html/htmledition/dropping-common-terms-stop-words-1.html):

> *Stop words are extremely common words that appear to have little value in helping select documents matching a user's need. As a result, they are excluded from the vocabulary entirely.*

Stop words are words which are filtered out before or after processing of natural language data (text). Though "stop words" usually refers to the most common words in a language, there is no single universal list of stop words used by all natural language processing tools, and indeed not all tools even use such a list. Some tools specifically avoid removing these stop words to support phrase search.

Interestingly, the use of stop words has seen a remarkable evolution over time in the field of Information Retrieval (IR). In the early systems, it was a common practice to maintain large stop lists, typically including around 200-300 terms. The main purpose was to filter out these frequently occurring words, with the idea that their exclusion would help streamline and improve the search process.

However, as IR systems have advanced, there has been a noticeable shift toward the usage of significantly smaller stop lists, containing just about 7-12 terms. This change mainly stemmed from the understanding that several common words, earlier perceived as insignificant, might carry valuable context or sentiment information which could enhance search accuracy and result relevance.

In fact, the latest trend in especially web search engines, is not using any stop list at all. These modern platforms now prefer keeping these so-called 'insignificant' words in their vocabulary, realizing that every word, regardless of how common, can potentially provide crucial information that aids in better matching with the user needs.

The theory and practice surrounding stop words is a prime example of how our understanding and handling of text data evolves as we develop more sophisticated tools and techniques. It highlights the importance of continual learning and adaptation in the ever-changing landscape of Information Retrieval and Natural Language Processing.

### NLTK

In [None]:
from nltk.corpus import stopwords
nltk.download('stopwords')
stopwords_nltk = stopwords.words('portuguese')

list(stopwords_nltk)[:10]

### What is Spacy?

Spacy is a state-of-the-art NLP library designed for high-performance tasks in this domain. It is highly opinionated, meaning that it typically offers one single, but exceptionally optimized way to achieve a task. This design philosophy contrasts with other libraries such as NLTK, which offer numerous methods for each task, yet often lack the same level of optimization which Spacy provides. The dedicated and targeted approach of Spacy excels in delivering speed, efficiency, and precision, making it arguably one of the best choices for professional NLP applications.

#### Downloading Language Models in Spacy

To make use of Spacy's capabilities, you would need to first install the Spacy library and then download the specific language model that suits your project needs. Let's take the example of downloading the Portuguese model.

#### Steps to download the Portuguese model:

Following the installation of Spacy, the Portuguese model can be downloaded using the command below:

```bash
python -m spacy download pt_core_news_sm
```

This command asks Python to run the Spacy module (`spacy`) and perform the `download` operation, specifically requesting for the small-sized (`sm`) central news model (`core_news`) for Portuguese (`pt`).

With Spacy installed and the necessary language model(s) downloaded, you are all set to explore and utilize its powerful features for your NLP projects. Whether it's tokenization, part-of-speech tagging, named entity recognition, or any number of other complex language processing tasks, Spacy offers robust and effective solutions to handle them all. For now, we'll just use spacy stopwords.

In [None]:
!python -m spacy download pt_core_news_sm

In [None]:
# Import the spaCy library, which is used for advanced natural language processing tasks
import spacy

# Load the small Portuguese language model from spaCy
# This model includes pre-trained word vectors, part-of-speech tags, named entity recognition, and more
nlp = spacy.load('pt_core_news_sm')

# Access the set of default stop words for the Portuguese language from the loaded model
# Stop words are common words (e.g., 'and', 'the') that are usually filtered out in text processing
stopwords_spacy = nlp.Defaults.stop_words

# Convert the set of stop words to a list and display the first 10 stop words
# This gives an idea of what words are considered stop words in the Portuguese language model
list(stopwords_spacy)[:10]

In [None]:
len(list(stopwords_spacy)), len(list(stopwords_nltk))


In [None]:
both_stopwords = set(stopwords_nltk) | set(stopwords_spacy)
len(both_stopwords)

In [30]:
def remove_stopwords(text):
    # Split the input text into individual tokens (words)
    tokens = text.split()
    
    # Filter out tokens that are present in the both_stopwords set
    tokens = filter(lambda token: token not in both_stopwords, tokens)
    
    # Join the filtered tokens back into a single string with spaces in between
    return ' '.join(tokens)

In [None]:
txts_train_cleaned_no_stopwords = txts_train_cleaned.apply(remove_stopwords)

plot_histogram_word(txts_train_cleaned_no_stopwords, 30)

This is much better. What else can we do to improve?

### Stemming and Lemmatization

#### Are the following words the same?

Consider these sets of words:

- *organizar, organiza, e organizam*
- *belo, belos, bela, e belas*

For those examples, the words have the same root but appear in different forms. To determine if they are the same, we can apply **stemming** and **lemmatization** techniques.

**Stemming** reduces words to their stem or root form by removing affixes such as plural markers, tense markers, or derivational suffixes. For example:

- organizar, organiza, e organizando → organiz

**Lemmatization** takes into account the morphological analysis of the word and returns its base or dictionary form (lemma). For example:

- belas → belos → belo

By applying these techniques, we can conclude that the words in each set have the same meaning but appear in different grammatical forms.

In [None]:
# Import the spaCy library for natural language processing
import spacy

# Load the small Portuguese language model from spaCy
# This model includes vocabulary, syntax, and entities for Portuguese
nlp = spacy.load('pt_core_news_sm')

# Define a function to lemmatize text
# Lemmatization reduces words to their base or dictionary form
def lemmatize_text(text):
    # Process the input text using the spaCy model
    doc = nlp(text)
    # Extract the lemma for each token in the processed text
    # token.lemma_ gives the lemma of the token
    tokens = [token.lemma_ for token in doc]
    # Join the lemmas back into a single string with spaces in between
    return ' '.join(tokens)

# Test the lemmatize_text function with different forms of the word 'organizar'
# 'organiza' is the third person singular present form
print(lemmatize_text('organiza'))  # Expected output: 'organizar'
# 'organizam' is the third person plural present form
print(lemmatize_text('organizam'))  # Expected output: 'organizar'
# 'organizamos' is the first person plural present form
print(lemmatize_text('organizamos'))  # Expected output: 'organizar'

# Test the lemmatize_text function with different forms of the adjective 'belo'
# 'belo' is the masculine singular form
print(lemmatize_text('belo'))  # Expected output: 'belo'
# 'bela' is the feminine singular form
print(lemmatize_text('bela'))  # Expected output: 'belo'
# 'belos' is the masculine plural form
print(lemmatize_text('belos'))  # Expected output: 'belo'
# 'belas' is the feminine plural form
print(lemmatize_text('belas'))  # Expected output: 'belo'

### Analyzing Word Forms: Stemming and Lemmatization

In linguistics, analyzing different word forms to dig into their base form or root is often crucial. The goal is to understand the patterns of a language, its semantics, morphology, and syntax. Let's consider this using some examples:

#### Set of Words for Analysis

Consider the following sets of words:

1. *organizar, organiza, e organizamos*
2. *belos, belas, e bela*

Observe that these words share a same root but appear in various forms due to changes in tense, number, mode, etc. So, how do we determine if they are essentially the 'same'?

To answer that, we will check two techniques in Natural Language Processing (NLP): **Stemming** and **Lemmatization**.

#### Understanding Stemming

**Stemming** is a technique used in NLP to break down a word into its base or stem form. It does so by chopping off affixes such as plural markers, tense markers, or derivational suffixes from a word.

Take for example, the set of words : *organizar, organiza, e organizamos*. After applying stemming, we can reduce these derivatives to their base form → 'organiz'.

However, the stemming approach may not always lead to actual words present in the dictionary.

#### Understanding Lemmatization

On the other hand, **Lemmatization** takes into account the complete morphological analysis of the word before reducing it to its base or dictionary form (lemma). This is done keeping the context of the word in mind which helps in achieving meaningful results.

Consider the set: *organizar, organiza, e organizamos*. These words after being subjected to lemmatization would all come down to → 'organizar'.

> While both stemming and lemmatization aim to produce the root form of words, lemmatization tends to be more accurate at the cost of speed, whereas stemming is quicker but may result in non-word tokens. Unlike stemming, lemmatization ensures that the root word belongs to the language which leverages understanding of the context. For this reason, Spacy doesn't include a stemmer, but it does provide a lemmatizer.


In [33]:
def spacy_lemmatizer(text):
    # Process the input text using the spaCy model
    # This creates a spaCy document object that contains linguistic annotations
    doc = nlp(text)
    
    # Extract the lemma for each token in the document
    # token.lemma_ gives the lemma (base form) of the token
    txt = [token.lemma_ for token in doc]
    
    # Join the lemmas back into a single string with spaces in between
    return " ".join(txt)

In [None]:
spacy_lemmatizer(txts_train_cleaned[1])

In [None]:
txts_train_cleaned[1]

In [None]:
txts_train_cleaned_no_stopwords_lemmatized = txts_train_cleaned_no_stopwords.apply(spacy_lemmatizer)

plot_histogram_word(txts_train_cleaned_no_stopwords_lemmatized, 30)

In [None]:
# Define a function to preprocess and lemmatize text using spaCy
def spacy_lemmatizer_v2(text):
    # Convert the text to lowercase to ensure uniformity
    text = text.lower()
    
    # Remove stopwords from the text
    # This step helps in focusing on the meaningful words
    text = remove_stopwords(text)
    
    # Remove punctuation from the text using a regular expression
    # r'[^\w\s]' matches any character that is not a word character or whitespace
    text = re.sub(r'[^\w\s]', '', text)
    
    # Create a spaCy document object from the cleaned text
    # This object contains linguistic annotations
    doc = nlp(text)
    
    # Lemmatize the text by extracting the lemma for each token in the document
    # token.lemma_ gives the lemma (base form) of the token
    txt = [token.lemma_ for token in doc]
    
    # Remove words with less than 3 characters
    # This step helps in filtering out short, less meaningful words
    txt = [word for word in txt if len(word) > 2]
    
    # Join the processed words back into a single string with spaces in between
    return " ".join(txt)

# Apply the spacy_lemmatizer_v2 function to each document in the training dataset
txts_train_cleaned_no_stopwords_lemmatized = txts_train_cleaned.apply(spacy_lemmatizer_v2)

# Plot a histogram of word counts in the preprocessed training dataset
# The second argument (30) specifies the number of bins in the histogram
plot_histogram_word(txts_train_cleaned_no_stopwords_lemmatized, 30)

Even better.

In [None]:
# Import necessary libraries
import nltk
nltk.download('punkt')  # Download the 'punkt' tokenizer models from NLTK
from collections import Counter  # Import Counter to count occurrences of n-grams
from nltk.util import ngrams  # Import ngrams to generate n-grams from text
from nltk import word_tokenize  # Import word_tokenize to split text into words
import plotly.graph_objs as go  # Import Plotly for creating visualizations

# Define a function to generate n-grams from text
def generate_ngrams(text, n, lowercase=False):
    # Convert text to lowercase if specified
    if lowercase:
        text = text.lower()
    
    # Tokenize the text into words
    tokens = nltk.word_tokenize(text)
    
    # Generate n-grams from the tokenized words
    n_grams = ngrams(tokens, n)
    
    # Join the n-grams into strings and return as a list
    return [' '.join(grams) for grams in n_grams]

# Initialize a Counter to count occurrences of n-grams
n_grams_counter = Counter()

# Define the number of most common n-grams to display
n_most_common = 30

# Loop through each text in the preprocessed training dataset
for text in txts_train_cleaned_no_stopwords_lemmatized.values:
    # Generate bigrams (n=2) from the text and update the counter
    n_grams_counter.update(generate_ngrams(text, n=2, lowercase=True))

# Select the top 30 most frequent bigrams
n_grams_counter = dict(n_grams_counter.most_common(n_most_common))

# Create a bar chart to visualize the most frequent bigrams
fig = go.Figure([go.Bar(x=list(n_grams_counter.keys()), y=list(n_grams_counter.values()))])

# Update the layout of the figure with a title
fig.update_layout(title_text=f'Top {n_most_common} most frequent bigrams in the text')

# Display the figure
fig.show()

Even better.

## Preprocessing Techniques in Deep Learning: A Pragmatic Approach

<img src="images/skomoroch.png" alt="" style="width: 65%"/>

When it comes to text preprocessing in the context of Deep Learning, traditional techniques such as stemming, lemmatization, and stop-word removal may not always yield the best results. While these methods have been considered standard practices in natural language processing, it's important to understand that Deep Learning models possess the ability to capture complex features and nuances from the original data.

### Potential Drawbacks of Excessive Preprocessing

Applying stemming, lemmatization, or stop-word removal can sometimes lead to a **decrease** in model performance. This is because these techniques inherently involve discarding certain information from the text, which may result in the loss of crucial context and subtle meanings that are essential for semantic understanding.

Consider the following example:
- After stemming, the words 'medicamento', 'medicinal', and 'medicina' would all be reduced to 'medic'.
- While this simplifies the dataset for processing, a Deep Learning model may lose valuable information about the context in which these words were used.

The nuanced differences between these words could potentially hold significant meaning that contributes to the overall understanding of the text.

### Experiment and Compare: Finding the Optimal Approach

When deciding whether to employ preprocessing techniques like stemming or lemmatization, it's crucial to recognize that there is no universal solution that fits all scenarios. The effectiveness of these techniques depends on several factors, including:
- The specific task you are trying to accomplish
- The characteristics and domain of the text data you are working with
- The complexity and architecture of the Deep Learning model being used

To determine the best approach, it is recommended to **experiment with different preprocessing strategies and compare the results**. By evaluating the model's performance with and without applying techniques like stemming or lemmatization, you can gain insights into which approach yields better outcomes for your specific use case.

> Personal Experience Note: In my experience, I have rarely encountered situations where stemming or lemmatization significantly enhanced the performance of a Deep Learning model. While there have been cases where these techniques negatively impacted the results, I haven't come across a scenario where they distinctly improved the model's performance.


# Step 4 - Feature Extraction: A Transformation Process

Feature extraction is a critical stage in the machine learning pipeline that entails transforming preprocessed data into a format digestible by machine learning algorithms. The purpose of this process is to create numerical representations of text data, which are more compatible with these algorithms.


## Analogy to Tabular Data

To understand this concept better, we can draw parallels with traditional machine learning processes involving tabular data. In such scenarios, the data is already numerical and well suited for direct delivery into the model.

However, when dealing with text data, it isn't inherently numerical. As such, we need to implement an additional step - feature extraction - to morph this data into a numerical form before it can be processed further.

---

## Feature Extraction Techniques

The most commonly applied techniques during feature extraction include Bag-of-Words (BoW) and TF-IDF (Term Frequency-Inverse Document Frequency). These methods convert textual information, such as words or phrases, into numerical values, thereby easing better interaction with machine learning models.

---

> In essence, feature extraction serves as the bridge between raw textual data and machine learning algorithms that require numeric input. This transformation equips these algorithms to uncover patterns and extract insights from otherwise unstructured and harder-to-interpret text data.


## Recap from the Previous Lesson - Feature Extraction

### Word Representations in Natural Language Processing (NLP)

In the field of Natural Language Processing (NLP), one of the key problems to solve is how to represent words in a way that computers can understand and process. There are two main approaches for this: the **localist approach** and the **distributed approach**.

#### Localist Approach: Words as Discrete Symbols

In the localist approach, which is used traditionally in NLP, each word in a language is considered as a distinct entity or a discrete symbol. One common representation technique here involves the use of 'one-hot vectors'.

Imagine you have a vocabulary of 10,000 unique words. Each word would be represented by a vector of length 10,000. This vector is filled with zeros except for one position - the unique index representing the specific word - which is set to one. Hence, the term 'one-hot'.

Although straightforward, the localist method has limitations, particularly when it comes to capturing semantic relationships between words, i.e., how closely related the meanings of different words are. This limitation leads us to the second approach.

#### Distributed Approach: Continuous Vectors

In contrast to seeing words as individual symbols, the distributed approach treats words as continuous vectors in high-dimensional space. Under this model, each word is portrayed as a point within this space. The intriguing part is that words sharing similar semantic attributes tend to cluster close to each other within this space.

This technique makes use of **word embeddings**, where every word is mapped to a dense vector that represents its context in the high-dimensional space. Consequently, even if we come across an unfamiliar word, we can infer some properties about it based on its proximity to known words.

The major strength of the distributed approach lies in its ability to generalize from known to unknown, based on semantic similarity, providing richer and more powerful representations of words.


`Let's start with the Localist Approach`


#### Counting words


In [None]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

# Sample corpus of Portuguese sentences
corpus = [
    "o céu é azul",
    'o mar é azul',
    'a bandeira é anil',
    'o sol é quente e bem que poderia ser azul',
    "o oceano é imenso e azul, assim como o mar, que também é azul",
]

# Initialize the CountVectorizer
vectorizer = CountVectorizer() 

# Fit the vectorizer to the corpus
# This step learns the vocabulary from the corpus
vectorizer.fit(corpus)

# Transform the corpus into a matrix of token counts
# Each row represents a sentence, each column represents a word
matrix = vectorizer.transform(corpus).toarray()

# Create a DataFrame from the matrix
# Columns are the words (features) learned by the vectorizer
# Rows are the original sentences (indexed by the corpus)
df = pd.DataFrame(matrix, columns=vectorizer.get_feature_names_out(), index=corpus)

# Display the resulting DataFrame
# This shows the frequency of each word in each sentence
df

In [None]:
import plotly.express as px
import numpy as np

# Create a heatmap visualization of the one-hot encoded word representations
fig = px.imshow(df, color_continuous_scale='blues')

# Customize the layout of the figure
fig.update_layout(
    title='One-Hot Word Representations',
    title_x=0.5,  # Center the title
    autosize=False,
    width=1000,  # Set fixed width
    height=600,  # Set fixed height
    margin=dict(
        l=50,  # Left margin
        r=50,  # Right margin
        b=100,  # Bottom margin
        t=100,  # Top margin
        pad=4   # Padding between plot area and axis labels
    )
)

# Rotate x-axis labels for better readability
fig.update_xaxes(tickangle=90)

# Display the interactive plot
fig.show()

In [None]:
# Our vocabulary size
len(vectorizer.vocabulary_)

#### A Better Way of Counting: Relative Frequency

Using simple word counts in a document has some disadvantages. The presence of frequently occurring word vectors, even if irrelevant, can be an issue as we saw above with the word "blue". Moreover, rare words, though highly relevant, are considered less important. To address these issues, we use TF-IDF vectorization.

The first part is TF, known as term frequency. It simply refers to the number of times the word occurs in the document divided by the total number of words in the document.

The second part is IDF, which stands for 'inverse document frequency', interpreted as the inverse occurrence of the term of interest in documents.


**$tf(t,d)$** is the frequency of the term **$t$** in the document **$d$**. The term frequency, denoted as $tf$ of $t$ in $d$, is equal to the number of times the term $t$ appears in the document $d$, represented by $n$ subscript $t$ comma $d$, divided by the sum of the number of occurrences of all terms $k$ in the document $d$, denoted by the summation symbol with $k$ belonging to $d$, followed by $n$ subscript $k$ comma $d$.


**$$tf(t,d) = {n_{t,d} \over \sum_{{k \in d}} n_{k,d}} $$**


**$idf(t,D)$** is the frequency of the document **$d$** containing the term **$t$**. We have a formula here to calculate the inverse document frequency, denoted as $idf$ of term $t$ in the document set $D$. The inverse document frequency is calculated as the logarithm of the ratio of the total number of documents in the set, represented by capital $N$, divided by the number of documents containing the term $t$, denoted by lowercase $n$ subscript $t$. In this formula, $d$ represents a document that belongs to the document set $D$, and $t$ represents a term that belongs to the document $d$. In summary, the inverse document frequency, $idf$ of $t$ in $D$, is equal to the logarithm of the total number of documents, $N$, divided by the number of documents containing the term $t$, $n$ subscript $t$."

**$$idf(t,D) = {log {N\over{n_t}}}$$**

**$$d \in D, t \in d $$**


where **$N$** is the total number of documents and **$n_t$** is the number of documents containing the term $t$

Thus, the **TF-IDF** vector is calculated as follows:

**$$tfidf(t,d,D) = tf(t,d) * idf(t,D)$$**


##### Smoothing

The IDF of each word is constant per *corpus*. A smoothed version, adding 1 to the denominator, is applied to prevent division by 0 when the word is not present in the *corpus*. The idea of the IDF is to reduce the weight of frequent terms and increase the weight of rare terms, assuming that more frequent terms are not always more important. Therefore, the smoothed IDF equation is:

**$$idf(t,D) = {log {N\over{1 + n_t}}}$$**


This smoothing is done automatically when using the `scikit-learn` vectorizer. Repeating what we did above with the `CountVectorizer`, but now with the `TfidfVectorizer`:

In [None]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np

# Sample corpus of Portuguese sentences
corpus = [
    "o céu é azul",
    'o mar é azul',
    'a bandeira é anil',
    'o sol é quente e bem que poderia ser azul',
    "o oceano é imenso e azul, assim como o mar, que também é azul",
]

# Initialize TfidfVectorizer
vectorizer = TfidfVectorizer()

# Fit the vectorizer to the corpus
# This step learns the vocabulary and idf weights
vectorizer.fit(corpus)

# Transform the corpus into a TF-IDF matrix and convert to a dense array
matrix = vectorizer.transform(corpus).toarray()

# Create a DataFrame from the TF-IDF matrix
# Columns are the unique words (features) from the corpus
# Rows are the original sentences (documents)
df = pd.DataFrame(
    matrix, 
    columns=vectorizer.get_feature_names_out(),  # Get feature names (words)
    index=corpus  # Use original sentences as index
)

# Display the resulting DataFrame
df

In [None]:
# Our vocabulary size
len(vectorizer.vocabulary_)

In [None]:
import plotly.express as px
import numpy as np

# Create a heatmap visualization of the TF-IDF matrix
fig = px.imshow(df, color_continuous_scale='blues')

# Customize the layout of the figure
fig.update_layout(
    title='TF-IDF Word Representations',  # Set the title of the plot
    title_x=0.5,  # Center the title
    autosize=False,  # Disable auto-sizing
    width=1000, height=600,  # Set specific dimensions for the plot
    margin=dict(l=50, r=50, b=100, t=100, pad=4)  # Adjust margins for better visibility
)

# Rotate x-axis labels for better readability
# This is useful when dealing with long word labels
fig.update_xaxes(tickangle=90)

# Display the interactive plot
# Note: This will show the plot in a Jupyter notebook or in a web browser
fig.show()

Both approaches above are referred to as **Bag-of-words** (BoW) models. Its name comes from their unique process of disregarding the order and sequencing of words in any given text, are a foundational concept in natural language processing. Rather than focusing on linguistic sequencing or contextual considerations, these models consider every word in isolation, tallying up the frequency of each term within the text.

As a result, these models also bear the label of **localist** models. This refers to their emphasis on the local, individual context of each word, as opposed to considering the overall global context of the sentence or paragraph in which they're found.

### Advantages of the Bag-of-Words Model

#### 1. Effective Simplicity
These models may be simple, but don't let that fool you — their performance can be surprisingly impressive. For fundamental tasks, a BoW model serves as a robust starting point likely to yield satisfactory outcomes.

#### 2. Quick Processing
Time-efficiency is another strong suit of this model. Both the training phase and usage of these models are relatively swift, making them a practical choice for various applications.

#### 3. Easy to Grasp
The simplicity of BoW models makes them easy to comprehend and explain—a critical aspect for natural language processing (NLP) models. This is especially true in legal domains where detailed jargon needs to be broken down efficiently.

#### 4. Straightforward Implementation
Developing a BoW model doesn't require extensive programming knowledge or complicated algorithms. They can be implemented easily thanks to their straightforward design.

---

### Disadvantages of the Bag-of-words Model

#### 1. Lack of Sequence Information
A key downside to the BoW model is its inability to account for word sequencing—an essential aspect of human language.
> As an example, consider the Portuguese sentences "Ele foi considerado culpado, não inocente" and "Ele foi considerado inocente, não culpado". These two hold entirely different meanings, but the discrepancy would remain undetected by a BoW model.

#### 2. No Semantic Understanding
These models are not capable of grasping the complex semantics or meanings behind words, limiting their applicability in more nuanced linguistic tasks.

#### 3. Sparse Representations
The vector representations produced by these models often contain numerous zeros, leading to inefficient resource usage.

#### 4. Large Vector Dimension
In the BoW model, the dimension of the vector equates to the size of the vocabulary used. Consequently, for large language data sets, vector dimensions can become unwieldly large, posing computational challenges.

#### 5. Limited Word Similarity Detection
BoW models use one-hot vectors, which invariably fail to reflect similarities between words. For example, in Portuguese, "azul" (blue) and "anil" (indigo) are semantically related, but as far as a BoW model is concerned, these words are as distinct as "azul" and "sol" (sun).

#### 6. Difficulties with Unknown Words
Representing words not included in the initial vocabulary is challenging. For instance, consider a word like "COVID", which was virtually non-existent before 2019—how might a BoW model trained prior to that year represent this term?


> Did you know? The BoW model remains a popular choice for text classification tasks, such as sentiment analysis, spam detection, and topic classification. It works unreally well for these tasks because it doesn't require a deep understanding of the text's meaning. Instead, it focuses on the frequency of words, which is often sufficient for these tasks.
> However, for more complex tasks, such as machine translation, the BoW model is not a viable option. So, what can we do to overcome these limitations? The answer lies in the **Word Embedding** model.

### Contextual Representation of Words and Documents in NLP

In Linguistics and Natural Language Processing (NLP), how can we interpret the meaning of a word or understand an entire document? The key lies within the **context**

#### The Importance of Context in Word Meaning

The cornerstone of this approach is the _distributional hypothesis_. This hypothesis posits that words appearing in similar contexts tend to carry similar meanings.

> _An example to illustrate the distributional hypothesis:_
>
> Suppose we have a word `Kapfushnefrabeau`, whose meaning you don't initially know. However, based on different sentences where the word appears, you start to comprehend its possible meaning:
>
> 1. _There is a bottle of `Kapfushnefrabeau` on the table._
> 2. _People have been drinking `Kapfushnefrabeau` for centuries._
> 3. _`Kapfushnefrabeau` can be red, white, or rose._
> 4. _`Kapfushnefrabeau` is made from grapes._
> 5. _`Kapfushnefrabeau` is an alcoholic beverage._
>
> From these statements, you may conclude that `Kapfushnefrabeau` likely refers to some kind of wine.

#### Transitioning from Word Embeddings to Document Embeddings

Previously, we learned about representing words as dense vectors, commonly known as **word embeddings**. Remember that word embeddings allow us to capture semantic properties of words by considering their context, rather than treating them in isolation.

Today, we extend that concept to entire documents. It is possible to represent whole documents as dense vectors too, known as **document embeddings**. We will explore how to create these using the same methods we applied for individual words.

As we progress through the course, we will also investigate more sophisticated techniques for creating document embeddings, such as Transformers. These advanced methods offer even greater insight into the context and meaning hidden within our data.

<!-- ## Topics to Discuss about Feature Extraction in a Class

1. **Introduction to Feature Extraction**: Definition, importance, and the role of feature extraction in Machine Learning.

2. **Types of Data in Machine Learning**: The difference between numerical and text data, and why different handling is required.

3. **Techniques for Feature Extraction**: Detailing methods like Bag-of-Words (BoW) and Term Frequency-Inverse Document Frequency (TF-IDF), including their working principles and applications.

4. **Practical Examples of Feature Extraction**: Discuss real-world examples where feature extraction is vital, such as in Natural Language Processing.

5. **Challenges in Feature Extraction**: Discuss possible issues like high dimensionality, overfitting, and how to address them.

6. **Feature Selection vs Feature Extraction**: Explore the differences and relationships between these two critical steps in Machine Learning.

7. **Advance Techniques**: Introduction to advanced techniques like Word Embedding (Word2Vec, GloVe) and their advantages over traditional methods.

8. **Hands-on Activity**: Implement a simple feature extraction on a given text dataset using Python libraries.

9. **Evaluation Metrics for Feature Extraction**: Understand how to evaluate and optimize the performance of feature extraction.

10. **Applications of Feature Extraction**: Explore domains like sentiment analysis, document classification, recommender systems, and more. -->

In [45]:
import fasttext
import numpy as np

# Load the pre-trained FastText model
fasttext_model = fasttext.load_model('data/bin/cc.pt.300.bin')

# Tokenize the text
docs = [doc.lower().split() for doc in txts_train_cleaned]

# Get the document vectors
doc_vectors = np.zeros((len(docs), fasttext_model.get_dimension())) # initialize the matrix with zeros
for i, doc in enumerate(docs):
    doc_vectors[i] = fasttext_model.get_sentence_vector(' '.join(doc)) # get the vector for each document

In [None]:
import matplotlib.pyplot as plt  # For plotting graphs
from sklearn.manifold import TSNE  # Import t-SNE from scikit-learn

# Create the t-SNE model
# n_components=2: Specifies the number of dimensions to reduce the data to (2D in this case)
# random_state=271828: Ensures reproducibility of the results by using a fixed seed for the random number generator
model = TSNE(n_components=2, random_state=271828)

# Apply t-SNE to the document vectors to reduce their dimensionality
# doc_vectors: High-dimensional vectors representing documents
# tsne_features: 2D representation of the document vectors after applying t-SNE
tsne_features = model.fit_transform(doc_vectors)

# Plot the 2D representation of the documents
# tsne_features[:,0]: x-coordinates of the transformed data
# tsne_features[:,1]: y-coordinates of the transformed data
# alpha=0.5: Sets the transparency of the points to make overlapping points more visible
plt.scatter(tsne_features[:,0], tsne_features[:,1], alpha=0.5)
plt.show()  # Display the plot

In [None]:
# Apply k-means clustering to the data

# Load the KMeans class from scikit-learn
from sklearn.cluster import KMeans

# Instantiate a KMeans object with 3 clusters
kmeans = KMeans(n_clusters=2)

# Fit the KMeans object to the data
kmeans.fit(tsne_features)

# Get the cluster labels
labels = kmeans.labels_

# Plot the data points in the new 2D space with the cluster labels
plt.scatter(tsne_features[:, 0], tsne_features[:, 1], marker='o', c=labels, s=25, edgecolor='k')

# Plot the centroids of the clusters
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], marker='x', s=100, c='r', edgecolor='k', label="Centroids")

plt.xlabel('First Component')
plt.ylabel('Second Component')
plt.title('t-SNE transformed data with k-means clustering')
plt.legend()
plt.show()

In [None]:
# get one element with label 0
labels_0 = np.where(labels == 0)[0]
print(labels_0)

# get one element with label 1
labels_1 = np.where(labels == 1)[0]
print(labels_1)

In [None]:
for el in labels_0[:5]:
    print(txts_train_cleaned[el])
    print('\n\n')

In [None]:
for el in labels_1[:5]:
    print(txts_train_cleaned[el])
    print('\n\n')

### Using cosine similarity to find similar documents


In [51]:
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

def find_similar_vectors(vector, vectors, k=2):
    # Compute the cosine similarity between the input vector and all other vectors
    # The input vector is reshaped to (1, -1) to make it 2D for compatibility with cosine_similarity
    similarities = cosine_similarity(vector.reshape(1, -1), vectors)[0]
    
    # Sort the similarities in descending order and get the indices of the top k+1
    # We use k+1 because the most similar vector will be the input vector itself
    # [::-1] reverses the order to get descending, and [1:k+1] excludes the first (self) match
    top_indices = np.argsort(similarities)[::-1][1:k+1]
    
    # Return the indices of the top k most similar vectors and their similarity scores
    return top_indices, similarities[top_indices]

# Assume doc_vectors is a pre-existing array of document vectors
# Select the second document vector (index 1) as our query vector
v1 = doc_vectors[1]

# Find the 2 most similar vectors to v1 in doc_vectors
# most_similar_v1_idx will contain the indices of the similar vectors
# most_similar_v1_similarity will contain their corresponding similarity scores
most_similar_v1_idx, most_similar_v1_similarity = find_similar_vectors(v1, doc_vectors, k=2)

In [None]:
most_similar_v1_idx, most_similar_v1_similarity

In [None]:
txts_train_cleaned.iloc[1]

In [None]:
txts_train_cleaned.iloc[439]

In [None]:
txts_train_cleaned.iloc[427]

In [None]:
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

def cross_similarity(vectors):
    # Compute the cosine similarity between all pairs of vectors
    # This creates a square matrix where each cell [i,j] represents the similarity between vector i and j
    similarities = cosine_similarity(vectors)
    
    # Set the diagonal to zero to exclude self-similarity
    # This is because a vector is always 100% similar to itself, which we don't want to consider
    np.fill_diagonal(similarities, 0)
    
    return similarities

# Assuming doc_vectors is a 2D array where each row is a document vector
similarities = cross_similarity(doc_vectors)

# Find the highest similarity between two documents and their indices
max_similarity = np.max(similarities)
# np.where returns a tuple of arrays, one for each dimension
# We use [0][0] and [1][0] to get the first pair of indices where the max similarity occurs
max_similarity_indices = np.where(similarities == max_similarity)

print(f"The highest similarity is {max_similarity:.2f} between documents {max_similarity_indices[0][0]} and {max_similarity_indices[1][0]}")

# Find the 10 most similar documents pairs and their indices
# argsort returns the indices that would sort an array
# We use axis=None to flatten the matrix, and [-10:] to get the last 10 (highest) values
most_similar = np.argsort(similarities, axis=None)[-10:]
# unravel_index converts flat indices back to 2D indices
most_similar_indices = np.unravel_index(most_similar, similarities.shape)

print(f"The 10 most similar documents pairs are {most_similar_indices[0]} and {most_similar_indices[1]}")

In [None]:
txts_train_cleaned.iloc[167]

In [None]:
txts_train_cleaned.iloc[336]

> The similarity between the two documents is striking. It is quite fascinating that we were able to determine this similarity by simply calculating the mean values of the vectors for each word.
>
>
> - **How does averaging word vectors represent the document's meaning?**
>   - Averaging word vectors aggregates the semantic information of individual words, providing a integrated representation of the document’s content.
>
> - **Why is this method effective?**
>   - This method leverages the distributional hypothesis, which suggests that words appearing in similar contexts have similar meanings. By averaging these word vectors, we capture the general context and meaning of the document.
>

# Questions

1. What are the main stages of the Natural Language Processing (NLP) pipeline discussed in the class?

2. Why is text cleaning considered crucial in the NLP pipeline?

3. Can you list and briefly explain three common text preprocessing techniques?

4. What is the purpose of feature extraction in NLP, and which methods were covered in the class?

5. How can the choice of preprocessing techniques affect the performance of machine learning models?

6. Why is it important to manage the vocabulary size in NLP tasks, and what are some strategies mentioned in the class?

7. Explain the concept of the distributional hypothesis and its significance in understanding word meanings.

8. What are stop words, and how does their usage affect NLP tasks?

9. What is the difference between stemming and lemmatization, and why might one be preferred over the other in certain scenarios?

10. Discuss the advantages of using word embeddings over traditional Bag-of-Words models. How do these embeddings enhance the NLP pipeline?

`Answers are commented inside this cell`

<!-- 1. The main stages of the Natural Language Processing (NLP) pipeline discussed in the class are Data Collection, Text Cleaning, Preprocessing, Feature Extraction, Modeling, Evaluation, Deployment, and Maintenance and Monitoring.

2. Text cleaning is crucial in the NLP pipeline because it improves model performance, efficiency, and accuracy by removing noise and irrelevant details from the dataset. This ensures that the algorithms process only meaningful and relevant data.

3. Three common text preprocessing techniques are tokenization, stemming/lemmatization, and removing stop words. Tokenization involves splitting text into words or phrases. Stemming and lemmatization reduce words to their base or root forms. Removing stop words eliminates common words like 'is', 'an', 'the' that do not carry significant meaning.

4. The purpose of feature extraction in NLP is to transform text data into numerical formats that machine learning algorithms can understand. The methods covered in the class include Bag-of-Words (BoW), which represents text by the frequency of words, and TF-IDF (Term Frequency-Inverse Document Frequency), which adjusts the frequency of words by how commonly they appear across documents.

5. The choice of preprocessing techniques can significantly impact the performance of machine learning models. Proper preprocessing can enhance model accuracy and efficiency by ensuring that only relevant and clean data is used. Conversely, poor preprocessing can introduce noise and irrelevant data, degrading model performance.

6. Managing vocabulary size is important because a larger vocabulary can lead to higher dimensionality and computational complexity. Strategies mentioned in the class include limiting the vocabulary to the top 'N' most frequent words and using subword tokenization to handle out-of-vocabulary words and reduce vocabulary size.

7. The distributional hypothesis suggests that words appearing in similar contexts have similar meanings. This hypothesis is significant because it allows us to understand the meaning of words based on their context, which is the foundation for techniques like word embeddings.

8. Stop words are common words that are filtered out before or after processing natural language data. Their usage affects NLP tasks by reducing the dimensionality of the data and focusing on more meaningful words, which can improve the efficiency and performance of the models.

9. Stemming reduces words to their base form by removing affixes, often resulting in non-dictionary words (e.g., "running" to "run"). Lemmatization reduces words to their base or dictionary form, considering the context (e.g., "running" to "run"). Lemmatization is often preferred over stemming in scenarios where the meaning and context of words are important, as it provides more accurate and meaningful roots.

10. Word embeddings offer several advantages over traditional Bag-of-Words models. Embeddings capture the semantic relationships between words, allowing similar words to have similar vector representations. Unlike BoW, which results in sparse vectors, embeddings provide dense vectors that are computationally efficient. Embeddings consider the context in which words appear, leading to better performance in tasks like text classification, sentiment analysis, and machine translation. Embeddings can also generalize to new words based on their context, overcoming the limitations of BoW models. -->