<img src="data/images/lecture-notebook-header.png" />

# Sentiment Analysis -- Data Preparation

When it comes to machine learning with text data, it's often a good idea to treat the transformation for the raw input corpus to the training and test set as valid input for the neural network as a separate step. This is particularly true of the size of the corpus is huge. One of the datasets we consider consists of 50,000 movie reviews, annotated with positive or negative labels. While this dataset is far from huge, you will notices that it will takes some time to preprocess.

Additionally, since already preprocessing requires making certain design choices (e.g., the consideration of the most frequent words, the performing of stemming/lemmatization, or stopword removal, etc.), creating several conversions of the text documents can often be meaningful in practice.


## Setting up the Notebook

### Required Imports

In [None]:
import re, glob
from tqdm import tqdm
from collections import Counter, OrderedDict

In [None]:
import torch
import torchtext

from torchtext.vocab import vocab

Lastly, `src/utils.py` provides some utility methods to download and decompress files. Since the datasets used in some of the notebooks are of considerable size -- although far from huge -- they are not part of the repository and need to be downloaded (and optionally decompressed) separately. The 2 methods `download_file` and `decompress_file` accomplish this for convenience.

In [None]:
from src.utils import download_file, decompress_file

In [None]:
import spacy

spacy.prefer_gpu()
# We use spaCy for preprocessing, but we only need the tokenizer and lemmatizer
# (for a large real-world dataset that would help with the performance)
nlp = spacy.load("en_core_web_sm", disable=['ner', 'parser'])

We consider 2 dataset for sentiment analysis (binary classifications) of different size, although even the larger dataset is still rather small:
* 10k sentences with a positive or negative sentiment (balanced)
* 50k multisentence movie reviews with a positive or negative (balanced)

---

## Sentence Polarity

The [sentence polarity dataset](https://www.kaggle.com/datasets/nltkdata/sentence-polarity) is a well-known dataset commonly used for sentiment analysis and text classification tasks in NLP. It consists of sentences or short texts labeled with their corresponding sentiment polarity (positive or negative). This dataset is often used to train and evaluate models that aim to classify text into positive or negative sentiment categories. It serves as a benchmark for sentiment analysis tasks and provides a standardized dataset for researchers and practitioners to compare and evaluate the performance of different algorithms and techniques.

There are several versions and variations of the sentence polarity dataset available, created for different purposes and domains. One of the popular versions is the Movie Review Dataset, also known as the Pang and Lee dataset, created by Bo Pang and Lillian Lee. This dataset contains movie reviews from the website IMDb, with each review labeled as positive or negative. The sentence polarity dataset enables researchers and developers to build and test sentiment analysis models that can automatically determine the sentiment expressed in text, allowing applications such as sentiment monitoring, opinion mining, and customer feedback analysis.

For this notebook, we already prepared the dataset by combining the 2 files containing the positive and negative sentences into a single file. The polarity of each sentence is denoted by a polarity label: `1` for positive and `-1` for negative. This makes handling the data a bit simpler and keeps the notebook a bot cleaner.

#### Auxiliary Method

The method `preprocess()`, well, tokenizes a given text. In this case, we not only tokenize but also lemmatize and lowercase all tokens. The exact list of preprocessing steps will in practice depend on the exact task, but this is what we do here. Notice that we do not, for example, remove stopwords. This is mainly to reduce the vocabulary size not too much here.

In [None]:
def preprocess(text):
    return [token.lemma_.lower() for token in nlp(text)]

preprocess("This is a test to see if the TOKENIZER does its job.")

#### Read Files & Compute Word Frequencies

The first to go through the whole corpus and count the number of occurrences for each token. 10k sentences is basically nothing these days, but the purpose of this notebook is not to focus on large scale data as the steps would be exactly the same.

In [None]:
token_counter = Counter()

targets_polarity = []

with tqdm(total=10662) as pbar:
    
    # Loop over each sentence (1 sentence per line)
    with open('data/datasets/sentence-polarities/sentence-polarities.csv', encoding='utf-8') as file:
        for line in file:
            line = line.strip()
            parts = line.split('\t')
            sentence, label = parts[0], int(parts[1])
            # Update token counts
            for token in preprocess(sentence):
                token_counter[token] += 1            
            # Add label to targets list
            targets_polarity.append(label)
            # Update progress bar
            pbar.update(1)

#### Create Vocabulary

To create our `vocab` object, we perform exactly the same steps as above. The only difference is that our "full" vocabulary is not larger (although with less than 20k tokens still rather small). We therefore limit the vocabulary here to the 10,000 most frequent tokens.


In [None]:
# Sort by word frequency
token_counter_sorted = sorted(token_counter.items(), key=lambda x: x[1], reverse=True)

print("Number of tokens: {}".format(len(token_counter_sorted)))

In [None]:
TOP_TOKENS = 10000

token_counter_sorted = token_counter_sorted[:TOP_TOKENS]

print("Number of tokens: {}".format(len(token_counter_sorted)))

In [None]:
token_ordered_dict = OrderedDict(token_counter_sorted)

# Define list of "special" tokens
SPECIALS = ["<PAD>", "<UNK>", "<SOS>", "<EOS>"]

vocabulary = vocab(token_ordered_dict, specials=SPECIALS)

vocabulary.set_default_index(vocabulary["<UNK>"])

print("Number of tokens: {}".format(len(vocabulary)))

### Save Dataset

Lastly, we save all the data for later use.

#### Vectorize and Save Dataset

In [None]:
output_file = open("data/datasets/sentence-polarities/polarity-dataset-vectors-{}.txt".format(TOP_TOKENS), "w")

with tqdm(total=10662) as pbar:
    
    # Loop over each sentence (1 sentence per line)
    with open('data/datasets/sentence-polarities/sentence-polarities.csv', encoding='utf-8') as file:
        for line in file:
            line = line.strip()
            parts = line.split('\t')
            sentence, label = parts[0], int(parts[1])
            # Convert labels from -1/1 to 0/1
            label = int((label + 1) / 2)
            # Convert sentence into sequence of word indices
            vector = vocabulary.lookup_indices(preprocess(sentence))
            # Write converted sequence and labelsto file
            output_file.write("{}\t{}\n".format(" ".join([str(idx) for idx in vector]), label))
            # Update progress bar
            pbar.update(1)

output_file.flush()
output_file.close()            

#### Save Metadata

In [None]:
vocabulary_file_name = "data/datasets/sentence-polarities/polarity-corpus-{}.vocab".format(TOP_TOKENS)

torch.save(vocabulary, vocabulary_file_name)

---

## IMDb Movie Reviews

The [Large Movie Review Datase](https://ai.stanford.edu/~amaas/data/sentiment/), commonly known as the IMDb dataset or IMDb movie reviews dataset, is a widely used benchmark dataset in natural language processing (NLP) and sentiment analysis. Created by Andrew Maas and a group of researchers at Stanford University, this dataset consists of movie reviews collected from IMDb (Internet Movie Database).

Here are the key characteristics of the Large Movie Review Dataset:

* **Data Size:** It contains a collection of 50,000 movie reviews.

* **Review Split:** The dataset is evenly divided into two sets:
    * 25,000 reviews for training
    * 25,000 reviews for testing

* **Sentiment** Labels: Each review is labeled with sentiment polarity:
    * 50% of reviews are labeled as positive
    * 50% of reviews are labeled as negative

* **Binary Classification Task:** The dataset is commonly used for binary sentiment classification tasks, where the goal is to classify whether a review expresses positive or negative sentiment.

This dataset serves as a standard benchmark for sentiment analysis and text classification algorithms, enabling researchers and developers to evaluate and compare the performance of different machine learning and deep learning models in sentiment classification tasks. The availability of labeled data in large quantities allows for the training and evaluation of models to predict sentiment accurately, making it a valuable resource in the field of natural language processing and sentiment analysis research.

Given its size, the dataset is not included in the Github repository. You can either download the dataset yourself using the link above, or you can run the notebook "Representations (Word2Vec - Data Preparation)" first which downloads the dataset for you.

In [None]:
folders_train = [
    'data/datasets/imdb-reviews/aclImdb/train/pos',
    'data/datasets/imdb-reviews/aclImdb/train/neg'  
]

folders_test = [
    'data/datasets/imdb-reviews/aclImdb/test/pos',
    'data/datasets/imdb-reviews/aclImdb/test/neg'  
]

### Auxiliary Method for Data Cleaning & Preprocessing

The method below takes a single review file as input and returns all valid tokens as a list. This includes that the method removes all punctuation marks and stopwords. The method performs lemmatization. Recall from the lecture how preprocessing affects the learning of word embeddings but here we want to keep it simple and try to minimize the vocabulary, i.e., the number of unique tokens.

Since the movie reviews can include HTML tags, we remove those as well using RegEx. Again, anything here is kept to a bare minimum to keep things short and simple. Feel free to put in more thoughts into potentially better preprocessing steps.

In [None]:
def process_file(file_name):
    text = None
    with open(file_name, 'r', encoding='utf-8') as file:
        text = file.read().replace('\n', '')
        
    if text is None:
        return

    ## Remove HTML tags
    p = re.compile(r'<.*?>')
    text = p.sub(' ', text)
    
    ## Let spaCy do its magic
    doc = nlp(text)
    
    ## Return "proper tokens" (lemmatize, lowercase)
    ##return [ t.lemma_.lower() for t in doc if t.pos_ not in ['PUNCT'] and t.dep_ not in ['punct'] and t.lemma_.strip() != '' and t.is_stop == False ]
    return [ t.lemma_.lower() for t in doc if t.pos_ not in ['PUNCT'] and t.dep_ not in ['punct'] and t.lemma_.strip() != '']

process_file("data/datasets/imdb-reviews/aclImdb/train/pos/0_9.txt")

### Process Review Files

The code cell below iterates over all text files representing the movie reviews in the specified folders, see above. For each review, we first extract all the tokens using `process_file()`. This returns the list of relevant tokens for this review which append to a list of all tokens across all reviews.

For each token, we also keep track of its count. We only need this to later create the final vocabulary by only looking at the top-k (e.g., top-20k most frequent) words.

For testing, it's recommended to use a lower value for `num_reviews` (e.g., 1000) to see if this and the other notebooks are working (of course, the results won't be great). Once you think all is good, you can set `num_reviews` to infinity to work on the whole dataset.

In [None]:
# Limit the number of reviews taken from each folder
num_reviews = 1000000000000

token_counter = Counter()
    
## Loop through all folders and files
for folder in folders_train:

    ## Get all filen names and limit as specified
    file_names = sorted(glob.glob('{}/*.txt'.format(folder)))[:num_reviews]
    
    with tqdm(total=len(file_names)) as t:
        ## Loop over each file (1 file = 1 review)
        for file_name in file_names:
            ## Extract tokens from file/review
            tokens = process_file(file_name)
            ## Update token counter
            for token in tokens:
                token_counter[token] += 1
            ## Update progress bar
            t.update(1)

            
print('Size of Vocabulary: {}'.format(len(token_counter)))

#### Create Vocabulary

To create our `vocab` object, we perform exactly the same steps as above. The only difference is that our "full" vocabulary is not larger. We therefore limit the vocabulary here to the 20,000 most frequent tokens.

In [None]:
TOP_TOKENS = 20000

# Sort with respect to frequencies
token_counter_sorted = sorted(token_counter.items(), key=lambda x: x[1], reverse=True)

token_ordered_dict = OrderedDict(token_counter_sorted[:TOP_TOKENS])

In [None]:
PAD_TOKEN = "<PAD>"
UNK_TOKEN = "<UNK>"
SOS_TOKEN = "<SOS>"
EOS_TOKEN = "<EOS>"

SPECIALS = [PAD_TOKEN, UNK_TOKEN, SOS_TOKEN, EOS_TOKEN]

vocabulary = vocab(token_ordered_dict, specials=SPECIALS)

vocabulary.set_default_index(vocabulary[UNK_TOKEN])

print("Number of tokens: {}".format(len(vocabulary)))

### Save Dataset

Lastly, we again save all the data for later use.

#### Vectorize and Save Dataset

To preserve the split between the original training and test data, we save the data in 2 separate files.

In [None]:
output_file = open("data/datasets/imdb-reviews/imdb-dataset-train-vectors-{}.txt".format(TOP_TOKENS), "w")

## Loop through all folders and files (1 file = 1 review)
for label, folder in enumerate(folders_train):

    ## Get all filen names and limit as specified
    file_names = sorted(glob.glob('{}/*.txt'.format(folder)))[:num_reviews]
    
    with tqdm(total=len(file_names)) as t:
        ## Loop over each file (1 file = 1 review)
        for file_name in file_names:
            ## Extract tokens from file/review
            tokens = process_file(file_name)
            vector = vocabulary.lookup_indices(tokens)
            # Write both texts to the output file (use tab as separator)
            output_file.write("{}\t{}\n".format(" ".join([str(idx) for idx in vector]), label))
            ## Update progress bar
            t.update(1)            
        
output_file.flush()
output_file.close()

In [None]:
output_file = open("data/datasets/imdb-reviews/imdb-dataset-test-vectors-{}.txt".format(TOP_TOKENS), "w")

## Loop through all folders and files (1 file = 1 review)
for label, folder in enumerate(folders_test):

    ## Get all filen names and limit as specified
    file_names = sorted(glob.glob('{}/*.txt'.format(folder)))[:num_reviews]
    
    with tqdm(total=len(file_names)) as t:
        ## Loop over each file (1 file = 1 review)
        for file_name in file_names:
            ## Extract tokens from file/review
            tokens = process_file(file_name)
            vector = vocabulary.lookup_indices(tokens)
            # Write both texts to the output file (use tab as separator)
            output_file.write("{}\t{}\n".format(" ".join([str(idx) for idx in vector]), label))
            ## Update progress bar
            t.update(1)            
        
output_file.flush()
output_file.close()

#### Save Metadata

We only need to save the vocabulary since the class labels are already 0 and 1.

In [None]:
vocabulary_file_name = "data/datasets/imdb-reviews/imdb-corpus-{}.vocab".format(TOP_TOKENS)

torch.save(vocabulary, vocabulary_file_name)

---

## Pretrained Word Embeddings

Word embeddings are numerical representations of words in a continuous vector space. Pretrained word embeddings are vector representations of words that are derived from large corpora of text using unsupervised learning techniques. These embeddings capture semantic and syntactic information about words in a dense vector space, where words with similar meanings or contexts are located closer to each other.

Once trained, these word embeddings can be reused in various downstream natural language processing (NLP) tasks, such as text classification, named entity recognition, *sentiment analysis*, and machine translation. By utilizing pretrained word embeddings, models can leverage the learned semantic relationships between words and benefit from transfer learning. Pretrained word embeddings have become popular because they offer several advantages. First, they capture rich semantic information that might be challenging to learn from smaller task-specific datasets. Second, they can help overcome the data sparsity problem, especially when dealing with rare words or out-of-vocabulary (OOV) terms. Lastly, pretrained word embeddings enable faster convergence and improved generalization for downstream NLP tasks.

Examples of popular pretrained word embeddings include *Word2Vec*, GloVe (Global Vectors for Word Representation), and fastText. These embeddings are typically available in prebuilt formats and can be readily loaded into models to enhance their performance on various NLP tasks. In later notebooks, we will actually train Word2Vec embeddings from scratch.

The notebook introducing and implementing an RNN-based model for sentiment analysis includes an optional step to utilize pretrained word embeddings. Such embeddings based on Word2Vec, GloVe, or fastText are available only for download. For example [http://vectors.nlpl.eu/repository/](http://vectors.nlpl.eu/repository/) is an online repository for pretrained word embeddings. In the code cells below, we download and decompress the ZIP file containing the embeddings used for training the sentiment analysis model. Note that these word embeddings have been trained on a lemmatized text corpus. This matches -- and has to match -- the preprocessing steps of the movie reviews.

In [None]:
print('Download file...')
download_file('http://vectors.nlpl.eu/repository/20/5.zip', 'data/embeddings/')
print('Decompress file...')
decompress_file('data/embeddings/5.zip', 'data/embeddings/')
print('DONE.')

---

## Summary

While we didn't do anything exciting here, this notebook has a couple of useful take-away messages:

* For large(r) text corpora it is a good practice to consider the preprocessing and preparation of the final dataset (incl. vectorization) as a separate step that requires a lot of consideration, and can be very time and resource-intensive on its own without any training of neural network models. In the follow-up notebooks, we will utilize the dataset generated in this notebook.

* Even when using the same corpus, different tasks are likely to require different preprocessing steps. For example, one of the main differences in this notebook was that we lemmatized the data for sentiment analysis (arguably debateable) but not for training language models (arguably mandatory).

* The preprocessing and vectorization of text corpora generally involves the same steps. Utilizing and benefitting from well-established packages such as `torchtext` is very recommended. The provided methods are mostly very flexible. Only in the case of very non-standard preprocessing and vectorization steps, any custom implementations should be required.
