<img src='data/images/section-notebook-header.png' />

# Data Preparation: Sentiment Analysis + Language Models

Training language models using Neural Network architectures from scratch is a bit more data intensive. We therefore do not include the raw dataset here but provide this notebook for you to generate the training data from the raw data. This also allows you (a) to modify the preprocessing step but also (b) to use a completely different corpus. This would of course require some changes to the code to accommodate for the folder and file structure of the new corpus, but the subsequent steps of generating the training dataset should more or less remain the same.

## Setting up the Notebook

### Import Required Packages

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import numpy as np
import pandas as pd
import random
import re
import os

from collections import Counter, OrderedDict
from tqdm import tqdm

We utilize some utility methods from PyTorch as well as Torchtext, so we need to import the `torch` and `torchtext` package.

In [3]:
import torch
import torchtext
from torchtext.vocab import vocab

As usual, we rely on spaCy to perform basic text preprocessing and cleaning steps, mainly tokenization and lemmatization.

In [4]:
import spacy

# Tell spaCy to use the GPU (if available)
spacy.prefer_gpu()

nlp = spacy.load("en_core_web_sm")

Lastly, `src/utils.py` provides some utility methods to download and decompress files. Since the datasets used in some of the notebooks are of considerable size -- although far from huge -- they are not part of the repository and need to be downloaded (and optionally decompressed) separately. The 2 methods `download_file` and `decompress_file` accomplish this for convenience.

In [5]:
from src.utils import download_file, decompress_file

**Important:** The code cells below to download the file naturally include the URLs of the files. However, there is always the chance that one of those files might be removed or renamed, in which case the URL will now longer be valid. In this case, it is recommended to search for alternative links using, e.g., Google or Bing, which should cause no problems as all datasets used here are generally widely available.

---

## Download Dataset

The [Large Movie Review Dataset](https://ai.stanford.edu/~amaas/data/sentiment/), commonly known as the IMDb dataset, is a widely used dataset for sentiment analysis and text classification tasks. It was created by Andrew Maas and his team at Stanford University and is freely available for research purposes. The dataset consists of movie reviews collected from the IMDb website, a popular online movie database. The reviews are labeled with sentiment polarity, indicating whether the review expresses a positive or negative sentiment towards the movie. The dataset is often used to train and evaluate machine learning models for sentiment analysis.

The dataset contains a total of 50,000 movie reviews, divided into 25,000 reviews for training and 25,000 reviews for testing. Each set is further split into an equal number of positive and negative reviews. The dataset provides a balanced distribution of sentiments. The reviews in the dataset are stored as individual text files, with each file representing a single review. The directory structure organizes the reviews into separate folders for positive and negative sentiments.

The Large Movie Review Dataset has been widely used in natural language processing (NLP) research and has served as a benchmark for sentiment analysis tasks. It offers a valuable resource for training and evaluating models in the field of sentiment analysis and text classification.

The code cell below should download and decompress the dataset. We recommend using the given `target_path` as this won't require any additional changes in subsequent code cells.

In [6]:
print('Download file...')
download_file('https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz', target_path='data/corpora/imdb-reviews/')
print('Decompress file...')
decompress_file('data/corpora/imdb-reviews/aclImdb_v1.tar.gz', target_path='data/corpora/imdb-reviews/')
print('DONE.')

Download file...


100%|███████████████████████████████████████████████████████████████████████████████████████████| 84.1M/84.1M [01:49<00:00, 770kiB/s]


Decompress file...
DONE.


---

## Dataset Preparation: Language Model

**Important Comments:**

* Using the [Large Movie Review Dataset](https://ai.stanford.edu/~amaas/data/sentiment/) to train a language model has of course its limitations. Firstly, the dataset is very small for this task, and maybe more importantly using a dataset from a specific domain (i.e.: movie reviews) limits its applicability this this domain. However, the focus here is to go through some of the basic steps and not to build a state-of-the-art language model.

* The goal in later notebooks is building a simple language model to generate single sentences, not paragraphs or even beyond. Hence each data sample we generate will reflect a single sentence. For training very large language models to generate paragraphs, samples are typically chunks of text containing multiple sentences that might even be arbitrarily cut off.

### Auxiliary Method for Data Preprocessing

The method `process_file_lm` processes each review file for the use of the document as part of the dataset for learning a language model. Since the movie reviews can include HTML tags, we remove those as well using RegEx. Apart from that, we only lowercase all words. Of course, since we want to use the data for training a language model, we do not perform additional steps such as lemmatization or stopword removal.

In [7]:
def process_file_lm(file_name):
    
    text = None
    with open(file_name, 'r') as file:
        text = file.read().replace('\n', '')
        
    # Just a fail-safe if anything is off here
    if text is None:
        return

    ## Remove HTML tags
    p = re.compile(r'<.*?>')
    text = p.sub(' ', text)
    
    ## Return "proper tokens" (lemme, lowercase)
    doc = nlp(text)
    
    samples = []
    for sent in doc.sents:
        samples.append([ t.text.lower() for t in sent if t.text.strip() != '' ])
            
    return samples

process_file_lm('data/corpora/imdb-reviews/aclImdb/train/pos/1000_8.txt')

[['i', 'liked', 'the', 'film', '.'],
 ['some',
  'of',
  'the',
  'action',
  'scenes',
  'were',
  'very',
  'interesting',
  ',',
  'tense',
  'and',
  'well',
  'done',
  '.'],
 ['i',
  'especially',
  'liked',
  'the',
  'opening',
  'scene',
  'which',
  'had',
  'a',
  'semi',
  'truck',
  'in',
  'it',
  '.'],
 ['a',
  'very',
  'tense',
  'action',
  'scene',
  'that',
  'seemed',
  'well',
  'done',
  '.'],
 ['some',
  'of',
  'the',
  'transitional',
  'scenes',
  'were',
  'filmed',
  'in',
  'interesting',
  'ways',
  'such',
  'as',
  'time',
  'lapse',
  'photography',
  ',',
  'unusual',
  'colors',
  ',',
  'or',
  'interesting',
  'angles',
  '.'],
 ['also', 'the', 'film', 'is', 'funny', 'is', 'several', 'parts', '.'],
 ['i',
  'also',
  'liked',
  'how',
  'the',
  'evil',
  'guy',
  'was',
  'portrayed',
  'too',
  '.'],
 ['i', "'d", 'give', 'the', 'film', 'an', '8', 'out', 'of', '10', '.']]

### Preprocessing Review Files

The method `process_reviews_lm()` iterates over all text files representing the movie reviews in the specified folders, see above. For each review, we first extract all the tokens using `process_file_lm()`. For each token, we also keep track of its count. We only need this to later create the final vocabulary by only looking at the top-k (e.g., top-20k most frequent) words.

In [8]:
def process_reviews_lm(folders, num_reviews):
    sentences = []             # List of all sentences
    token_counter = Counter()  # Dictionary with all tokens and their frequencies
    review_count = 0           # Running counter of process reviews
    # Iterate over all reviews
    with tqdm(total=num_reviews) as progress_bar:
        for folder in folders:
            for file_name in os.scandir(folder):
                # Ignore directories (just a fail-safe; not really needed)
                if file_name.is_file() is False:
                    continue
                # Preprocess review
                review_sentences = process_file_lm(file_name.path)
                # Add all extracted sentences to final list
                sentences.extend(review_sentences)
                # Update token counts
                for sentence in review_sentences:
                    for token in sentence:
                        token_counter[token] += 1
                # Update progress bar
                progress_bar.update(1)
                # Check if we need to stop early
                review_count += 1
                if review_count >= num_reviews:
                    return sentences, token_counter
    # Return sentences and token counts
    return sentences, token_counter                

For training a language model, we only need the reviews themselves but not the sentiment labels. This means we can make use of the complete dataset, including the 50k reviews that do not have a sentiment label associated with them. Thus, in the code cell below, we include all subfolders for consideration.

For testing, we recommend using a lower value for `num_reviews` (e.g., 1000) to see if this and the other notebooks are working (of course, the results won't be great). Once you think all is good, you can set `num_reviews` to infinity to work on the whole dataset.

**Side note:** If you used a different target path when decompressing the files, you need to set the `corpus_base_path` variable accordingly.

In [9]:
corpus_base_path = 'data/corpora/imdb-reviews/'

folders = [
    corpus_base_path+'aclImdb/test/pos',
    corpus_base_path+'aclImdb/test/neg',    
    corpus_base_path+'aclImdb/train/pos',
    corpus_base_path+'aclImdb/train/neg',
    corpus_base_path+'aclImdb/train/unsup'    
]

num_reviews = 0

for folder in folders:
    num_reviews += sum([len(files) for r, d, files in os.walk(folder)])

num_reviews = min(num_reviews, 999999999)    
    
print("Total number of reviews: {}".format(num_reviews))

Total number of reviews: 100000


Se, let's call the method `process_reviews_lm` to process the reviews. Depending on the number of reviews you have specified and the performance of your machine, this might take several minutes up to 1h+. On the other hand, this should only be a one-time task.

In [10]:
sentences, token_counter = process_reviews_lm(folders, num_reviews)

print('Total number of sentences: {}'.format(len(sentences)))
print('Number of unique tokens: {}'.format(len(token_counter)))

100%|██████████████████████████████████████████████████████████████████████████████████████| 100000/100000 [1:18:37<00:00, 21.20it/s]

Total number of sentences: 1271285
Number of unique tokens: 174318





### Create & Save Vocabulary

For using the dataset to train a PyTorch model, we need to map each unique word/token to a unique index (i.e., integer identifier). Given a vocabulary size of `V` these unique indices must be of the range from `0` to `V-1`. This is needed since at the end, training a model using the data comes to matrix/tensor operations and we use identifiers to index the respective tensors. Also, we often want to do additional steps such as considering only the top-k most frequent tokens. Again, it's not difficult to implement this from scratch, however, the `torchtext` text simplifies this resulting in cleaner code.

The method `process_reviews_lm` already returns the number of occurrences for each token. So if we want to limit the total number of tokens, we simply need to pick the most frequent tokens using those counts. The code cell below accomplishes this, considering the top 20k tokens by default.

In [11]:
TOP_TOKENS = 20000

# Sort with respect to frequencies
token_counter_sorted = sorted(token_counter.items(), key=lambda x: x[1], reverse=True)

# Consider only the TOP_TOKENS and convert to an OrderedDict (expect by Torchtext; see below)
token_ordered_dict = OrderedDict(token_counter_sorted[:TOP_TOKENS])

We can now create a `vocab` object. In its core, it creates the mappings between the tokens and their indices. It also support some additional useful features:

* For many tasks, we need to include special tokens in our vocabulary. For example, we often need a special token (e.g., `<PAD>`) to represent an "empty" word we can use to pad sequence (see also the other notebooks). Even more common is a special token (e.g., `UNK`) to represent tokens that haven't been seen when building the vocabulary. Not that the exact string for those tokens do not matter. For example, we could have used, say, `[[[padding]]]` and `[[[unseen]]]`. It's only important that those tokens are unique. In the code cell below, we also add `<SOS>` (start of sequence) and `<EOS>` (end of sequence). These are typically required for tasks such as machine translation. While not needed here, it's no harm having them either.

* By using `set_default_index()` we can specify the default index to be used if a sentence we want to transform contains a word not seen before. Most intuitively, we will use the index representing the special token `<UNK>`.

In [12]:
PAD_TOKEN = '<PAD>'
UNK_TOKEN = '<UNK>'
SOS_TOKEN = '<SOS>'
EOS_TOKEN = '<EOS>'

SPECIALS = [PAD_TOKEN, UNK_TOKEN, SOS_TOKEN, EOS_TOKEN]

vocabulary = vocab(token_ordered_dict, specials=SPECIALS)

vocabulary.set_default_index(vocabulary[UNK_TOKEN])

print('Number of tokens: {}'.format(len(vocabulary)))

Number of tokens: 20004


**Side note:** Listing `<PAD>` first ensures that its index will be `0`. While it is not required, it is often assumed to be the padding index and commonly the default value for many utility methods of PyTorch. As such, making sure that `<PAD>` gets the index `0` simplifies later code and making it less prone to errors.

We also need to save the vocabulary to save the mappings between the tokens and their indices. Without it, we would only have a dataset of integer sequences without knowing which words those integers represent. We can still train a model -- after all, this is why we vectorize the dataset to begin with -- however, then we could not decode predicted sequences of indices back into proper words/tokens.

In [13]:
vocabulary_file_name = corpus_base_path+'vectorized-rnn-lm/imdb-rnn-lm-{}.vocab'.format(TOP_TOKENS)

torch.save(vocabulary, vocabulary_file_name)

### Vectorizing & Saving Dataset

In practice, we often deal with very large datasets. This means that creating the vocabulary and vectorizing the corpus can take a significant amount of time -- note this also includes any preprocessing. It is therefore common to consider this as an individual step and save the vectorized dataset to be used for training later. Each sample is a sentence represented by the indices of the contained words, and the length of the sentence. We use this information during training to better handle sequences/sentences of different lengths.

As an additional step, we only consider sentences of a certain length (between 5 and 50). This is a somewhat arbitrary choice, and the sole purpose is to limit the dataset to "normal looking" sentences.

**Side note:** In the code cells below, we use a naming scheme to reflect the number of tokens in the vocabulary (excluding the special tokens). Such naming schemes can be useful when the same raw input data gets converted into different datasets using different preprocessing steps of vocabulary settings.

In [14]:
min_len, max_len = 5, 50
 
# Define ouput file
output_file = open(corpus_base_path+'vectorized-rnn-lm/imdb-rnn-lm-sentences.txt', 'w')

for sentence in sentences:
    # Get length of sentence
    sent_len = len(sentence)
    
    # If the sentence is too short or too long, ignore
    if sent_len < min_len or sent_len > max_len:
        continue

    # Convert tokens to their respective indices
    sentence_vectorized = vocabulary.lookup_indices(sentence)
    
    # Save vectorize sentence and length information to the output file
    output_file.write('{},{}\n'.format(' '.join([ str(idx) for idx in sentence_vectorized ]), sent_len))
                      
output_file.flush()
output_file.close()

Now we have a ready dataset to train a simple language model.

---

## Data Preparation: Sentiment Analysis

### Auxiliary Methods for Data Cleaning & Preprocessing

As the dataset is organized using different subfolders, which in turn indicate the sentiment label, and each review is represented by its own file containing noise such as HTML tags, we first define a couple of auxiliary methods. Firstly, `get_sentiment_label` takes a complete file name as input and extracts the sentiment label from the path. We have only 2 classes, and here we assign positive reviews the label `1` and negative reviews the label `0`.

In [15]:
def get_sentiment_label(file_name):
    label = file_name.split('/')[-2]
    if label.lower() == 'pos':
        return 1
    else:
        return 0

# Let's quickly check the method using an example file name
get_sentiment_label('data/corpora/imdb-reviews/aclImdb/train/pos/1000_8.txt')

1

Secondly, we again need a method to clean and preprocess a review file. The method `process_file_sa()` below is very similar to the method  `process_file_lm()`; the main difference is in the actual preprocessing steps applied. By default, each token is lemmatized and all punctuation marks are removed. Of course, this does not mean that these selected steps guarantee the best results. So feel free to change them. Note that no stopword removal is performed to preserve words such as *"not"*, *"n't"*, *"never"*, etc. which are intuitively relevant for sentiment analysis.

In [16]:
def process_file_sa(file_name):
    
    text = None
    with open(file_name, 'r') as file:
        text = file.read().replace('\n', '')
        
    # Just a fail-safe if anything is off here
    if text is None:
        return

    ## Remove HTML tags
    p = re.compile(r'<.*?>')
    text = p.sub(' ', text)
    
    ## Return "proper tokens" (lemme, lowercase)
    doc = nlp(text)
    
    label = get_sentiment_label(file_name)
    sample = [ t.lemma_.lower() for t in doc if t.pos_ not in ['PUNCT'] and t.dep_ not in ['punct'] and t.lemma_.strip() != '']

    return sample, label

process_file_sa('data/corpora/imdb-reviews/aclImdb/train/pos/1136_8.txt')

(['i',
  'be',
  'not',
  'sure',
  'that',
  'this',
  'comment',
  'contain',
  'an',
  'actual',
  'spoiler',
  'but',
  'i',
  'be',
  'play',
  'it',
  'safe',
  'so',
  'do',
  'not',
  'read',
  'this',
  'if',
  'you',
  'have',
  'not',
  'see',
  'the',
  'movie',
  'i',
  'adore',
  'this',
  'movie',
  'and',
  'so',
  'do',
  'everyone',
  'i',
  'work',
  'with',
  'and',
  'that',
  'be',
  'the',
  'point',
  'i',
  'spend',
  'a',
  'large',
  'part',
  'of',
  'my',
  'work',
  'life',
  'in',
  'cinema',
  'without',
  'be',
  'an',
  'actor',
  'such',
  'people',
  'be',
  'the',
  '_',
  'sung',
  'hero',
  'of',
  'this',
  'movie',
  'the',
  'gaffer',
  'the',
  'puller',
  'the',
  'on',
  'air',
  'director',
  'the',
  'lighter',
  'and',
  'writer',
  'the',
  'costume',
  'people',
  'etc',
  'etc',
  'and',
  'the',
  'whole',
  'thing',
  'be',
  'tell',
  'from',
  'their',
  'point',
  'of',
  'view',
  'at',
  'least',
  'to',
  'a',
  'great',
  'ext

### Preprocessing Review Files

Similar to above, below we define a method `process_reviews_sa()` that iterates over all files in a folder and preprocesses them using the method `process_file_sa()`. Again, we keep track of the counts for all tokens to later limit the size of the final vocabulary.

In [17]:
def process_reviews_sa(folders, num_reviews):
    samples = []               # List of all sentences
    labels =  []               # List of all labels
    token_counter = Counter()  # Dictionary with all tokens and their frequencies
    # Iterate over all reviews
    for folder in folders:
        print(folder)
        with tqdm(total=num_reviews) as progress_bar:
            review_count = 0 # Running counter of process reviews
            for file_name in os.scandir(folder):
                # Ignore directories (just a fail-safe; not really needed)
                if file_name.is_file() is False:
                    continue
                # Preprocess review
                sample, label = process_file_sa(file_name.path)
                # Add all extracted sentences to final list
                samples.append(sample)
                labels.append(label)
                # Update token counts
                for token in sample:
                    token_counter[token] += 1
                # Update progress bar
                progress_bar.update(1)
                # Check if we need to stop early
                review_count += 1
                if review_count >= num_reviews:
                    break
    # Return sentences, labels, and token counts
    return samples, labels, token_counter       

Since sentiment analysis is a classification task, we can utilize only the reviews that actually have a sentiment label. As the original dataset is also already organized into a training and test set, we have to utilize the required subfolder correctly.

Again, for testing the code below, you can first manually limit the values of `num_reviews`. However, note that here this limit is for each folder. After all, we need to ensure that any subset of the dataset remains balanced. In other words, we have to avoid that any subset of the dataset contains only, say, positive reviews. The method `process_reviews_sa()` handles this correctly. For example if `num_reviews=1000` both training and test data will contain 2,000 reviews (1,000 positive and 1,000 negative reviews).

In [18]:
corpus_base_path = 'data/corpora/imdb-reviews/'

folders_train = [
    corpus_base_path+'aclImdb/train/pos',
    corpus_base_path+'aclImdb/train/neg'
]

folders_test = [
    corpus_base_path+'aclImdb/test/pos',
    corpus_base_path+'aclImdb/test/neg'
]

num_reviews = sum([len(files) for r, d, files in os.walk(folders_test[0])])
num_reviews = min(num_reviews, 999999999)        
    
print("Total number of reviews: {}".format(num_reviews))

Total number of reviews: 12500


Again, it's time to call the method `process_reviews_sa` to process the reviews, which might take some time.

In [19]:
samples_train, labels_train, token_counter = process_reviews_sa(folders_train, num_reviews)
samples_test , labels_test,  _             = process_reviews_sa(folders_test , num_reviews)

print('Total number of training samples: {}'.format(len(samples_train)))
print('Total number of test samples: {}'.format(len(samples_test)))
print('Number of unique tokens: {}'.format(len(token_counter)))

data/corpora/imdb-reviews/aclImdb/train/pos


100%|██████████████████████████████████████████████████████████████████████████████████████████| 12500/12500 [09:46<00:00, 21.31it/s]


data/corpora/imdb-reviews/aclImdb/train/neg


100%|██████████████████████████████████████████████████████████████████████████████████████████| 12500/12500 [09:40<00:00, 21.52it/s]


data/corpora/imdb-reviews/aclImdb/test/pos


100%|██████████████████████████████████████████████████████████████████████████████████████████| 12500/12500 [09:34<00:00, 21.76it/s]


data/corpora/imdb-reviews/aclImdb/test/neg


100%|██████████████████████████████████████████████████████████████████████████████████████████| 12500/12500 [09:42<00:00, 21.45it/s]

Total number of training samples: 25000
Total number of test samples: 25000
Number of unique tokens: 72888





### Create & Save Vocabulary

With the reviews processed, we can again create the vocabulary, which requires the same steps as above using essentially the same code. So for simplicity, we combine all the code for this into a single code cell; see below.

In [20]:
TOP_TOKENS = 20000

# Sort with respect to frequencies
token_counter_sorted = sorted(token_counter.items(), key=lambda x: x[1], reverse=True)

# Consider only the TOP_TOKENS and convert to an OrderedDict (expect by Torchtext; see below)

token_ordered_dict = OrderedDict(token_counter_sorted[:TOP_TOKENS])

PAD_TOKEN = "<PAD>"
UNK_TOKEN = "<UNK>"
SOS_TOKEN = "<SOS>"
EOS_TOKEN = "<EOS>"

SPECIALS = [PAD_TOKEN, UNK_TOKEN, SOS_TOKEN, EOS_TOKEN]

vocabulary = vocab(token_ordered_dict, specials=SPECIALS)

vocabulary.set_default_index(vocabulary[UNK_TOKEN])

vocabulary_file_name = corpus_base_path+'vectorized-rnn-sa/imdb-rnn-sa-{}.vocab'.format(TOP_TOKENS)

torch.save(vocabulary, vocabulary_file_name)

## Vectorizing & Saving Dataset

The last step is again to vectorize the sentences to generate the final dataset. Compared to above, there are 2 differences:

* Here, we do not filter sentences based on the lengths

* Since we have a training and test set, we also save them into 2 separate files

In [21]:
output_file = open(corpus_base_path+'vectorized-rnn-sa/imdb-rnn-sa-reviews-{}-train.txt'.format(TOP_TOKENS), "w")

for idx, sample in enumerate(samples_train):

    label = labels_train[idx]
    
    # Convert tokens to their respective indices
    sample_vectorized = vocabulary.lookup_indices(sample)
    
    # Write vectorized sample and label to file
    output_file.write('{},{}\n'.format(' '.join([ str(idx) for idx in sample_vectorized ]), label))        
        
output_file.flush()
output_file.close()

In [22]:
output_file = open(corpus_base_path+'vectorized-rnn-sa/imdb-rnn-sa-reviews-{}-test.txt'.format(TOP_TOKENS), "w")

for idx, sample in enumerate(samples_test):
    
    label = labels_test[idx]
    
    # Convert tokens to their respective indices
    sample_vectorized = vocabulary.lookup_indices(sample)

    # Write vectorized sample and label to file
    output_file.write('{},{}\n'.format(' '.join([ str(idx) for idx in sample_vectorized ]), label))        
        
output_file.flush()
output_file.close()

---

## Pretrained Word Embeddings

Word embeddings are numerical representations of words in a continuous vector space. Pretrained word embeddings are vector representations of words that are derived from large corpora of text using unsupervised learning techniques. These embeddings capture semantic and syntactic information about words in a dense vector space, where words with similar meanings or contexts are located closer to each other.

Once trained, these word embeddings can be reused in various downstream natural language processing (NLP) tasks, such as text classification, named entity recognition, *sentiment analysis*, and machine translation. By utilizing pretrained word embeddings, models can leverage the learned semantic relationships between words and benefit from transfer learning. Pretrained word embeddings have become popular because they offer several advantages. First, they capture rich semantic information that might be challenging to learn from smaller task-specific datasets. Second, they can help overcome the data sparsity problem, especially when dealing with rare words or out-of-vocabulary (OOV) terms. Lastly, pretrained word embeddings enable faster convergence and improved generalization for downstream NLP tasks.

Examples of popular pretrained word embeddings include *Word2Vec*, GloVe (Global Vectors for Word Representation), and fastText. These embeddings are typically available in prebuilt formats and can be readily loaded into models to enhance their performance on various NLP tasks. In later notebooks, we will actually train Word2Vec embeddings from scratch.

The notebook introducing and implementing an RNN-based model for sentiment analysis includes an optional step to utilize pretrained word embeddings. Such embeddings based on Word2Vec, GloVe, or fastText are available only for download. For example [http://vectors.nlpl.eu/repository/](http://vectors.nlpl.eu/repository/) is an online repository for pretrained word embeddings. In the code cells below, we download and decompress the ZIP file containing the embeddings used for training the sentiment analysis model. Note that these word embeddings have been trained on a lemmatized text corpus. This matches -- and has to match -- the preprocessing steps of the movie reviews.

In [23]:
print('Download file...')
download_file('http://vectors.nlpl.eu/repository/20/5.zip', 'data/embeddings/')
print('Decompress file...')
decompress_file('data/embeddings/5.zip', 'data/embeddings/')
print('DONE.')

Download file...


100%|█████████████████████████████████████████████████████████████████████████████████████████████| 575M/575M [17:23<00:00, 551kiB/s]


Decompress file...
DONE.


---

## Summary

While we didn't do anyhting exciting here, this notebook has a couple of useful take-away messages:

* For large(r) text corpora it is a good practice to consider the preprocessing and preparation of the final dataset (incl. vectorization) as a separate step that requires a lot of consideration, and can be very time and resource-intensive on its own without any training of neural network models. In the follow-up notebooks, we will utilized the dataset generated in this notebook.

* Even when using the same corpus, different tasks are likely to require different preprocessing steps. For example, one of the main differences in this notebook was that we lemmatize the data for sentiment analysis (arguably debateable) but not for training language models (arguably mandatory).

* The preprocessing and vectorization of text corpora generally involves the same steps. Utilizing and benefitting from well-established packages such as `torchtext` is very recommended. The provided methods are mostly very flexible. Only in the case of very non-standard preprocessing and vectorization steps, any custom implementations should be required.