# Rudiments of Natural Language Processing: Data

This is the first part of the NLP workshop at the Center for Machine Learning. In this notebook we preprocess the IMDB movie reviews to later classify them as positive or negative. We do the following:

*   Gather some prerequisites
*   Retrieve the IMDB dataset of movie reviews
*   Clean and tokenize the texts
*   Split the data into training and validation sets
*   Create a vocabulary and save it to disk
*   Convert words to indices and save them to disk



## Prerequisites

Install the missing dependencies:

In [None]:
!pip install portalocker

Import the necessary modules:

In [None]:
import csv
import google.colab as colab
import itertools
import pickle
import torchtext

We will create some files and store them on the google drive so that they are available for the next notebooks. Mount your drive:

In [None]:
colab.drive.mount('/content/drive')

Your home directory on the google drive is now mounted as `drive/MyDrive`. Check it by listing its contents:

In [None]:
!ls drive/MyDrive/

For simplicity we will store all the necessary files in the home directory but you can create a dedicated subfolder if you wish.

## Get the data

The IMDB dataset can be retrieved in various ways. In order to avoid some technical problems with the existing distributions we prepared a single file `data.zip`. Download it to your home directory on the google drive:

In [None]:
!wget www.fuw.edu.pl/~polbrat/data.zip --directory-prefix=drive/MyDrive/

Unzip it to the current folder of the execution environment:

In [None]:
!unzip drive/MyDrive/data.zip

Note that this directory will be lost when the environment gets disconnected. List its contents:

In [None]:
!ls

Print the first ten reviews from the CSV file:

In [None]:
!head data.csv

The labels in the first column equal 0 for negative and 1 for positive reviews.

## Clean the texts

Further processing will be easier if the texts are first cleaned a bit. In a simple approach we will remove HTML tags, digits, punctuation, and make everything lowercase. Read the CSV file and store the cleaned texts together with their labels in a list of tuples:

In [None]:
data = list()
with open('data.csv', 'rt', encoding = 'utf-8') as stream:
    reader = csv.reader(stream)
    for label0, word1 in reader:
        word1 = word1.replace('<hr>', ' ').replace('<br />', ' ')
        word1 = word1.replace('!', ' ').replace('"', ' ').replace('#', ' ').replace('$', ' ').replace('%', ' ')
        word1 = word1.replace('&', ' ').replace("'", '').replace('(', ' ').replace(')', ' ').replace('*', ' ')
        word1 = word1.replace('+', ' ').replace(',', ' ').replace('-', ' ').replace('.', ' ').replace('/', ' ')
        word1 = word1.replace('0', ' ').replace('1', ' ').replace('2', ' ').replace('3', ' ').replace('4', ' ')
        word1 = word1.replace('5', ' ').replace('6', ' ').replace('7', ' ').replace('8', ' ').replace('9', ' ')
        word1 = word1.replace(':', ' ').replace(';', ' ').replace('<', ' ').replace('=', ' ').replace('>', ' ')
        word1 = word1.replace('?', ' ').replace('@', ' ').replace('[', ' ').replace('\\', ' ').replace(']', ' ')
        word1 = word1.replace('^', ' ').replace('_', ' ').replace('`', ' ').replace('{', ' ').replace('|', ' ')
        word1 = word1.replace('}', ' ').replace('~', ' ')
        word1 = word1.lower()
        word1 = word1.split()
        word1 = ' '.join(word1)
        data.append((label0, word1))

Save the cleaned data in the same CSV file:

In [None]:
with open('data.csv', 'wt', encoding = 'utf-8') as stream:
    writer = csv.writer(stream)
    for label0, word1 in data:
        writer.writerow([label0, word1])

Print the first ten cleaned reviews:

In [None]:
!head data.csv

By cleaning in this way we certainly lost some information. But doing significantly better would be cumbersome without specialized tools while this approach is sufficient for our purposes.

## Tokenize the texts

For further processing we want to represent each review as a list of tokens that will be just words in our simple approach. Just split the texts on spaces and store the labels and sequences of words in a list of tuples:

In [None]:
data = list()
with open('data.csv', 'rt', encoding = 'utf-8') as stream:
    reader = csv.reader(stream)
    for label0, word1 in reader:
        word1 = word1.split()
        data.append((label0, word1))

Save the tokenized data in the same CSV file so that the labels are in the first column and the words in the following columns:

In [None]:
with open('data.csv', 'wt', encoding = 'utf-8') as stream:
    writer = csv.writer(stream)
    for label0, word1 in data:
        writer.writerow([label0] + word1)

Print the first ten tokenized reviews:

In [None]:
!head data.csv

## Split the data into training and validation sets

Read the tokenized CSV file back just to see how it is done. By the way convert labels from text format to integer numbers:

In [None]:
data = list()
with open('data.csv', 'rt', encoding = 'utf-8') as stream:
    reader = csv.reader(stream)
    for label0, *word1 in reader:
        label0 = int(label0)
        data.append((label0, word1))

Print the first ten reviews just to check:

In [None]:
for label0, word1 in data[:10]:
    print(label0, word1)

Print the total number of reviews as well as the numbers of negative and positive ones:

In [None]:
print(len(data),
      sum(label0 == 0 for label0, word1 in data),
      sum(label0 == 1 for label0, word1 in data))

There are as many negative reviews as there are positive ones so that the dataset is exactly balanced. Split the data into training and validation sets of 40000 and 10000 reviews respectively:

In [None]:
train_data = data[:40000]
valid_data = data[40000:]

Print the total number of reviews in each set as well as the numbers of negative and positive reviews:

In [None]:
print(len(train_data),
      sum(label0 == 0 for label0, word1 in train_data),
      sum(label0 == 1 for label0, word1 in train_data))

print(len(valid_data),
      sum(label0 == 0 for label0, word1 in valid_data),
      sum(label0 == 1 for label0, word1 in valid_data))

The downloaded file was prepared so that both sets are exactly balanced. If this was not the case you would need a better splitting technique than just slicing the data list. 

## Build a vocabulary

In order to feed the reviews to any machine-learning model we must somehow convert the texts to numbers. We do it by replacing each word with its index on a list of all considered words that constitute the vocabulary of the problem. The vocabulary contains only words from the training set because anyway we cannot predict what other words will appear in other reviews. We will now create the vocabulary. Drop the labels from the training set and consider only the lists of words in the subsequent reviews:

In [None]:
word2 = [word1 for label0, word1 in train_data]

Print the lists of words in the first ten reviews just to see if they are correct:

In [None]:
for word1 in word2[:10]:
    print(word1)

Create a vocabulary object from all the words in the training set using a builtin function:

In [None]:
vocab = torchtext.vocab.build_vocab_from_iterator(word2)

The vocabulary object assigns a unique index to each unique word in the provided corpus and is then able to convert words to their indices. For the first ten reviews print their words pass them through the vocabulary and print the resulting word indices:

In [None]:
for word1 in word2[:10]:
    print(word1)
    index1 = vocab(word1)
    print(index1)
    print()

The vocabulary can also transform a single word into its index. Print the index of a common word `message`:

In [None]:
index0 = vocab['message']
print(index0)

Print the index of an odd word `jlo` which may be a typo or a proper name appearing by chance in this particular training set:

In [None]:
index0 = vocab['jlo']
print(index0)

From the vocabulary object retrieve a python dictionary that maps words to their indices. Print the first ten keys of this dictionary togehter with their corresponding values:

In [None]:
stoi = vocab.get_stoi()
for word0 in itertools.islice(stoi.keys(), 10):
    index0 = stoi[word0]
    print(word0, index0)

Note that despite our cleaning some non-ASCII UTF8 characters remained. You may get rid of them by better cleaning. From the vocabulary object retrieve a list of words arranged according to their indices. Print the first ten indices and words:

In [None]:
itos = vocab.get_itos()
for index0 in range(10):
    word0 = itos[index0]
    print(index0, word0)

These are actually the most frequent words in our text corpus. The vocabulary can be manually extended with arbitrary words that usually play some special role. Add a word `<unk>`:


In [None]:
specials = ['<unk>']
vocab = torchtext.vocab.build_vocab_from_iterator(word2, specials = specials)

Print the first ten words again:

In [None]:
itos = vocab.get_itos()
for index0 in range(10):
    word0 = itos[index0]
    print(index0, word0)

The special words are placed at the beginning of the list. They are treated as all other words and can be mapped to their indices as well:

In [None]:
index0 = vocab['<unk>']
print(index0)

New reviews may contain words absent in the training set and so in the vocabulary. An attempt to map an unknown word causes an error:

In [None]:
index0 = vocab['ferdydurke']
print(index0)

But the vocabulary may be told to map unknown words to a default index which is usually set as the index of the special word `<unk>`:

In [None]:
index0 = vocab['<unk>']
vocab.set_default_index(index0)

Check it by mapping any unknown word:

In [None]:
index0 = vocab['ferdydurke']
print(index0)

The number of unique words in the vocabulary can be obtained as its length:

In [None]:
indices = len(vocab)
print(indices)

The vocabulary contains quite many words but some of them appear in the training set very rarely. If a rare word appears by chance only in positive reviews the model may erroneously think that any review containing this word is positive. It is therefore better to treat such words as unknown and exclude them from the vocabulary. Limit the vocabulary to words present in at least five reviews from the training set:

In [None]:
vocab = torchtext.vocab.build_vocab_from_iterator(word2, 5, specials = specials)
index0 = vocab['<unk>']
vocab.set_default_index(index0)

Note that this dramatically reduces the vocabulary size:

In [None]:
indices = len(vocab)
print(indices)

This reduction makes the calculations lighter and helps prevent overfitting. Now the odd word `jlo` is absent from the vocabulary:

In [None]:
index0 = vocab['jlo']
print(index0)

Note that the most freqeunt words like `the`, `and`, `a` etc. carry little information on whether a review is negative or positive. So it would be beneficial to exclude them from the vocabulary as well. We will not do so because there is no ready mechanism the library. Instead our models will learn that these words are not important. During training reviews of different lengths will be grouped into batches where they must have equal lengths. So shorter ones will be padded with an index that does not correspond to any real word but to a special one usually called `<pad>`. Create a vocabulary with two special words `<pad>` and `<unk>`:

In [None]:
specials = ['<pad>', '<unk>']
vocab = torchtext.vocab.build_vocab_from_iterator(word2, 5, specials = specials)
index0 = vocab['<unk>']
vocab.set_default_index(index0)

Note that they correspond to indices 0 and 1:

In [None]:
index1 = vocab(['<pad>', '<unk>'])
print(index1)

This is the final form of our vocabulary. Save is to google drive in the pickle format:

In [None]:
with open('drive/MyDrive/vocab.pkl', 'wb') as stream:
    pickle.dump(vocab, stream, protocol = pickle.HIGHEST_PROTOCOL)

## Convert words to indices

We will now use the created vocabulary to convert the training and validation reviews to word indices. Convert the training data:

In [None]:
train_data = [(label0, vocab(word1)) for label0, word1 in train_data]

Save it to google drive:

In [None]:
with open('drive/MyDrive/train_data.csv', 'wt', encoding = 'utf-8') as stream:
    writer = csv.writer(stream)
    for label0, index1 in train_data:
        writer.writerow([label0] + index1)

Print the first ten converted reviews:

In [None]:
!head drive/MyDrive/train_data.csv

The first column contains the label and the next columns contain subsequent word indices. Do the same with the validation data:

In [None]:
valid_data = [(label0, vocab(word1)) for label0, word1 in valid_data]

with open('drive/MyDrive/valid_data.csv', 'wt', encoding = 'utf-8') as stream:
    writer = csv.writer(stream)
    for label0, index1 in valid_data:
        writer.writerow([label0] + index1)

!head drive/MyDrive/valid_data.csv

Now that the vocabulary and the data are saved to google drive switch to the next notebook and train a very simple model that will classify the reviews as negative or positive.