# Text Processing and Word Embeddings

Welcome to this new exercise! In this exercise, we will play around with text instead of images as before, using Recurrent Neural Networks. Generally, it is called Natural Language Processing (NLP) when dealing with text, speech, etc. But the data structure is very different from images, i.e., text is a string, while images consist of numbers. Hence, we need some preprocessing steps to transform the raw text into another data format. This notebook will introduce these basic concepts in NLP pipelines. Specifically, you will learn about:

1. How to preprocess text classification datasets
2. How to create a simple word embedding layer that maps words to dense vectors

## (Optional) Mount folder in Colab

Uncomment the following cell to mount your gdrive if you are using the notebook in google colab:

In [None]:
# Use the following lines if you want to use Google Colab
# We presume you created a folder "i2dl" within your main drive folder, and put the exercise there.
# NOTE: terminate all other colab sessions that use GPU!
# NOTE 2: Make sure the correct exercise folder (e.g exercise_11) is given.

"""
from google.colab import drive
import os

gdrive_path='/content/gdrive/MyDrive/i2dl/exercise_11'

# This will mount your google drive under 'MyDrive'
drive.mount('/content/gdrive', force_remount=True)
# In order to access the files in this notebook we have to navigate to the correct folder
os.chdir(gdrive_path)
# Check manually if all files are present
print(sorted(os.listdir()))
"""

### Set up PyTorch environment in colab
- (OPTIONAL) Enable GPU via Runtime --> Change runtime type --> GPU
- Uncomment the following cell if you are using the notebook in google colab:

In [None]:
# Optional: install correct libraries in google colab
# !python -m pip install torch==1.11.0+cu113 torchvision==0.12.0+cu113 -f https://download.pytorch.org/whl/torch_stable.html
# !python -m pip install tensorboard==2.8.0
# !python -m pip install pytorch-lightning==1.6.0

# 0. Setup

As usual, we first import some packages to setup this notebook.

In [1]:
import os
import torch
import matplotlib.pyplot as plt
from torch.utils.data import DataLoader

from exercise_code.rnn.sentiment_dataset import (
    create_dummy_data,
    download_data
)

%matplotlib inline
plt.rcParams['figure.figsize'] = (10.0, 8.0) # set default size of plots
plt.rcParams['image.interpolation'] = 'nearest'
plt.rcParams['image.cmap'] = 'gray'

# for auto-reloading external modules
# see http://stackoverflow.com/questions/1907993/autoreload-of-modules-in-ipython
%load_ext autoreload
%autoreload 2

c:\users\xshys\appdata\local\programs\python\python38\lib\site-packages\numpy\.libs\libopenblas.EL2C6PLE4ZYW3ECEVIV3OXXGRN2NRFM2.gfortran-win_amd64.dll
c:\users\xshys\appdata\local\programs\python\python38\lib\site-packages\numpy\.libs\libopenblas.FB5AE2TYXYH2IJRDKGDGQ3XBKLKTF43H.gfortran-win_amd64.dll


# 1. Preprocessing a Text Classification Dataset

As a starting point, let's load a dummy text classification dataset and have a sense of how it looks. We take these samples from the IMDb movie review dataset, which includes movie reviews and labels that show whether they are negative (0) or positive (1). You will investigate this task further in the second notebook.

In this section, our goal is to create a text processing dataset. You are not required to write any code in this section. However, the concept introduced here is very important for working on NLP datasets in the future as well as in the rest of this exercise. 
Take your time to understand the procedure here. 

First, let us download the data and take a look at some data samples.

In [None]:
i2dl_exercises_path = os.path.dirname(os.path.abspath(os.getcwd()))
data_root = os.path.join(i2dl_exercises_path, "datasets", "SentimentData")
path = download_data(data_root)
data = create_dummy_data(path)
for text, label in data:
    print('Text: {}'.format(text))
    print('Label: {}'.format(label))
    print()

## 1.1 Tokenizing Data

As seen above, we loaded 3 positive and 3 negative reviews. Since the basic semantic unit of text is a word, the first thing we need to do is **tokenizing** the dataset, which means converting each review to a list of words.

In [None]:
import re

# use regular expression to split the sentence
# check https://docs.python.org/3/library/re.html for more information
def tokenize(text):
    return [s.lower() for s in re.split(r'\W+', text) if len(s) > 0]

tokenized_data = []
for text, label in data:
    tokenized_data.append((tokenize(text), label))
    print(tokenized_data[-1], '\n')

## 1.2 Creating a Vocabulary

We have converted the dataset into pairs of token lists and corresponding labels. But strings have varying lengths, which is hard to handle. It would be nice to represent words with numbers. So, we need to create a <b>vocabulary</b>, which is a dictionary that maps each word to an integer id.

In large datasets, there are too many words, and most of them don't occur very frequently. One common approach we use to tackle this problem is to pick the most common N words from the dataset. Therefore, we restrict the number of words.

First, let's compute the word frequencies in our dummy dataset. To compute frequencies, we use the [Counter](https://docs.python.org/3/library/collections.html#collections.Counter) data structure.

In [None]:
from collections import Counter

freqs = Counter()
for tokens, _ in tokenized_data:
    freqs.update(tokens)

freqs

To create the dictionary, let's select the most common 20 words to create a vocabulary. In addition to the words that appear in our data, we need to have two special words:

- `<eos>` End of sequence symbol used for padding
- `<unk>` Words unknown in our vocabulary

In [None]:
vocab = {'<eos>': 0, '<unk>': 1}
for token, freq in freqs.most_common(20):
    vocab[token] = len(vocab)
vocab

## 1.3 Creating the Dataset

Putting it all together, we can now create a dataset class. First, let's create index-label pairs:

In [None]:
indexed_data = []
for tokens, label in tokenized_data:
    indices = [vocab.get(token, vocab['<unk>']) for token in tokens]    
    # the token that is not in vocab get assigned <unk>
    indexed_data.append((indices, label))
    

for indices, label in indexed_data:
    print(indices, ' -> ', label)
    print()

<div class="alert alert-success"> 
    <h3>Task: Check Code</h3>
    <p>We now use the PyTorch dataset class we provided in <code>exercise_code/rnn/sentiment_dataset.py</code> file. Please also take a look at the code.</p>
 </div>
    


Dataset class also reverse sorts the sequences with respect to the lengths. Thanks to this sorting, we can reduce the total number of padded elements, which means that we have less computations for padded values.

In [None]:
from exercise_code.rnn.sentiment_dataset import SentimentDataset

combined_data = [
    (raw_text, tokens, indices, label)
    for (raw_text, label), (tokens, _), (indices, _)
    in zip(data, tokenized_data, indexed_data)
]

dataset = SentimentDataset(combined_data)

for elem in dataset:
    print(elem)
    print()

## 1.4 Minibatching
Note that in the dataset we created, not all sequences have the same length. Therefore, we cannot minibatch the data trivially. This means we cannot use a `DataLoader` class easily.

<b>If you uncomment the following cell and run it, you will very likely get an error!</b>

In [None]:
# loader = DataLoader(dataset, batch_size=3)

# for batch in loader:
#     print(batch)

<div class="alert alert-success"> 
    <h3>Task: Check Code</h3>
    <p>To solve the problem, we need to pad the sequences with <code> < eos > </code> tokens that we indexed as zero. To integrate this approach into the Pytorch <a href="https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader" target="_blank">Dataloader</a> class, we will make use of the <code>collate_fn</code> argument. For more details, check out the <code>collate</code> function in <code>exercise_code/rnn/sentiment_dataset</code>. </p>
    <p> In addition, we use the <a href="https://pytorch.org/docs/stable/generated/torch.nn.utils.rnn.pad_sequence.html" target="_blank">pad_sequence</a> that pads shorter sequences with 0. </p>
 </div>

In [None]:
from torch.nn.utils.rnn import pad_sequence

def collate(batch):
    assert isinstance(batch, list)
    data = pad_sequence([b['data'] for b in batch])
    lengths = torch.tensor([len(b['data']) for b in batch])
    label = torch.stack([b['label'] for b in batch])
    return {
        'data': data,
        'label': label,
        'lengths': lengths
    }

loader = DataLoader(dataset, batch_size=3, collate_fn=collate)
for batch in loader:
    print('Data: \n', batch['data'])
    print('\nLabels: \n', batch['label'])
    print('\nSequence Lengths: \n', batch['lengths'])
    print('\n')

We can see that these two batches have different length, this is how the reverse sort mentioned in `1.3 Creating the Dataset` benefits for less memory and less computation. 

# 2. Embeddings

In the previous section, we explored how to convert text into a sequence of integers. In this form, sequences are still not ready to be inputs of RNNs you implemented in the optional notebook. 

An integer representation is usually a one-hot encoding, while not the same since they are not equally weighted given only an integer. 

Moreover, it fails to express the semantic relations between words and the order of the words has no meaning. We would like a better representation to keep the semantic meaning of the word. For example, as shown in the following picture, the difference between man and woman and the difference between king and queen should be close, since the difference is only the gender. If we use a vector for each word, the above relation can be expressed as $vec(\text{women})-vec(\text{man}) \approx vec(\text{queen}) - vec(\text{king})$. Usually we call such vector representations as embeddings.

<img src='https://developers.google.com/machine-learning/crash-course/images/linear-relationships.svg' width=80% height=80%/>

While one can use pre-trained embedding vectors such as [word2vec](https://arxiv.org/abs/1301.3781) or [GLoVe](https://nlp.stanford.edu/projects/glove/), in this exercise we use randomly initialized embedding vectors that will be trained from scratch together with our networks. As we train our model, it will learn the semantic relations between words.

<div class="alert alert-info">

<h3> Task: Implement Embedding</h3>
 <p>In this part, you will implement a simple embedding layer. Embedding is a simple lookup table that stores a dense vector to represent each word in the vocabulary.</p> 

 <p>Your task is to implement the <code>Embedding</code> class in <code>exercise_code.rnn.rnn_nn</code> file. Once you are done, run the below cell to test your implementation. Note that we ensure eos embeddings to be zero by using the <code>padding_idx</code> argument.

 </div>

In [None]:
import torch.nn as nn

from exercise_code.rnn.rnn_nn import Embedding
from exercise_code.rnn.tests import embedding_output_test


i2dl_embedding = Embedding(len(vocab), 16, padding_idx=0)
pytorch_embedding = nn.Embedding(len(vocab), 16, padding_idx=0)

loader = DataLoader(dataset, batch_size=len(dataset), collate_fn=collate)
for batch in loader:
    x = batch['data']

embedding_output_test(i2dl_embedding, pytorch_embedding, x)


# 3. Conclusion

In this notebook, you learned how to prepare text data and how to create an embedding layer. In the next notebook, you will combine your Embedding and RNN implementations to create a sentiment analysis network!