To perform sentence classification, and many other classification tasks for NLP, we need to do three main steps:

- Preprocessing the data
- Prepare the dataloader
- Build the model

Of course, all of these steps requires a lot of other steps, and also they can include many different solutions. 

To make you to jumpstart on this task, I will provide you a pretty clean dataset, the Amazon Reviews one, that you can extensively find online, and it's also included in the `torxchtext.datasets` module. 

For this example, I will use just a little part of it, to give some guidance on how to start, without actually training the whole model.

### Load the data



In [1]:
import pandas as pd

In [2]:
import spacy

In [4]:
from torchtext.datasets import AmazonReviewFull

In [7]:
AmazonReviewFull(root='/Users/franciscovarelacid/Desktop/Strive/datasets')

amazon_review_full_csv.tar.gz: 644MB [00:13, 48.9MB/s]
3000000lines [07:36, 6567.34lines/s]
3000000lines [10:41, 4677.75lines/s]
650000lines [02:27, 4416.54lines/s]


(<torchtext.datasets.text_classification.TextClassificationDataset at 0x7fdafa84d100>,
 <torchtext.datasets.text_classification.TextClassificationDataset at 0x7fdaf93d5280>)

In [8]:
df = pd.read_csv("/Users/franciscovarelacid/Desktop/Strive/datasets/amazon_review_full_csv/train.csv", nrows=10000, header=None)
df

Unnamed: 0,0,1,2
0,3,more like funchuck,Gave this to my dad for a gag gift after direc...
1,5,Inspiring,I hope a lot of people hear this cd. We need m...
2,5,The best soundtrack ever to anything.,I'm reading a lot of reviews saying that this ...
3,4,Chrono Cross OST,The music of Yasunori Misuda is without questi...
4,5,Too good to be true,Probably the greatest soundtrack in history! U...
...,...,...,...
9995,4,Needed additional equipment,We selected this product for my daughter becau...
9996,3,Installation & Rebate Problems,Installation took several hours and was frustr...
9997,1,Functionally worthless as of June 2009,The dual tuner feature won't work on this mode...
9998,1,Very Bad Experience,Tivo worked for two weeks using the tivo wirel...


In [9]:
df.rename({0:"star", 1:"rating1", 2:"rating2"}, axis=1, inplace=True)

Since we are going to predict the number of stars a certain product has got based on the semantics of the text, we could merge the title of the review together with the body of the review, just by concatenating them:

In [10]:
df["review"] = df["rating1"] + " " +  df["rating2"]

In [11]:
df

Unnamed: 0,star,rating1,rating2,review
0,3,more like funchuck,Gave this to my dad for a gag gift after direc...,more like funchuck Gave this to my dad for a g...
1,5,Inspiring,I hope a lot of people hear this cd. We need m...,Inspiring I hope a lot of people hear this cd....
2,5,The best soundtrack ever to anything.,I'm reading a lot of reviews saying that this ...,The best soundtrack ever to anything. I'm read...
3,4,Chrono Cross OST,The music of Yasunori Misuda is without questi...,Chrono Cross OST The music of Yasunori Misuda ...
4,5,Too good to be true,Probably the greatest soundtrack in history! U...,Too good to be true Probably the greatest soun...
...,...,...,...,...
9995,4,Needed additional equipment,We selected this product for my daughter becau...,Needed additional equipment We selected this p...
9996,3,Installation & Rebate Problems,Installation took several hours and was frustr...,Installation & Rebate Problems Installation to...
9997,1,Functionally worthless as of June 2009,The dual tuner feature won't work on this mode...,Functionally worthless as of June 2009 The dua...
9998,1,Very Bad Experience,Tivo worked for two weeks using the tivo wirel...,Very Bad Experience Tivo worked for two weeks ...


and then of course we can drop the other two columns:

In [12]:
df.drop(columns=["rating1", "rating2"], inplace=True)

In [13]:
df

Unnamed: 0,star,review
0,3,more like funchuck Gave this to my dad for a g...
1,5,Inspiring I hope a lot of people hear this cd....
2,5,The best soundtrack ever to anything. I'm read...
3,4,Chrono Cross OST The music of Yasunori Misuda ...
4,5,Too good to be true Probably the greatest soun...
...,...,...
9995,4,Needed additional equipment We selected this p...
9996,3,Installation & Rebate Problems Installation to...
9997,1,Functionally worthless as of June 2009 The dua...
9998,1,Very Bad Experience Tivo worked for two weeks ...


👏

The `star`column is what we want to predict, given the text of the review. I think we are all Amazon users, and we are all aware of how many stars a rating can have, but let's just double check:

In [14]:
df.star.unique()

array([3, 5, 4, 1, 2])

In [15]:
df.star = df.star.apply(lambda x: int(x) -1)

Ok, now that our data are in order, we need to preprocess them. We can take advantage of spacy for basically of the steps:

In [16]:
nlp = spacy.load("en_core_web_sm")

Let's create a function that, given a sentence, it preprocess it by doing:
- tokenization
- removing stopwords
- remove special characters/punctuation
- make everything lower case
- lemmatize it

With spacy, we can do it in a very compact form:

In [17]:
def preprocessing(sentence):
    """
    params sentence: a str containing the sentence we want to preprocess
    return the tokens list
    """
    doc = nlp(sentence)
    tokens = [token.lemma_ for token in doc if not token.is_punct and not token.is_stop]
    return tokens
    

In [18]:
preprocessing("This is an example! Hello")

['example', 'hello']

The preprocessing phase has not finished yet. In fact, we want to create a neural network, and a neural network works with numbers. In general, computers work with numbers...

So we need to use embeddings to transform a sentence into a tensor: the embeddings are usually one-dimensional, and in the following example they will have size 300, that means that if you have a sentence of 10 words (after have it preprocessed), the shape of the sentence will be $10\times 300$. You will notice another dimension, that is the batch size. So you will train and run a model that receive as input a tensor of shape:

`batch_size*length_of_the_sentence*embedding_size`.

Let's do things in order:

In [19]:
import torch
from collections import Counter
from torch.utils.data import DataLoader, Dataset
from tqdm import tqdm, tqdm_notebook

If you are using the whole dataset, you should not need to split the dataset into train and test 'cause it should be already. If not, and if you are using any other dataset, remember to split into train and test (eventually validation).

In [20]:
train_df, test_df = df.iloc[:8000], df.iloc[8000:]

To get the vectors for each token, we are going to use some pretrained embeddings. Specifically, we are going to use the FastText embeddings that you can find at this link https://pytorch.org/text/stable/vocab.html#fasttext .

We need to download and load them by doing:

In [23]:
from torchtext.vocab import FastText

In [24]:
fasttext = FastText("simple")

.vector_cache/wiki.simple.vec: 293MB [00:27, 10.8MB/s]                           
100%|██████████| 111051/111051 [00:22<00:00, 4967.99it/s]


You can run `help(fasttext)` and/or `dir(fasttext)` to get more info about the methods and the attributes this object contains.

In [25]:
dir(fasttext)

['__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getitem__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__len__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 'cache',
 'dim',
 'get_vecs_by_tokens',
 'itos',
 'stoi',
 'unk_init',
 'url_base',
 'vectors']

I want to highlight a couple of things:

- `dim` is the dimensions of the vectors (in our case it is 300)
- `itos` stands for *index to string* and it maps an integer to the corresponding string. The reason for having such a method is that it's much lighter to store integers and use them to index the vectors instead of having a string per word (In addition to that, heuristics can be used so that the most frequent words get lower value for the index, resulting in a better memory management. I know, it sounds like minor things, but the model is going to make billions of operations!)
- `stoi` is the opposite: it's a dictionary that given the string returns the index



Above you can see the embeddings associated with the word "hello". Let's inspect the shape:

In [26]:
fasttext["hello"].shape

torch.Size([300])

300, as anticipated. 

Let's inspect what's the index associated with "hello":

In [27]:
fasttext.stoi["hello"]

2610

and viceversa:

In [28]:
fasttext.stoi

{'</s>': 0,
 '.': 1,
 ',': 2,
 'the': 3,
 'of': 4,
 "'": 5,
 'in': 6,
 '-': 7,
 'and': 8,
 ')': 9,
 '(': 10,
 'a': 11,
 'to': 12,
 'is': 13,
 'was': 14,
 'it': 15,
 'for': 16,
 'on': 17,
 's': 18,
 'as': 19,
 'that': 20,
 'from': 21,
 'by': 22,
 'he': 23,
 'are': 24,
 'with': 25,
 'this': 26,
 '–': 27,
 'be': 28,
 'an': 29,
 'at': 30,
 'or': 31,
 'i': 32,
 'not': 33,
 'people': 34,
 '}': 35,
 'other': 36,
 'they': 37,
 'his': 38,
 'american': 39,
 'have': 40,
 'has': 41,
 'utc': 42,
 'also': 43,
 'one': 44,
 'were': 45,
 'which': 46,
 'but': 47,
 'can': 48,
 'talk': 49,
 'there': 50,
 'first': 51,
 '#': 52,
 'new': 53,
 'united': 54,
 'about': 55,
 'you': 56,
 'their': 57,
 'may': 58,
 'all': 59,
 'she': 60,
 'd': 61,
 'when': 62,
 'after': 63,
 'had': 64,
 'states': 65,
 'who': 66,
 'made': 67,
 'more': 68,
 'if': 69,
 'born': 70,
 'used': 71,
 'many': 72,
 'city': 73,
 'some': 74,
 'time': 75,
 'websites': 76,
 'two': 77,
 't': 78,
 'its': 79,
 'most': 80,
 'called': 81,
 'b': 82,
 '

We can create and *encoder* which can transform each word into an integer:

In [29]:
def token_encoder(token, vec):
    if token == "<pad>":
        return 1
    else:
        try:
            return vec.stoi[token]
        except:
            return 0

In [30]:
def encoder(tokens, vec):
    return [token_encoder(token, vec) for token in tokens]

In [31]:
text = "Antonio is learning Python"
encoder(preprocessing(text), fasttext)

[0, 1660, 0]

Why all those zeros?
Well, in the function that we have defined, we have put a try and except, in which we are basically saying: if the word is not in the vocabulary, return the index 0. Clearly, Antonio and Python weren't in the corpus used by FastText!


What about the `<pad>` thing? 

Well, not all the reviews have same length, so we need to find a solution for it. Why? Cause our Neural Network is waiting for input that are all of the same size! It needs to know how many weights it needs to initialize!

There are several possibilities, but the easiest is to just set a cap with a `max_seq_len` parameter, so that all the reviews that are shorter than that length will be padded by using a vector associated with the padding index, and all the ones that are longer than `max_seq_len` will be just cut.

Do you see problems? I actually don't see that much problems for it. I think that the sentiment of a comment can be seen already from the first words of the review.

In the encoder part, the `<pad>` is a made up token that we know is very unlikely to be part of the text. To that, I assigned the index 1. 

You may ask: what does it happen to things at index 0 and 1? Well, let's inspect them:

In [32]:
fasttext.itos[0], fasttext.itos[1]

('</s>', '.')

and in our preprocessing pipeline they can never appear! So we are fine with that!

Now let's create a function for padding:

In [33]:
def padding(list_of_indexes, max_seq_len, padding_index=1):
    output = list_of_indexes + (max_seq_len - len(list_of_indexes))*[padding_index]
    return output[:max_seq_len]

In [34]:
text = "this is a sample review"
list_of_indexes = encoder(preprocessing(text), fasttext)
list_of_indexes

[3697, 1363]

In [35]:
padding(list_of_indexes, max_seq_len=10)

[3697, 1363, 1, 1, 1, 1, 1, 1, 1, 1]

In this way, any sentence shorter than 10 becomes of length 10 and anything longer...

In [36]:
text = "this is a sample review this is a sample review this is a sample review this is a sample review this is a sample review v this is a sample review this is a sample review this is a sample review this is a sample review this is a sample review"
list_of_indexes = encoder(preprocessing(text), fasttext)
padding(list_of_indexes, max_seq_len=10)

[3697, 1363, 3697, 1363, 3697, 1363, 3697, 1363, 3697, 1363]

...get just cut to ten!

All right. I feel confident enough to say that we have all of what we need for the preprocessing part!

Now we need to create the:


### Data Loader

Yes, they are back. [Is it a good or a bad memory?]("https://github.com/Strive-School/ai_mar21/blob/main/M5_Deep_Learning/D7/Custom%20DataLoader%20and%20Dataset.ipynb")

If you take a look at that notebook, you remember that to create a custom data loader you need to override some method of the `Dataset` class from `torch.utils.data`. Before doing so, let's define the steps we need to do while loading the data:

- Receive as input a row from the dataframe that we have defined above, that contains two columns: "star" and "review"
- we separate "star" from "review"
- we preprocess the "review" columns by doing what we have so far (tokenization etc but excluding the embeddings for now)
- Padding 
- Store a list containing the sequence of indices with the associated labels

Then we need to override also the `__len__` and the `__getitem__`methods of the `Dataset` class.

Ok, stop talking, more action:

In [37]:
class TrainData(Dataset):
    def __init__(self, df, max_seq_len=32): # df is the input df, max_seq_len is the max lenght allowed to a sentence before cutting or padding
        self.max_seq_len = max_seq_len
        
        counter = Counter()
        train_iter = iter(df.review.values)
        self.vec = FastText("simple")
        self.vec.vectors[1] = -torch.ones(self.vec.vectors[1].shape[0]) # replacing the vector associated with 1 (padded value) to become a vector of -1.
        self.vec.vectors[0] = torch.zeros(self.vec.vectors[0].shape[0]) # replacing the vector associated with 0 (unknown) to become zeros
        self.vectorizer = lambda x: self.vec.vectors[x]
        self.labels = df.star
        sequences = [padding(encoder(preprocessing(sequence), self.vec), max_seq_len) for sequence in df.review.tolist()]
        self.sequences = sequences
    
    def __len__(self):
        return len(self.sequences)
    
    def __getitem__(self, i):
        assert len(self.sequences[i]) == self.max_seq_len
        return self.sequences[i], self.labels[i]

In [38]:
dataset = TrainData(train_df, max_seq_len=32)

When we index dataset with a `dataset[index]` notation, we get the pair containing the padded sequence of indices with the associated label: 

In [43]:
import numpy as np

In [51]:
dataset[0]

([90,
  0,
  631,
  7404,
  14711,
  6168,
  1924,
  0,
  216,
  0,
  5577,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1],
 2)

In [52]:
dataset[1][0]

[22986,
 1659,
 477,
 34,
 3282,
 2225,
 338,
 859,
 1753,
 22761,
 90,
 297,
 2942,
 3680,
 5133,
 999,
 1607,
 7344,
 690,
 10531,
 741,
 906,
 47858,
 5487,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1]

What are the ones there? They are the product of the padding! 

What is the vector associated with the index 1?

In [53]:
dataset.vec.vectors[1]

tensor([-1., -1., -1., -1., -1., -1., -1., -1., -1., -1., -1., -1., -1., -1.,
        -1., -1., -1., -1., -1., -1., -1., -1., -1., -1., -1., -1., -1., -1.,
        -1., -1., -1., -1., -1., -1., -1., -1., -1., -1., -1., -1., -1., -1.,
        -1., -1., -1., -1., -1., -1., -1., -1., -1., -1., -1., -1., -1., -1.,
        -1., -1., -1., -1., -1., -1., -1., -1., -1., -1., -1., -1., -1., -1.,
        -1., -1., -1., -1., -1., -1., -1., -1., -1., -1., -1., -1., -1., -1.,
        -1., -1., -1., -1., -1., -1., -1., -1., -1., -1., -1., -1., -1., -1.,
        -1., -1., -1., -1., -1., -1., -1., -1., -1., -1., -1., -1., -1., -1.,
        -1., -1., -1., -1., -1., -1., -1., -1., -1., -1., -1., -1., -1., -1.,
        -1., -1., -1., -1., -1., -1., -1., -1., -1., -1., -1., -1., -1., -1.,
        -1., -1., -1., -1., -1., -1., -1., -1., -1., -1., -1., -1., -1., -1.,
        -1., -1., -1., -1., -1., -1., -1., -1., -1., -1., -1., -1., -1., -1.,
        -1., -1., -1., -1., -1., -1., -1., -1., -1., -1., -1., -

All negative ones! Makes sense! This is what we have defined!

Storing into memory a lot of tensors containing all the embedded vectors, it can be very costly. This is why we load them by indexing with an integer. However, when we train our model, we need the embedded vectors!

So let's define the `collate` function that will index our vocabulary only when it needs it!

As argument it takes the batch (which will contains a `batch_size*max_seq_len` shape tensor) and the vectorizer. What is the vectorizer in our case? It's the vectorizer we have built in the TrainData class, that assign the vector associated with an index.

In [54]:
def collate(batch, vectorizer=dataset.vectorizer):
    inputs = torch.stack([torch.stack([vectorizer(token) for token in sentence[0]]) for sentence in batch])
    target = torch.LongTensor([item[1] for item in batch]) # Use long tensor to avoid unwanted rounding
    return inputs, target

And now, we can use the `DataLoader` class as we did for images:

In [55]:
batch_size = 16
train_loader = DataLoader(dataset, batch_size=batch_size, collate_fn=collate)


In [56]:
next(iter(train_loader))[0].shape

torch.Size([16, 32, 300])

Ready to train? Following is a small model to *makes things to run on my computer*. You can expect to be kicked out if you come at the debrief with this model! 



In [57]:
from torch import nn
import torch.nn.functional as F
emb_dim = 300
class Classifier(nn.Module):
    def __init__(self, max_seq_len, emb_dim, hidden1=16, hidden2=16):
        super(Classifier, self).__init__()
        self.fc1 = nn.Linear(max_seq_len*emb_dim, hidden1)
        self.fc2 = nn.Linear(hidden1, hidden2)
        self.fc3 = nn.Linear(hidden2, 5)
        self.out = nn.LogSoftmax(dim=1)
    
    
    def forward(self, inputs):
        x = F.relu(self.fc1(inputs.squeeze(1).float()))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return self.out(x)

In [71]:
MAX_SEQ_LEN = 32
model = Classifier(MAX_SEQ_LEN, 300, 16, 16)
model

Classifier(
  (fc1): Linear(in_features=9600, out_features=16, bias=True)
  (fc2): Linear(in_features=16, out_features=16, bias=True)
  (fc3): Linear(in_features=16, out_features=5, bias=True)
  (out): LogSoftmax()
)

In [59]:
from torch import optim
criterion = nn.NLLLoss()

# Only train the classifier parameters, feature parameters are frozen
optimizer = optim.Adam(model.parameters(), lr=0.003)


In [60]:
dataiter = iter(train_loader)
sentences, labels = dataiter.next()

In [61]:
# Forward pass through the network
sentence_idx = 0
sentences.resize_(16, 1, MAX_SEQ_LEN*emb_dim).shape
log_ps = model.forward(sentences[sentence_idx,:])

sentence = sentences[sentence_idx]
torch.exp(log_ps)

tensor([[0.2050, 0.1258, 0.1624, 0.2720, 0.2347]], grad_fn=<ExpBackward>)

We got 5 probabilities: one for each of the possible rating star!

In [62]:
epochs = 3
print_every = 40

for e in range(epochs):
    running_loss = 0
    print(f"Epoch: {e+1}/{epochs}")

    for i, (sentences, labels) in enumerate(iter(train_loader)):

        sentences.resize_(sentences.size()[0], 32* emb_dim)
        
        optimizer.zero_grad()
        
        output = model.forward(sentences)   # 1) Forward pass
        loss = criterion(output, labels) # 2) Compute loss
        loss.backward()                  # 3) Backward pass
        optimizer.step()                 # 4) Update model
        
        running_loss += loss.item()
        
        if i % print_every == 0:
            print(f"\tIteration: {i}\t Loss: {running_loss/print_every:.4f}")
            running_loss = 0

Epoch: 1/3
	Iteration: 0	 Loss: 0.0397
	Iteration: 40	 Loss: 1.6734
	Iteration: 80	 Loss: 1.6169
	Iteration: 120	 Loss: 1.6101
	Iteration: 160	 Loss: 1.6115
	Iteration: 200	 Loss: 1.6039
	Iteration: 240	 Loss: 1.6004
	Iteration: 280	 Loss: 1.5890
	Iteration: 320	 Loss: 1.5909
	Iteration: 360	 Loss: 1.6011
	Iteration: 400	 Loss: 1.6019
	Iteration: 440	 Loss: 1.6034
	Iteration: 480	 Loss: 1.5870
Epoch: 2/3
	Iteration: 0	 Loss: 0.0387
	Iteration: 40	 Loss: 1.5754
	Iteration: 80	 Loss: 1.5704
	Iteration: 120	 Loss: 1.5544
	Iteration: 160	 Loss: 1.5138
	Iteration: 200	 Loss: 1.5313
	Iteration: 240	 Loss: 1.5267
	Iteration: 280	 Loss: 1.5412
	Iteration: 320	 Loss: 1.5371
	Iteration: 360	 Loss: 1.5129
	Iteration: 400	 Loss: 1.5101
	Iteration: 440	 Loss: 1.5274
	Iteration: 480	 Loss: 1.5114
Epoch: 3/3
	Iteration: 0	 Loss: 0.0365
	Iteration: 40	 Loss: 1.4899
	Iteration: 80	 Loss: 1.4274
	Iteration: 120	 Loss: 1.4591
	Iteration: 160	 Loss: 1.3636
	Iteration: 200	 Loss: 1.3477
	Iteration: 240	 Lo

Eventually:

In [73]:
test_dataset = TrainData(test_df, max_seq_len=32)

 26%|██▌       | 28576/111051 [00:07<00:22, 3723.29it/s]


KeyboardInterrupt: 

In [76]:
test_dataset

<__main__.TrainData at 0x7fd96e6ca2e0>

In [84]:
test_batch_size = 16
test_loader = DataLoader(test_dataset, batch_size=batch_size)

In [85]:
next(iter(test_loader))[0].shape

KeyError: 0

In [None]:
test_dataiter = iter(test_loader)
sentences, labels = test_dataiter.next()

KeyError: 0

In [None]:
for i, (sentences, labels) in enumerate(iter(test_loader)):
    print(sentences)

KeyError: 0

In [None]:
classifier = Classifier(max_seq_len=32, emb_dim=300)

correct = 0
total = 0
with torch.no_grad():
    for i, (sentences, labels) in enumerate(iter(test_loader)):
        sentences.resize_(sentences.size()[0], 32* emb_dim)
        outputs = model.forward(text)
        _, predicted = torch.max(outputs.data, 1)
        total += labels.size(0)
        correct += (predicted == labels).sum().item()

print('Accuracy of the network on the %d test images: %d %%' % (len(test),
    100 * correct / total))

In [None]:
from torchtext import datasets

In [None]:
# train, test = datasets.AmazonReviewFull()

amazon_review_full_csv.tar.gz: 188MB [00:15, 12.3MB/s] 


KeyboardInterrupt: 

### Exercises

- Create a real training process: use the train, val, test split for the dataset
- Create a training loop that includes validation and test at the end
    - You can borrow from your previous work, no need to write it from scratch
- If you want to, feel free to change dataset
