To perform sentence classification, and many other classification tasks for NLP, we need to do three main steps:

- Preprocessing the data
- Prepare the dataloader
- Build the model

Of course, all of these steps requires a lot of other steps, and also they can include many different solutions. 

To make you to jumpstart on this task, I will provide you a pretty clean dataset, the Amazon Reviews one, that you can extensively find online, and it's also included in the `torxchtext.datasets` module. 

For this example, I will use just a little part of it, to give some guidance on how to start, without actually training the whole model.

### Load the data



In [1]:
import pandas as pd

In [2]:
import spacy

In [3]:
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.1.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.1.0/en_core_web_sm-3.1.0-py3-none-any.whl (13.6 MB)
[K     |████████████████████████████████| 13.6 MB 8.6 MB/s 
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [3]:
import torchtext

In [4]:
from torchtext.datasets import AmazonReviewFull
train, test = AmazonReviewFull()

amazon_review_full_csv.tar.gz: 644MB [00:23, 27.6MB/s]


In [5]:
label_list = []
review_list = []

for label, rev in train:
    label_list.append(label)
    review_list.append(rev)

print(label_list[:5], review_list[:5])

[3, 5, 5, 4, 5] ['more like funchuck Gave this to my dad for a gag gift after directing "Nunsense," he got a reall kick out of it!', 'Inspiring I hope a lot of people hear this cd. We need more strong and positive vibes like this. Great vocals, fresh tunes, cross-cultural happiness. Her blues is from the gut. The pop sounds are catchy and mature.', "The best soundtrack ever to anything. I'm reading a lot of reviews saying that this is the best 'game soundtrack' and I figured that I'd write a review to disagree a bit. This in my opinino is Yasunori Mitsuda's ultimate masterpiece. The music is timeless and I'm been listening to it for years now and its beauty simply refuses to fade.The price tag on this is pretty staggering I must say, but if you are going to buy any cd for this much money, this is the only one that I feel would be worth every penny.", 'Chrono Cross OST The music of Yasunori Misuda is without question my close second below the great Nobuo Uematsu.Chrono Cross OST is a wo

In [6]:
df_dict = {"star":label_list[:3000],"review":review_list[:3000]}
df = pd.DataFrame(df_dict)

In [61]:
df.to_csv('df_amazon_3000.csv', '.')

In [7]:
# df = pd.read_csv("test.csv", nrows=3000, header=None)
df

Unnamed: 0,star,review
0,3,more like funchuck Gave this to my dad for a g...
1,5,Inspiring I hope a lot of people hear this cd....
2,5,The best soundtrack ever to anything. I'm read...
3,4,Chrono Cross OST The music of Yasunori Misuda ...
4,5,Too good to be true Probably the greatest soun...
...,...,...
2995,1,"An Example of What is ""Classically"" Wrong with..."
2996,1,Do not buy this HP 960 printer from ANTonline ...
2997,3,color ink cartridge I've had this printer for ...
2998,2,"Good printer, so so on photos, crap for envelo..."


In [8]:
# df.rename({0:"star", 1:"rating1", 2:"rating2"}, axis=1, inplace=True)

Since we are going to predict the number of stars a certain product has got based on the semantics of the text, we could merge the title of the review together with the body of the review, just by concatenating them:

In [9]:
# df["review"] = df["rating1"] + " " +  df["rating2"]

In [10]:
# df

and then of course we can drop the other two columns:

In [11]:
# df.drop(columns=["rating1", "rating2"], inplace=True)

In [12]:
# df

👏

The `star`column is what we want to predict, given the text of the review. I think we are all Amazon users, and we are all aware of how many stars a rating can have, but let's just double check:

In [13]:
df.star.unique()

array([3, 5, 4, 1, 2])

Ok, now that our data are in order, we need to preprocess them. We can take advantage of spacy for basically of the steps:

In [14]:
nlp = spacy.load("en_core_web_sm")

Let's create a function that, given a sentence, it preprocess it by doing:
- tokenization
- removing stopwords
- remove special characters/punctuation
- make everything lower case
- lemmatize it

With spacy, we can do it in a very compact form:

In [15]:
# def preprocessing(sentence):
#     """
#     params sentence: a str containing the sentence we want to preprocess
#     return the tokens list
#     """
#     doc = nlp(sentence)
#     tokens = [token.lemma_ for token in doc if not token.is_punct and not token.is_stop]
#     return tokens

def preprocessing(sentence):
    """
    params sentence: a str containing the sentence we want to preprocess
    return the tokens list
    """
    doc = nlp(sentence)
    tokens = [token.lemma_.lower()  for token in doc if not token.is_punct and not token.is_stop and not token.text.isdigit() or "not" in token.text.lower() ]
    #punct is for commas and questions marks like this kind of stuffs (punctioation)
    # we used "not" because not can block bad reactions like "not good"

    return tokens
    

In [16]:
preprocessing("This is an example! Alessio 999 not Hello")

['example', 'alessio', 'not', 'hello']

The preprocessing phase has not finished yet. In fact, we want to create a neural network, and a neural network works with numbers. In general, computers work with numbers...

So we need to use embeddings to transform a sentence into a tensor: the embeddings are usually one-dimensional, and in the following example they will have size 300, that means that if you have a sentence of 10 words (after have it preprocessed), the shape of the sentence will be $10\times 300$. You will notice another dimension, that is the batch size. So you will train and run a model that receive as input a tensor of shape:

`batch_size*length_of_the_sentence*embedding_size`.

Let's do things in order:

In [17]:
import torch
from collections import Counter
from torchtext.vocab import Vocab
from torch.utils.data import DataLoader, Dataset
from tqdm import tqdm, tqdm_notebook

If you are using the whole dataset, you should not need to split the dataset into train and test 'cause it should be already. If not, and if you are using any other dataset, remember to split into train and test (eventually validation).

In [18]:
train_df, test_df = df.iloc[:2000], df.iloc[2000:]

Now we need to create a vocabulary. What does it mean? We need to keep track of all the tokens that occurs in our dataset, so that we can index them with a number instead of using a string.

To do so, we take advantage of the `Counter` function from the `collections` library, and then pass to the `Vocab` class https://pytorch.org/text/stable/vocab.html. This is used to create a fast way to lookup the dictionary you have created. Don't forget to update the counter *after* applying the `preprocessing` function that you have created before.

In [19]:
counter = Counter()
train_iter = iter(train_df.review.values)
for text in train_iter:
    counter.update(preprocessing(text))


In [22]:
vocab = Vocab(counter)

As `min_freq` I chose 1, that means that I'm considering all the terms in the vocabulary that occurs at least once. There are cases in which you want to filter out some rare words, but in this reviews dataset I don't think it's a good idea.

*Tip:* I'd rather do a better preprocessing, trying to use some spellchecker to correct typos so that mispelled words would be associated to the right one. However, in reality, it's pretty hard to find a good spellchecker.

Let's check what the `vocab` we just created do:

In [25]:
text = preprocessing("hello world")

[vocab[x] for x in text]


[2496, 101]

In [26]:
vocab['book']

2

In [47]:
counter.most_common



In [27]:
print(type(vocab))


<class 'torchtext.vocab.Vocab'>


It seems it trasformed each token into a number. How to get the token as a string back?

We can use the `.itos` method, that stands for *index to string*:

In [29]:
dir(vocab)

['UNK',
 '__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getitem__',
 '__getstate__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__len__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__setstate__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_default_unk_index',
 'extend',
 'freqs',
 'itos',
 'load_vectors',
 'lookup_indices',
 'set_vectors',
 'stoi',
 'unk_index',
 'vectors']

👏

These things are cool, but the most important goal of vocab for the `torchtext` library is the possibility of loading pretrained embeddings by typing `vocab.load_vectors("name_of_the_embeddings")`:

In [31]:
vocab.load_vectors("fasttext.simple.300d")

.vector_cache/wiki.simple.vec: 293MB [00:28, 10.3MB/s]                           
100%|██████████| 111051/111051 [00:08<00:00, 13434.68it/s]


You don't even need to assign it to a variable, it will load and store it as an object attribute. 

As you have seen, embeddings are a numerical representation of a word, that have been trained in an unsupervised manner on some text corpus. They help in all the text classification problems, since they encode also some semantic information about the words, in such a way that the distance (say the cosine similarity) of two vectors with a similar meaning is less than the ones with different meanings.

So if inspect them, we should get a tensor containing 300 size arrays:

In [32]:
vocab.vectors

tensor([[ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [ 0.1181, -0.3024,  0.2944,  ..., -0.1119, -0.0891, -0.1466],
        ...,
        [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [ 0.3920,  0.1647, -0.0525,  ..., -0.2147, -0.2600, -0.1543]])

In [33]:
vocab.vectors.shape

torch.Size([10812, 300])

Indeed. What's the first number? It's the amount of unique tokens we have in the training set. Instead, the second term (300) is the size of the embeddings. How can we retrieve the vector corresponding to the word "good"?

Well we have the method `.stoi` that stands for "string to index" to retrieve the index associated with the string "good":


In [34]:
vocab.stoi["good"]

4

Actually, no need to do it, we can just do `vocab["good"`
 and we'll get the same result:

In [35]:
vocab["good"]

4

and then we can use that index to retrieve the vector:

In [36]:
vocab.vectors[4]

tensor([ 5.9637e-01,  2.7714e-01,  2.6521e-01, -3.4600e-01, -1.0764e-01,
         1.8982e-01, -3.8168e-02,  8.3709e-02,  1.9178e-01,  1.6162e-01,
         1.6343e-01,  1.3317e-01, -3.8956e-01, -1.3596e-01, -1.2511e-01,
        -1.6472e-01, -1.4022e-01,  4.1599e-02, -1.4979e-01,  1.0635e-01,
         3.6200e-01,  1.0988e-01,  1.4841e-01,  1.1830e-01,  8.3510e-02,
        -2.1211e-01,  7.1777e-02, -7.3148e-03,  2.2641e-01,  3.7710e-02,
        -4.5244e-03,  7.0736e-02, -7.2897e-02,  3.0860e-01,  1.5270e-01,
        -9.1561e-02, -3.8422e-01, -1.6947e-01,  5.8803e-02, -9.8637e-03,
         6.0262e-02, -2.4001e-01,  1.1871e-01, -1.7887e-01, -2.3948e-01,
        -9.4501e-02, -1.5217e-01, -6.8412e-02,  8.2164e-02,  1.8725e-01,
         3.3745e-02, -1.3283e-01, -3.0824e-01,  1.0093e-01, -3.3814e-01,
        -5.6273e-02, -1.6498e-01,  1.3821e-01, -5.3919e-02,  2.6901e-01,
        -4.8584e-01, -1.4683e-01, -2.5746e-01,  2.5280e-01,  2.7210e-02,
         1.2690e-02, -1.7246e-02, -1.8162e-01, -8.4

In [37]:
vocab.vectors[3].shape

torch.Size([300])

In [38]:
vocab["good"]

4

What about "nice"?

In [39]:
vocab["nice"]

55

In [40]:
vocab.vectors[vocab["nice"]]

tensor([ 0.4940,  0.4000,  0.2400, -0.1512, -0.0875,  0.3711, -0.1918,  0.2610,
         0.3133, -0.0206, -0.1875,  0.0116,  0.0544, -0.2277,  0.0178,  0.1509,
        -0.1158, -0.2391,  0.1476, -0.5993,  0.4263, -0.2732,  0.1811, -0.1666,
         0.2579, -0.0722,  0.2222, -0.3234, -0.0363,  0.0015, -0.1955, -0.0208,
        -0.0191,  0.4611,  0.2720, -0.2414, -0.0971, -0.1544,  0.1492,  0.1033,
        -0.4095,  0.2533,  0.2096,  0.0630,  0.2702, -0.3699, -0.0760, -0.1037,
         0.1022,  0.2548, -0.0250,  0.0291,  0.2737,  0.2041,  0.0960, -0.3495,
         0.2727,  0.0887,  0.1010,  0.3411, -0.3515, -0.1092,  0.1397, -0.0361,
        -0.0667, -0.0639,  0.1377, -0.3333, -0.0857,  0.1262,  0.0286, -0.1108,
        -0.1534,  0.0137, -0.1651, -0.4551,  0.1704, -0.1705,  0.1677,  0.0466,
         0.0512, -0.2957,  0.0971,  0.1499,  0.1507, -0.3408, -0.1480, -0.3976,
         0.1525, -0.3452,  0.2638,  0.0131,  0.1264,  0.0492,  0.0179,  0.0911,
        -0.5341,  0.3287,  0.2030, -0.21

Ok, but what about the words that are not in the vocabulary? How do we handle them?

To all of them, there's a dedicated index: 0. However, maybe different vocabulary have a different standards, so you can just check it by running `vocab.unk_index`:

In [41]:
vocab["asfgahsgf"], vocab.unk_index

(0, 0)

In [42]:
vocab['<pad>']

1

All right. I feel confident enough to say that the preprocessing part has been completed!

Now we need to create the:




### Data Loader

Yes, they are back. [Is it a good or a bad memory?]("https://github.com/Strive-School/ai_mar21/blob/main/M5_Deep_Learning/D7/Custom%20DataLoader%20and%20Dataset.ipynb")

If you take a look at that notebook, you remember that to create a custom data loader you need to override some method of the `Dataset` class from `torch.utils.data`. Before doing so, let's define the steps we need to do while loading the data:

- Receive as input a row from the dataframe that we have defined above, that contains two columns: "star" and "review"
- we separate "star" from "review"
- we preprocess the "review" columns by doing what we have so far (tokenization etc but excluding the embeddings for now)
- as we have seen before, the possible stars are from 1 to 5. However, you should be advanced enough in Python to know that it starts from zero. So need to shift of the labels by 1 (e.g. if the stars are 4, the label will be 3).
- Padding <- this is important, I will spend few words more in a few lines
- Store a list containing the sequence of indices with the associated labels

Then we need to override also the `__len__` and the `__getitem__`methods of the `Dataset` class.

So few words about **padding**: not all the reviews have same length, so we need to find a solution for it. Why? Cause our Neural Network is waiting for input that are all of the same size! It needs to know how many weights it needs to initialize! 

There are several possibilities, but the easiest is to just set a cap with a `max_seq_len` parameter, so that all the reviews that are shorter than that length will be padded by using a vector associated with the padding index, and all the ones that are longer than `max_seq_len` will be just cut.

Do you see problems? I actually don't see that much problems for it. I think that the sentiment of a comment can be seen already from the first words of the review.

Ok, stop talking, more action:

In [43]:
class TrainData(Dataset):
    def __init__(self, df, max_seq_len=32): # df is the input df, max_seq_len is the max lenght allowed to a sentence before cutting or padding
        self.max_seq_len = max_seq_len
        
        counter = Counter()
        train_iter = iter(df.review.values)
        for text in train_iter:
            counter.update(preprocessing(text))
        self.vocab = Vocab(counter, min_freq=1)
        self.vocab.load_vectors("fasttext.simple.300d")
        
        label_pipeline = lambda x: int(x) - 1 # we need to preprocess the stars to start from 0
        token2idx = lambda x: self.vocab[x] # Basically renaming functions to access them quickly
        self.encode = lambda x: [token2idx(token) for token in preprocessing(x)]
        self.pad = lambda x: x + (max_seq_len - len(x))*[token2idx("<pad>")] # concatenating the original sentence with max_seq_len - len(x) padding indexes
        sequences = [self.encode(sequence)[:max_seq_len] for sequence in df.review.tolist()] # here we are cutting to the max_seq_len and encoding
        sequence, self.labels = zip(*[(sequence, label_pipeline(label)) for sequence, label in zip(sequences, df.star.tolist()) if sequence]) # not so much Pythonic I guess, a lot of list comprehension.
        # If you get it fast, good, otherwise write your own version of it
        self.sequences = [self.pad(sequence) for sequence in sequences]
        
    
    def __len__(self):
        return len(self.sequences)
    
    def __getitem__(self, i):
        assert len(self.sequences[i]) == self.max_seq_len
        return self.sequences[i], self.labels[i]

In [67]:
class TestData(Dataset):
    def __init__(self, df, vocab, max_seq_len=32): # df is the input df, max_seq_len is the max lenght allowed to a sentence before cutting or padding
        self.max_seq_len = max_seq_len
        
        # counter = Counter()
        # train_iter = iter(df.review.values)
        # for text in train_iter:
        #     counter.update(preprocessing(text))
        # self.vocab = Vocab(counter, min_freq=1)
        # self.vocab.load_vectors("fasttext.simple.300d")
        self.vocab = vocab
        
        label_pipeline = lambda x: int(x) - 1 # we need to preprocess the stars to start from 0
        token2idx = lambda x: self.vocab[x] # Basically renaming functions to access them quickly
        self.encode = lambda x: [token2idx(token) for token in preprocessing(x)]
        self.pad = lambda x: x + (max_seq_len - len(x))*[token2idx("<pad>")] # concatenating the original sentence with max_seq_len - len(x) padding indexes
        sequences = [self.encode(sequence)[:max_seq_len] for sequence in df.review.tolist()] # here we are cutting to the max_seq_len and encoding
        sequence, self.labels = zip(*[(sequence, label_pipeline(label)) for sequence, label in zip(sequences, df.star.tolist()) if sequence]) # not so much Pythonic I guess, a lot of list comprehension.
        # If you get it fast, good, otherwise write your own version of it
        self.sequences = [self.pad(sequence) for sequence in sequences]
        
    
    def __len__(self):
        return len(self.sequences)
    
    def __getitem__(self, i):
        assert len(self.sequences[i]) == self.max_seq_len
        return self.sequences[i], self.labels[i]

In [48]:
dataset = TrainData(train_df, max_seq_len=32)

In [68]:
train_vocab = dataset.vocab

In [69]:
test_set = TestData(test_df, train_vocab, max_seq_len=32)

In [70]:
test_set[0]

([4,
  2,
  3302,
  618,
  3302,
  618,
  473,
  1611,
  3108,
  2,
  271,
  1419,
  75,
  52,
  140,
  19,
  618,
  3,
  19,
  0,
  226,
  4744,
  1451,
  2651,
  0,
  348,
  13,
  2,
  32,
  2784,
  3108,
  1],
 4)

When we index dataset with a `dataset[index]` notation, we get the pair containing the padded sequence of indices with the associated label: 

In [49]:
dataset[0]

([6,
  7283,
  52,
  1783,
  4125,
  331,
  1451,
  8627,
  14,
  4679,
  2124,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1],
 2)

In [50]:
dataset[1][0]

[1478,
 151,
 43,
 29,
 125,
 28,
 33,
 295,
 724,
 1546,
 6,
 7,
 491,
 1041,
 694,
 962,
 2043,
 7494,
 446,
 4167,
 577,
 51,
 1765,
 1648,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1]

What are the ones there? They are the product of the padding! 

What is the vector associated with the index 1?

In [51]:
dataset.vocab.vectors[1]

tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 

All zeros! Makes sense!

Storing into memory a lot of tensors containing all the embedded vectors, it can be very costly. This is why we load them by indexing with an integer. However, when we train our model, we need the embedded vectors!

So let's define the `collate` function that will index our vocabulary only when it needs it!

As argument it takes the batch (which will contains a `batch_size*max_seq_len` shape tensor) and the vectorizer. What is the vectorizer in our case? It's the vocabulary that we have built in the dataset object, so `dataset.vocab.vectors`. Indexing that vocabulary will retrieve the vector associated with that index.

In [52]:
def collate(batch, vectorizer=dataset.vocab.vectors):
    inputs = torch.stack([torch.stack([vectorizer[token] for token in sentence[0]]) for sentence in batch])
    target = torch.LongTensor([item[1] for item in batch]) # Use long tensor to avoid unwanted rounding
    return inputs, target

And now, we can use the `DataLoader` class as we did for images:

In [53]:
batch_size = 16
train_loader = DataLoader(dataset, batch_size=batch_size, collate_fn=collate)


In [54]:
next(iter(train_loader))[0].shape

torch.Size([16, 32, 300])

Ready to train? Following is a small model to *makes things to run on my computer*. You can expect to be kicked out if you come at the debrief with this model! 



In [55]:
from torch import nn
import torch.nn.functional as F
emb_dim = 300
class Classifier(nn.Module):
    def __init__(self, max_seq_len, emb_dim, hidden1=16, hidden2=16):
        super(Classifier, self).__init__()
        self.fc1 = nn.Linear(max_seq_len*emb_dim, hidden1)
        self.fc2 = nn.Linear(hidden1, hidden2)
        self.fc3 = nn.Linear(hidden2, 5)
        self.out = nn.LogSoftmax(dim=1)
    
    
    def forward(self, inputs):
        x = F.relu(self.fc1(inputs.squeeze(1).float()))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return self.out(x)

In [56]:
MAX_SEQ_LEN = 32
model = Classifier(MAX_SEQ_LEN, 300, 16, 16)
model

Classifier(
  (fc1): Linear(in_features=9600, out_features=16, bias=True)
  (fc2): Linear(in_features=16, out_features=16, bias=True)
  (fc3): Linear(in_features=16, out_features=5, bias=True)
  (out): LogSoftmax(dim=1)
)

In [57]:
from torch import optim
criterion = nn.NLLLoss()

# Only train the classifier parameters, feature parameters are frozen
optimizer = optim.Adam(model.parameters(), lr=0.003)


In [58]:
dataiter = iter(train_loader)
sentences, labels = dataiter.next()

In [59]:
# Forward pass through the network
sentence_idx = 0
sentences.resize_(16, 1, MAX_SEQ_LEN*emb_dim).shape
log_ps = model.forward(sentences[sentence_idx,:])

sentence = sentences[sentence_idx]
torch.exp(log_ps)

tensor([[0.2001, 0.2372, 0.2114, 0.1770, 0.1743]], grad_fn=<ExpBackward>)

We got 5 probabilities: one for each of the possible rating star!

In [None]:
def check_accuracy(test_loader, model):
    
    


In [60]:
epochs = 3
print_every = 40

for e in range(epochs):
    running_loss = 0
    print(f"Epoch: {e+1}/{epochs}")

    for i, (sentences, labels) in enumerate(iter(train_loader)):

        sentences.resize_(sentences.size()[0], 32* emb_dim)
        
        optimizer.zero_grad()
        
        output = model.forward(sentences)   # 1) Forward pass
        loss = criterion(output, labels) # 2) Compute loss
        loss.backward()                  # 3) Backward pass
        optimizer.step()                 # 4) Update model
        
        running_loss += loss.item()
        
        if i % print_every == 0:
            print(f"\tIteration: {i}\t Loss: {running_loss/print_every:.4f}")
            running_loss = 0

Epoch: 1/3
	Iteration: 0	 Loss: 0.0412
	Iteration: 40	 Loss: 1.6075
	Iteration: 80	 Loss: 1.5956
	Iteration: 120	 Loss: 1.5880
Epoch: 2/3
	Iteration: 0	 Loss: 0.0349
	Iteration: 40	 Loss: 1.3068
	Iteration: 80	 Loss: 1.3855
	Iteration: 120	 Loss: 1.2639
Epoch: 3/3
	Iteration: 0	 Loss: 0.0221
	Iteration: 40	 Loss: 0.5572
	Iteration: 80	 Loss: 0.6940
	Iteration: 120	 Loss: 0.7886


Eventually:

In [72]:
from torchtext import datasets

In [73]:
train, test = datasets.AmazonReviewFull()

### Exercises

- Create a real training process: use the train, val, test split for the dataset
- Create a training loop that includes validation and test at the end
    - You can borrow from your previous work, no need to write it from scratch
- If you want to, feel free to change dataset
