# torchtext example

#### Handy Links
- https://github.com/bentrevett/pytorch-sentiment-analysis/blob/master/1%20-%20Simple%20Sentiment%20Analysis.ipynb
- http://mlexplained.com/2018/02/08/a-comprehensive-tutorial-to-torchtext/
- http://anie.me/On-Torchtext/

Have already preprocessed the data slightly so we have 3 files:
- train.csv
- test.csv
- classes.csv

train.csv and test.csv are the standard format of the label followed by the text:
```
LABEL1,TEXT1
LABEL2,TEXT2
...
```
The label is just 0 or 1 (pos, neg)

Lets store the path to our data folder in a variable:

In [10]:
PATH = "aclImdb"

Importing the libraries we need:

In [11]:
import torch
import torchtext
from torchtext import data

## Fields
ok good. We can now start processing our data.

We need 2 **Fields**: one for the label and one for the text itself:

In [12]:
REVIEW = data.Field(sequential=True,lower=True,tokenize="spacy")
LABEL = data.LabelField(use_vocab=False,dtype=torch.float)

These store what preprocessing (tokenisation - splitting into words, and numerisation - converting these words into numbers) will be done on each string, but doesn't do it until we pass the data in.

For more details see the torchtext docs on [Field](https://torchtext.readthedocs.io/en/latest/data.html#field).

You might want to also consider the **ReversibleField** class. This uses a different tokeniser: [revtok](https://github.com/jekbradbury/revtok) which allows you to map back to strings. This can be handy, especially when debugging.

## Dataset
Once we have the fields defined, we can initialise a **Dataset**.

This class does the heavy lifting of actually loading the data into these fields.

If your data is in some weird format you can implement your own class (it just has to match the [interface](https://torchtext.readthedocs.io/en/latest/data.html#torchtext.data.Dataset). There are loads of [examples](https://github.com/pytorch/text/tree/master/torchtext/datasets)!).

Here as we have CSV files we're just gonna use the handy TabularDataset class already provided for us!

In [13]:
train, test = data.TabularDataset.splits(
    path=PATH,
    format="csv",
    train = "train.csv",
    test = "test.csv",
    fields=[('label',LABEL),('review',REVIEW)])

print("Train set size:",len(train))
print("Test set size :",len(test))

Train set size: 25000
Test set size : 25000


This gives us 2 Datasets: train and test.

NOTE1: We should really add a validation set in as well.

NOTE2: Loading and preprocessing takes time. We should pickle these objects so we only have to do it once.

Torchtext will now tokenize our data based on what we told it to do in our fields:

In [14]:
train.examples[0].review[:8]

['by', 'my', '"', 'kool', '-', 'aid', 'drinkers', '"']

## Word Vectors

We can now load in some pretrained word embeddings into our text field using the build_vocab method of the text field.

We pass the name of the embedding file. This can be an actual file on your computer or one of the [predefined names](https://torchtext.readthedocs.io/en/latest/vocab.html#pretrained-aliases) which includes common embeddings such as word2vec, GloVe and FastText. If you pick one of these torchtext will go and download these for you.

In [15]:
REVIEW.build_vocab(train, vectors="glove.6B.100d")

So now we have these vectors downloaded and stored inside the Field object, we need to create the mapping from our tokenised words to these word vectors.

PyTorch has an nn.Embedding layer which will do this job, and we will include it in our model. Here we define a function which we will call to set up this nn.Embedding layer (see the model definition later for where we call it):


In [16]:
def create_emb(textField,em_sz=300):
    """Create embedding matrix from GloVe"""
    emb = nn.Embedding(len(textField.vocab.itos),em_sz,padding_idx=1)
    weights = emb.weight.data
    miss = []
    for i,w in  enumerate(textField.vocab.itos):
        try: weights[i] = torch.from_numpy(textField.vocab.vectors[w]*3)
        except: miss.append(w)
    print("OOV:",len(miss),miss[5:10]) # just to check
    return emb

This function takes in the text field object and loops through every word in the vocabulary 
(the property itos - Integer To String, maps from integers back to the strings ).

On each loop iteration i, we set the ith row of the embedding weight matrix to the word vector (\*3 to give us space for special tokens we might want to add such as OOV,EOS etc...)

If the word doesn't have a word vector we skip it and add it to a list of misses to help us debug.

## Interators
The last piece of the preprocessing puzzle is the **Iterator**. 
This class tells torchtext how to loop over the data in **batches**.

We use the handy BucketIterator here which groups similar length reviews together meaning we don't need as much padding:

In [17]:
#Check if we can use the GPU
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')


train_iter,test_iter = data.BucketIterator.splits(
    (train,test),
    batch_sizes=(64,64),
    sort_key=lambda x: len(x.review),
    device=device #Run on the GPU
)

We can actually test out these iterators, getting the next item as you would with a standard [Python iterator](https://www.geeksforgeeks.org/iterators-in-python/). This is exactly what our model is going to do: looping over the iterator, grabbing new batches:

In [18]:
batch = next(iter(train_iter))
print(vars(batch).keys())
batch.review[:5, :3]

dict_keys(['batch_size', 'dataset', 'fields', 'input_fields', 'target_fields', 'label', 'review'])


tensor([[13110,    20,    72],
        [    9,    97, 26785],
        [  780,  1983,  1610],
        [    6,    35,    56],
        [   85,  3299,    13]], device='cuda:0')

## A Simple Model
Now it's time to write a simple model to classify the reviews. We have 3 layers: an Embedding, an LSTM and finally a Linear output layer:

In [77]:
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.autograd import Variable


class SimpleLSTMmodel(nn.Module):
    def __init__(self,textField,hidden_dim=1000,emb_dim=300):
        """Initialise all the layers"""
        super().__init__() #very important!
        
        self.embedding = create_emb(textField,emb_dim) #calling the function from earlier!
        self.rnn = nn.LSTM(emb_dim, hidden_dim)
        self.fc = nn.Linear(hidden_dim, 1)

    
    def forward(self,seq):
        """Connect all the layers together"""
        embedded = self.embedding(x) #embedded = [sent len, batch size, emb dim]
        output, hidden = self.rnn(embedded)
        hidden = hidden[-1]
        return self.fc(hidden.squeeze(0))

model = SimpleLSTMmodel(REVIEW)

OOV: 101867 ['and', 'a', 'of', 'to', 'is']


When writing a NN (or even a single layer) in PyTorch we always have these 2 methods: **\_\_init\_\_** where we define what layers we want, and **forward** where we define how the sequence gets passed through these layers.

## Training
Now we have our model we need to train it for our task: getting the sentiment of movie reviews.

Here's a fairly standard training loop which does the job:

In [78]:
from tqdm import  tqdm_notebook as tqdm
opt = optim.Adam(model.parameters(), lr=1e-2)
loss_func = nn.BCEWithLogitsLoss()

#if we have a gpu move our stuff there
model = model.to(device)
loss_func = loss_func.to(device)


epochs = 10

for epoch in range(1,epochs+1):
    
    #***********
    # Train
    train_loss = 0
    model.train() #allow the model to train
    
    for b in tqdm(train_iter): #train a mini-batch
        x, y = b.review, b.label
        opt.zero_grad()
        preds = model(x).squeeze(1) #make a prediction
        loss = loss_func(preds,y) #find the loss
        loss.backward() # backpropogate this loss
        opt.step() # make a step of the parameters
        
        train_loss += loss.item()  
    #***********
    # Validate
    val_loss = 0
    model.eval() #switch to evaluation mode
    with torch.no_grad():
        for b in test_iter:
            x, y = b.review, b.label
            preds = model(x).squeeze(1)
            loss = loss_func(preds,y)
            val_loss += loss.item()
    
    print('(Epoch',str(epoch)+"/"+str(epochs)+")",
          "train loss:",train_loss/len(train_iter),
          "val loss:",val_loss/len(test_iter))

HBox(children=(IntProgress(value=0, max=391), HTML(value='')))

RuntimeError: CUDA error: out of memory

NOTE: fast.ai simplifies this by having a standard training loop you can use

## Predicting
Now we can make some predictions:

In [66]:
import numpy as np

preds = np.array([])
y_test = np.array([])
for b in tqdm(test_iter):
    x, y = b.review, b.label
    predsTmp = model(x).cpu().squeeze(1).data.numpy()
    y = y.cpu().data.numpy()
    preds = np.concatenate((preds,predsTmp))
    y_test = np.concatenate((y_test,y))
    
print("First 5 predictions:\n",preds[:5])

HBox(children=(IntProgress(value=0, max=391), HTML(value='')))

First 5 predictions:
 [ 0.06850207  0.79389507 -1.58640885  1.27722037 -1.15759289]


You can then calculate accuracy, F1-measure and anything else you could dream of:

In [68]:
def accuracy_score(preds,y):
    return np.mean(preds==y)

preds = np.where(preds < 0.5,0.0,1.0)
print(preds)
print(y_test)
print()
print("Accuracy",accuracy_score(preds,y_test))

[0. 1. 0. ... 0. 1. 0.]
[1. 0. 1. ... 0. 0. 0.]

Accuracy 0.5204


## Conclusions
And we're done! This model could be a lot more complicated, with more layers, maybe a biLSTM etc etc. Feel free to tweak the model class.

Check out the [fastai libary](https://github.com/fastai/fastai) which makes many of these steps much easier.