# TP5: RNNs

In this practical session, we will explore the modifications needed to use a RNN for text classification. We will compare a FFNN with an LSTM, using either BoW or continuous representations. We will also see how to use mini-batches for both classification algorithms.

The first part of the code is based on previous sessions:
* Part 1: using BoW representations with FFNN or LSTM
* Part 2: using continuous representations with FFNN or LSTM
* Part 3: using mini-batches requires padding with LSTMs
* Part 4: trying other architectures
* Part 5: sequence tagging

In [None]:
import torch
import torch.nn as nn

## 1- Define the device to be used

# CUDA for PyTorch
use_cuda = torch.cuda.is_available()
device = torch.device("cuda" if use_cuda else "cpu")
print(device)

# PART1: BoW Representation

## 1.1 Read and load the data

The code below is exactly the same as in TP2, building BoW representations.

In [None]:
import pandas as pd
import numpy as np
import re
import sklearn

from sklearn.feature_extraction.text import CountVectorizer

train_path = "allocine_train.tsv"
dev_path = "allocine_dev.tsv"
test_path = "allocine_test.tsv"

# This will be the size of the vectors reprensenting the input
MAX_FEATURES = 5000 

# Load train, dev and test set
train_df = pd.read_csv(train_path, header=0, delimiter="\t", quoting=3)
dev_df = pd.read_csv(dev_path, header=0, delimiter="\t", quoting=3)
test_df = pd.read_csv(test_path, header=0, delimiter="\t", quoting=3)
print("Creating features from bag of words...")
vectorizer = CountVectorizer(analyzer = "word", max_features = MAX_FEATURES) 
train_data_features = vectorizer.fit_transform(train_df["review"])
x_train = train_data_features.toarray()
y_train = np.asarray(train_df["sentiment"])
print( "TRAIN:", x_train.shape )
count_train = x_train.shape[0]
# -- DEV
dev_data_features = vectorizer.transform(dev_df["review"])
x_dev = dev_data_features.toarray()
y_dev = np.asarray(dev_df["sentiment"])
print( "DEV:", x_dev.shape )
# -- TEST
test_data_features = vectorizer.transform(test_df["review"])
x_test = test_data_features.toarray()
y_test = np.asarray(test_df["sentiment"])
print( "TEST:", x_test.shape )

## 1.2 Building models

### 1.2.1 Using a FFNN

The code below defines a FFNN, taking as input a BoW representation (no embedding layer).

In [None]:
class FeedforwardNeuralNetModel(nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim):
        super(FeedforwardNeuralNetModel, self).__init__()
        # Linear function ==> W1
        self.fc1 = nn.Linear(input_dim, hidden_dim)

        # Non-linearity ==> g
        self.sigmoid = nn.Sigmoid()

        # Linear function (readout) ==> W2
        self.fc2 = nn.Linear(hidden_dim, output_dim)  

    def forward(self, x):
        '''
        y = g(x.W1+b).W2
        '''
        # Linear function  # LINEAR ==> x.W1+b
        out = self.fc1(x)

        # Non-linearity  # NON-LINEAR ==> h1 = g(x.W1+b)
        out = self.sigmoid(out) 

        # Linear function (readout)  # LINEAR ==> y = h1.W2
        out = self.fc2(out)
        return out

### 1.2.2 Using an LSTM

The code below defines an LSTM model, taking also a Bow representation.

As you can see, we have now:
* an LSTM layer that will transform our input into a vector representation with the size hidden_dim
* in the forward pass, we need to reshape the data using:
```
x = x.view(len(x), 1, -1)
```

We need to reshape our input data before passing it to the LSTM layer, because it takes a 3D tensor with (Sequence lenght, Batch size, Input size). This is done with the 'view' method, the pytorch 'reshape' function for tensors. (there's also a format with batch size first, more easy to understand)

In [None]:
class LSTMModel(nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim):
        super(LSTMModel, self).__init__()

        self.lstm = nn.LSTM( input_size=input_dim, 
                            hidden_size=hidden_dim, 
                            bidirectional=False)

        # Linear function (readout) ==> W2
        self.fc2 = nn.Linear(hidden_dim, output_dim)  

    def forward(self, x):
        # The view function is meant to reshape the tensor, keeping the 
        # same number of elements
        # e.g. try a = torch.range(1, 16) and a = a.view(4, 4)
        # When you don t know how many elements you want for one dimension,
        # you can use -1
        # Here, an LSTM wants as input a 3D tensor with:
        # Sequence lenght, Batch size, Input size
        x = x.view(len(x), 1, -1)
        out, (ht, ct) = self.lstm( x )
        y = self.fc2(ht[-1])
        return y
        

The training and evaluation functions are given below. 

In [None]:
from sklearn.metrics import classification_report, accuracy_score

def train( model, train_loader, optimizer, num_epochs=5, trace=False ):
    for epoch in range(num_epochs):
        train_loss, total_acc, total_count = 0, 0, 0
        for input, label in train_loader:
            input = input.to(device)
            label = label.to(device)
            # Step1. Clearing the accumulated gradients
            optimizer.zero_grad()
            # Step 2. Forward pass to get output/logits
            outputs = model( input )
            if trace:
              print(input) # <---- call with trace=True to 'see' the input
            # Step 3. Compute the loss, gradients, and update the parameters by
            # calling optimizer.step()
            # - Calculate Loss: softmax --> cross entropy loss
            loss = criterion(outputs, label)
            # - Getting gradients w.r.t. parameters
            loss.backward()
            # - Updating parameters
            optimizer.step()
            # Accumulating the loss over time
            train_loss += loss.item()
            total_acc += (outputs.argmax(1) == label).sum().item()
            total_count += label.size(0)
        # Compute accuracy on train set at each epoch
        print('Epoch: {}. Loss: {}. ACC {} '.format(epoch, train_loss/count_train, total_acc/count_train))
        total_acc, total_count = 0, 0
        train_loss = 0

def evaluate( model, dev_loader ):
    predictions = []
    gold = []
    with torch.no_grad():
        for input, label in dev_loader:
            input = input.to(device)
            label = label.to(device)
            probs = model(input)
            predictions.append( torch.argmax(probs, dim=1).cpu().numpy()[0] )
            gold.append(int(label))
    print(classification_report(gold, predictions))
    return gold, predictions

## 1.3 Run an experiment

### 1.3.1 Test with a FFNN

The code below will launch an experiment with a simple Feed Forward Neural Network.

We load the data with a batch size of 1.

In [None]:
from torch.utils.data import TensorDataset, DataLoader

# create Tensor dataset
train_data = TensorDataset(torch.from_numpy(x_train).to(torch.float), torch.from_numpy(y_train))
batch_size = 1 #no batch, or batch = 1
train_loader = DataLoader(train_data, shuffle=True, batch_size=batch_size )
# Load dev data
dev_data = TensorDataset(torch.from_numpy(x_dev).to(torch.float), torch.from_numpy(y_dev))
dev_loader = DataLoader(dev_data, shuffle=True )

In [None]:
# Set the value of the hyper-parameters
VOCAB_SIZE = MAX_FEATURES # here BoW representation
input_dim = VOCAB_SIZE 
hidden_dim = 4
output_dim = 2
learning_rate = 0.1
num_epochs = 5
criterion = nn.CrossEntropyLoss()

In [None]:
# Initialize the model
model_ffnn = FeedforwardNeuralNetModel(input_dim, hidden_dim, output_dim)
optimizer = torch.optim.SGD(model_ffnn.parameters(), lr=learning_rate)
model_ffnn = model_ffnn.to(device)
# Train the model
train( model_ffnn, train_loader, optimizer, num_epochs=5 )
# Evaluate on dev
gold, pred = evaluate( model_ffnn, dev_loader )

### 1.3.2 Test with LSTM 

The code below will launch the experiment with the LSTM model.

In [None]:
# Initialization of the model
model_lstm = LSTMModel(input_dim, hidden_dim, output_dim)
optimizer = torch.optim.SGD(model_lstm.parameters(), lr=learning_rate)
model = model_lstm.to(device)

# Train the model
train( model_lstm, train_loader, optimizer, num_epochs=5 )

# Evaluate on dev
gold, pred = evaluate( model_lstm, dev_loader )

As an additional exercize, you can vary some hyper-parameters and compute the final scores of both models on the test dataset.

# PART2: continuous representation

Now we will go back to the continuous representation, i.e. randomly initialized real-valued vectors. 

## 2.1 Load the data

The code below is the same as in TP4: we tokenize the data, extract the vocabulary, and build pipelines to process the data. We also define the function 'collate_batch' that is used to process batches of data. For now, we keep batch_size=1. 

In [None]:
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator

# splits the string sentence by space.
tokenizer = get_tokenizer( None ) 
train_iter = []
for i in train_df.index:
    train_iter.append( tuple( [train_df["sentiment"][i], train_df["review"][i]] ) )
dev_iter = []
for i in dev_df.index:
    dev_iter.append( tuple( [dev_df["sentiment"][i], dev_df["review"][i]] ) )

def yield_tokens(data_iter):
    for _, text in data_iter:
        yield tokenizer(text)

vocab = build_vocab_from_iterator(yield_tokens(train_iter), specials=["<unk>"])
vocab.set_default_index(vocab["<unk>"])

text_pipeline = lambda x: vocab(tokenizer(x))
label_pipeline = lambda x: int(x) #simple mapping to self

def collate_batch(batch):
    label_list, text_list, offsets = [], [], [0]
    for (_label, _text) in batch:
         label_list.append(label_pipeline(_label))
         processed_text = torch.tensor(text_pipeline(_text), dtype=torch.int64)
         text_list.append(processed_text)
         offsets.append(processed_text.size(0))
    label = torch.tensor(label_list, dtype=torch.int64)
    offsets = torch.tensor(offsets[:-1]).cumsum(dim=0)
    text_list = torch.cat(text_list)
    return label.to(device), text_list.to(device), offsets.to(device)


## 2.2 Define the models

The code below defines a FFNN with an embedding layer that transforms our input words to vectors of size 'embed_dim' and performs an operation on these vectors to build a representaton for each document (default=mean).

Remember that we need the 'offsets' here to retrieve the batches (each document is concatenated to the others in a batch, the offsets are used to retrieve the separate documents).

In [None]:
class FeedforwardNeuralNetModel2(nn.Module):
    def __init__(self, vocab_size, embed_dim, hidden_dim, output_dim):
        super(FeedforwardNeuralNetModel2, self).__init__()

        self.embedding = nn.EmbeddingBag(vocab_size, embed_dim, sparse=True)
        # Linear function ==> W1
        self.fc1 = nn.Linear(embed_dim, hidden_dim)

        # Non-linearity ==> g
        self.sigmoid = nn.Sigmoid()

        # Linear function (readout) ==> W2
        self.fc2 = nn.Linear(hidden_dim, output_dim)  

    def forward(self, text, offsets):
        embedded = self.embedding(text, offsets)
        # Linear function  # LINEAR ==> x.W1+b
        out = self.fc1(embedded)

        # Non-linearity  # NON-LINEAR ==> h1 = g(x.W1+b)
        out = self.sigmoid(out) 

        # Linear function (readout)  # LINEAR ==> y = h1.W2
        out = self.fc2(out)
        return out

The code below defines an architecture using an LSTM which is fed with continuous representations. The embedding layer transforms our words into continuous vectors that are the inputs of our LSTM (that is thus a replacement of the 'embedding bag'). 

In [None]:
class LSTMModel2(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim, output_dim):
        super(LSTMModel2, self).__init__()

        self.embedding = nn.Embedding(vocab_size, embedding_dim)

        self.lstm = nn.LSTM( input_size=embedding_dim, 
                            hidden_size=hidden_dim, 
                            bidirectional=False)

        self.fc2 = nn.Linear(hidden_dim, output_dim)  

    def forward(self, text):
        embeds = self.embedding(text)
        x = embeds.view(len(text), 1, -1)
        out, (ht, ct) = self.lstm( x )
        y = self.fc2(ht[-1])
        return y

### Train and evaluation

To use the offsets, we need to modify the train and evaluation procedures.

In [None]:
def train_woffset( model, train_loader, optimizer, num_epochs=5 ):
    for epoch in range(num_epochs):
        train_loss, total_acc, total_count = 0, 0, 0
        for label, input, offsets in train_loader:
            input = input.to(device)
            label = label.to(device)
            # Step1. Clearing the accumulated gradients
            optimizer.zero_grad()
            # Step 2. Forward pass to get output/logits
            outputs = model( input, offsets ) # <-----
            # Step 3. Compute the loss, gradients, and update the parameters by
            # calling optimizer.step()
            # - Calculate Loss: softmax --> cross entropy loss
            loss = criterion(outputs, label)
            # - Getting gradients w.r.t. parameters
            loss.backward()
            # - Updating parameters
            optimizer.step()
            # Accumulating the loss over time
            train_loss += loss.item()
            total_acc += (outputs.argmax(1) == label).sum().item()
            total_count += label.size(0)
        # Compute accuracy on train set at each epoch
        print('Epoch: {}. Loss: {}. ACC {} '.format(epoch, train_loss/count_train, total_acc/count_train))
        total_acc, total_count = 0, 0
        train_loss = 0

def evaluate_woffset( model, dev_loader ):
    predictions = []
    gold = []
    with torch.no_grad():
        for label, input, offsets in dev_loader:
            input = input.to(device)
            label = label.to(device)
            probs = model(input, offsets) # <-----
            predictions.append( torch.argmax(probs, dim=1).cpu().numpy()[0] )
            gold.append(int(label))
    print(classification_report(gold, predictions))
    return gold, predictions

## 2.3 Run an experiments

### Test with a FFNN

The code below uses the FFNN with continuous representations.

In [None]:
# Load data
batch_size = 1
train_loader = DataLoader(train_iter, batch_size=batch_size, shuffle=True, collate_fn=collate_batch)
dev_loader = DataLoader(dev_iter, batch_size=batch_size, shuffle=True, collate_fn=collate_batch)

In [None]:
# Set the values of the hyperparameters
vocab_size = len(vocab)
emb_dim = 300
hidden_dim = 4
output_dim = 2
learning_rate = 0.1
num_epochs = 5
criterion = nn.CrossEntropyLoss()

In [None]:
# Initialize the model
model_ffnn2 = FeedforwardNeuralNetModel2(vocab_size, emb_dim, hidden_dim, output_dim)
optimizer = torch.optim.SGD(model_ffnn2.parameters(), lr=learning_rate)
model_ffnn2 = model_ffnn2.to(device)
# Train the model
train_woffset( model_ffnn2, train_loader, optimizer, num_epochs=5 )
# Evaluate on dev
gold, pred = evaluate_woffset( model_ffnn2, dev_loader )

### Test with LSTM

The code below will laucnh an experiment with the LSTM archtecture and a continuous representation. 

We don't need offsets with LSTM, since we do not embed directly each sequence using EmbeddingBag. Below is thus a version of collate_batch that does not return the offsets. In this case, we can use the train / evaluation functions defined before.


Note: be careful, the collate_batch function below return (input, label) while the previous one returns (label, input, offsets), thus in one train function we have *for input, label in train* while in the other we have *for label, input in train* (should be modified in further versions)

In [None]:
def collate_batch2(batch):
    label_list, text_list, offsets = [], [], [0]
    for (_label, _text) in batch:
         label_list.append(label_pipeline(_label))
         processed_text = torch.tensor(text_pipeline(_text), dtype=torch.int64)
         text_list.append(processed_text)
         offsets.append(processed_text.size(0))
    label = torch.tensor(label_list, dtype=torch.int64)
    offsets = torch.tensor(offsets[:-1]).cumsum(dim=0)
    text_list = torch.cat(text_list)
    return text_list.to(device), label.to(device)

In [None]:
# Load data
batch_size = 1
train_loader = DataLoader(train_iter, batch_size=batch_size, shuffle=True, collate_fn=collate_batch2)
dev_loader = DataLoader(dev_iter, batch_size=batch_size, shuffle=True, collate_fn=collate_batch2)

In [None]:
# Set the values for the hyperparameters
vocab_size = len(vocab)
emb_dim = 300
hidden_dim = 32
output_dim = 2
learning_rate = 0.1
num_epochs = 5
criterion = nn.CrossEntropyLoss()

In [None]:
# Initialize the model
model_lstm2 = LSTMModel2(vocab_size, emb_dim, hidden_dim, output_dim)
optimizer = torch.optim.SGD(model_lstm2.parameters(), lr=learning_rate)
model_lstm2 = model_lstm2.to(device)

# Train the model
train( model_lstm2, train_loader, optimizer, num_epochs=5 )
# Evaluate on dev
gold, pred = evaluate( model_lstm2, dev_loader )

As an additional exercize, vary the hyper-parameters values and evaluate the models on the test set. 

# PART3: using mini-batches

## With FFNN
We have the code required to use mini-batches with FFNN. The function 'collate_batch' defined earlier makes a concatenation of the input data, and the offsets are used to retrieve the separate documents to be embeddded. 

▶▶ **Load the data with a batch of size 2 and run a FFNN.**

In [None]:
# Load data


In [None]:
# Hyper-parameters
vocab_size = len(vocab)
emb_dim = 300
hidden_dim = 32
output_dim = 2
learning_rate = 0.1
num_epochs = 5
criterion = nn.CrossEntropyLoss()

In [None]:
# Initialize the model

# Train the model

# Evaluate on dev


## With LSTM

When using LSTMs, we need a bit more work: the problem is that all the documents in a batch need to have the same length, because the size of the input defines the size of the network (each xi is associated with a state si). 

The solution is called **padding**: we add zeros at the end of the sequences that are shorter than the max length. 

The easiest solution to do so is to pad the sequences using *torch.nn.utils.rnn.pad_sequence* as done below within the *collate_batch_pad* function. This function returns a tensor of padded sequences, that can be directly used as input of our model.

https://pytorch.org/docs/stable/generated/torch.nn.utils.rnn.pad_sequence.html

In [None]:
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator
from torch.nn.utils.rnn import pad_sequence

def collate_batch_pad(batch):
    label_list, text_list, offsets = [], [], [0]
    for (_label, _text) in batch:
         label_list.append(label_pipeline(_label))
         processed_text = torch.tensor(text_pipeline(_text), dtype=torch.int64)
         text_list.append(processed_text)
         offsets.append(processed_text.size(0))
    label = torch.tensor(label_list, dtype=torch.int64)
    offsets = torch.tensor(offsets[:-1]).cumsum(dim=0)
    #text_list = torch.cat(text_list) # Instead of concatenating, we use padding
    text_list = pad_sequence(text_list, padding_value=0) # <-------
    return text_list.to(device), label.to(device)



We slightly modify our model, just to take into account a custom batch size. See in the forward pass:
* the *view* method now has, as a 2nd argument, the batch size (while it was previously set to 1)

In [None]:
class LSTMModel3(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim, output_dim, batch_size):
        super(LSTMModel3, self).__init__()
        self.trace = True
        self.batch_size = batch_size # <------
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.lstm = nn.LSTM( input_size=embedding_dim, 
                            hidden_size=hidden_dim, 
                            bidirectional=False)

        self.fc2 = nn.Linear(hidden_dim, output_dim)  

    def forward(self, text):
        embeds = self.embedding(text)
        if self.trace:
          print( len(text), self.batch_size, embeds.shape)
          self.trace = False
        x = embeds.view(len(text), self.batch_size, -1) # <------
        out, (ht, ct) = self.lstm( x )
        y = self.fc2(ht[-1])
        return y

You can now run an experiment with a batch size of 2.

Note that we have another modification here in the Dataloader:
* drop_last=True: drop the last incomplete batch 

▶▶ **Uncomment the 'print' in the forward function above and in the train loop to see what the data looks like (stop training when a few tensors are printed).**

In [None]:
# Load data
batch_size = 2
train_loader = DataLoader(train_iter, batch_size=batch_size, shuffle=True, 
                          collate_fn=collate_batch_pad, drop_last=True)
dev_loader = DataLoader(dev_iter, shuffle=True, batch_size=2, 
                        collate_fn=collate_batch_pad, drop_last=True)

In [None]:
# Hyper-parameters
vocab_size = len(vocab)
emb_dim = 300
hidden_dim = 32
output_dim = 2
learning_rate = 0.1
num_epochs = 5
criterion = nn.CrossEntropyLoss()

In [None]:
# Initialize the model
model_lstm3 = LSTMModel3( vocab_size, emb_dim, hidden_dim, output_dim, batch_size )
optimizer = torch.optim.SGD(model_lstm3.parameters(), lr=learning_rate)
model_lstm3 = model_lstm3.to(device)
# Train the model
train( model_lstm3, train_loader, optimizer, num_epochs=5 )

We also modify the evaluation function to take batches as input.

In [None]:
def evaluate_batch( model, dev_loader ):
    predictions = []
    gold = []
    with torch.no_grad():
        for input, label in dev_loader:
            input = input.to(device)
            label = label.to(device)
            probs = model(input)
            # print( probs)
            # print( torch.argmax(probs, dim=1).cpu().numpy())
            # predictions.append( torch.argmax(probs, dim=1).cpu().numpy()[0] )
            predictions.extend( torch.argmax(probs, dim=1).cpu().numpy() ) # <-----
            # gold.append( int(label) )
            gold.extend([int(l) for l in label])  # <-----
    print(classification_report(gold, predictions))
    return gold, predictions


# Evaluate on dev
gold, pred = evaluate_batch( model_lstm3, dev_loader )

# PART4: trying other architectures

* Try to implement a GRU instead of an LSTM
* Try to implement a bi-GRU

As an additional exercize:
* Try with multiple GRU layers
* Try to add an hidden layer over the RNN

https://pytorch.org/docs/stable/generated/torch.nn.GRU.html

## 4.1 Using a GRU

▶▶ **Modify the code to use a GRU instead of an LSTM**

In [None]:
class GRUModel(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim, output_dim, batch_size):
          

    def forward(self, text):
        

In [None]:
batch_size=2

dataloader = DataLoader(train_iter, batch_size=batch_size, shuffle=False, 
                        collate_fn=collate_batch_pad, drop_last=True)

In [None]:
# Hyper-parameters
vocab_size = len(vocab)
emb_dim = 300
hidden_dim = 32
output_dim = 2

learning_rate = 0.1
num_epochs = 5

criterion = nn.CrossEntropyLoss()

In [None]:
model_gru = GRUModel(vocab_size, emb_dim, hidden_dim, output_dim, batch_size)
optimizer = torch.optim.SGD(model_gru.parameters(), lr=learning_rate)
model_gru = model_gru.to(device)

In [None]:
# Start training
for epoch in range(num_epochs):
    train_loss, total_acc, total_count = 0, 0, 0
    for text, label in dataloader:
        text = text.to(device)
        label = label.to(device)

        # Clear gradients w.r.t. parameters
        optimizer.zero_grad()

        # Forward pass to get output/logits
        outputs = model_gru( text )
        #print(text)

        # Calculate Loss: softmax --> cross entropy loss
        loss = criterion(outputs, label)

        # Getting gradients w.r.t. parameters
        loss.backward()

        # Updating parameters
        optimizer.step()

        # Accumulating the loss over time
        train_loss += loss.item()
        total_acc += (outputs.argmax(1) == label).sum().item()
        total_count += label.size(0)

    # Compute accuracy on train set at each epoch
    print('Epoch: {}. Loss: {}. ACC {} '.format(epoch, train_loss/count_train, total_acc/count_train))
        
    total_acc, total_count = 0, 0
    train_loss = 0

### 4.2 bi-GRU

▶▶ **Modify the code to implement a bi-directional GRU. Hint: what is the size of the output of a bi-RNN? what should be used for predictions?**

In [None]:
class BiGRUModel(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim, output_dim, batch_size):
 

    def forward(self, text):


In [None]:
batch_size=2

dataloader = DataLoader(train_iter, batch_size=batch_size, shuffle=False, 
                        collate_fn=collate_batch_pad, drop_last=True)

In [None]:
# Hyper-parameters
vocab_size = len(vocab)
emb_dim = 300
hidden_dim = 32
output_dim = 2

learning_rate = 0.1
num_epochs = 5

criterion = nn.CrossEntropyLoss()

In [None]:
model_bigru = BiGRUModel(vocab_size, emb_dim, hidden_dim, output_dim, batch_size)
optimizer = torch.optim.SGD(model_bigru.parameters(), lr=learning_rate)
model_bigru = model_bigru.to(device)

In [None]:
# Start training
for epoch in range(num_epochs):
    train_loss, total_acc, total_count = 0, 0, 0
    for text, label in dataloader:
        text = text.to(device)
        label = label.to(device)

        # Clear gradients w.r.t. parameters
        optimizer.zero_grad()

        # Forward pass to get output/logits
        outputs = model_bigru( text )
        #print(text)

        # Calculate Loss: softmax --> cross entropy loss
        loss = criterion(outputs, label)

        # Getting gradients w.r.t. parameters
        loss.backward()

        # Updating parameters
        optimizer.step()

        # Accumulating the loss over time
        train_loss += loss.item()
        total_acc += (outputs.argmax(1) == label).sum().item()
        total_count += label.size(0)

    # Compute accuracy on train set at each epoch
    print('Epoch: {}. Loss: {}. ACC {} '.format(epoch, train_loss/count_train, total_acc/count_train))
        
    total_acc, total_count = 0, 0
    train_loss = 0

# PART5: Sequence tagging

## POS Tagging

From: https://pytorch.org/tutorials/beginner/nlp/sequence_models_tutorial.html 

In [None]:
def prepare_sequence(seq, to_ix):
    idxs = [to_ix[w] for w in seq]
    return torch.tensor(idxs, dtype=torch.long)


training_data = [
    # Tags are: DET - determiner; NN - noun; V - verb
    # For example, the word "The" is a determiner
    ("The dog ate the apple".split(), ["DET", "NN", "V", "DET", "NN"]),
    ("Everybody read that book".split(), ["NN", "V", "DET", "NN"])
]
word_to_ix = {}
# For each words-list (sentence) and tags-list in each tuple of training_data
for sent, tags in training_data:
    for word in sent:
        if word not in word_to_ix:  # word has not been assigned an index yet
            word_to_ix[word] = len(word_to_ix)  # Assign each word with a unique index
print(word_to_ix)
tag_to_ix = {"DET": 0, "NN": 1, "V": 2}  # Assign each tag with a unique index
ix_to_tag = {v: k for k, v in tag_to_ix.items()}

# These will usually be more like 32 or 64 dimensional.
# We will keep them small, so we can see how the weights change as we train.
EMBEDDING_DIM = 6
HIDDEN_DIM = 6

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F

class LSTMTagger(nn.Module):

    def __init__(self, embedding_dim, hidden_dim, vocab_size, tagset_size):
        super(LSTMTagger, self).__init__()
        self.hidden_dim = hidden_dim

        self.word_embeddings = nn.Embedding(vocab_size, embedding_dim)

        # The LSTM takes word embeddings as inputs, and outputs hidden states
        # with dimensionality hidden_dim.
        self.lstm = nn.LSTM(embedding_dim, hidden_dim)

        # The linear layer that maps from hidden state space to tag space
        self.hidden2tag = nn.Linear(hidden_dim, tagset_size)

    def forward(self, sentence):
        embeds = self.word_embeddings(sentence)
        #print('embeds.shape', embeds.shape)
        #print(embeds.view(len(sentence), 1, -1).shape)
        lstm_out, _ = self.lstm(embeds.view(len(sentence), 1, -1))
        tag_space = self.hidden2tag(lstm_out.view(len(sentence), -1)) # the whole output, vs output[-1] for classif
        tag_scores = F.log_softmax(tag_space, dim=1) # required with nn.NLLLoss()
        return tag_scores

In [None]:
model = LSTMTagger(EMBEDDING_DIM, HIDDEN_DIM, len(word_to_ix), len(tag_to_ix))
loss_function = nn.NLLLoss() # does not include the softmax
optimizer = torch.optim.SGD(model.parameters(), lr=0.1)

# See what the scores are before training
# Note that element i,j of the output is the score for tag j for word i.
# Here we don't need to train, so the code is wrapped in torch.no_grad()
with torch.no_grad():
    inputs = prepare_sequence(training_data[0][0], word_to_ix)
    tag_scores = model(inputs)
    print(tag_scores)

for epoch in range(300):  # again, normally you would NOT do 300 epochs, it is toy data
    for sentence, tags in training_data:
        # Step 1. Remember that Pytorch accumulates gradients.
        # We need to clear them out before each instance
        model.zero_grad()

        # Step 2. Get our inputs ready for the network, that is, turn them into
        # Tensors of word indices.
        sentence_in = prepare_sequence(sentence, word_to_ix)
        targets = prepare_sequence(tags, tag_to_ix)

        # Step 3. Run our forward pass.
        tag_scores = model(sentence_in)

        # Step 4. Compute the loss, gradients, and update the parameters by
        #  calling optimizer.step()
        loss = loss_function(tag_scores, targets)
        loss.backward()
        optimizer.step()

In [None]:
# See what the scores are after training
with torch.no_grad():
    inputs = prepare_sequence(training_data[0][0], word_to_ix)
    tag_scores = model(inputs)
    predictions = torch.argmax(tag_scores, dim=1).cpu().numpy()
    print(tag_scores)
    print(predictions)
    print(training_data[0][0])
    print( [ix_to_tag[p] for p in predictions])

    # The sentence is "the dog ate the apple".  i,j corresponds to score for tag j
    # for word i. The predicted tag is the maximum scoring tag.
    # Here, we can see the predicted sequence below is 0 1 2 0 1
    # since 0 is index of the maximum value of row 1,
    # 1 is the index of maximum value of row 2, etc.
    # Which is DET NOUN VERB DET NOUN, the correct sequence!
    #print(tag_scores)