**RNN**

---



*Let's first create an RNN of with input feature vector size 5, hidden vector (unit) size 2*

*The output of RNN is hidden unit values at all time steps. It is of size (batch_size, sequence_length, hidden_size x num_directions) if batch_first=True; otherwise, (sequence_length, batch_size, num_directions * hidden_size)*

*num_directions is 2 for bidirectional RNN, where the data is input in the reverse order to a secondary network.*  

Check out the following

https://github.com/bentrevett/pytorch-sentiment-analysis/blob/master/1%20-%20Simple%20Sentiment%20Analysis.ipynb


In [None]:
import torch
import torch.nn as nn

# create an RNN layer 
#   input_size is the feature length; hidden_size is the number of hidden units
rnn_layer = nn.RNN(input_size=5, hidden_size=2, num_layers=2, batch_first=True)

print(rnn_layer)

print('# Print the initial input-to-hidden weights and biases @ the 1st layer')
print(rnn_layer.weight_ih_l0)
print(rnn_layer.bias_ih_l0)
print('# Print the initial hidden-to-hidden weights and biases @ the 1st layer')
print(rnn_layer.weight_hh_l0)
print(rnn_layer.bias_hh_l0)

# For the second layer, rnn_layer.weight_ih_l1, ...

# Note that there is no separate output; the hidden state is used as the output


*Let's create an input sequence*

In [None]:
# When batch_first = True, the expected shape of input is
#   (batch_size, sequence_length, input_size)
# Let's create a sequence of size (1,3,5)

x_seq = torch.randn(1,3,5)
print(x_seq)

# At the first time instant, the feature vector is
print(x_seq[0,0,:])

# At the second time instant, the feature vector is
print(x_seq[0,1,:])

*Let's get the output*

RNN has two outputs:

**out**: the output from all time steps. (batch_size, sequence_length, num_directions x hidden_size). If there are multiple layers, this is the output of last layer at all time steps.

**h_n**: hidden unit values from the last step. (num_layers x num_directions, batch_size, hidden_size). If there are multiple layers, it is the hidden states of all layers.

In [None]:
out, h_n = rnn_layer(x_seq)
print(out)
print(h_n)
# Above, check out the last step value of out and h_n. 
#   Try it when batch_first=False and with different num_layers
print(out.shape)
print(h_n.shape)


*Now, let's do sentiment analysis with real data*


In [None]:
import torch
import pandas as pd

PATH = '/content/drive/MyDrive/Colab Notebooks/sample_data/'
train_dataset = pd.read_csv(PATH+"imdb_train.csv")
valid_dataset = pd.read_csv(PATH+"imdb_valid.csv")
test_dataset = pd.read_csv(PATH+"imdb_test.csv")

#print(dir(train_dataset))
print(train_dataset.head())
print(train_dataset.label.unique()) # there are two labels: 0, 1

# Print a specific review and its review class
print('First sentence:',train_dataset.text[0])
print('First label:',train_dataset.label[0])

print('Training test size:',len(train_dataset.label))


*Next, we have to build a dictionary; that is, assign a codeword for each unique word [token]. One-hot-encoding is a popular choice. However, the number of unique words could be too much. So, it is better to limit it. We can take, for example, the most common n words. Then, replace the words that are not in the dictionary as "unknown".*

In [None]:
import re # Regular expressions

def tokenizer(text):
  # returns a list of unique words, while removing special characters, etc.
  text = re.sub('<[^>]*>', '', text)
  emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text.lower())
  text = re.sub('[\W]+', ' ', text.lower()) +' '.join(emoticons).replace('-','')
  tokenized = text.split()
  return tokenized

print(tokenizer('Hi! Selam...!  Are you still working? Good luck:)'))




In [None]:
from collections import Counter

# Now, find the counts of each token in the train dataset
token_counts = Counter() # a dictionary of words (as key) with counts (as value)

for i in range(len(train_dataset.label)):
  text = train_dataset.text[i]
  label = train_dataset.label[i]
  tokens = tokenizer(text)
  token_counts.update(tokens) 

print('Dictionary size:', len(token_counts))

print(token_counts) # ordered based on appearance

print('The number of appearance for "hi" is:',token_counts['hi'])



*Now, let's create a vocabulary*

In [None]:
from collections import OrderedDict
#from torchtext import data
from torchtext.vocab import vocab


MAX_NO_OF_WORDS = 25000

# the below is necessary only when token_counts is not ordered
tokens_sorted_by_freq = sorted(token_counts.items(), key=lambda x: x[1], reverse=True)

# Take the first MAX_NO_OF_WORDS
tokens_sorted_by_freq = tokens_sorted_by_freq[0:MAX_NO_OF_WORDS]

# Now build a vocabulary, first get an ordered dictionary of tokens
tokens_ordered_dict = OrderedDict(tokens_sorted_by_freq)
print(tokens_ordered_dict)

# And then, get the vocabulary
vocabulary = vocab(tokens_ordered_dict) # maps tokens to indices
print(vocabulary['the'])  


# Add two new tokens to the vocabulary
vocabulary.insert_token("<pad>", 0) # this is for padding to adjust seq. length
vocabulary.insert_token("<unk>", 1) # this is for unknown words
vocabulary.set_default_index(1)

print(vocabulary['<pad>'])
print(vocabulary['the'])


*Now, let's create a custom dataset. Note that train_dataset is not a PyTorch dataset, we need to define __ init__, __ len__ and __ getitem__ methods. Also, we need to transform the text according to vocabulary we have built.*

In [None]:

from torch.utils.data import Dataset, DataLoader

class TextDataset(Dataset):
  def __init__(self,text_dataset,vocab):
    # text_dataset is a pandas dataframe, with .text and .label attributes
    self.text_dataset = text_dataset 
    self.vocab = vocab
  
  def __len__(self):
    return len(self.text_dataset.label)

  def __getitem__(self,idx):
    a_text = self.text_dataset.text[idx] 

    text = self.transform_text(a_text)
    label = torch.tensor(self.text_dataset.label[idx])
    textlength = torch.tensor(len(text.detach()))

    return text, label, textlength

  def transform_text(self,text):
    # Define a function to transform text to vocab indices
    text_pipeline = lambda x: [self.vocab[token] for token in tokenizer(x)]
    
    # the following should be a list of indices
    transformed_text = text_pipeline(text)
    transformed_text = torch.tensor(transformed_text,dtype=torch.int64)

    return transformed_text

train_textdataset = TextDataset(train_dataset,vocabulary)
valid_textdataset = TextDataset(valid_dataset,vocabulary)
test_textdataset = TextDataset(test_dataset,vocabulary)

# Create a dataloader with batch_size = 1
dataloader = DataLoader(train_textdataset, batch_size=1,shuffle=True)

# Check it out
text_batch, label_batch, textlength_batch = next(iter(dataloader))
print('text batch:',text_batch)
print('label batch:',label_batch)
print('text length batch',textlength_batch)

# The above is going to work for batch_size=1; however, for other batch sizes, 
#   it will give error because it expects each item in batch size to be equal.
#   Therefore, we need to define a collate (combine) function

def collate_batch(batch):
  text_list, label_list,length_list = [], [], []
  for text, label, length in batch:
    text_list.append(text)
    label_list.append(label)
    length_list.append(length)
  
  # Pad text to make sure all samples in the batch have same size
  padded_text_list = nn.utils.rnn.pad_sequence(text_list, batch_first=True) 

  # convert the label_list and length_list to tensor
  label_list = torch.tensor(label_list,dtype=torch.float32)
  length_list = torch.tensor(length_list,dtype=torch.float32)

  return padded_text_list, label_list, length_list


# Now, create a dataloader that can handle batches
dataloader2 = DataLoader(train_textdataset, 
                         batch_size=2,
                         shuffle=True,
                         collate_fn=collate_batch)

# Check it out
text_batch, label_batch, textlength_batch = next(iter(dataloader2))
print(text_batch)
print(label_batch)
print(textlength_batch)
print(text_batch.shape)


*Let's create dataloaders for train, validation and test datasets*

In [None]:
batch_size = 256

train_dl = DataLoader(train_textdataset, 
                      batch_size=batch_size,
                      shuffle=True, 
                      collate_fn=collate_batch)
valid_dl = DataLoader(valid_textdataset, 
                      batch_size=batch_size,
                      shuffle=False, 
                      collate_fn=collate_batch)
test_dl = DataLoader(test_textdataset, 
                     batch_size=batch_size,
                     shuffle=False, 
                     collate_fn=collate_batch)


# Check it out
text_batch, label_batch, length_batch = next(iter(train_dl))
print(text_batch.shape)
print(label_batch.shape)
print(length_batch.shape)
print(text_batch)
print(label_batch)
print(length_batch)
print(text_batch[0].dtype)
print(label_batch[0].dtype)
print(length_batch[0].dtype)

*We will do **feature embedding,** which is a dimensionality reduction approach for word vectors.*

*At the moment, words in a text are integer numbers. One may do one-hot encoding to convert these integers to vectors of ones and zeros. However, the dimensionality would be very large.*

In [None]:
# An example of embedding, which is a linear mapping from indices to vectors
#   The weights are initialized randomly. (Run this cell multiple times, and see)
#   If part of a model, the weights are optimized through training
embedding = nn.Embedding(num_embeddings=10,embedding_dim=2,padding_idx=0)
# In the above, num_embeddings will be vocabulary size,
#   embedding_dim is what we choose
#   padding_idx is 0; for these gradient will not be calculated...

# a sample input
text_input = torch.LongTensor([[1,2,0,0],[5,4,2,1],[9,1,1,0]])
print(text_input.shape)
out = embedding(text_input)
print(out)
print(out.shape)


*An LSTM model with feature embedding*

In [None]:
class model_LSTM(nn.Module):
  def __init__(self,vocab_size,embed_dim,hidden_size,fc_size):
    super().__init__()
    self.embedding = nn.Embedding(vocab_size,
                                  embed_dim,
                                  padding_idx=0)
    self.lstm = nn.LSTM(embed_dim,
                        hidden_size,
                        batch_first = True)
    self.fc1 = nn.Linear(hidden_size,fc_size)
    self.relu = nn.ReLU()
    # We have to sentiment outcomes 0 (negative) or 1 (positive).
    #   So, use a single output unit, and then apply nn.Sigmoid()
    #   Then, for the cost function, use BCEloss 
    self.fc2 = nn.Linear(fc_size,1)
    self.sigmoid = nn.Sigmoid()

  def forward(self,text,textlength):

    # textlength is a tensor list
    textlength = [x.item() for x in textlength]
    #textlength = torch.stack(textlength)

    out = self.embedding(text)


    # Packing provides some computational efficiency
    #   Check out: https://stackoverflow.com/questions/51030782/why-do-we-pack-the-sequences-in-pytorch
    out = nn.utils.rnn.pack_padded_sequence(out, 
            textlength,
            enforce_sorted=False, 
            batch_first=True)
    '''
    a = [torch.tensor([1,2,3]), torch.tensor([3,4])]
    b = torch.nn.utils.rnn.pad_sequence(a, batch_first=True)
    >>>>
    tensor([[ 1,  2,  3],
        [ 3,  4,  0]])
    torch.nn.utils.rnn.pack_padded_sequence(b, batch_first=True, lengths=[3,2])
    >>>>PackedSequence(data=tensor([ 1,  3,  2,  4,  3]), batch_sizes=tensor([ 2,  2,  1]))
    '''

    _, (hidden,cell) = self.lstm(out)
    out = hidden[-1,:,:]
    out = self.fc1(out)
    out = self.relu(out)
    out = self.fc2(out)
    out = self.sigmoid(out)

    return out

# parameters
vocab_size = len(vocabulary)
embed_dim = 20
hidden_size = 64
fc_size = 32
model = model_LSTM(vocab_size,embed_dim,hidden_size,fc_size)
print(model)


# Check out the model with a sample input
text_batch, label_batch, length_batch = next(iter(train_dl))
# take the first two elements of the batch
text_batch = text_batch[0:2,:]
label_batch = label_batch[0:2]
length_batch = length_batch[0:2]

print(text_batch.shape)

print(text_batch)
print(label_batch)
print(length_batch)


model(text_batch,length_batch)



*Now, let's define a train function and a test function, with dataloader as input. So, each function will go over the entire dataset with the dataloader, which loads data in batches.*

In [None]:
num_epochs = 10
loss_fn = nn.BCELoss()
optimizer = torch.optim.SGD(model.parameters(),lr=0.01)

def train_for_an_epoch(dataloader):
  # put the model in train mode
  model.train()

  total_acc = 0.0
  total_loss = 0.0

  for text_batch, label_batch, length_batch in dataloader:
    optimizer.zero_grad()
    pred = model(text_batch,length_batch)[:,0]
    loss = loss_fn(pred, label_batch)
    loss.backward()
    optimizer.step()

    total_acc += ((pred >= 0.5).float() == label_batch).float().sum().item()
    total_loss += loss.item()*label_batch.size(0)

  return total_acc/len(dataloader.dataset), total_loss/len(dataloader.dataset)


def test_for_an_epoch(dataloader):
  model.eval()

  total_acc = 0.0
  total_loss = 0.0

  with torch.no_grad():
    for text_batch, label_batch, length_batch in dataloader:
      pred = model(text_batch, length_batch)[:,0]
      loss = loss_fn(pred, label_batch)

      total_acc += ((pred >= 0.5).float() == label_batch).float().sum().item()
      total_loss += loss.item()*label_batch.size(0)
  
  return total_acc/len(dataloader.dataset), total_loss/len(dataloader.dataset)


*Let's do the training.*

In [None]:
#text_batch, label_batch, length_batch = next(iter(train_dl))


for epoch in range(num_epochs):
  acc_train, loss_train = train_for_an_epoch(train_dl)
  acc_valid, loss_valid = test_for_an_epoch(valid_dl)

  print(f'Epoch: {epoch}, train_accuracy: {acc_train:.4f}, val_accuracy: {acc_valid:.4f}')

