# Implementing The concept of embeddings
Embedding is the way to do transfer learning in natural language processing. In image processing, the concept of transfer learning is quite mature and similarly, it is getting stronger for NLP. In this recipe, I will be demonstrating how embeddings can help in training.

I am providing one example to illustrate how to transfer learning can help. Let's say tom stays in the USA and is a native English speaker. He has got an internship opportunity in France, he will be traveling to France after 3 months. Tom has no knowledge of the French language. Tom was smart, meanwhile, for 3 months he started listening to the French radio channel. Although initially, he could not understand anything but slowly and gradually his brain started making sense. By the time he reached France, he had a kind of pre-trained his brain. Now, this low-level understanding of French concept helped him to learn the French language faster. This is similar to how embeddings work. Embeddings are formed by forcing the network to learn from context. These learning are passed to RNN Like network it can better learn than learning from scratch. 

# Importing Requirements

In [None]:

import gzip
import json
import os
import random
import shutil
import tarfile
import urllib

import chakin
import matplotlib.pyplot as plt
import nltk
import torch
from torch import nn, optim
from torchtext import data
from torchtext import vocab
from tqdm import tqdm

nltk.download('popular')
SEED = 1234

torch.manual_seed(SEED)
torch.cuda.manual_seed(SEED)
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')


# Downloading required datasets
To demonstrate how embeddings can help, we will be conducting an experiment on sentiment analysis task. I have used movie review dataset having 5331 positive and 5331 negative processed sentences. The entire experiment is divided into 5 sections. 

Downloading Dataset: Above discussed dataset is available at http://www.cs.cornell.edu/people/pabo/movie-review-data/rt-polaritydata.tar.gz.



In [None]:
data_exists = os.path.isfile('data/rt-polaritydata.tar.gz')
if not  data_exists:
    urllib.request.urlretrieve("http://www.cs.cornell.edu/people/pabo/movie-review-data/rt-polaritydata.tar.gz",
                                       "data/rt-polaritydata.tar.gz")
    tar = tarfile.open("data/rt-polaritydata.tar.gz")
    tar.extractall(path='data/')

# Downloading embedding
The pre-trained embeddings are available and can be easily used in our model.  I have found a module on GitHub that lets you download required embeddings very easily. This package is known as Chakin. Chakin is a  simple downloader for pre-trained word vectors. Chakin can be installed by using pip as pip install chakin  . You can use Chakin as shown below : 

In [None]:
embed_exists = os.path.isfile('../embeddings/cc.en.300.vec.gz')
if not embed_exists:
    print("Downloading FastText embeddings, if not downloaded properly, then delete the `embeddings/cc.en.300.vec.gz
    chakin.search(lang='English')
    chakin.download(number=2, save_dir='./embeddings')
    with gzip.open('../embeddings/cc.en.300.vec.gz', 'rb') as f_in:
    with open('../embeddings/cc.en.300.vec', 'wb') as f_out:
        shutil.copyfileobj(f_in, f_out)

# Preprocessing
I am using TorchText to preprocess downloaded data. The preprocessing includes following steps:

- Reading and parsing data 
- Defining sentiment and label fields
- Dividing data into train, valid and test subset
- Downloading embedding
- forming the train, valid and test iterators

In [None]:
SEED = 1
split = 0.80

In [None]:
data_block = []
negative_data  = open('data/rt-polaritydata/rt-polarity.neg',encoding='utf8',errors='ignore').read().splitlines()
for i in negative_data:
        data_block.append({"sentiment":str(i.strip()),"label" : 0}) 
positve_data  = open('data/rt-polaritydata/rt-polarity.pos',encoding='utf8',errors='ignore').read().splitlines()
for i in positve_data:
        data_block.append({"sentiment":str(i.strip()),"label" : 1}) 

In [None]:
random.shuffle(data_block)

train_file = open('data/train.json', 'w')
test_file = open('data/test.json', 'w')
for i in  range(0,int(len(data_block)*split)):
    train_file.write(str(json.dumps(data_block[i]))+"\n")
for i in  range(int(len(data_block)*split),len(data_block)):
    test_file.write(str(json.dumps(data_block[i]))+"\n")

In [None]:
def tokenize(sentiments):
#     print(sentiments)
    return sentiments
def pad_to_equal(x):
    if len(x) < 61:
        return x + ['<pad>' for i in range(0, 61 - len(x))]
    else:
        return x[:61]
def to_categorical(x):
    if x == 1:
        return [0,1]
    if x == 0:
        return [1,0]
    

In [None]:
SENTIMENT = data.Field(sequential=True , preprocessing =pad_to_equal , use_vocab = True, lower=True)
LABEL = data.Field(is_target=True,use_vocab = False, sequential=False, preprocessing =to_categorical)
fields = {'sentiment': ('sentiment', SENTIMENT), 'label': ('label', LABEL)}

In [None]:
train_data , test_data = data.TabularDataset.splits(
                            path = 'data',
                            train = 'train.json',
                            test = 'test.json',
                            format = 'json',
                            fields = fields                                
)

In [None]:
print("Printing an example data : ",vars(train_data[1]))

**Splitting data in to test and train**

In [None]:
train_data, valid_data = train_data.split(random_state=random.seed(SEED))

In [None]:
print('Number of training examples: ', len(train_data))
print('Number of validation examples: ', len(valid_data))
print('Number of testing examples: ',len(test_data))

**Loading Embedding to vocab**

In [None]:
vec = vocab.Vectors(name = "cc.en.300.vec",cache = "../embeddings/")

In [None]:
SENTIMENT.build_vocab(train_data, valid_data, test_data, max_size=100000, vectors=vec)

**Constructing Iterators**

In [None]:
train_iter, val_iter, test_iter = data.Iterator.splits(
        (train_data, valid_data, test_data), sort_key=lambda x: len(x.sentiment),
        batch_sizes=(32,32,32), device=-1,)

In [None]:
sentiment_vocab = SENTIMENT.vocab

In [None]:
sentiment_vocab.vectors.shape

# Training
 Training will be conducted for two models one with no pre-trained embedding and one with FastText embeddings. I am using FastText embeddings trained on wikipedia corpus with a vector size of 300. 

In [None]:
def binary_accuracy(preds, y):
    """
    Returns accuracy per batch, i.e. if you get 8/10 right, this returns 0.8, NOT 8
    """
    rounded_preds = torch.argmax(preds, dim=1)
#     print(rounded_preds)
    correct = (rounded_preds == torch.argmax(y, dim=1)).float() #convert into float for division 
    acc = correct.sum()/len(correct)
    return acc

In [None]:
def train(model, iterator, optimizer, criterion):
    epoch_loss = 0
    epoch_acc = 0
    model.train()
    
    for batch in iterator:
        optimizer.zero_grad()       
        predictions = model(batch.sentiment.to(device)).squeeze(1)
        loss = criterion(predictions.type(torch.FloatTensor), batch.label.type(torch.FloatTensor))
        acc = binary_accuracy(predictions.type(torch.FloatTensor), batch.label.type(torch.FloatTensor))
        loss.backward()
        optimizer.step()
        epoch_loss += loss.item()
        epoch_acc += acc.item()
        
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

## Training From Scratch
The network with no pre-trained embeddings can be defined as given below. 
The `SCRATCH_RNN` class builds embeddings from scratch using torch embedding function. Embedding function is very frequently used to store word embeddings and retrieve them using indices. The input to the module is a list of indices, and the output is the corresponding word embeddings. Parameters of the embeddings functions are trainable so the weights change constantly throughout training and help in generating better word vectors.  Such embedding vectors are passed to the RNN function to get hidden and output tensor. The hidden tensor has the crux of the learning and hence the hidden output is passed through one linear transformation and after application of softmax, the predicted output is calculated. 

In [None]:
class SCRATCH_RNN(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim, output_dim, n_layers, bidirectional, dropout, sentiment_vocab):
        super(SCRATCH_RNN, self).__init__()

        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.rnn = nn.RNN(embedding_dim, hidden_dim, num_layers=n_layers, bidirectional=bidirectional, dropout=dropout)
        self.fc = nn.Linear(hidden_dim * 2, output_dim)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        embedded = self.dropout(self.embedding(x))

        output, hidden = self.rnn(embedded)
        # concat the final forward (hidden[-2,:,:]) and backward (hidden[-1,:,:]) hidden layers
        # and apply dropout
        hidden = self.dropout(torch.cat((hidden[-2, :, :], hidden[-1, :, :]), dim=1))
        return torch.softmax(self.fc(hidden.squeeze(0)),dim = 1)

In [None]:
INPUT_DIM = len(SENTIMENT.vocab)
EMBEDDING_DIM = 300
HIDDEN_DIM = 256
OUTPUT_DIM = 2
BATCH_SIZE = 32
N_LAYERS = 2
BIDIRECTIONAL = True
DROPOUT = 0.5

scratch_rnn = SCRATCH_RNN(INPUT_DIM, EMBEDDING_DIM, HIDDEN_DIM, OUTPUT_DIM, N_LAYERS, BIDIRECTIONAL, DROPOUT, sentiment_vocab)
scratch_rnn = scratch_rnn.to(device)

In [None]:
optimizer = optim.SGD(scratch_rnn.parameters(), lr=0.1)
criterion = nn.BCEWithLogitsLoss()
criterion = criterion.to(device)

In [None]:
new_embedding_loss = []
new_embedding_accuracy = []
for i in tqdm(range(0,100)):
    loss, accuracy =  train(scratch_rnn, train_iter, optimizer, criterion)
    new_embedding_loss.append(loss)
    new_embedding_accuracy.append(accuracy)

## Using Pre-trained Embeddings
The other network is one where we are passing pre-trained embeddings. This network looks like the previous network except for the change in one line as indicated in bold. 

In [None]:
class GLOVE_RNN(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim, output_dim, n_layers, bidirectional, dropout, sentiment_vocab):
        super(GLOVE_RNN, self).__init__()

        self.embedding = nn.Embedding(vocab_size, 300)
        self.embedding.weight.data.copy_(sentiment_vocab.vectors)
        self.embedding.weight.requires_grad = True
        self.rnn = nn.RNN(embedding_dim, hidden_dim, num_layers=n_layers, bidirectional=bidirectional, dropout=dropout)
        self.fc = nn.Linear(hidden_dim * 2, output_dim)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):

        embedded = self.dropout(self.embedding(x))
        output, hidden = self.rnn(embedded)
        # concat the final forward (hidden[-2,:,:]) and backward (hidden[-1,:,:]) hidden layers
        # and apply dropout
        hidden = self.dropout(torch.cat((hidden[-2, :, :], hidden[-1, :, :]), dim=1))
        return self.fc(hidden.squeeze(0))

In [None]:
INPUT_DIM = len(SENTIMENT.vocab)
EMBEDDING_DIM = 300
HIDDEN_DIM = 256
OUTPUT_DIM = 2
BATCH_SIZE = 32
N_LAYERS = 2
BIDIRECTIONAL = True
DROPOUT = 0.5

glove_rnn = GLOVE_RNN(INPUT_DIM, EMBEDDING_DIM, HIDDEN_DIM, OUTPUT_DIM, N_LAYERS, BIDIRECTIONAL, DROPOUT, sentiment_vocab)
glove_rnn = glove_rnn.to(device)

In [None]:
optimizer = optim.SGD(glove_rnn.parameters(), lr=0.01)
criterion = nn.BCEWithLogitsLoss()
criterion = criterion.to(device)

In [None]:
glove_embedding_loss = []
glove_embedding_accuracy = []
for i in tqdm(range(0,100)):
    loss, accuracy =  train(glove_rnn, train_iter, optimizer, criterion)
    glove_embedding_loss.append(loss)
    glove_embedding_accuracy.append(accuracy)

# Result

I allow to train this network for 100 epochs and plotted accuracy progress of both models. The plot is given below.
![Figure:  Showing the difference in the accuracy when embeddings are trained from scratch and when Pretrained FastText embeddings are used.](figures/advantage_of_embeddings.png)


It is very clear from the pre-trained embeddings really helps in learning. Training from scratch resulted in 83% accuracy whereas training by using FastText embeddings provided 88% accuracy. 

In [None]:
plt.plot(new_embedding_accuracy , label = "New Embedding Accuracy")
plt.plot(glove_embedding_accuracy , label = "Pre-trained Embedding Accuracy")
plt.ylabel("Accuracy")
plt.xlabel("Epoch")
plt.legend(loc='upper left')
plt.show()
