## Feedforward Neural Networks - Supervised Learning II - MDS Computational Linguistics

### Goal of this tutorial
- Implement single layer and multilayer neural networks for sentiment analysis
- Implement word embeddings based on word2vec

### General
- This notebook was last tested on Python 3.6.9, PyTorch 1.2.0 and sklearn 0.21.3

We would like to acknowledge the following materials which helped as a reference in preparing this tutorial:
- https://github.com/UBC-NLP/dlnlp2019/blob/master/slides/feedforward_nets.pdf
- https://github.com/UBC-NLP/dlnlp2019/blob/master/slides/word2vec.pdf

## Feedforward neural networks

Neural networks are a family of classifiers which can model **non-linear decision boundaries**. 

Recommended reading for the theory behind neural networks: https://github.com/UBC-NLP/dlnlp2019/blob/master/slides/feedforward_nets.pdf

We will first define the practical task that we are going to deal with in this tutorial.

In this tutorial, we will focus only on sentiment analysis task. Specifically, we focus on classifying the sentiment of the tweet. We make use of the dataset provided by ``SemEval-2016 Task 4 on Sentiment Analysis on Twitter`` (http://alt.qcri.org/semeval2016/task4/). We focus on the subtask A which is coined as **message polarity classification task**. In this task, given a tweet, we need to predict whether the tweet is of **positive, negative or neutral sentiment**. We have 6,000, 1,999 and 20,632 tweets in train set, validation set and test set respectively. We have already preprocessed (tokenization, removing URLs, mentions, hashtags and so on) the tweets and placed it under ``data/sentiment-twitter-2016-task4`` folder in three files as ``train.tsv``, ``dev.tsv`` and ``test.tsv``. Some example tweets include:

| class index | class name | tweet example |
| ----------------- | ----------- |-------------|
| 0  | Negative   | --MENTION-- --MENTION-- the reason i ask is because it may be the manufacturer's fault and they could help you |
| 1  | Neutral | just ordered my ever tablet --MENTION-- surface pro --DIGIT-- ssd hopefully it works out for dev to replace my laptop |
| 2  | Positive | dear --MENTION-- the newooffice for mac is great and all but no lync update c'mon |


### Getting Started

In [1]:
# all the necessary imports
import torch
from torch.utils.data import Dataset, DataLoader
import torch.nn as nn
from torch import optim

# set the seed for reproducibility
manual_seed = 123
torch.manual_seed(manual_seed)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
n_gpu = torch.cuda.device_count()
if n_gpu > 0:
    torch.cuda.manual_seed(manual_seed)

# hyperparameters
BATCH_SIZE = 5 # size of the mini-batch for training
MAX_EPOCHS = 15 # maximum no. of passes over the training data
LEARNING_RATE = 0.5 # learning rate
MAX_FEATURES = 5000 # x_j, the number of j, maximum number of features to represent a tweet (or) size of the tweet feature vector

# other parameters
NUM_CLASSES = 3

# dataset paths
DATA_FOLDER = "data/sentiment-twitter-2016-task4"
TRAIN_FILE = DATA_FOLDER + "/train.tsv"
VALID_FILE = DATA_FOLDER + "/dev.tsv"
TEST_FILE = DATA_FOLDER + "/test.tsv"

### Feature Extractor
The training example is a **tweet**, which contains an ordered list of terms (e.g., words). We need to **vectorize** the tweet, that is, convert the tweet to a fixed-length vector to be fed as input to a ML model. We also call this vector as a **feature vector**. In this tutorial, we will utilize simple features based on **term frequency-inverse document frequency (tf-idf)** to represent a tweet. In our case a *document* is a *tweet* and a *word* is a *term*. There are two functions:
- tf: A function for **'term frequency'** (basically counts, i.e., how many times the term occurred in our document) 
- idf:  A function for **'inverse document frequency'** (the total number of documents divided by the number of documents in which the term occurred).

Tf-idf of a tweet $d$ is a $|V|$ dimensional feature vector, where each component corresponds to a term $t$ in the vocabulary V whose size is $|V|$. Tf-idf is formulated as,

tf-idf(d,t) $ = (1 + \log f_{d,t}) * \log(1 + \frac{N}{n_t})$

where,
- $f_{d,t}$ corresponds to the raw frequency of the term $t$ in tweet $d$
- $N$ corresponds to the number of tweets in train set
- $n_t$ corresponds to the number of tweets in train set that has the term $t$

Note: There are variations of tf-idf, but here:
- We take the log and **add one** to idf to avoid dividing by zero if the term never occurs in any document.
- We also similarly add 1 and take the log of tf.
Ultimately, we will let sklearn take care of [tf-idf construction](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html). 

Essentially, **high weight** in tf–idf feature vector is attained by a **high term frequency (in the given tweet) and a low tweet frequency of the term in the whole collection of tweets**. Hence, the weights tend to filter out common terms.

Let us construct the tf-idf feature vectorizer from the tweets in train set. First, let us read the training corpus.

In [2]:
# tool from sklearn library to compute tfidf of a document
from sklearn.feature_extraction.text import TfidfVectorizer

# function for loading file
def read_corpus(file):
    corpus = [] 
    for line in open(file):
        content, label = line.strip().split("\t")
        corpus.append(content)
    return corpus

# reads the train corpus
train_corpus = read_corpus(TRAIN_FILE) 

# print the number of training records
print("No. of training records: %d"%(len(train_corpus)))

No. of training records: 6000


Let us instantiate an object from the TfidfVectorizer class by specifying the maximum number of features. Other (optional) arguments to the class can be found in the [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html).

In [3]:
# define the vectorizer
# builds a vocabulary that only considers the **top max_features** ordered by **term frequency** across the corpus.
vectorizer = TfidfVectorizer(max_features=MAX_FEATURES)

Let us fit the vectorizer with the training records, that is, compute $f_{d,t}$, $N$ and $n_t$.

In [4]:
# fit the vectorizer on train set
vectorizer.fit(train_corpus)

TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.float64'>, encoding='utf-8',
                input='content', lowercase=True, max_df=1.0, max_features=5000,
                min_df=1, ngram_range=(1, 1), norm='l2', preprocessor=None,
                smooth_idf=True, stop_words=None, strip_accents=None,
                sublinear_tf=False, token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=None, use_idf=True, vocabulary=None)

Let us check the number of features our vectorizer will output for a given tweet:

In [5]:
# print the number of features
print(len(vectorizer.get_feature_names()))

5000


Let us print the feature names (term, in our case) corresponding to first 10 features:

In [6]:
# print feature names (term) corresponding to first 10 features
print(vectorizer.get_feature_names()[0:10])

['aa', 'aapl', 'abc', 'abigail', 'able', 'about', 'above', 'absence', 'absolute', 'absolutely']


It's important to note that the tf-idf representation of a tweet ignores the word order present in the tweet (this shallow feature representation might be limiting the performance of a model for many NLP tasks). In the latter part of the tutorial, we will look at word embeddings, a key concept in building the state-of-the-art deep learning models for NLP.

### DataLoader

Next, let us construct the dataloader for sentiment dataset. In the previous tutorial, we studied dataloader for loading regression datasets. Here again, we create a custom class **TweetSentimentDataset** that inherits the Dataset class and define the functionality of the constructor function (**__init__**),  get item function (**__getitem__**) and length function (**__len__**).

In [7]:
# create a new class inheriting torch.utils.data.Dataset
class TweetSentimentDataset(Dataset):
    """ sentiment-twitter-2016-task4 dataset."""
    def __init__(self, file, vectorizer):
        # read the corpus
        corpus, labels = [], []
        for line in open(file):
            content, label = line.strip().split("\t")
            corpus.append(content)
            labels.append(int(label))

        # set the size of the corpus
        self.n = len(corpus)

        # vectorize all the tweets
        features = vectorizer.transform(corpus)

        # convert features and labels to torch.tensor
        self.features = torch.from_numpy(features.toarray()).float()
        self.features.to(device)
        self.labels = torch.tensor(labels, device=device, requires_grad=False)

    # return input and output of a single example
    # Input: Feature vectors, where each vector corresponds to a tweet. 
    # Output: Labels, where each label is one index for each of our tags in the set {positive, negative, neutral}
    def __getitem__(self, index):
        return self.features[index], self.labels[index]

    # return the total number of examples
    def __len__(self):
        return self.n


Let us now instantiate the loader for training, validation and test set. Note that we only shuffle the training dataset (hence, we set shuffle argument to True for creating training data loader).

In [8]:
# create the dataloader object
train_loader = DataLoader(dataset=TweetSentimentDataset(TRAIN_FILE, vectorizer), batch_size=BATCH_SIZE, shuffle=True, num_workers=2) 
valid_loader = DataLoader(dataset=TweetSentimentDataset(VALID_FILE, vectorizer), batch_size=BATCH_SIZE, shuffle=False, num_workers=1)
test_loader = DataLoader(dataset=TweetSentimentDataset(TEST_FILE, vectorizer), batch_size=BATCH_SIZE, shuffle=False, num_workers=1) 

Let us access a single batch from the training dataset and print the properties of the batch (such as input and output size):

In [9]:
# iterate over the dataset for one epoch
num_batches = 0
for i, data in enumerate(train_loader, 0):
    input, output = data
    if i == 0:
        print("Input (sample batch) vector size: ",input.size(), "\nGolden label (sample batch) vector size: ",output.size())
    num_batches += 1

Input (sample batch) vector size:  torch.Size([5, 5000]) 
Golden label (sample batch) vector size:  torch.Size([5])


We can also check the number of batches (remember the batch size is set as 5) 

In [10]:
print("Number of batches: ", num_batches) # prints the number of batches per epoch

Number of batches:  1200


The **computational graph** for a single (hidden) layer neural network for classification that we will now build in this tutorial, looks like this:

<img src="images/sl2_textclassification_neuralnets_cg.jpg" alt="MLP" title="Single Layer neural network - Computational Graph" width="650" height="450"/>

For our setting, **x** and **y** correspond to tweet feature vector and tweet sentiment label respectively. 

The LogSoftmax function is akin to Softmax function multiplied by **log** and has better numerical stability in practice. The documentation for LogSoftmax can be found [here](https://pytorch.org/docs/stable/generated/torch.nn.LogSoftmax.html).

Let us look at an example for using **LogSoftmax** layer. First, let's create sample input:

In [11]:
sample_input = torch.Tensor([1.0,2.0,3.0]) # 1x3
print(sample_input)

tensor([1., 2., 3.])


Second, let's create a LogSoftmax layer

In [12]:
logsoftmax_layer = nn.LogSoftmax()

Third, let's pass the **sample_input** to the logsoftmax_layer

In [13]:
print(logsoftmax_layer(sample_input))

tensor([-2.4076, -1.4076, -0.4076])


  """Entry point for launching an IPython kernel.


Let's check the output from logsoftmax_layer matches with multiplying the output from softmax_layer with log 

In [14]:
print(nn.Softmax()(sample_input).log())

tensor([-2.4076, -1.4076, -0.4076])


  """Entry point for launching an IPython kernel.


Now, let's look at another module in the computational graph we've not seen before: **NLLLoss**

The **NLLLoss** corresponds to negative log-likelihood loss, a commonly used loss to train a classification problem with fixed number of classes (in our setting, the number of classes is 3). The documentation for NLLLoss can be found [here](https://pytorch.org/docs/stable/generated/torch.nn.NLLLoss.html).

Like **MSLELoss**, **NLLLoss** takes the model prediction (class-wise log proababilities) and the ground truth (or target) (vector containing true class indices) and measures the degree to which the model prediction deviates from the ground truth.

The loss function for a single example (input $x_n$, target $y_n$) is given by, $l = -w_{y_n} * x_{n,y_n}$, where $y_n$ represent the target class index, $x_{y_n}$ is the input dimension corresponding to the target class index and $w_{y_n}$ represent the weight corresponding to the target class (hyperparameter, default is 1 for all the classes. for imbalanced datasets, we might need to tune these weights so that mistakes in minority class can be adequately penalized).

Let's see an example by creating the **NLLLoss** criterion:

In [15]:
criterion = nn.NLLLoss()

Let's pass the model prediction (output from logsoftmax) and ground truth (assuming the right class is 2 (0-indexed))

In [16]:
# calculate the model prediction
model_prediction = logsoftmax_layer(sample_input).unsqueeze(0) # add extra dimension to match layer syntax (check documentation)
print("model_prediction = ", model_prediction)

# store the ground truth for this example
target = torch.tensor([2], dtype=torch.long)
print("target = ", target)

# pass the model prediction and the ground truth to the loss layer to compute l
print("loss, l =", criterion(model_prediction, target))

model_prediction =  tensor([[-2.4076, -1.4076, -0.4076]])
target =  tensor([2])
loss, l = tensor(0.4076)


  


As expected, $l = -1.0 * -0.4076 = 0.4076$ 

Let us implement the single layer neural network in PyTorch.

<img src="images/sl2_textclassification_neuralnets_cg.jpg" alt="MLP" title="Single Layer neural network - Computational Graph" width="650" height="450"/>

Let us start with defining the class for the main class for the neural network, defining the layers in the network along with the forward propagation logic.


In [17]:
"""
create a custom model class inheriting torch.nn.Module
"""
class SingleLayerNeuralNetworkModel(nn.Module):
  
  def __init__(self, num_inputs, hidden_layers, num_outputs, debug=False):
    # In the constructor we define the layers for our model
    super(SingleLayerNeuralNetworkModel, self).__init__()
    self.input_to_hidden = nn.Linear(num_inputs, hidden_layers[0]) # includes W_input and bias_input 
    self.sigmoid_layer = nn.Sigmoid()  
    self.hidden_to_output = nn.Linear(hidden_layers[0], num_outputs) # includes W_hidden and bias_hidden
    self.softmax_layer = nn.LogSoftmax(dim=1)
    self.debug = debug
  
  def forward(self, x):
    # In the forward function we define the forward propagation logic
    if self.debug:
        print("input to hidden layer input (or x) shape = ", x.size())
    out = self.input_to_hidden(x) # Wx + c
    if self.debug:
        print("input to hidden layer output shape = ", out.size())
    out = self.sigmoid_layer(out) # h = g(Wx + c)
    if self.debug:
        print("sigmoid layer output (or hidden to output layer input) shape = ", out.size())
    out = self.hidden_to_output(out) # W^T h + b
    if self.debug:
        print("logsoftmax layer input (or hidden to output layer output) shape = ", out.size())
    out = self.softmax_layer(out) # out = softmax(W^T h + b)
    if self.debug:
        print("logsoftmax layer output shape = ", out.size())
    return out

Let us define the training and evaluation logic. Most of the steps are similar to the steps we saw in the previous tutorial for regression.

In [18]:
# train logic (similar to that of linear regression model)
def train(loader):
    total_loss = 0.0
    # iterate throught the data loader
    num_batches = 0
    for batch in loader:
        # load the current batch
        batch_input, batch_output = batch

        # forward propagation
        # pass the data through the model
        model_outputs = model(batch_input)
        # compute the loss
        cur_loss = criterion(model_outputs, batch_output)
        total_loss += cur_loss.item()

        # backward propagation (compute the gradients and update the model)
        # clear the buffer
        optimizer.zero_grad()
        # compute the gradients
        cur_loss.backward()
        # update the weights
        optimizer.step()

        num_batches += 1
    return total_loss/num_batches

# evaluation logic based on classification accuracy
def evaluate(loader):
    accuracy, num_examples = 0.0, 0
    with torch.no_grad(): # impacts the autograd engine and deactivate it. reduces memory usage and speeds up computation
        for batch in loader:
            # load the current batch
            batch_input, batch_output = batch
            # forward propagation
            # pass the data through the model
            model_outputs = model(batch_input)
            # identify the predicted class for each example in the batch (row-wise)
            _, predicted = torch.max(model_outputs.data, 1) # Returns a (values, indices) 
            # compare with batch_output (gold labels) to compute accuracy
            accuracy += (predicted == batch_output).sum().item()
            num_examples += batch_output.size(0)
    return accuracy/num_examples

Let us set the important hyperparameter: number of hidden neurons (or units) in the hidden layer of the model.  

In [19]:
# hyperparameter of neural network
hidden_layers = [50]  # set number of hidden features of the first and only hidden layer to 50

Let us instantiate the model, create the criterion and optimizer.

In [20]:
# instantiate a model based on the class and move it to the right device (cpu or gpu)
model = SingleLayerNeuralNetworkModel(MAX_FEATURES, hidden_layers, NUM_CLASSES)
model.to(device)
print("Model specs: ", model) # prints the model properties

# define the loss function (last node of the graph)
criterion = nn.NLLLoss() # The negative log likelihood loss

# create an instance of SGD with required hyperparameters
optimizer = torch.optim.SGD(model.parameters(), lr=LEARNING_RATE)

Model specs:  SingleLayerNeuralNetworkModel(
  (input_to_hidden): Linear(in_features=5000, out_features=50, bias=True)
  (sigmoid_layer): Sigmoid()
  (hidden_to_output): Linear(in_features=50, out_features=3, bias=True)
  (softmax_layer): LogSoftmax()
)


Let us perform a feedforward pass of the neural network with a sample input and print layer outputs (carefully look at the shape of the intermediate outputs):

In [21]:
# toggle debug mode so that all intermediate layer outputs are printed out
model.debug = True

# create sample input and target
sample_input = torch.Tensor(1, MAX_FEATURES) # random tensor of required number of features
sample_target = torch.tensor([2], dtype=torch.long) # assuming the target class for this example is 2 (0-indexed)

# perform feedforward propagation
model_prediction = model(sample_input)
print("model prediction = ", model_prediction.data)

# calculate loss
print("loss = ", criterion(model_prediction, sample_target).data)

# turn off debug mode
model.debug = False

input to hidden layer input (or x) shape =  torch.Size([1, 5000])
input to hidden layer output shape =  torch.Size([1, 50])
sigmoid layer output (or hidden to output layer input) shape =  torch.Size([1, 50])
logsoftmax layer input (or hidden to output layer output) shape =  torch.Size([1, 3])
logsoftmax layer output shape =  torch.Size([1, 3])
model prediction =  tensor([[-1.1530, -1.0574, -1.0878]])
loss =  tensor(1.0878)


Let us start the training of our first neural network model (in addition to training loss, we will also print the training accuracy and validation accuracy): 

In [22]:
# start the training
for epoch in range(MAX_EPOCHS):
    # train the model for one pass over the data
    train_loss = train(train_loader)
    # compute the training accuracy
    train_acc = evaluate(train_loader)
    # compute the validation accuracy 
    val_acc = evaluate(valid_loader)
    # print the loss for every epoch
    
    print('Epoch [{}/{}], Loss: {:.4f}, Training Accuracy: {:.4f},  Validation Accuracy: {:.4f}'.format(epoch+1, MAX_EPOCHS, train_loss, train_acc, val_acc))

Epoch [1/15], Loss: 1.1222, Training Accuracy: 0.5157,  Validation Accuracy: 0.4217
Epoch [2/15], Loss: 1.0186, Training Accuracy: 0.5157,  Validation Accuracy: 0.4217
Epoch [3/15], Loss: 0.9626, Training Accuracy: 0.6305,  Validation Accuracy: 0.5033
Epoch [4/15], Loss: 0.8864, Training Accuracy: 0.5432,  Validation Accuracy: 0.4302
Epoch [5/15], Loss: 0.8240, Training Accuracy: 0.5955,  Validation Accuracy: 0.4577
Epoch [6/15], Loss: 0.7746, Training Accuracy: 0.7260,  Validation Accuracy: 0.5248
Epoch [7/15], Loss: 0.7328, Training Accuracy: 0.4887,  Validation Accuracy: 0.4212
Epoch [8/15], Loss: 0.6813, Training Accuracy: 0.7442,  Validation Accuracy: 0.4952
Epoch [9/15], Loss: 0.6452, Training Accuracy: 0.8222,  Validation Accuracy: 0.5063
Epoch [10/15], Loss: 0.5984, Training Accuracy: 0.8393,  Validation Accuracy: 0.5043
Epoch [11/15], Loss: 0.5608, Training Accuracy: 0.7920,  Validation Accuracy: 0.4777
Epoch [12/15], Loss: 0.5247, Training Accuracy: 0.8703,  Validation Accura

### Adding more hidden layers

Single layer neural network uses **one hidden layer**. A deep neural network typically stacks multiple hidden layers leading to good performance on many tasks. And the computational graph for a **2 layer neural network** for classification can look like this:

<img src="images/sl2_textclassification_neuralnets_multi_cg.jpg" alt="MLP" title="2 Layer neural network - Computational Graph" />

Let us implement this two layer neural network in PyTorch:

In [23]:
"""
create a custom model class inheriting torch.nn.Module
"""
class MultiLayerNeuralNetworkModel(nn.Module):
  
  def __init__(self, num_inputs, hidden_layers, num_outputs, debug=False):
    # In the constructor we define the layers for our model
    super(MultiLayerNeuralNetworkModel, self).__init__()
    
    self.modules = [] # stores all the layers for the neural network
    input_dim = num_inputs
    # add input layer followed by hidden layers
    for hidden_layer in hidden_layers:
      # add one layer followed by non-linearity (nn.Sigmoid)
      self.modules.append(nn.Linear(input_dim, hidden_layer))
      self.modules.append(nn.Sigmoid())
      input_dim = hidden_layer
    # add the classification module (output layer)
    self.modules.append(nn.Linear(input_dim, num_outputs))
    self.modules.append(nn.LogSoftmax(dim=1))
    
    # create the model from all the modules
    self.model = nn.Sequential(*self.modules) # container of layers, for more details: https://pytorch.org/docs/stable/nn.html#torch.nn.Sequential
    
    # set the debug flag
    self.debug = debug
 
  def forward(self, x):
    # In the forward function we define the forward propagation logic
    if self.debug:
        out = x
        for li, layer in enumerate(self.modules[0:-2]):
            if li == 0:
                print("hidden layer 1 input (or x) shape = ", out.size())
                out = layer(out)
                print("hidden layer 1 output shape = ", out.size())
            elif li%2 == 0:
                print("hidden layer %d input shape = "%(1+(li/2)), out.size())
                out = layer(out)
                print("hidden layer %d output shape = "%(1+(li/2)), out.size())
            else:
                print("sigmoid layer %d input shape = "%(1+(li/2)), out.size())
                out = layer(out)
                print("sigmoid layer %d output shape = "%(1+(li/2)), out.size())
        print("sigmoid layer output (or hidden to output layer input) shape = ", out.size())
        out = self.modules[-2](out)
        print("logsoftmax layer input (or hidden to output layer output) shape = ", out.size())
        out = self.modules[-1](out)
        print("logsoftmax layer output shape = ", out.size())
        return out
    out = self.model(x)
    return out



Let us set the important hyperparameter: number of hidden neurons (or units) in each hidden layer of the model.  

In [24]:
# hyperparameter of neural network
hidden_layers = [50, 50]  # [num. of hidden units in first layer, num. of hidden units in second layer]

Let us instantiate the model, create the criterion and optimizer.

In [25]:
# define the loss function (last node of the graph)
model = MultiLayerNeuralNetworkModel(MAX_FEATURES, hidden_layers, NUM_CLASSES)
model.to(device)
print("Model specs: ", model)
criterion = nn.NLLLoss()

# create an instance of SGD with required hyperparameters
optimizer = torch.optim.SGD(model.parameters(), lr=LEARNING_RATE)

Model specs:  MultiLayerNeuralNetworkModel(
  (model): Sequential(
    (0): Linear(in_features=5000, out_features=50, bias=True)
    (1): Sigmoid()
    (2): Linear(in_features=50, out_features=50, bias=True)
    (3): Sigmoid()
    (4): Linear(in_features=50, out_features=3, bias=True)
    (5): LogSoftmax()
  )
)


Let us perform a feedforward pass of the neural network with a sample input and print layer outputs (carefully look at the shape of the intermediate outputs):

In [26]:
# toggle debug mode so that all intermediate layer outputs are printed out
model.debug = True

# create sample input and target
sample_input = torch.Tensor(1, MAX_FEATURES) # random tensor of required number of features
sample_target = torch.tensor([2], dtype=torch.long) # assuming the target class for this example is 2 (0-indexed)

# perform feedforward propagation
model_prediction = model(sample_input)
print("model prediction = ", model_prediction.data)

# calculate loss
print("loss = ", criterion(model_prediction, sample_target).data)

# turn off debug mode
model.debug = False

hidden layer 1 input (or x) shape =  torch.Size([1, 5000])
hidden layer 1 output shape =  torch.Size([1, 50])
sigmoid layer 1 input shape =  torch.Size([1, 50])
sigmoid layer 1 output shape =  torch.Size([1, 50])
hidden layer 2 input shape =  torch.Size([1, 50])
hidden layer 2 output shape =  torch.Size([1, 50])
sigmoid layer 2 input shape =  torch.Size([1, 50])
sigmoid layer 2 output shape =  torch.Size([1, 50])
sigmoid layer output (or hidden to output layer input) shape =  torch.Size([1, 50])
logsoftmax layer input (or hidden to output layer output) shape =  torch.Size([1, 3])
logsoftmax layer output shape =  torch.Size([1, 3])
model prediction =  tensor([[-1.2292, -1.1165, -0.9675]])
loss =  tensor(0.9675)


Let us start the training of our multilayer neural network model (in addition to training loss, we will also print the training accuracy and validation accuracy): 

In [27]:
# start the training
for epoch in range(MAX_EPOCHS):
  # train the model for one pass over the data
  train_loss = train(train_loader)
  # compute the training accuracy
  train_acc = evaluate(train_loader)
  # compute the validation accuracy 
  val_acc = evaluate(valid_loader)
  # print the loss for every epoch
  print('Epoch [{}/{}], Loss: {:.4f}, Training Accuracy: {:.4f}, Validation Accuracy: {:.4f}'.format(epoch+1, MAX_EPOCHS, train_loss, train_acc, val_acc))

Epoch [1/15], Loss: 1.0212, Training Accuracy: 0.5157, Validation Accuracy: 0.4217
Epoch [2/15], Loss: 1.0044, Training Accuracy: 0.3405, Validation Accuracy: 0.3827
Epoch [3/15], Loss: 1.0032, Training Accuracy: 0.5157, Validation Accuracy: 0.4217
Epoch [4/15], Loss: 1.0043, Training Accuracy: 0.5157, Validation Accuracy: 0.4217
Epoch [5/15], Loss: 1.0019, Training Accuracy: 0.5157, Validation Accuracy: 0.4217
Epoch [6/15], Loss: 1.0019, Training Accuracy: 0.5157, Validation Accuracy: 0.4217
Epoch [7/15], Loss: 1.0002, Training Accuracy: 0.5157, Validation Accuracy: 0.4217
Epoch [8/15], Loss: 0.9952, Training Accuracy: 0.5157, Validation Accuracy: 0.4217
Epoch [9/15], Loss: 0.9736, Training Accuracy: 0.5157, Validation Accuracy: 0.4217
Epoch [10/15], Loss: 0.9281, Training Accuracy: 0.5567, Validation Accuracy: 0.4362
Epoch [11/15], Loss: 0.8724, Training Accuracy: 0.6172, Validation Accuracy: 0.4937
Epoch [12/15], Loss: 0.8304, Training Accuracy: 0.6478, Validation Accuracy: 0.5063
E

Let us shift gears and take a look at word embeddings, a key concept in building the state-of-the-art deep learning models. 

## Word Embeddings

In this tutorial, we will focus on **word embeddings (or vectors)** generated by a [word2vec model](https://code.google.com/archive/p/word2vec/). Specifically, we will focus on **continuous bag of words (CBOW)** model. At its core, word2vec relies on the hypothesis that words which occur in **similar contexts** (defined by a select set of words before and after the word in the sentences in which the target word appears) tend to have **similar meaning** and therefore must have similar embeddings. CBOW model learns word embedding (of a certain size called **embedding size** which is a hyperparameter) by setting up an auxiliary task. The auxiliary task is to predict the word (target) given the surrounding words of the target word in a sentence as input. We typically consider **window size** (another hyperparameter) words before and after the target word as surrounding words (or context).

The model architecture of CBOW based word2vec model from the [original paper](https://arxiv.org/pdf/1301.3781.pdf) is:

<img src="images/sl2_textclassification_neuralnets_cbow.png" alt="word2vec" title="CBOW model - Model Architecture" width="250" height="100" />

For instance, consider our corpus has only one sentence **"I had so much fun in my semester break"**. If the window size is set to 2, then the train data for CBOW model corresponding to the auxiliary classification task looks like this:

| example no. | train input | train label |
| ----------------- | ----------- |-------------|
| 1  | 'I', 'had', 'much', 'fun'   | 'so' |
| 2  | 'had', 'so', 'fun', 'in' | 'much' |
| 3  | 'so', 'much', 'in', 'my' | 'fun' |
| 4  | 'much', 'fun', 'my', 'semester' | 'in' |
| 5  | 'fun', 'in', 'semester', 'break' | 'my' |

Our vocabulary has 9 words and CBOW model learns a word embedding for every word in the vocabulary. These word embeddings are typically stored in a giant matrix $W_{input} \in 9\times3$, that is **vocabulary size $\times$ embedding size** (assuming embedding size is 3). Each row in this giant matrix corresponds to a unique word in the vocabulary. The feature vector for the train input is constructed by **averaging the word embeddings** corresponds to the tokens in the train input. For instance, the feature vector for the first training example is given by $\frac{1}{4} （W_{(I,:)} + W_{(had,:)} + W_{(much,:)} + W_{(fun,:)}）$. 

This input is fed to a classification layer to identify the target word. Note that the CBOW model does not have bias term in the affine transformation module (``nn.Linear``) corresponding to the classification layer in the neural networks model. The number of categories ($K$) in CBOW model is equivalent to the size of the vocabulary ($9$ in our case).

The word embeddings are **intialized** with small numbers sampled randomly **from Gaussian distribution** before training. During training, the word embeddings are jointly learned along with the parameters of the classification layer to predict the target word well given the context. Once the training is done, the word embeddings can capture some semantic features of the word. Note that post training, we do not care for the auxiliary task, whose main goal is to induce meaningful word embeddings.

Recommended reading for word embeddings: https://github.com/UBC-NLP/dlnlp2019/blob/master/slides/word2vec.pdf, https://arxiv.org/pdf/1301.3781.pdf and https://pytorch.org/tutorials/beginner/nlp/word_embeddings_tutorial.html


We will use [``torch.nn.Embedding``](https://pytorch.org/docs/stable/nn.html#torch.nn.Embedding) module to simulate the giant matrix, $W_{input}$. Let us look at an example now:

In [28]:
# an Embedding module containing 9 tensors of size 3 to represent W_{input}

# Note, the parameters to Embedding class below are:
# num_embeddings (int): size of the dictionary of embeddings
# embedding_dim (int): the size of each embedding vector
# For more details on Embedding class, see: https://github.com/pytorch/pytorch/blob/master/torch/nn/modules/sparse.py
embedding = nn.Embedding(9, 3, sparse=True)

# print all the word embeddings from the lookup table
print("*" *20 ," word lookup table ", "*" *20)
print(embedding.weight.data)

********************  word lookup table  ********************
tensor([[-0.5321, -0.3469, -1.2427],
        [-0.6171, -1.3775,  0.8161],
        [ 0.5251,  3.0070,  1.4533],
        [ 0.5846, -1.0117,  1.0847],
        [ 1.0463, -1.3028, -0.9977],
        [ 0.0837, -1.3473, -1.5042],
        [-0.3642, -0.1346, -1.0112],
        [-0.8088,  0.9030,  2.2841],
        [ 1.3468,  1.1902, -2.2917]])


Let us construct the word to index and index to word mappings

In [29]:
# let us construct the word to index mapping
word2id, id2word = {}, {}
for word in 'I had so much fun in my semester break'.split():
    if word not in word2id:
        word2id[word] = len(word2id)
        id2word[word2id[word]] = word
print("*" *20 ," word2id dictionary ", "*" *20)
print(word2id)
print("*" *20 ," id2word dictionary ", "*" *20)
print(id2word)

********************  word2id dictionary  ********************
{'I': 0, 'had': 1, 'so': 2, 'much': 3, 'fun': 4, 'in': 5, 'my': 6, 'semester': 7, 'break': 8}
********************  id2word dictionary  ********************
{0: 'I', 1: 'had', 2: 'so', 3: 'much', 4: 'fun', 5: 'in', 6: 'my', 7: 'semester', 8: 'break'}


Let us print all the word embeddings from the lookup table against the corresponding word:

In [30]:
# print all the word embeddings from the lookup table against the words
print("*" *20 ," word lookup table ", "*" *20)
for wi in range(embedding.weight.data.size(0)):
    print(id2word[wi], embedding.weight.data[wi])

********************  word lookup table  ********************
I tensor([-0.5321, -0.3469, -1.2427])
had tensor([-0.6171, -1.3775,  0.8161])
so tensor([0.5251, 3.0070, 1.4533])
much tensor([ 0.5846, -1.0117,  1.0847])
fun tensor([ 1.0463, -1.3028, -0.9977])
in tensor([ 0.0837, -1.3473, -1.5042])
my tensor([-0.3642, -0.1346, -1.0112])
semester tensor([-0.8088,  0.9030,  2.2841])
break tensor([ 1.3468,  1.1902, -2.2917])


Let us create the input tensor containing indices of context words for a sample input:

In [31]:
# construct the train input for example 1
train_input = torch.LongTensor([word2id['I'], word2id['had'], word2id['much'], word2id['fun']])
print("*" *20 ," train_input ", "*" *20)
print(train_input)

********************  train_input  ********************
tensor([0, 1, 3, 4])


We then convert the input tensor of word indices to corresponding word embeddings

In [32]:
# obtain the word embeddings corresponding to the words in the train input
word_embeddings = embedding(train_input)
print("*" *20 ," word_embeddings ", "*" *20)
print(word_embeddings)
print(word_embeddings.shape)

********************  word_embeddings  ********************
tensor([[-0.5321, -0.3469, -1.2427],
        [-0.6171, -1.3775,  0.8161],
        [ 0.5846, -1.0117,  1.0847],
        [ 1.0463, -1.3028, -0.9977]], grad_fn=<EmbeddingBackward>)
torch.Size([4, 3])


To compute the input feature vector, CBOW model performs mean of all the word embedings in the context:

In [33]:
# If we want to do text classification on the sentence, we can take the mean of the vectors of all the words in the sentence,
# thus acquiring a single vector:
# construct the feature input (mean of word embeddings) to be fed to the classification module
feature_input = word_embeddings.mean(0) # column-wise, dimension-wise
print("*" *20 ," feature_input ", "*" *20)
print(feature_input)

********************  feature_input  ********************
tensor([ 0.1204, -1.0097, -0.0849], grad_fn=<MeanBackward1>)


The size of the feature vector for a training input in CBOW model is equal to the size of a word embedding (**3** in our case). The rest of the network is a classification module (hidden to output layer + logsoftmax) to predict the middle word (**so** in this case).

Thus, the computational graph for the CBOW model will look like this:

<img src="images/sl2_textclassification_word2vec.jpg" alt="MLP" title="Word2vec - Computational Graph" />

Let us train a CBOW model from scratch on our one-sentence corpus.

Before that, we will define the dataset reader for our one-sentence dataset.

In [34]:
"""
create a dataset reader
"""
class CBOWDataset(Dataset):
  """ one-sentence dataset."""
  def __init__(self, window_size=2):
    # read the corpus
    corpus = ['I had so much fun in my semester break']

    window = 2 * window_size + 1
    token2id = {}
    # populate the word to id mapping and generate train inputs/targets
    for sentence in corpus:
      tokens = sentence.strip().split()
      for token in tokens:
        if token not in token2id.keys(): # add new word
          token2id[token] = len(token2id) # new word index = length of token2id
    
      # map to index
      tokens = [token2id[token] for token in tokens]
    
      # generate all training samples for this sentence
      train_inputs, train_outputs = [], []
      for num_win in range(len(tokens) - window + 1):
        cur_tokens = tokens[num_win:num_win + window]
        tgt_token = cur_tokens[window_size]
        train_inputs.append(cur_tokens[0:window_size]+cur_tokens[window_size+1:])
        train_outputs.append(tgt_token)
    self.token2id = token2id
    
    # set the vocab. size
    self.vocab_size = len(token2id)
    
    # set the total number of training examples
    self.n = len(train_inputs)
    
    # convert features and labels to torch.tensor
    self.features = torch.LongTensor(train_inputs, device=device)
    self.labels = torch.LongTensor(train_outputs, device=device)
    
  # return input and output of a single example
  # Input: Feature vectors, where each vector corresponds to a tweet. 
  # Output: Labels, where each label is one index for each of our tags in the set {positive, negative, neutral}
  def __getitem__(self, index):
    return self.features[index], self.labels[index]
  
  # return the total number of examples
  def __len__(self):
    return self.n

Let us define the CBOW model (layers and forward propagation logic):

In [35]:
"""
create a model for CBOW
"""
class CBOWmodel(nn.Module):
  
  def __init__(self, embedding_size, vocab_size, output_size, debug=False):
    # In the constructor we define the layers for our model
    super(CBOWmodel, self).__init__()
    self.embedding = nn.Embedding(vocab_size, embedding_size, sparse=True)
    
    self.embedding.weight.data.normal_(0.0,0.05) # mean=0.0, mu=0.05
    
    self.linear_layer = nn.Linear(embedding_size, output_size, bias=False) # the layer will not learn an additive bias
    self.softmax_layer = nn.LogSoftmax(dim=1)
    
    self.debug = debug

  def forward(self, x):
    # In the forward function we define the forward propagation logic
    if self.debug:
        print('input (word ids) shape = ', x.size())
        print('input word embeddings shape = ', self.embedding(x).size())
        print('input feature vector (mean of word embeddings) shape = ', self.embedding(x).mean(1).size())
        out = self.embedding(x).mean(1)
        print('hidden to output layer input shape = ', out.size())
        out = self.linear_layer(out)
        print('hidden to output layer output (or logsoftmax input) shape = ', out.size())
        out = self.softmax_layer(out)
        print('logsoftmax output shape = ', out.size())
        return out
    out = self.embedding(x).mean(1)
    out = self.linear_layer(out)
    out = self.softmax_layer(out)
    return out

And the hyperparameters of the CBOW model will be:

In [36]:
# hyperparameter of CBOW model
EMBEDDING_SIZE = 3 # size of the word embedding
LEARNING_RATE = 0.5 # learning rate of gradient descent
WINDOW_SIZE = 2  # number of words to be considerd before (or after) the target word for making the context
MAX_EPOCHS = 5 # number of passes over the training data

Let's instantiate an object from the CBOW dataset class and print properties of the training data.  

In [37]:
# load the dataset reader, corpus: 'I had so much fun in my semester break'
dataset = CBOWDataset(window_size = WINDOW_SIZE)
print("number of samples in the dataset:", dataset.n)
print("feature matrix:", dataset.features)
print("label matrix:", dataset.labels)

# create loader for reading the training dataset
train_loader = DataLoader(dataset=dataset, batch_size=1, shuffle=True, num_workers=1) 

number of samples in the dataset: 5
feature matrix: tensor([[0, 1, 3, 4],
        [1, 2, 4, 5],
        [2, 3, 5, 6],
        [3, 4, 6, 7],
        [4, 5, 7, 8]])
label matrix: tensor([2, 3, 4, 5, 6])


Let's create the model object, criterion and optimizer

In [38]:
# create the model object
model = CBOWmodel(EMBEDDING_SIZE, dataset.vocab_size, dataset.vocab_size)
model.to(device)
print("model specs: ", model)

# create the loss function (last node of the graph)
criterion = nn.NLLLoss()

# create an instance of SGD with required hyperparameters
optimizer = torch.optim.SGD(model.parameters(), lr=LEARNING_RATE)

model specs:  CBOWmodel(
  (embedding): Embedding(9, 3, sparse=True)
  (linear_layer): Linear(in_features=3, out_features=9, bias=False)
  (softmax_layer): LogSoftmax()
)


Before kick-starting the training, we can see the forward propagation logic with a sample input and target (as usual, closely pay attention to the size of intermediate vectors)

In [39]:
# toggle debug mode so that all intermediate layer outputs are printed out
model.debug = True

# create sample input and target
sample_input = torch.LongTensor([word2id['I'], word2id['had'], word2id['much'], word2id['fun']]).unsqueeze(0)
sample_target = torch.LongTensor([word2id['so']]) 

# perform feedforward propagation
model_prediction = model(sample_input)
print("model prediction = ", model_prediction.data) # size is no. of classes

# calculate loss
print("loss = ", criterion(model_prediction, sample_target).data)

# turn off debug mode
model.debug = False

input (word ids) shape =  torch.Size([1, 4])
input word embeddings shape =  torch.Size([1, 4, 3])
input feature vector (mean of word embeddings) shape =  torch.Size([1, 3])
hidden to output layer input shape =  torch.Size([1, 3])
hidden to output layer output (or logsoftmax input) shape =  torch.Size([1, 9])
logsoftmax output shape =  torch.Size([1, 9])
model prediction =  tensor([[-2.1980, -2.1916, -2.1857, -2.2005, -2.1972, -2.1909, -2.2009, -2.2082,
         -2.2023]])
loss =  tensor(2.1857)


Let us kick-start training the CBOW model on our one-sentence dataset.

In [40]:
# start the training
for epoch in range(MAX_EPOCHS):
  # train the model for one pass over the data
  train_loss = train(train_loader)   
  # print the loss for every epoch
  print('Epoch [{}/{}], Loss: {:.4f}'.format(epoch+1, MAX_EPOCHS, train_loss))

Epoch [1/5], Loss: 2.1992
Epoch [2/5], Loss: 2.1746
Epoch [3/5], Loss: 2.1468
Epoch [4/5], Loss: 2.1166
Epoch [5/5], Loss: 2.0769


We trained CBOW model on a toyish corpus possible. In general, the model is trained on a **large corpus** of raw text and the quality of resulting word embeddings tend to get better with increase in the size of training corpus. Several researchers have released their word embeddings trained on large amounts of data. We can directly use their embeddings in the future work. For example, Google offers a [pre-trained model](https://code.google.com/archive/p/word2vec/) which was trained on Google News dataset containing about 100 billion words.

You can access the word embedding in the following way:

In [41]:
# prints the embedding of word 'I'
print("embedding for the word 'I':", model.embedding.weight.data[dataset.token2id['I']]) 

embedding for the word 'I': tensor([ 0.1686, -0.1465,  0.2619])


In [42]:
# prints the embedding of word 'semester'
print("embedding for the word 'semester':", model.embedding.weight.data[dataset.token2id['semester']]) 

embedding for the word 'semester': tensor([-0.2040, -0.4036,  0.1287])


That's it.