<h2 id="tocheading">Table of Contents</h2>
<div id="toc"></div>

In [1]:
%%javascript
$.getScript('https://kmahelona.github.io/ipython_notebook_goodies/ipython_notebook_toc.js')

<IPython.core.display.Javascript object>

In [2]:
import pickle
import random
import random
import spacy
import csv
import string
import os
import torch
import numpy as np
import pandas as pd
import spacy
from torch.utils.data import Dataset
import torch.nn as nn
import torch.nn.functional as F
import matplotlib.pyplot as plt
%matplotlib inline

## Part 1: Data Upload & Preprocessing
The datasets provided are already tokenized. Thus, without running the data through a tokenizer, we use pretrained word embeddings (e.g. fast-Text) to embed the tokens. 

#### Word Vectors

The web page for recommended word vector sets can be found here: https://fasttext.cc/docs/en/english-vectors.html wiki-news-300d-1M.vec from Mikolov et al (2018, Advances in Pre-Training Distributed Word Representations) 1 million word vectors trained on Wikipedia 2017, UMBC webbase corpus and statmt.org news dataset (16B tokens) is used in this assignment. 

In [3]:
import io

def load_vectors(fname):
    fin = io.open(fname, 'r', encoding='utf-8', 
                  newline='\n', errors='ignore')
    n, d = map(int, fin.readline().split())
    data = {}
    for line in fin:
        tokens = line.rstrip().split(' ')
        ## convert all maps to lists
        data[tokens[0]] = [*map(float, tokens[1:])]
    return data

In [4]:
## get the wiki word vectors
fname = "wiki-news-300d-1M.vec"
word_vectors = load_vectors(fname)

In [5]:
all_vocab_tokens = [*word_vectors.keys()]

In [6]:
print ("The number of unique tokens in the wiki news English vectors is " + str(len(all_vocab_tokens) ))

The number of unique tokens in the wiki news English vectors is 999994


#### Construct Table from Vocab Dict

In [7]:
word_vector_df = pd.DataFrame(word_vectors)

In [8]:
word_vector_df = word_vector_df.T

In [9]:
table_lookup = np.array(word_vector_df)

In [10]:
def index_vocab(table_df):
    
    token_array = np.array([*table_df.index])
    num_index_array = np.array([*range(table_df.shape[0])])
    
    token2id = {}
    id2token = {}
    for i in [*range(len(token_array))]:
        token2id[token_array[i]] = num_index_array[i]
        id2token[num_index_array[i]] = token_array[i]

    return token2id, id2token

In [11]:
token2id_wiki, id2token_wiki = index_vocab(word_vector_df)

__Check for table correctness!__

Do token2id and id2token match each other?

In [12]:
token2id_wiki["Alberto"]

93141

In [13]:
id2token_wiki[93141]

'Alberto'

Does the table fit the initial word vector vocab?

In [14]:
all(word_vectors["Alberto"] == table_lookup[93141])==True

True

### Part 1.1: SNLI Dataset

In [15]:
label_dict = {"entailment":1,
             "neutral":0,
             "contradiction":-1}

In [16]:
snli_train = pd.read_table("data/snli_train.tsv")
snli_val = pd.read_table("data/snli_val.tsv")

In [17]:
## get tokenized training data
snli_train["sentence1"] = snli_train["sentence1"].apply(lambda x: x.split(" "))
snli_train["sentence2"] = snli_train["sentence2"].apply(lambda x: x.split(" "))

In [18]:
## get labels
snli_train["label_num"] = snli_train["label"].apply(lambda x: label_dict[x])
snli_val["label_num"] = snli_val["label"].apply(lambda x: label_dict[x])

In [19]:
## get tokenized validation data
snli_val["sentence1"] = snli_val["sentence1"].apply(lambda x: x.split(" "))
snli_val["sentence2"] = snli_val["sentence2"].apply(lambda x: x.split(" "))

In [20]:
## get label arrays
snli_train_labels = np.array(snli_train["label_num"])
snli_val_labels = np.array(snli_val["label_num"])

In [21]:
snli_val.head(3)

Unnamed: 0,sentence1,sentence2,label,label_num
0,"[Three, women, on, a, stage, ,, one, wearing, ...","[There, are, two, women, standing, on, the, st...",contradiction,-1
1,"[Four, people, sit, on, a, subway, two, read, ...","[Multiple, people, are, on, a, subway, togethe...",entailment,1
2,"[bicycles, stationed, while, a, group, of, peo...","[People, get, together, near, a, stand, of, bi...",entailment,1


### Part 1.2: MultiNLI Dataset

In [22]:
mnli_train = pd.read_table("data/mnli_train.tsv")
mnli_val = pd.read_table("data/mnli_val.tsv")

In [23]:
mnli_train.head(3)

Unnamed: 0,sentence1,sentence2,label,genre
0,and now that was in fifty one that 's forty ye...,It was already a problem forty years ago but n...,neutral,telephone
1,Jon could smell baked bread on the air and his...,Jon smelt food in the air and was hungry .,neutral,fiction
2,it will be like Italian basketball with the uh...,This type of Italian basketball is nothing lik...,contradiction,telephone


In [24]:
mnli_train["genre"].value_counts()

telephone     4270
slate         4026
travel        3985
government    3883
fiction       3836
Name: genre, dtype: int64

In [25]:
## get tokenized training data
mnli_train["sentence1"] = mnli_train["sentence1"].apply(lambda x: x.split(" "))
mnli_train["sentence2"] = mnli_train["sentence2"].apply(lambda x: x.split(" "))

In [26]:
## get tokenized validation data
mnli_val["sentence1"] = mnli_val["sentence1"].apply(lambda x: x.split(" "))
mnli_val["sentence2"] = mnli_val["sentence2"].apply(lambda x: x.split(" "))

In [27]:
## get labels
mnli_train["label_num"] = mnli_train["label"].apply(lambda x: label_dict[x])
mnli_val["label_num"] = mnli_val["label"].apply(lambda x: label_dict[x])

Get train and val datasets for each __MNLI genre__. 

In [28]:
## telephone
mnli_train_telephone = mnli_train[mnli_train["genre"]=="telephone"]
mnli_val_telephone = mnli_val[mnli_val["genre"]=="telephone"]
## slate
mnli_train_slate = mnli_train[mnli_train["genre"]=="slate"]
mnli_val_slate = mnli_val[mnli_val["genre"]=="slate"]
## travel
mnli_train_travel = mnli_train[mnli_train["genre"]=="travel"]
mnli_val_travel = mnli_val[mnli_val["genre"]=="travel"]
## government
mnli_train_government = mnli_train[mnli_train["genre"]=="government"]
mnli_val_government = mnli_val[mnli_val["genre"]=="government"]
## fiction
mnli_train_fiction = mnli_train[mnli_train["genre"]=="fiction"]
mnli_val_fiction = mnli_val[mnli_val["genre"]=="fiction"]

In [29]:
## get label arrays for each train and val dataset

## whole MNLI dataset
mnli_train_labels = np.array(mnli_train["label_num"])
mnli_val_labels = np.array(mnli_val["label_num"])
## telephone
mnli_train_tel_labels = np.array(mnli_train_telephone["label_num"])
mnli_val_tel_labels = np.array(mnli_val_telephone["label_num"])
## slate
mnli_train_slate_labels = np.array(mnli_train_slate["label_num"])
mnli_val_slate_labels = np.array(mnli_val_slate["label_num"])
## travel
mnli_train_travel_labels = np.array(mnli_train_travel["label_num"])
mnli_val_travel_labels = np.array(mnli_val_travel["label_num"])
## gov
mnli_train_gov_labels = np.array(mnli_train_government["label_num"])
mnli_val_gov_labels = np.array(mnli_val_government["label_num"])
## fiction
mnli_train_fiction_labels = np.array(mnli_train_fiction["label_num"])
mnli_val_fiction_labels = np.array(mnli_val_fiction["label_num"])

#### Data Loaders

In [30]:
## idx = token2id_wiki

def token2index_dataset(tokens_data,idx_dict=None):
    indices_data = []
    for tokens in tokens_data:
        ## get index list for each sentence.
        index_list = [idx_dict[token] if token in \
                      idx_dict else idx_dict["unk"] for token in tokens]
        indices_data.append(index_list)
    return indices_data

__Note:__ I am getting the indices for Sentence 1 and Sentence 2 separately (not concatenating them at first from the beginning) since, in hyperparameter search I want to try more than one ways of interacting the hidden representations of the two sentences. 

In [31]:
## get train and val indices for both datasets

## SNLI
snli_train_sentence1_indices = token2index_dataset([*snli_train["sentence1"]],idx_dict=token2id_wiki)
snli_train_sentence2_indices = token2index_dataset([*snli_train["sentence2"]],idx_dict=token2id_wiki)
snli_val_sentence1_indices = token2index_dataset([*snli_val["sentence1"]],idx_dict=token2id_wiki)
snli_val_sentence2_indices = token2index_dataset([*snli_val["sentence2"]],idx_dict=token2id_wiki)

## MNLI
mnli_train_sentence1_indices = token2index_dataset([*mnli_train["sentence1"]],idx_dict=token2id_wiki)
mnli_train_sentence2_indices = token2index_dataset([*mnli_train["sentence2"]],idx_dict=token2id_wiki)
mnli_val_sentence1_indices = token2index_dataset([*mnli_val["sentence1"]],idx_dict=token2id_wiki)
mnli_val_sentence2_indices = token2index_dataset([*mnli_train["sentence2"]],idx_dict=token2id_wiki)

In [32]:
## GENRES

## telephone
mnli_train_s1_tel_ix = token2index_dataset([*mnli_train_telephone["sentence1"]],idx_dict=token2id_wiki)
mnli_train_s2_tel_ix = token2index_dataset([*mnli_train_telephone["sentence2"]],idx_dict=token2id_wiki)
mnli_val_s1_tel_ix = token2index_dataset([*mnli_val_telephone["sentence1"]],idx_dict=token2id_wiki)
mnli_val_s2_tel_ix = token2index_dataset([*mnli_val_telephone["sentence2"]],idx_dict=token2id_wiki)
## slate
mnli_train_s1_slate_ix = token2index_dataset([*mnli_train_slate["sentence1"]],idx_dict=token2id_wiki)
mnli_train_s2_slate_ix = token2index_dataset([*mnli_train_slate["sentence2"]],idx_dict=token2id_wiki)
mnli_val_s1_slate_ix = token2index_dataset([*mnli_val_slate["sentence1"]],idx_dict=token2id_wiki)
mnli_val_s2_slate_ix = token2index_dataset([*mnli_val_slate["sentence2"]],idx_dict=token2id_wiki)
## travel
mnli_train_s1_travel_ix = token2index_dataset([*mnli_train_travel["sentence1"]],idx_dict=token2id_wiki)
mnli_train_s2_travel_ix = token2index_dataset([*mnli_train_travel["sentence2"]],idx_dict=token2id_wiki)
mnli_val_s1_travel_ix = token2index_dataset([*mnli_val_travel["sentence1"]],idx_dict=token2id_wiki)
mnli_val_s2_travel_ix = token2index_dataset([*mnli_val_travel["sentence2"]],idx_dict=token2id_wiki)
## gov
mnli_train_s1_gov_ix = token2index_dataset([*mnli_train_government["sentence1"]],idx_dict=token2id_wiki)
mnli_train_s2_gov_ix = token2index_dataset([*mnli_train_government["sentence2"]],idx_dict=token2id_wiki)
mnli_val_s1_gov_ix = token2index_dataset([*mnli_val_government["sentence1"]],idx_dict=token2id_wiki)
mnli_val_s2_gov_ix = token2index_dataset([*mnli_val_government["sentence2"]],idx_dict=token2id_wiki)
## fiction
mnli_train_s1_fiction_ix = token2index_dataset([*mnli_train_fiction["sentence1"]],idx_dict=token2id_wiki)
mnli_train_s2_fiction_ix = token2index_dataset([*mnli_train_fiction["sentence2"]],idx_dict=token2id_wiki)
mnli_val_s1_fiction_ix = token2index_dataset([*mnli_val_fiction["sentence1"]],idx_dict=token2id_wiki)
mnli_val_s2_fiction_ix = token2index_dataset([*mnli_val_fiction["sentence2"]],idx_dict=token2id_wiki)

In [33]:
## get concatenated tokens
snli_train_concat_indices = [snli_train_sentence1_indices[i]+snli_train_sentence2_indices[i] for i in [*range(len(snli_train_sentence1_indices))]]
snli_val_concat_indices = [snli_val_sentence1_indices[i]+snli_val_sentence2_indices[i] for i in [*range(len(snli_val_sentence1_indices))]]

mnli_train_concat_indices = [mnli_train_sentence1_indices[i]+mnli_train_sentence2_indices[i] for i in [*range(len(mnli_train_sentence1_indices))]]
mnli_val_concat_indices = [mnli_val_sentence1_indices[i]+mnli_val_sentence2_indices[i] for i in [*range(len(mnli_val_sentence1_indices))]]

## GENRES

## telephone
mnli_train_tel_concat_indices = [mnli_train_s1_tel_ix[i]+mnli_train_s2_tel_ix[i] for i in [*range(len(mnli_train_s1_tel_ix))]]
mnli_val_tel_concat_indices = [mnli_val_s1_tel_ix[i]+mnli_val_s2_tel_ix[i] for i in [*range(len(mnli_val_s1_tel_ix))]]
## slate
mnli_train_slate_concat_indices = [mnli_train_s1_slate_ix[i]+mnli_train_s2_slate_ix[i] for i in [*range(len(mnli_train_s2_slate_ix))]]
mnli_val_slate_concat_indices = [mnli_val_s1_slate_ix[i]+mnli_val_s2_slate_ix[i] for i in [*range(len(mnli_val_s2_slate_ix))]]
## travel
mnli_train_travel_concat_indices = [mnli_train_s1_travel_ix[i]+mnli_train_s2_travel_ix[i] for i in [*range(len(mnli_train_s2_travel_ix))]]
mnli_val_travel_concat_indices = [mnli_val_s1_travel_ix[i]+mnli_val_s2_travel_ix[i] for i in [*range(len(mnli_val_s2_travel_ix))]]
## gov
mnli_train_gov_concat_indices = [mnli_train_s1_gov_ix[i]+mnli_train_s2_gov_ix[i] for i in [*range(len(mnli_train_s2_gov_ix))]]
mnli_val_gov_concat_indices = [mnli_val_s1_gov_ix[i]+mnli_val_s2_gov_ix[i] for i in [*range(len(mnli_val_s2_gov_ix))]]
## fiction
mnli_train_fiction_concat_indices = [mnli_train_s1_fiction_ix[i]+mnli_train_s2_fiction_ix[i] for i in [*range(len(mnli_train_s2_fiction_ix))]]
mnli_val_fiction_concat_indices = [mnli_val_s1_fiction_ix[i]+mnli_val_s2_fiction_ix[i] for i in [*range(len(mnli_val_s2_fiction_ix))]]


Getting training and validation set __labels__ (targets) for both datasets. 

In [34]:
## SNLI
snli_train_labels = np.array(snli_train["label_num"])
snli_val_labels = np.array(snli_val["label_num"])

## MNLI
mnli_train_labels = np.array(mnli_train["label_num"])
mnli_val_labels = np.array(mnli_val["label_num"])

## GENRES

## telephone
mnli_train_tel_labels = np.array(mnli_train_telephone["label_num"])
mnli_val_tel_labels = np.array(mnli_val_telephone["label_num"])
## slate
mnli_train_slate_labels = np.array(mnli_train_slate["label_num"])
mnli_val_slate_labels = np.array(mnli_val_slate["label_num"])
## travel
mnli_train_travel_labels = np.array(mnli_train_travel["label_num"])
mnli_val_travel_labels = np.array(mnli_val_travel["label_num"])
## gov
mnli_train_gov_labels = np.array(mnli_train_government["label_num"])
mnli_val_gov_labels = np.array(mnli_val_government["label_num"])
## fiction
mnli_train_fiction_labels = np.array(mnli_train_fiction["label_num"])
mnli_val_fiction_labels = np.array(mnli_val_fiction["label_num"])

Function to get pretrained word embeddings from the table

In [54]:
table_lookup.shape

(999994, 300)

In [37]:
def sentence_embed(dataset_indices=None,
                            method="concat"):
    
    """Takes as input dataset indices and uses the pretrained 
    word embedding matrix to look up the vector representation for each word,
    then interacts the two sentences according to the specified method.
    For ex: dataset_indices = snli_train_concat_indices
            method = 'concat' 
            
    Returns a numpy array of embeddings
            
    To be used instead of nn.Embedding in the model class"""
    
    sentence_matrixrep = []
    
    for i in [*range(len(dataset_indices))]:
        sentence_rep = [table_lookup[x] if x != "-" else None for x in dataset_indices[i]]
        sentence_matrixrep.append(sentence_rep)
            
    embeddings = np.array([np.array(n) for n in sentence_matrixrep])
    
    return embeddings

In [38]:
sentence_embed(snli_train_concat_indices[:1])

array([[[-0.0595, -0.0428,  0.06  , ...,  0.0235,  0.215 ,  0.0581],
        [ 0.038 , -0.0449, -0.0897, ...,  0.0795,  0.0255, -0.027 ],
        [-0.0433, -0.0663, -0.0548, ...,  0.0326,  0.2213,  0.0096],
        ...,
        [ 0.0897,  0.016 , -0.0571, ...,  0.1559, -0.0254, -0.0259],
        [ 0.0324,  0.1127,  0.0299, ...,  0.0468,  0.1144, -0.0125],
        [ 0.0004,  0.0032, -0.0204, ...,  0.207 ,  0.0689, -0.0467]]])

In [58]:
## code taken from lab3

## SNLI
MAX_SENTENCE_LENGTH = 200
snli_train_targets = snli_train_labels
snli_val_targets = snli_val_labels

from torch.utils.data import Dataset

class SNLI_Dataset(Dataset):
    """
    Class that represents a train/validation/test dataset that's readable for PyTorch
    Note that this class inherits torch.utils.data.Dataset
    """
    
    def __init__(self, data_list, target_list):
        """
        @param data_list: list of newsgroup tokens 
        @param target_list: list of newsgroup targets 

        """
        self.data_list = data_list
        self.target_list = target_list
        assert (len(self.data_list) == len(self.target_list))

    def __len__(self):
        return len(self.data_list)
        
    def __getitem__(self, key):
        """
        Triggered when you call dataset[i]
        """
        
        token_idx = self.data_list[key][:MAX_SENTENCE_LENGTH]
        label = self.target_list[key]
        return [token_idx, len(token_idx), label]

def snli_func(batch):
    """
    Customized function for DataLoader that dynamically pads the batch so that all 
    data have the same length
    """
    data_list = []
    label_list = []
    length_list = []

    for datum in batch:
        label_list.append(datum[2])
        length_list.append(datum[1])
#         data_list.append(datum[0])
        
    # padding
    for datum in batch:
        padded_vec = np.pad(np.array(datum[0]), 
                                pad_width=((0,MAX_SENTENCE_LENGTH-datum[1])), 
                                mode="constant", constant_values=0)
        data_list.append(padded_vec)
        
    return [torch.from_numpy(np.array(data_list)), 
            torch.LongTensor(length_list), 
            torch.LongTensor(label_list)]

# create pytorch dataloader

BATCH_SIZE = 32
snli_train_dataset = SNLI_Dataset(snli_train_concat_indices,snli_train_labels)
snli_train_loader = torch.utils.data.DataLoader(dataset=snli_train_dataset,
                                               batch_size=BATCH_SIZE,
                                               collate_fn=snli_func,
                                               shuffle=True)

snli_val_dataset = SNLI_Dataset(snli_val_concat_indices, snli_val_labels)
snli_val_loader = torch.utils.data.DataLoader(dataset=snli_val_dataset,
                                             batch_size=BATCH_SIZE,
                                             collate_fn=snli_func,
                                             shuffle=True)

In [59]:
[*snli_val_loader][1]

[tensor([[ 76579, 816560, 683763,  ...,      0,      0,      0],
         [662682, 937843, 802154,  ...,      0,      0,      0],
         [ 76579, 980614, 773898,  ...,      0,      0,      0],
         ...,
         [ 76579, 822411, 794698,  ...,      0,      0,      0],
         [ 76579, 691841, 739190,  ...,      0,      0,      0],
         [ 76579, 822411, 672856,  ...,      0,      0,      0]]),
 tensor([23, 28, 27, 27, 26, 22, 27, 22, 19, 23, 16, 21, 18, 21, 13, 63, 14, 33,
         21, 18, 18, 12, 26, 20, 17, 21, 30, 30, 16, 36, 18, 17]),
 tensor([ 1, -1,  1,  0,  0,  1, -1,  0, -1, -1, -1, -1, -1,  0, -1, -1,  1,  0,
          0, -1,  1,  1,  1,  0, -1,  0,  1,  0, -1,  0, -1,  0])]

In [45]:
## TODO

## code taken from lab3
## mnli
MAX_SENTENCE_LENGTH = 200
mnli_train_targets = mnli_train_labels
mnli_val_targets = mnli_val_labels

from torch.utils.data import Dataset

class MNLI_Dataset(Dataset):
    """
    Class that represents a train/validation/test dataset that's readable for PyTorch
    Note that this class inherits torch.utils.data.Dataset
    """
    
    def __init__(self, data_list, target_list):
        """
        @param data_list: list of newsgroup tokens 
        @param target_list: list of newsgroup targets 

        """
        self.data_list = data_list
        self.target_list = target_list
        assert (len(self.data_list) == len(self.target_list))

    def __len__(self):
        return len(self.data_list)
        
    def __getitem__(self, key):
        """
        Triggered when you call dataset[i]
        """
        
        token_idx = self.data_list[key][:MAX_SENTENCE_LENGTH]
        label = self.target_list[key]
        return [token_idx, len(token_idx), label]

def mnli_func(batch):
    """
    Customized function for DataLoader that dynamically pads the batch so that all 
    data have the same length
    """
    data_list = []
    label_list = []
    length_list = []

    for datum in batch:
        label_list.append(datum[2])
        length_list.append(datum[1])
        
    # padding
    for datum in batch:
        padded_vec = np.pad(np.array(datum[0]), 
                                pad_width=((0,MAX_SENTENCE_LENGTH-datum[1])), 
                                mode="constant", constant_values=0)
            
        data_list.append(padded_vec)
        
        
        
    return [torch.from_numpy(np.array(data_list)), 
            torch.LongTensor(length_list), 
            torch.LongTensor(label_list)]

# create pytorch dataloader

BATCH_SIZE = 32
mnli_train_dataset = MNLI_Dataset(mnli_train_concat_indices,mnli_train_labels)
mnli_train_loader = torch.utils.data.DataLoader(dataset=mnli_train_dataset,
                                               batch_size=BATCH_SIZE,
                                               collate_fn=mnli_func,
                                               shuffle=True)

mnli_val_dataset = MNLI_Dataset(mnli_val_concat_indices, mnli_val_labels)
mnli_val_loader = torch.utils.data.DataLoader(dataset=mnli_val_dataset,
                                             batch_size=BATCH_SIZE,
                                             collate_fn=mnli_func,
                                             shuffle=True)

In [46]:
[*mnli_train_loader][29]

[tensor([[961193, 578022,   2924,  ...,      0,      0,      0],
         [187482, 944934, 512775,  ...,      0,      0,      0],
         [255405, 898138,   2924,  ...,      0,      0,      0],
         ...,
         [562796, 723455, 704565,  ...,      0,      0,      0],
         [ 88713, 662682, 868293,  ...,      0,      0,      0],
         [190300, 853891, 316602,  ...,      0,      0,      0]]),
 tensor([ 30,  12,  32,  28,  48,  15,  16,  50, 132,  23,  23,  54,  44,  24,
          36,  82,  20,  49,  28,  28,  44,  16,  23,  21,  23,  41,  21,  19,
          32,  34,  30,  36]),
 tensor([-1, -1, -1,  0,  1,  1,  1,  0, -1,  0, -1,  1, -1,  0,  1,  0,  0,  0,
          1, -1, -1,  0,  1, -1, -1,  0,  1,  0, -1, -1,  1, -1])]

In [None]:
def sentence_embed(dataset_indices=None,
                            method="concat"):
    
    """Takes as input dataset indices and uses the pretrained 
    word embedding matrix to look up the vector representation for each word,
    then interacts the two sentences according to the specified method.
    For ex: dataset_indices = snli_train_concat_indices
            method = 'concat' 
            
    Returns a numpy array of embeddings
            
    To be used instead of nn.Embedding in the model class"""
    
    sentence_matrixrep = []
    
    for i in [*range(len(dataset_indices))]:
        sentence_rep = [table_lookup[x] if x != "-" else None for x in dataset_indices[i]]
        sentence_matrixrep.append(sentence_rep)
            
    embeddings = np.array([np.array(n) for n in sentence_matrixrep])
    
    return embeddings

## Part 2: Model

The model is trained on SNLI training set. The best model is chosen using SNLI validation set, then the best model is evaluated on each genre in MultiNLI validation set. 

We will use an encoder (either a CNN or an RNN) to map each string of text (hypothesis and premise) to a fixed-dimension vector representation. We will interact the two hidden representations and output a 3-class soft- max. (To keep things simple, we will simply concatenate the two repre- sentations, and feed them through a network of 2 fully-connected layers.) For the encoder, we want the following:

### Part 2.1: CNN
For the CNN, a 2-layer 1-D convolutional network with ReLU activations will suffice. We can perform a max-pool at the end to compress the hidden representation into a single vector.

In [None]:
#  # FloatTensor containing pretrained weights
# >> weight = torch.FloatTensor([[1, 2.3, 3], [4, 5.1, 6.3]])
# >> embedding = nn.Embedding.from_pretrained(weight)
# >> # Get embeddings for index 1
# >> input = torch.LongTensor([1])
# >> embedding(input)

In [611]:
train_dataset = [*snli_train_dataset]

class CNN(nn.Module):
    
    def __init__(self,emb_size,
                 hidden_size, 
                 num_layers, 
                 num_classes,
                vocab_size):

        super(CNN, self).__init__()

        self.num_layers, self.hidden_size = num_layers, hidden_size
#         self.emb_size = emb_size
        ## use pretrained wiki embeddings
        wiki_embed_table = torch.tensor(table_lookup)
        embedding = nn.Embedding.from_pretrained(wiki_embed_table)
        self.embedding = embedding
    
        self.conv1 = nn.Conv1d(emb_size, hidden_size, kernel_size=3, padding=1)
        self.conv2 = nn.Conv1d(hidden_size, hidden_size, kernel_size=3, padding=1)

        self.linear = nn.Linear(hidden_size, num_classes)
        

    def forward(self, x, lengths):
        batch_size, seq_len = x.size()
        ## initialize batch embeddings
        embeds = []
        ## get embeddings for each sentence couple in batch
        ## exclude zero-pads at the end, if you don't slice by length 
        ## the model will get table_lookup[0] for 0-pads
        MAX_SENTENCE_LENGTH = 200
        for arr in [*range(len(x))]: 
            input = torch.LongTensor(x[arr][:int(lengths[arr])])
#             embed = embedding(torch.LongTensor(input))
            embed = embedding(input)
            ## pad again
            length_to_pad = MAX_SENTENCE_LENGTH - embed.size()[0]
            embed = np.vstack((embed.numpy(),np.zeros((length_to_pad,300))))
            embeds.append(embed)
            
#         print ("embeds-"+str(embeds))
        embeds = torch.FloatTensor(embeds)
#         print (embeds.size()) # 32
#         hidden = self.conv1(embed.transpose(1,-2)).transpose(1,-2)
        hidden = self.conv1(embeds)


        print ("hidden = "+str(hidden))
        hidden = F.relu(hidden.contiguous().view(-1, hidden.size(-1))).view(batch_size, seq_len, hidden.size(-1))

        hidden = self.conv2(hidden.transpose(-2,1)).transpose(-2,1)
        hidden = F.relu(hidden.contiguous().view(-1, hidden.size(-1))).view(batch_size, seq_len, hidden.size(-1))

        hidden = torch.sum(hidden, dim=1)
        logits = self.linear(hidden)
        return logits

In [612]:
model = CNN(emb_size=300, hidden_size=200, num_layers=2, num_classes=3, vocab_size=wiki_embed_table.size()[0])

In [613]:
learning_rate = 3e-4
num_epochs = 10 # number epoch to train

In [614]:
criterion = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)

In [615]:
def test_model(loader, model):
    """
    Help function that tests the model's performance on a dataset
    @param: loader - data loader for the dataset to test against
    """
    correct = 0
    total = 0
    model.eval()
    for data, lengths, labels in loader:
        data_batch, lengths_batch, label_batch = data, lengths, labels
        outputs = F.softmax(model(data_batch, lengths_batch), dim=1)
        predicted = outputs.max(1, keepdim=True)[1]

        total += labels.size(0)
        correct += predicted.eq(labels.view_as(predicted)).sum().item()
    return (100 * correct / total)

total_step = len(snli_train_loader)
num_epochs = 10
for epoch in range(num_epochs):
    for i, (data, lengths, labels) in enumerate(snli_train_loader):
        model.train()
        optimizer.zero_grad()
        # Forward pass
        outputs = model(data, lengths)
        loss = criterion(outputs, labels)

        # Backward and optimize
        loss.backward()
        optimizer.step()
        # validate every 100 iterations
        if i > 0 and i % 100 == 0:
            # validate
            val_acc = test_model(snli_val_loader, model)
            print('Epoch: [{}/{}], Step: [{}/{}], Validation Acc: {}'.format(
                       epoch+1, num_epochs, i+1, len(train_loader), val_acc))



embeds-[array([[-0.038 , -0.0278, -0.0227, ..., -0.1878,  0.3144,  0.1036],
       [-0.0965, -0.0777,  0.0006, ...,  0.0649,  0.1015, -0.0266],
       [-0.007 ,  0.0228, -0.1499, ...,  0.1013, -0.0968,  0.0194],
       ...,
       [ 0.    ,  0.    ,  0.    , ...,  0.    ,  0.    ,  0.    ],
       [ 0.    ,  0.    ,  0.    , ...,  0.    ,  0.    ,  0.    ],
       [ 0.    ,  0.    ,  0.    , ...,  0.    ,  0.    ,  0.    ]]), array([[-0.038 , -0.0278, -0.0227, ..., -0.1878,  0.3144,  0.1036],
       [-0.0126,  0.0669, -0.1186, ...,  0.1152,  0.0349,  0.0512],
       [-0.023 ,  0.0005, -0.042 , ...,  0.1031, -0.0753,  0.0264],
       ...,
       [ 0.    ,  0.    ,  0.    , ...,  0.    ,  0.    ,  0.    ],
       [ 0.    ,  0.    ,  0.    , ...,  0.    ,  0.    ,  0.    ],
       [ 0.    ,  0.    ,  0.    , ...,  0.    ,  0.    ,  0.    ]]), array([[-0.0595, -0.0428,  0.06  , ...,  0.0235,  0.215 ,  0.0581],
       [ 0.0751, -0.0591, -0.0102, ...,  0.0501,  0.133 ,  0.0352],
       [ 0.0

RuntimeError: Given groups=1, weight of size [200, 300, 3], expected input[32, 200, 300] to have 300 channels, but got 200 channels instead

#### Hyperparameter Search

The hyperparameters included in the hyperparameter search space are;

- The size of the hidden dimension of the CNN and RNN,
- The kernel size of the CNN,
- Experiment with different ways of interacting the two encoded sentences (concatenation, element-wise multiplication, outer multiplication etc)
- Regularization (e.g. weight decay, dropout).


In [278]:
table_lookup[31]

array([-5.220e-02, -4.900e-03, -1.430e-02, -1.240e-02,  2.240e-02,
        2.650e-02,  3.730e-02, -2.200e-03,  1.800e-03, -5.170e-02,
       -1.330e-02,  1.240e-02, -1.140e-02, -8.270e-02,  4.040e-02,
        2.300e-02,  1.800e-02, -1.860e-02,  6.670e-02,  3.400e-03,
       -5.980e-02, -1.340e-02, -4.430e-02, -1.070e-01, -1.950e-02,
       -3.540e-02,  5.200e-03,  6.180e-02,  2.080e-02, -9.700e-02,
       -2.410e-02,  1.500e-03, -5.690e-02, -2.290e-02, -2.210e-02,
       -2.570e-02, -2.400e-02, -7.340e-02,  3.800e-02,  2.900e-02,
        2.990e-02,  1.250e-02, -4.870e-02,  1.770e-02,  5.840e-02,
       -4.600e-02, -2.480e-02, -1.300e-02, -2.020e-02,  2.240e-02,
       -4.240e-02, -3.190e-02, -8.868e-01,  1.100e-03, -4.740e-02,
        7.310e-02, -2.340e-02, -7.280e-02, -3.210e-02,  1.690e-02,
       -4.500e-03,  4.000e-04, -2.086e-01,  1.090e-02,  1.060e-02,
       -1.980e-02, -1.280e-02, -1.250e-02, -1.310e-02,  3.030e-02,
       -4.600e-02, -6.100e-03,  3.830e-02, -2.110e-02,  5.500e

### Part 2.2 RNN
For the RNN, a single-layer, bi-directional GRU will suffice. We can take the last hidden state as the encoder output. (In the case of bi-directional, the last of each direction, although PyTorch takes care of this.)


#### Hyperparameter Search

The hyperparameters included in the hyperparameter search space are;

- The size of the hidden dimension of the CNN and RNN,
- Experiment with different ways of interacting the two encoded sentences (concatenation, element-wise multiplication, outer multiplication etc)
- Regularization (e.g. weight decay, dropout).