# PS3: Neural Networks for Classification and Natural Language Inference

In [1]:
import json
import csv
import os
import glob

import torch
import torch.nn as nn
from torch.autograd import Variable
from torch.nn import functional as F

from sklearn.metrics import f1_score, precision_score, recall_score

import numpy as np

The purpose of this task is to gain an understanding of training neural networks. Likewise, you will get to learn about the pytorch framework.

## Submission Instructions

After completing the exercises below, generate a pdf of the code **with** outputs. After that create a zip file containing both the completed exercise and the generated PDF. You are **required** to check the PDF to make sure all the code **and** outputs are clearly visible and easy to read. If your code goes off the page, you should reduce the line size. I generally recommend not going over 80 characters.

Finally, name the zip file using a combination of your the assigment and your name, e.g., ps3_rios.zip

## PART I: Data Cleaning (10 points)

Load the "surnames.csv" file to train a LSTM to predict nationality based on surname. You will need to transform the data from a list of strings to a list of indexes. For example, the following data

```
Anthony
John
David
```

should be transformed into a list of lists.

```
[[0, 1, 2, 3, 4, 1, 5],
 [6, 4, 3, 1],
 [7, 8, 9, 10, 11]]
```

Next, you will need zero-pad all examples to be the same size.

```
[[0, 1, 2, 3, 4, 1, 5],
 [6, 4, 3, 1, 0, 0, 0],
 [7, 8, 9, 10, 11, 0, 0]]
```

Finally, everything will be converted into numpy arrays.

In [27]:
char2index = {'<PAD>': 0}
index2char = {0: '<PAD>'}
class2index = {} # stores the class index pairs.
index2class = {}
doc_lengths = [] # Stores the lengths of all docs (train, test and dev)
X_train = [] # stores an 
y_train = [] # stores an index to the correct class
X_dev = []
y_dev = []
X_test = []
y_test = []
X_train_len = [] # Stores the length of each training name
X_test_len = [] # ... length of each test name
X_dev_len = [] # ... length of each dev name

# Write code to load data here.
dataset_filename = 'surnames.csv'
datadir = 'data'
dataset_path = os.path.join(datadir, dataset_filename)
data_names = ['X_train', 'X_dev', 'X_test']

def access_data(path):
    ret = {'X_train':[], 'y_train':[], 'X_dev':[], 'y_dev':[], 'X_test':[], 'y_test':[]}
    with open(path, 'r', encoding='utf-8') as f:
        reader = csv.reader(f,dialect='excel')
        for i, row in enumerate(reader):
#             if i > 5:
#                 break
#             print(row)
            if row[0] == 'train':
                ret['X_train'].append(row[1])
                ret['y_train'].append(row[2])
            if row[0] == 'test':
                ret['X_test'].append(row[1])
                ret['y_test'].append(row[2])
            if row[0] == 'dev':
                ret['X_dev'].append(row[1])
                ret['y_dev'].append(row[2])
    return ret
    
def load_data(path):
    
    results = access_data(path)
    
    class2index = {}
    
    for class_name in set(results['y_train']+results['y_dev']+results['y_test']):
        class2index[class_name] = len(class2index)
    index2class = {ind:class_name for ind,class_name in class2index.items()}
    return results, class2index, index2class
    
    
def update_char_maps(X_train, char2index, index2char):
    chars = set([char for name in X_train for char in name])
    for char in chars:
        char2index[char] = len(char2index)
        index2char[len(index2char)] = char
        
def convert_to_index_map(names, char2index):
    index_mappings = []
    for name in names:
        index_map = [char2index[char] if char in char2index else 0 for char in name]
        index_mappings.append(index_map)
    return index_mappings
            
    
    
    
data, class2index, index2class = load_data(dataset_path)
X_train, y_train, X_dev, y_dev, X_test, y_test = data['X_train'],data['y_train'],data['X_dev'],data['y_dev'] \
                                                 ,data['X_test'],data['y_test']
d = [X_train, X_dev, X_test]
doc_lengths = [X_train_len, X_dev_len, X_test_len]
[doc_lengths[i].append(len(x)) for i in range(len(d)) for x in d[i]]
update_char_maps(X_train, char2index, index2char)
X_train_nums, X_dev_nums = convert_to_index_map(X_train, char2index),convert_to_index_map(X_dev, char2index)
X_test_nums = convert_to_index_map(X_test, char2index)

In [28]:
print(len(X_dev))
print(len(X_dev_len))
print(X_train_nums[0])
print(X_train[1])
print(X_train_nums[1])

3060
3060
[55, 64, 80, 64, 80]
Prikazchikov
[4, 11, 3, 15, 64, 25, 49, 82, 3, 15, 35, 42]


In [29]:
# PADDING

max_seq_len = max(doc_lengths)
len_to_pad = len(max(X_train, key=lambda x: len(x)))
print('longest in training set:', len_to_pad)
X_train_eq_size = []
X_dev_eq_size = []
X_test_eq_size = []

def pad_example(ex, len_to_pad, pad):
    
    padded = ex[:len_to_pad] +[pad]*(len_to_pad -len(ex))
    return padded
# Write code to append data to code here
for x in X_train_nums:
    X_train_eq_size.append(pad_example(x,len_to_pad, 0))
    
X_train = np.array(X_train_eq_size)
X_dev = np.array(X_dev_eq_size)
X_test = np.array(X_test_eq_size)

y_train = np.array(y_train)
y_dev = np.array(y_dev)
y_test = np.array(y_test)

X_train_len = np.array(X_train_len)
X_dev_len = np.array(X_dev_len)
X_test_len = np.array(X_test_len)

idx = np.argsort(X_dev_len)[::-1]
print(idx)
print(len(X_dev))
print(len(X_dev_len))
X_dev = X_dev[idx]
y_dev = y_dev[idx]
X_dev_len = X_dev_len[idx]

idx = np.argsort(X_test_len)[::-1]
X_test = X_test[idx]
y_test = y_test[idx]
X_test_len = X_test_len[idx]

doc_lengths = np.array(doc_lengths)
print(X_train.shape)

longest in training set: 20
[2806 2970  335 ... 1266 1402 2330]
0
3060


IndexError: index 2806 is out of bounds for axis 0 with size 0

## PART II: Classification (25 points)

In [12]:
class LSTM(nn.Module):
    def __init__(self, nb_layers, word2index, class2index, nb_lstm_units=100,
                 embedding_dim=3, batch_size=3):
        super(LSTM, self).__init__()
        self.vocab = word2index
        self.tags = class2index

        self.nb_layers = nb_layers
        self.nb_lstm_units = nb_lstm_units
        self.embedding_dim = embedding_dim
        self.batch_size = batch_size

        self.nb_tags = len(self.tags)

        # build actual NN
        self.__build_model()

    def __build_model(self):
        nb_vocab_words = len(self.vocab)

        padding_idx = self.vocab['<PAD>']
        self.word_embedding = nn.Embedding(
            num_embeddings=nb_vocab_words,
            embedding_dim=self.embedding_dim,
            padding_idx=padding_idx
        )

        self.lstm = nn.LSTM(
            input_size=self.embedding_dim,
            hidden_size=self.nb_lstm_units,
            num_layers=self.nb_layers,
            batch_first=True
        )

        self.hidden_to_tag = nn.Linear(self.nb_lstm_units*self.nb_layers, self.nb_tags)
        
        self.logsoftmax = nn.LogSoftmax(dim=1)
        self.softmax = nn.Softmax(dim=1)
        self.inference = False

    def init_hidden(self, X):
        # Initial ht (hidden state) and ct (context)
        h0 = torch.zeros(self.nb_layers, X.size(0), self.nb_lstm_units).float()
        c0 = torch.zeros(self.nb_layers, X.size(0), self.nb_lstm_units).float()
        return (h0,c0)

    def forward(self, X, X_lengths):
        # reset the LSTM hidden state. Must be done before you run a new batch.
        # Otherwise the LSTM will treat
        # a new batch as a continuation of a sequence
        self.hidden = self.init_hidden(X)

        batch_size, seq_len = X.shape
        
        # ---------------------
        # 1. embed the input
        # Dim transformation: (batch_size, seq_len) -> (batch_size, seq_len, embedding_dim)
        X = self.word_embedding(X)

        # ---------------------
        # 2. Run through RNN
        # TRICK 2 ********************************
        # Dim transformation: (batch_size, seq_len, embedding_dim) -> (batch_size, seq_len, nb_lstm_units)

        # pack_padded_sequence so that padded items in the sequence won't be shown to the LSTM
        X = torch.nn.utils.rnn.pack_padded_sequence(X, X_lengths, batch_first=True)

        # now run through LSTM
        # X contains the padded sequence output and ht contains the final hidden states
        # for example.
        X, (ht, ct) = self.lstm(X, self.hidden)
        
        # Reshape to use the final state from each lstm layer
        out = ht.view(ht.size(1), self.nb_lstm_units*self.nb_layers)

        # pass final states to output layer
        out = self.hidden_to_tag(out)
        
        # Use logsoftmax for training and softmax for testing
        if not self.inference:
            Y_hat = self.logsoftmax(out)
        else:
            Y_hat = self.softmax(out)

        return Y_hat

In [14]:
epochs = 25
batch_size = 16
lstm_unit_size = 512
embedding_size = 512
print_iter = 128
num_layers = 1

m = LSTM(num_layers, char2index, class2index, nb_lstm_units = lstm_unit_size,
         embedding_dim = embedding_size, batch_size = batch_size)

criterion = nn.NLLLoss(size_average=False)
optim = torch.optim.Adam(m.parameters(), lr=0.001)

idx = np.arange(X_train.shape[0])     


for epoch in range(epochs):

    np.random.shuffle(idx)
    x_train = X_train[idx]
    #print(x_train.shape)
    y_train = y_train[idx]
    x_lens = X_train_len[idx]
        
    current_batch = 0
    for iteration in range(y_train.shape[0] // batch_size):
        
        batch_lengths = x_lens[current_batch: current_batch + batch_size]
        lengths = np.array(batch_lengths)
        idx = np.argsort(lengths)[::-1]
        batch_lengths = batch_lengths[idx]
        batch_lengths = torch.tensor(batch_lengths).long()
        
        
        batch_x = X_train[current_batch: current_batch + batch_size]
        batch_x = batch_x[idx]
        batch_x = torch.tensor(batch_x).long()
        
        batch_y = y_train[current_batch: current_batch + batch_size]
        batch_y = batch_y[idx]
        batch_y = torch.tensor(batch_y).long()
        
        current_batch += batch_size
                        
        optim.zero_grad()
        if len(batch_x) > 0:
            #batch_pred, batch_y = get_prediction(batch_x, batch_y)
            batch_pred = m(batch_x, batch_lengths)
            loss = criterion(batch_pred, batch_y)
            loss.backward()
            optim.step()

        if iteration % print_iter == 0 or True:
            with torch.no_grad():
                m.train(False)
                m.inference = False
                X_dev = torch.tensor(X_dev).long()
                X_dev_len = torch.tensor(X_dev_len).long()
                batch_pred = np.array(m(X_dev, X_dev_len)).argmax(axis=1)
                batch_y = y_dev
                f1 = f1_score(batch_y, batch_pred, average='micro')
                #precision = precision_score(batch_y, batch_pred, average='micro')
                #recall = recall_score(batch_y, batch_pred, average='micro')
                print(loss.item(), '\titeraton:', iteration, '\tepoch', epoch, 'f1', f1)
                m.train(True)
                m.inference = True

(15000, 20)


IndexError: index 14580 is out of bounds for axis 0 with size 0

Answer the following questions below:

1. What was the micro and macro F1 on the test and dev sets?
2. Implement a bidirectional LSTM model. You will need to modify the hidden states and self.lstm variables. Does it work better?
3. Experiments with the various hyperparameters (hidden state size, learning rate, etc.). What hyperparemeters result in the best performance?

## PART III: Natural Language Inference (25 points)

Natural language inference is the task of determining whether a "hypothesis" is true (entailment), false (contradiction), or undetermined (neutral) given a "premise"[1, 2]. This task has been known to perform well for zero-shot classification[3].

Example:

| Premise | Label | Hypothesis |
| ------- | ----- | ---------- |
| A man inspects the uniform of a figure in some East Asian country. | contradiction | The man is sleeping. |
| An older and younger man smiling | neutral | Two men are smiling and laughing at the cats playing on the floor. |
| A soccer game with multiple males playing. | entailment | Some men are playing a sport. |

Your task is to load and train a model on the "multinli_1.0_train.jsonl" dataset and evaluate on "multinli_1.0_dev_matched.jsonl" using accuracy.

I am leaving this task relativley open. One solution is to modify the LSTM code above to pass two documents through a LSTM model and return the last hidden state for each. Next, concatenate the two vectors, then pass it through a softmax layer. Finally, train using the same forumlate as Part I.

**NOTE:** You do not need to train until convergence. You can train for only an epoch or 2 max; train less if it takes to long. I simply want to see that it runs and is learning.


[1] Williams, Adina, Nikita Nangia, and Samuel Bowman. "A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference." Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). 2018.

[2] Bowman, Samuel R., et al. "A large annotated corpus for learning natural language inference." Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. 2015.

[3] Yin, Wenpeng, Jamaal Hay, and Dan Roth. "Benchmarking Zero-shot Text Classification: Datasets, Evaluation and Entailment Approach." Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 2019.

In [None]:
# COPY AND EDIT CODE HERE

1. Describe your solution.

**ANSWER HERE**

## EXTRA CREDIT 1 (10 points)

Modify the LSTM model to train a language model, then write code to generate new text from the model. Do not forget to mask the loss function when training the language model to handle the different lengths of the sequences. Use the "en-ud-train.upos.tsv" dataset.

Generate 10 examples from your model.

In [None]:
# COPY AND EDIT CODE HERE