## Programming Assignment (20 points)

In this assignment, you will solve an irony detection task: given a tweet, your job is to classify whether it is ironic or not.

You will implement a new classifier that does not rely on feature engineering as in previous homeworks. Instead, you will use pretrained word embeddings downloaded from using the `irony.py` script as your input feature vectors. Then, you will encode your sequence of word embeddings with an (already implemented) LSTM and classify based on its final hidden state.


In [3]:
# This is so that you don't have to restart the kernel everytime you edit hmm.py

%load_ext autoreload
%autoreload 2

## Data

We will use the dataset from SemEval-2018: https://github.com/Cyvhee/SemEval2018-Task3

In [4]:
from irony import load_datasets

train_sentences, train_labels, test_sentences, test_labels, label2i = load_datasets()

# TODO: Split train into train/dev

2022-11-01 17:42:32.367018: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


## Baseline: Naive Bayes

We have provided the solution for the Naive Bayes part from HW2 in [bayes.py](bayes.py)

There are two implementations: NaiveBayesHW2 is what was expected from HW2. However, we will use a more effecient implementation of it that uses vector operations to calculate the probabilities. Please go through it if you would like to

In [5]:
from irony import run_nb_baseline

run_nb_baseline()

Vectorizing Text: 100%|██████████| 3834/3834 [00:00<00:00, 9158.57it/s]
Vectorizing Text: 100%|██████████| 3834/3834 [00:00<00:00, 10912.28it/s]
Vectorizing Text: 100%|██████████| 784/784 [00:00<00:00, 13658.48it/s]


Baseline: Naive Bayes Classifier
F1-score Ironic: 0.6402966625463535
Avg F1-score: 0.6284487265300938


### Task 1: Implement avg_f1_score() in [util.py](util.py). Then re-run the above cell  (2 Points)

So the micro F1-score for the test set of the Ironic Class using a Naive Bayes Classifier is **0.64**

## Logistic Regression with Word2Vec  (Total: 18 Points)

Unlike sentiment, Irony is very subjective, and there is no word list for ironic and non-ironic tweets. This makes hand-engineering features tedious, therefore, we will use word embeddings as input to the classifier, and make the model automatically extract features aka learn weights for the embeddings

## Tokenizer for Tweets


Tweets are very different from normal document text. They have emojis, hashtags and bunch of other special character. Therefore, we need to create a suitable tokenizer for this kind of text.

Additionally, as described in class, we also need to have a consistent input length of the text document in order for the neural networks built over it to work correctly.

### Task 2: Create a Tokenizer with Padding (5 Points)

Our Tokenizer class is meant for tokenizing and padding batches of inputs. This is done
before we encode text sequences as torch Tensors.

Update the following class by completing the todo statements.

In [6]:
from typing import Dict, List, Optional, Tuple
from collections import Counter

import torch
import numpy as np
import spacy


class Tokenizer:
    """Tokenizes and pads a batch of input sentences."""

    def __init__(self, pad_symbol: Optional[str] = "<PAD>"):
        """Initializes the tokenizer

        Args:
            pad_symbol (Optional[str], optional): The symbol for a pad. Defaults to "<PAD>".
        """
        self.pad_symbol = pad_symbol
        self.nlp = spacy.load("en_core_web_sm")
    
    def __call__(self, batch: List[str]) -> List[List[str]]:
        """Tokenizes each sentence in the batch, and pads them if necessary so
        that we have equal length sentences in the batch.

        Args:
            batch (List[str]): A List of sentence strings

        Returns:
            List[List[str]]: A List of equal-length token Lists.
        """
        batch = self.tokenize(batch)
        batch = self.pad(batch)

        return batch

    def tokenize(self, sentences: List[str]) -> List[List[str]]:
        """Tokenizes the List of string sentences into a Lists of tokens using spacy tokenizer.

        Args:
            sentences (List[str]): The input sentence.

        Returns:
            List[str]: The tokenized version of the sentence.
        """
        # TODO: Tokenize the input with spacy.
        # TODO: Make sure the start token is the special <SOS> token and the end token
        #       is the special <EOS> token
        tokenized_sentences = []
        for sentence in sentences:
            tokenized_sentence = ['<SOS>']
            sentence = self.nlp(sentence)
            for token in sentence:
                tokenized_sentence.append(token.text)
            tokenized_sentence.append('<EOS>')
            tokenized_sentences.append(tokenized_sentence)
        return tokenized_sentences

    def pad(self, batch: List[List[str]]) -> List[List[str]]:
        """Appends pad symbols to each tokenized sentence in the batch such that
        every List of tokens is the same length. This means that the max length sentence
        will not be padded.

        Args:
            batch (List[List[str]]): Batch of tokenized sentences.

        Returns:
            List[List[str]]: Batch of padded tokenized sentences. 
        """
        # TODO: For each sentence in the batch, append the special <P>
        #       symbol to it n times to make all sentences equal length
        pad_max = max([len(l) for l in batch])
        for sentence in batch:
            sentence.extend(['<PAD>'] * (pad_max - len(sentence)))
        return batch

In [7]:
# create the vocabulary of the dataset: use both training and test sets here

SPECIAL_TOKENS = ['<UNK>', '<PAD>', '<SOS>', '<EOS>']

all_data = train_sentences + test_sentences
my_tokenizer = Tokenizer()

tokenized_data = my_tokenizer.tokenize(all_data)
vocab = sorted(set([w for ws in tokenized_data + [SPECIAL_TOKENS] for w in ws]))

with open('vocab.txt', 'w') as vf:
    vf.write('\n'.join(vocab))

## Embeddings

We use GloVe embeddings https://nlp.stanford.edu/projects/glove/. But these do not necessarily have all of the tokens that will occur in tweets! Hoad the GloVe embeddings, pruning them to only those words in vocab.txt. This is to reduce the memory and runtime of your model.

Then, find the out-of-vocabulary words (oov) and add them to the encoding dictionary and the embeddings matrix.

In [8]:
# Dowload the gloVe vectors for Twitter tweets. This will download a file called glove.twitter.27B.zip

# ! wget https://nlp.stanford.edu/data/glove.twitter.27B.zip

In [9]:
# unzip glove.twitter.27B.zip
# if there is an error, please download the zip file again

# ! unzip glove.twitter.27B.zip

In [10]:
# Let's see what files are there:

# ! ls . | grep "glove.*.txt"

In [11]:
# For this assignment, we will use glove.twitter.27B.50d.txt which has 50 dimensional word vectors
# Feel free to experiment with vectors of other sizes

embeddings_path = 'glove.twitter.27B.50d.txt'
vocab_path = "./vocab.txt"

## Creating a custom Embedding Layer

Now the GloVe file has vectors for about 1.2 million words. However, we only need the vectors for a very tiny fraction of words -> the unique words that are there in the classification corpus. Some of the next tasks will be to create a custom embedding layer that has the vectors for this small set of words

### Task 2: Extracting word vectors from GloVe (3 Points)

In [12]:
from typing import Dict, Tuple

import torch


def read_pretrained_embeddings(
    embeddings_path: str,
    vocab_path: str
) -> Tuple[Dict[str, int], torch.FloatTensor]:
    """Read the embeddings matrix and make a dict hashing each word.

    Note that we have provided the entire vocab for train and test, so that for practical purposes
    we can simply load those words in the vocab, rather than all 27B embeddings

    Args:
        embeddings_path (str): _description_
        vocab_path (str): _description_

    Returns:
        Tuple[Dict[str, int], torch.FloatTensor]: _description_
    """
    word2i = {}
    vectors = []
    
    with open(vocab_path, encoding='utf8') as vf:
        vocab = set([w.strip() for w in vf.readlines()]) 
    
#     print('vocab we have', vocab)
    print(f"Reading embeddings from {embeddings_path}...")
    with open(embeddings_path, "r") as f:
        i = 0
        for line in f:
            word, *weights = line.rstrip().split(" ")
            # TODO: Build word2i and vectors such that
            #       each word points to the index of its vector,
            #       and only words that exist in `vocab` are in our embeddings
            if word in vocab:
                word2i[word] = i
                vectors.insert(i, torch.Tensor([float(x) for x in weights]))
                i += 1
    return word2i, torch.stack(vectors)

### Task 3: Get GloVe Out of Vocabulary (oov) words (0 Points)

The task is to find the words in the Irony corpus that are not in the GloVe Word list

In [13]:
def get_oovs(vocab_path: str, word2i: Dict[str, int]) -> List[str]:
    """Find the vocab items that do not exist in the glove embeddings (in word2i).
    Return the List of such (unique) words.

    Args:
        vocab_path: List of batches of sentences.
        word2i (Dict[str, int]): _description_

    Returns:
        List[str]: _description_
    """
    with open(vocab_path, encoding='utf8') as vf:
        vocab = set([w.strip() for w in vf.readlines()])
    
    glove_and_vocab = set(word2i.keys())
    vocab_and_not_glove = vocab - glove_and_vocab
    return list(vocab_and_not_glove)

### Task 4: Update the embeddings with oov words (3 Points)

In [14]:
def intialize_new_embedding_weights(num_embeddings: int, dim: int) -> torch.FloatTensor:
    """xavier initialization for the embeddings of words in train, but not in gLove.

    Args:
        num_embeddings (int): _description_
        dim (int): _description_

    Returns:
        torch.FloatTensor: _description_
    """
#     TODO: Initialize a num_embeddings x dim matrix with xiavier initiialization
#          That is, a normal distribution with mean 0 and standard deviation of dim^-0.5
    w = torch.empty(num_embeddings, dim)
    return torch.nn.init.xavier_normal_(w)


def update_embeddings(
    glove_word2i: Dict[str, int],
    glove_embeddings: torch.FloatTensor,
    oovs: List[str]
) -> Tuple[Dict[str, int], torch.FloatTensor]:
    # TODO: Add the oov words to the dict, assigning a new index to each

    # TODO: Concatenate a new row to embeddings for each oov
    #       initialize those new rows with `intialize_new_embedding_weights`

    # TODO: Return the tuple of the dictionary and the new embeddings matrix
    index = len(glove_word2i)
    row = 0
    for word in oovs:
        glove_word2i[word] = index
        index += 1
        row += 1 
    x = intialize_new_embedding_weights(row, glove_embeddings.size(dim=1))
    glove_embeddings = torch.cat((glove_embeddings, x), 0)
    return  glove_word2i, glove_embeddings

In [15]:
glove_word2i, glove_embeddings = read_pretrained_embeddings(
    embeddings_path,
    vocab_path
)

# Find the out-of-vocabularies
oovs = get_oovs(vocab_path, glove_word2i)

# Add the oovs from training data to the word2i encoding, and as new rows
# to the embeddings matrix
word2i, embeddings = update_embeddings(glove_word2i, glove_embeddings, oovs)

Reading embeddings from glove.twitter.27B.50d.txt...


### Encoding words to integers: DO NOT EDIT

In [16]:
# Use these functions to encode your batches before you call the train loop.

def encode_sentences(batch: List[List[str]], word2i: Dict[str, int]) -> torch.LongTensor:
    """Encode the tokens in each sentence in the batch with a dictionary

    Args:
        batch (List[List[str]]): The padded and tokenized batch of sentences.
        word2i (Dict[str, int]): The encoding dictionary.

    Returns:
        torch.LongTensor: The tensor of encoded sentences.
    """
    UNK_IDX = word2i["<UNK>"]
    tensors = []
    for sent in batch:
        tensors.append(torch.LongTensor([word2i.get(w, UNK_IDX) for w in sent]))
        
    return torch.stack(tensors)


def encode_labels(labels: List[int]) -> torch.FloatTensor:
    """Turns the batch of labels into a tensor

    Args:
        labels (List[int]): List of all labels in the batch

    Returns:
        torch.FloatTensor: Tensor of all labels in the batch
    """
    return torch.LongTensor([int(l) for l in labels])

In [17]:
import random
vocab_path = "./vocab.txt"


def make_batches(sequences: List[str], batch_size: int) -> List[List[str]]:
    """Yield batch_size chunks from sequences."""
    # TODO
    samples = sequences
    batches = []
    for i in range(0, len(samples), batch_size):
        batches.append(samples[i:i+batch_size])
    return batches

# TODO: Set your preferred batch size
batch_size = 16
tokenizer = Tokenizer()

# We make batches now and use those.
encode_tokenized_batch_train_sentences = []
encode_tokenized_batch_train_labels = []
encode_tokenized_batch_dev_sentences = []
encode_tokenized_batch_dev_labels = []

# Note: Labels need to be batched in the same way to ensure
# We have train sentence and label batches lining up.
for batch in make_batches(train_sentences, batch_size):
    encode_tokenized_batch_train_sentences.append(encode_sentences(tokenizer(batch), word2i))

for batch in make_batches(train_labels, batch_size):
    encode_tokenized_batch_train_labels.append(encode_labels(batch))

for batch in make_batches(test_sentences, batch_size):
    encode_tokenized_batch_dev_sentences.append(encode_sentences(tokenizer(batch), word2i))

for batch in make_batches(test_labels, batch_size):
    encode_tokenized_batch_dev_labels.append(encode_labels(batch))

## Modeling   ( 7 Points)

In [30]:
import torch


# Notice there is a single TODO in the model
class IronyDetector(torch.nn.Module):
    def __init__(
        self,
        input_dim: int,
        hidden_dim: int,
        embeddings_tensor: torch.FloatTensor,
        pad_idx: int,
        output_size: int,
        dropout_val: float = 0.3,
    ):
        super().__init__()
        self.input_dim = input_dim
        self.hidden_dim = hidden_dim
        self.pad_idx = pad_idx
        self.dropout_val = dropout_val
        self.output_size = output_size
        # TODO: Initialize the embeddings from the weights matrix.
        #       Check the documentation for how to initialize an embedding layer
        #       from a pretrained embedding matrix. 
        #       Be careful to set the `freeze` parameter!
        #       Docs are here: https://pytorch.org/docs/stable/generated/torch.nn.Embedding.html#torch.nn.Embedding.from_pretrained
        self.embeddings = torch.nn.Embedding.from_pretrained(embeddings_tensor)
        # Dropout regularization
        # https://jmlr.org/papers/v15/srivastava14a.html
        self.dropout_layer = torch.nn.Dropout(p=self.dropout_val, inplace=False)
        # Bidirectional 2-layer LSTM. Feel free to try different parameters.
        # https://colah.github.io/posts/2015-08-Understanding-LSTMs/
        self.lstm = torch.nn.LSTM(
            self.input_dim,
            self.hidden_dim,
            num_layers=2,
            dropout=dropout_val,
            batch_first=True,
            bidirectional=True,
        )
        # For classification over the final LSTM state.
        self.classifier = torch.nn.Linear(hidden_dim*2, self.output_size)
        self.log_softmax = torch.nn.LogSoftmax(dim=2)
    
    def encode_text(
        self,
        symbols: torch.Tensor
    ) -> torch.Tensor:
        """Encode the (batch of) sequence(s) of token symbols with an LSTM.
            Then, get the last (non-padded) hidden state for each symbol and return that.

        Args:
            symbols (torch.Tensor): The batch size x sequence length tensor of input tokens

        Returns:
            torch.Tensor: The final hiddens tate of the LSTM, which represents an encoding of
                the entire sentence
        """
        # First we get the embedding for each input symbol
        embedded = self.embeddings(symbols)
        embedded = self.dropout_layer(embedded)
        # Packs embedded source symbols into a PackedSequence.
        # This is an optimization when using padded sequences with an LSTM
        lens = (symbols != self.pad_idx).sum(dim=1).to("cpu")
        packed = torch.nn.utils.rnn.pack_padded_sequence(
            embedded, lens, batch_first=True, enforce_sorted=False
        )
        # -> batch_size x seq_len x encoder_dim, (h0, c0).
        packed_outs, (H, C) = self.lstm(packed)
        encoded, _ = torch.nn.utils.rnn.pad_packed_sequence(
            packed_outs,
            batch_first=True,
            padding_value=self.pad_idx,
            total_length=None,
        )
        # Now we have the representation of eahc token encoded by the LSTM.
        encoded, (H, C) = self.lstm(embedded)
   
        # This part looks tricky. All we are doing is getting a tensor
        # That indexes the last non-PAD position in each tensor in the batch.
        last_enc_out_idxs = lens - 1
        # -> B x 1 x 1.
        last_enc_out_idxs = last_enc_out_idxs.view([encoded.size(0)] + [1, 1])
        # -> 1 x 1 x encoder_dim. This indexes the last non-padded dimension.
        last_enc_out_idxs = last_enc_out_idxs.expand(
            [-1, -1, encoded.size(-1)]
        )
        # Get the final hidden state in the LSTM
        last_hidden = torch.gather(encoded, 1, last_enc_out_idxs)
        return last_hidden
    
    def forward(
        self,
        symbols: torch.Tensor,
    ) -> torch.Tensor:
        encoded_sents = self.encode_text(symbols)
        output = self.classifier(encoded_sents)
        print(output)
        print(self.log_softmax(output))
        return self.log_softmax(output)

## Evaluation

In [19]:
def predict(model: torch.nn.Module, dev_sequences: List[torch.Tensor]):
    preds = []
    # TODO: Get the predictions for the dev_sequences using the model
    for dev_sequence in dev_sequences:
        preds.append(torch.argmax(model(dev_sequence).squeeze(1), dim = 1))
    return preds

## Training

In [31]:
from tqdm import tqdm_notebook as tqdm

import random
from util import avg_f1_score, f1_score


def training_loop(
    num_epochs,
    train_features,
    train_labels,
    dev_features,
    dev_labels,
    optimizer,
    model,
):
    print("Training...")
    loss_func = torch.nn.NLLLoss()
    batches = list(zip(train_features, train_labels))
    random.shuffle(batches)
    dev_labels = np.vstack(ten.numpy() for ten in dev_labels).flatten()
    for i in range(num_epochs):
        losses = []
        for features, labels in tqdm(batches):
            # Empty the dynamic computation graph
            optimizer.zero_grad()
            preds = model(features).squeeze(1)
            loss = loss_func(preds, labels)
            # Backpropogate the loss through our model
            loss.backward()
            optimizer.step()
            losses.append(loss.item())
        
        print(f"epoch {i}, loss: {sum(losses)/len(losses)}")
        # Estimate the f1 score for the development set
        print("Evaluating dev...")
        preds = predict(model, dev_features)
        preds= np.vstack(ten.numpy() for ten in preds).flatten()
        dev_f1 = f1_score(preds, dev_labels, label2i['1'])
        dev_avg_f1 = avg_f1_score(preds, dev_labels, list(label2i.values()))
        print(f"Dev F1 {dev_f1}")
        print(f"Avf Dev F1 {dev_avg_f1}")
        
    # Return the trained model
    return model

In [32]:
# TODO: Load the model and run the training loop 
#       on your train/dev splits. Set and tweak hyperparameters.
model = IronyDetector(        
        input_dim=50,
        hidden_dim=25,
        embeddings_tensor=embeddings,
        pad_idx=word2i['<PAD>'],
        output_size=2
)

optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

model = training_loop(num_epochs = 10,
                      train_features = encode_tokenized_batch_train_sentences,
                      train_labels = encode_tokenized_batch_train_labels,
                      dev_features = encode_tokenized_batch_dev_sentences ,
                      dev_labels = encode_tokenized_batch_dev_labels,
                      optimizer = optimizer,
                      model = model)

Training...


  dev_labels = np.vstack(ten.numpy() for ten in dev_labels).flatten()
Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  for features, labels in tqdm(batches):


  0%|          | 0/240 [00:00<?, ?it/s]

tensor([[[ 0.0407, -0.0259]],

        [[ 0.0900, -0.0073]],

        [[ 0.0530, -0.0468]],

        [[ 0.0732, -0.0174]],

        [[ 0.0371, -0.0481]],

        [[ 0.0575, -0.0139]],

        [[ 0.0744, -0.0142]],

        [[ 0.0906,  0.0018]],

        [[ 0.0570, -0.0400]],

        [[ 0.0513, -0.0498]],

        [[ 0.0544, -0.0400]],

        [[ 0.0841, -0.0406]],

        [[ 0.0924, -0.0079]],

        [[ 0.0570, -0.0472]],

        [[ 0.0324, -0.0328]],

        [[ 0.0631, -0.0331]]], grad_fn=<ViewBackward0>)
tensor([[[-0.6604, -0.7270]],

        [[-0.6457, -0.7430]],

        [[-0.6445, -0.7443]],

        [[-0.6489, -0.7394]],

        [[-0.6514, -0.7367]],

        [[-0.6581, -0.7295]],

        [[-0.6499, -0.7384]],

        [[-0.6497, -0.7385]],

        [[-0.6458, -0.7428]],

        [[-0.6438, -0.7450]],

        [[-0.6470, -0.7415]],

        [[-0.6327, -0.7575]],

        [[-0.6442, -0.7446]],

        [[-0.6424, -0.7466]],

        [[-0.6611, -0.7262]],

        [[-0.6

tensor([[[ 0.0351, -0.0083]],

        [[ 0.0361,  0.0241]],

        [[ 0.0029, -0.0003]],

        [[ 0.0007, -0.0044]],

        [[ 0.0017, -0.0087]],

        [[-0.0143, -0.0251]],

        [[ 0.0378,  0.0061]],

        [[ 0.0124, -0.0015]],

        [[ 0.0160, -0.0046]],

        [[-0.0106, -0.0212]],

        [[ 0.0061, -0.0363]],

        [[ 0.0165, -0.0051]],

        [[ 0.0088, -0.0141]],

        [[ 0.0411,  0.0085]],

        [[ 0.0070, -0.0167]],

        [[ 0.0020, -0.0009]]], grad_fn=<ViewBackward0>)
tensor([[[-0.6717, -0.7151]],

        [[-0.6872, -0.6992]],

        [[-0.6916, -0.6947]],

        [[-0.6906, -0.6957]],

        [[-0.6880, -0.6984]],

        [[-0.6877, -0.6986]],

        [[-0.6774, -0.7091]],

        [[-0.6862, -0.7001]],

        [[-0.6829, -0.7035]],

        [[-0.6879, -0.6985]],

        [[-0.6722, -0.7146]],

        [[-0.6824, -0.7040]],

        [[-0.6817, -0.7047]],

        [[-0.6770, -0.7096]],

        [[-0.6814, -0.7051]],

        [[-0.6

tensor([[[ 0.0178, -0.0341]],

        [[ 0.0177, -0.0102]],

        [[ 0.0063, -0.0058]],

        [[ 0.0188, -0.0111]],

        [[-0.0034, -0.0237]],

        [[ 0.0472,  0.0528]],

        [[ 0.0008, -0.0199]],

        [[-0.0005, -0.0097]],

        [[ 0.0050,  0.0040]],

        [[ 0.0117, -0.0022]],

        [[ 0.0134,  0.0046]],

        [[-0.0108, -0.0271]],

        [[ 0.0340,  0.0230]],

        [[ 0.0244,  0.0006]],

        [[ 0.0113, -0.0091]],

        [[-0.0013,  0.0049]]], grad_fn=<ViewBackward0>)
tensor([[[-0.6676, -0.7194]],

        [[-0.6793, -0.7072]],

        [[-0.6871, -0.6992]],

        [[-0.6783, -0.7082]],

        [[-0.6830, -0.7034]],

        [[-0.6960, -0.6903]],

        [[-0.6829, -0.7035]],

        [[-0.6885, -0.6978]],

        [[-0.6926, -0.6937]],

        [[-0.6862, -0.7001]],

        [[-0.6888, -0.6975]],

        [[-0.6850, -0.7014]],

        [[-0.6877, -0.6987]],

        [[-0.6813, -0.7051]],

        [[-0.6830, -0.7034]],

        [[-0.6

tensor([[[ 0.0578,  0.0262]],

        [[ 0.0078, -0.0058]],

        [[ 0.0346,  0.0160]],

        [[-0.0009, -0.0080]],

        [[-0.0079, -0.0134]],

        [[ 0.0076, -0.0030]],

        [[ 0.0151,  0.0150]],

        [[ 0.0194,  0.0021]],

        [[ 0.0095,  0.0182]],

        [[ 0.0295, -0.0069]],

        [[ 0.0525,  0.0516]],

        [[-0.0146,  0.0008]],

        [[ 0.0295,  0.0164]],

        [[ 0.0071, -0.0014]],

        [[ 0.0375,  0.0199]],

        [[ 0.0275, -0.0202]]], grad_fn=<ViewBackward0>)
tensor([[[-0.6775, -0.7091]],

        [[-0.6863, -0.7000]],

        [[-0.6839, -0.7025]],

        [[-0.6896, -0.6967]],

        [[-0.6904, -0.6959]],

        [[-0.6879, -0.6984]],

        [[-0.6931, -0.6932]],

        [[-0.6845, -0.7019]],

        [[-0.6975, -0.6888]],

        [[-0.6751, -0.7115]],

        [[-0.6927, -0.6936]],

        [[-0.7009, -0.6855]],

        [[-0.6866, -0.6997]],

        [[-0.6889, -0.6974]],

        [[-0.6844, -0.7020]],

        [[-0.6

tensor([[[-0.0106,  0.0164]],

        [[-0.0330, -0.0153]],

        [[ 0.0020, -0.0096]],

        [[-0.0116, -0.0019]],

        [[ 0.0270,  0.0187]],

        [[ 0.0120,  0.0212]],

        [[-0.0199, -0.0213]],

        [[-0.0207, -0.0075]],

        [[-0.0057,  0.0172]],

        [[ 0.0208,  0.0319]],

        [[-0.0030,  0.0084]],

        [[ 0.0276,  0.0379]],

        [[-0.0325, -0.0009]],

        [[-0.0090, -0.0032]],

        [[ 0.0212,  0.0394]],

        [[ 0.0052,  0.0142]]], grad_fn=<ViewBackward0>)
tensor([[[-0.7068, -0.6797]],

        [[-0.7020, -0.6844]],

        [[-0.6873, -0.6990]],

        [[-0.6980, -0.6883]],

        [[-0.6890, -0.6973]],

        [[-0.6978, -0.6886]],

        [[-0.6924, -0.6939]],

        [[-0.6998, -0.6866]],

        [[-0.7047, -0.6818]],

        [[-0.6987, -0.6876]],

        [[-0.6988, -0.6875]],

        [[-0.6983, -0.6880]],

        [[-0.7091, -0.6775]],

        [[-0.6961, -0.6902]],

        [[-0.7023, -0.6841]],

        [[-0.6

tensor([[[ 0.0306, -0.0233]],

        [[-0.0002, -0.0318]],

        [[ 0.0357, -0.0333]],

        [[ 0.0214, -0.0416]],

        [[ 0.0101,  0.0016]],

        [[ 0.0113, -0.0140]],

        [[ 0.0149, -0.0217]],

        [[ 0.0266, -0.0240]],

        [[ 0.0263,  0.0045]],

        [[ 0.0404,  0.0036]],

        [[ 0.0201,  0.0032]],

        [[ 0.0410, -0.0274]],

        [[ 0.0393, -0.0126]],

        [[ 0.0747, -0.0154]],

        [[ 0.0419, -0.0168]],

        [[ 0.0240, -0.0047]]], grad_fn=<ViewBackward0>)
tensor([[[-0.6666, -0.7205]],

        [[-0.6775, -0.7091]],

        [[-0.6593, -0.7282]],

        [[-0.6622, -0.7251]],

        [[-0.6889, -0.6974]],

        [[-0.6806, -0.7059]],

        [[-0.6750, -0.7116]],

        [[-0.6682, -0.7188]],

        [[-0.6823, -0.7041]],

        [[-0.6749, -0.7117]],

        [[-0.6847, -0.7016]],

        [[-0.6595, -0.7279]],

        [[-0.6675, -0.7194]],

        [[-0.6491, -0.7392]],

        [[-0.6642, -0.7229]],

        [[-0.6

tensor([[[ 0.0436, -0.0041]],

        [[ 0.0756, -0.0442]],

        [[ 0.0629, -0.0020]],

        [[ 0.0829, -0.0379]],

        [[ 0.0311, -0.0451]],

        [[ 0.0513, -0.0611]],

        [[ 0.0502, -0.0171]],

        [[ 0.0427, -0.0684]],

        [[ 0.0442, -0.0383]],

        [[ 0.0678, -0.0166]],

        [[ 0.0484, -0.0483]],

        [[ 0.0554, -0.0513]],

        [[ 0.0662, -0.0350]],

        [[ 0.0281, -0.0751]],

        [[ 0.0207, -0.0450]],

        [[ 0.0971,  0.0066]]], grad_fn=<ViewBackward0>)
tensor([[[-0.6696, -0.7173]],

        [[-0.6351, -0.7548]],

        [[-0.6612, -0.7261]],

        [[-0.6346, -0.7554]],

        [[-0.6558, -0.7320]],

        [[-0.6385, -0.7509]],

        [[-0.6601, -0.7273]],

        [[-0.6391, -0.7503]],

        [[-0.6528, -0.7352]],

        [[-0.6518, -0.7362]],

        [[-0.6460, -0.7427]],

        [[-0.6412, -0.7479]],

        [[-0.6438, -0.7450]],

        [[-0.6429, -0.7461]],

        [[-0.6608, -0.7265]],

        [[-0.6

tensor([[[ 0.0719, -0.0723]],

        [[ 0.0828, -0.0250]],

        [[ 0.0674, -0.1156]],

        [[ 0.0968, -0.0620]],

        [[ 0.0690, -0.0743]],

        [[ 0.0827, -0.0591]],

        [[ 0.0625, -0.0559]],

        [[ 0.0695, -0.0754]],

        [[ 0.0511, -0.0916]],

        [[ 0.0905, -0.0187]],

        [[ 0.0930, -0.0294]],

        [[ 0.0831, -0.0279]],

        [[ 0.1017, -0.1040]],

        [[ 0.0612, -0.0531]],

        [[ 0.0728, -0.0079]],

        [[ 0.1200, -0.0888]]], grad_fn=<ViewBackward0>)
tensor([[[-0.6237, -0.7678]],

        [[-0.6407, -0.7485]],

        [[-0.6058, -0.7888]],

        [[-0.6169, -0.7757]],

        [[-0.6241, -0.7673]],

        [[-0.6248, -0.7665]],

        [[-0.6357, -0.7541]],

        [[-0.6233, -0.7682]],

        [[-0.6243, -0.7670]],

        [[-0.6400, -0.7493]],

        [[-0.6338, -0.7563]],

        [[-0.6392, -0.7502]],

        [[-0.5956, -0.8013]],

        [[-0.6376, -0.7519]],

        [[-0.6536, -0.7343]],

        [[-0.5

tensor([[[ 0.0793, -0.0873]],

        [[ 0.1111, -0.0908]],

        [[ 0.0761, -0.1033]],

        [[ 0.1545, -0.1194]],

        [[ 0.1316, -0.0755]],

        [[ 0.1187, -0.0953]],

        [[ 0.0875, -0.1140]],

        [[ 0.0981, -0.0867]],

        [[ 0.0991, -0.0807]],

        [[ 0.0653, -0.0765]],

        [[ 0.1212, -0.0818]],

        [[ 0.1380, -0.1120]],

        [[ 0.1056, -0.0725]],

        [[ 0.1125, -0.0158]],

        [[ 0.0959, -0.1136]],

        [[ 0.0860, -0.0942]]], grad_fn=<ViewBackward0>)
tensor([[[-0.6133, -0.7799]],

        [[-0.5973, -0.7992]],

        [[-0.6074, -0.7869]],

        [[-0.5656, -0.8394]],

        [[-0.5949, -0.8021]],

        [[-0.5919, -0.8059]],

        [[-0.5974, -0.7990]],

        [[-0.6050, -0.7898]],

        [[-0.6073, -0.7871]],

        [[-0.6248, -0.7666]],

        [[-0.5968, -0.7998]],

        [[-0.5759, -0.8259]],

        [[-0.6081, -0.7862]],

        [[-0.6311, -0.7594]],

        [[-0.5939, -0.8033]],

        [[-0.6

tensor([[[ 0.1243, -0.0962]],

        [[ 0.1054, -0.0671]],

        [[ 0.0866, -0.0719]],

        [[ 0.0948, -0.0387]],

        [[ 0.0570, -0.0645]],

        [[ 0.0842, -0.0450]],

        [[ 0.0590, -0.0313]],

        [[ 0.0832, -0.0767]],

        [[ 0.1306, -0.0725]],

        [[ 0.1105, -0.0795]],

        [[ 0.0894, -0.0863]],

        [[ 0.1170, -0.0872]],

        [[ 0.0769, -0.0499]],

        [[ 0.0711, -0.0700]],

        [[ 0.0685, -0.0897]],

        [[ 0.0948, -0.0758]]], grad_fn=<ViewBackward0>)
tensor([[[-0.5890, -0.8094]],

        [[-0.6106, -0.7831]],

        [[-0.6171, -0.7755]],

        [[-0.6286, -0.7621]],

        [[-0.6343, -0.7557]],

        [[-0.6306, -0.7598]],

        [[-0.6490, -0.7393]],

        [[-0.6164, -0.7763]],

        [[-0.5968, -0.7998]],

        [[-0.6026, -0.7927]],

        [[-0.6092, -0.7848]],

        [[-0.5962, -0.8005]],

        [[-0.6318, -0.7585]],

        [[-0.6251, -0.7662]],

        [[-0.6172, -0.7754]],

        [[-0.6

tensor([[[ 0.0239,  0.0171]],

        [[ 0.0105,  0.0440]],

        [[ 0.0368,  0.0009]],

        [[ 0.0177,  0.0488]],

        [[ 0.0131,  0.0250]],

        [[ 0.0646, -0.0011]],

        [[-0.0058,  0.0629]],

        [[ 0.0117,  0.0087]],

        [[ 0.0215, -0.0187]],

        [[ 0.0230,  0.0023]],

        [[ 0.0574, -0.0093]],

        [[ 0.0187,  0.0029]],

        [[ 0.0263, -0.0180]],

        [[-0.0117, -0.0061]],

        [[ 0.0517,  0.0014]],

        [[-0.0059,  0.0325]]], grad_fn=<ViewBackward0>)
tensor([[[-0.6897, -0.6966]],

        [[-0.7100, -0.6765]],

        [[-0.6753, -0.7113]],

        [[-0.7088, -0.6778]],

        [[-0.6991, -0.6872]],

        [[-0.6609, -0.7265]],

        [[-0.7281, -0.6594]],

        [[-0.6917, -0.6946]],

        [[-0.6732, -0.7135]],

        [[-0.6828, -0.7036]],

        [[-0.6604, -0.7270]],

        [[-0.6853, -0.7011]],

        [[-0.6712, -0.7155]],

        [[-0.6959, -0.6904]],

        [[-0.6683, -0.7186]],

        [[-0.7

tensor([[[-0.0232,  0.0957]],

        [[-0.0071,  0.1044]],

        [[ 0.0142,  0.0613]],

        [[-0.0640,  0.0723]],

        [[-0.1063,  0.1216]],

        [[-0.0363,  0.0772]],

        [[-0.0228,  0.0544]],

        [[-0.0115,  0.0006]],

        [[-0.0404,  0.0851]],

        [[-0.0236,  0.0714]],

        [[-0.0172,  0.0471]],

        [[-0.0719,  0.0646]],

        [[-0.0969,  0.1139]],

        [[-0.0084,  0.0415]],

        [[-0.0272,  0.0163]],

        [[-0.0228,  0.0643]]], grad_fn=<ViewBackward0>)
tensor([[[-0.7543, -0.6355]],

        [[-0.7504, -0.6390]],

        [[-0.7170, -0.6699]],

        [[-0.7636, -0.6273]],

        [[-0.8136, -0.5857]],

        [[-0.7515, -0.6380]],

        [[-0.7325, -0.6553]],

        [[-0.6992, -0.6871]],

        [[-0.7578, -0.6324]],

        [[-0.7418, -0.6468]],

        [[-0.7258, -0.6615]],

        [[-0.7637, -0.6272]],

        [[-0.8041, -0.5933]],

        [[-0.7184, -0.6685]],

        [[-0.7151, -0.6716]],

        [[-0.7

tensor([[[-0.0494,  0.1185]],

        [[-0.0492,  0.1019]],

        [[-0.0904,  0.1828]],

        [[-0.0498,  0.0610]],

        [[-0.1151,  0.1942]],

        [[-0.0394,  0.0417]],

        [[-0.0191,  0.1197]],

        [[-0.0888,  0.1070]],

        [[-0.0764,  0.1407]],

        [[-0.1057,  0.1960]],

        [[-0.0912,  0.1435]],

        [[-0.0974,  0.2021]],

        [[-0.1392,  0.1677]],

        [[-0.1064,  0.2204]],

        [[-0.0600,  0.0945]],

        [[-0.0688,  0.1915]]], grad_fn=<ViewBackward0>)
tensor([[[-0.7806, -0.6128]],

        [[-0.7715, -0.6205]],

        [[-0.8390, -0.5659]],

        [[-0.7501, -0.6393]],

        [[-0.8597, -0.5504]],

        [[-0.7345, -0.6534]],

        [[-0.7650, -0.6261]],

        [[-0.7958, -0.6000]],

        [[-0.8075, -0.5905]],

        [[-0.8553, -0.5536]],

        [[-0.8173, -0.5827]],

        [[-0.8540, -0.5546]],

        [[-0.8583, -0.5514]],

        [[-0.8698, -0.5430]],

        [[-0.7733, -0.6189]],

        [[-0.8

tensor([[[-0.0912,  0.1336]],

        [[-0.1590,  0.2706]],

        [[-0.0717,  0.1171]],

        [[-0.1262,  0.2418]],

        [[-0.1297,  0.2131]],

        [[-0.1479,  0.2546]],

        [[-0.0922,  0.1331]],

        [[-0.1124,  0.1018]],

        [[-0.0728,  0.0716]],

        [[-0.1200,  0.1148]],

        [[-0.1154,  0.1024]],

        [[-0.0710,  0.1352]],

        [[-0.1035,  0.1602]],

        [[-0.1533,  0.2022]],

        [[-0.1109,  0.1630]],

        [[-0.1207,  0.1502]]], grad_fn=<ViewBackward0>)
tensor([[[-0.8118, -0.5871]],

        [[-0.9308, -0.5012]],

        [[-0.7920, -0.6032]],

        [[-0.8940, -0.5260]],

        [[-0.8791, -0.5364]],

        [[-0.9145, -0.5120]],

        [[-0.8121, -0.5868]],

        [[-0.8060, -0.5918]],

        [[-0.7679, -0.6236]],

        [[-0.8174, -0.5826]],

        [[-0.8079, -0.5902]],

        [[-0.8015, -0.5954]],

        [[-0.8337, -0.5700]],

        [[-0.8866, -0.5311]],

        [[-0.8394, -0.5656]],

        [[-0.8

tensor([[[-0.1118,  0.1686]],

        [[-0.1692,  0.2808]],

        [[-0.1634,  0.2442]],

        [[-0.1066,  0.1394]],

        [[-0.1259,  0.2289]],

        [[-0.1453,  0.3047]],

        [[-0.0879,  0.1449]],

        [[-0.1797,  0.2665]],

        [[-0.1215,  0.0982]],

        [[-0.1069,  0.1351]],

        [[-0.0330,  0.0998]],

        [[-0.1585,  0.2044]],

        [[-0.1300,  0.2296]],

        [[-0.1194,  0.2333]],

        [[-0.1351,  0.1627]],

        [[-0.2262,  0.3147]]], grad_fn=<ViewBackward0>)
tensor([[[-0.8432, -0.5627]],

        [[-0.9433, -0.4932]],

        [[-0.9176, -0.5099]],

        [[-0.8237, -0.5777]],

        [[-0.8862, -0.5314]],

        [[-0.9432, -0.4933]],

        [[-0.8163, -0.5835]],

        [[-0.9409, -0.4947]],

        [[-0.8090, -0.5893]],

        [[-0.8214, -0.5795]],

        [[-0.7618, -0.6289]],

        [[-0.8909, -0.5281]],

        [[-0.8890, -0.5294]],

        [[-0.8850, -0.5322]],

        [[-0.8531, -0.5553]],

        [[-0.9

KeyboardInterrupt: 

## Written Assignment (30 Points)

### 1. Describe what the task is, and how it could be useful.

1. implement average f1 score, it can minimize the biase for using f1-score only, espcially for imbalanced dataset
2. Create a Tokenizer with Padding, tokenizing and padding batches of inputs sentences, pad make every list of tokens is the same length, make model easier.  
3. Extracting word vectors from GloVe, we load the created vocabulary, and get embeddings from the existing 1.2 million words from GloVe file, so we have generated our embeddings based on our own vocabulary. 
4. Update the embeddings with oov words, xavier initialization for the embedidings of words in train, and add the oov words to the dict, assigning a new index to each. So we have a dictionary for word to index and a torch tensor containing the embeddings.  
5. Make_batches, we batch train/dev set and labels along with tokenizer and encode methods. So the model can directly use these inputs 
6. Get the predictions for the dev_sequences using the model, we use torch argmax to get prediction result because the model forward method return a softmax result so we need torch argmax to determine index 0 or 1 for the result. 
7. Load the model and run the training loop on your train/dev splits. Set and tweak hyperparameters. Set epochs as 10 for feasible time, the parameter we can change is learning rate and batch size, please see question 5 for result table. 

### 2. Describe, at the high level, that is, without mathematical rigor, how pretrained word embeddings like the ones we relied on here are computed. Your description can discuss the Word2Vec class of algorithms, GloVe, or a similar method.

Word2Vec is a simple neural network with a single hidden layer, and like all neural networks, it has weights, and during training, its goal is to adjust those weights to reduce a loss function. It takes as its input a large corpus of words and produces a vector space, typically of several hundred dimensions, with each unique word in the corpus being assigned a corresponding vector in the space.

### 3. What are some of the benefits of using word embeddings instead of e.g. a bag of words?

1. They retain semantic similarity
2. They have dense vectors
3. They have a constant vector size
4. Their Vector representations are absolute
5. They have multiple embedding models

### 4. What is the difference between Binary Cross Entropy loss and the negative log likelihood loss we used here (`torch.nn.NLLLoss`)?

The cross_entropy combines nn.LogSoftmax() and nn.NLLLoss() in one single class. If we use NLLLoss we need to use softmax manually to transform the output.  

Use NLLLoss if two-dimensional input encodes log-likelihood, it essentially performs the masking step followed by mean reduction. Use CrossEntropyLoss if two-dimensional input encodes raw prediction values that need to be activated using the softmax function

### 5. Show your experimental results. Indicate any changes to hyperparameters, data splits, or architectural changes you made, and how those effected results.

| --- | loss |  Dev F1 | Avf Dev F1 |  
|-----------|----------- |----------- |----------- |  
| lr = 0.01 & batch = 8  | 0.5908 |0.5997 | 0.6355 |  
| lr = 0.01 & batch = 16 | 0.5690 | 0.5749 |   0.6381 |  
| lr = 0.01 & batch = 4 | 0.6312 | 0.5594 |   0.6532 |  
| lr = 0.001 & batch = 8  | 0.5723 |0.5932 | 0.6291 |  
| lr = 0.001 & batch = 16 | 0.5833 | **0.6779** |  **0.7252**  |  
| lr = 0.001 & batch = 4 | **0.5553** | 0.6177 |   0.6721 |  