<a href="https://colab.research.google.com/github/biancasmrt/CAPACIT/blob/master/CS4650_p2_egoh9.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# HW3 Programming Assignment

In this assignment, we will train LSTM POS-taggers, and evaluate their performance.

We will use English text from the Wall Street Journal, marked with POS tags such as `NNP` (proper noun) and `DT` (determiner).

## Part 1 Building a Basic POS Tagger  [15 points]

Define BasicPOSTagger -- 5 points


Train and evaluate BasicPOSTagger -- 5 points


Error analysis for BasicPOSTagger -- 5 points



###Part 1.1 Setup

**To begin this project, make a copy of this notebook and save it to your local drive so that you can edit it.**


If you want GPU's (which will improve training speed), you can always change your instance type to GPU by going to Runtime -> Change runtime type -> Hardware accelerator.

If you're new to PyTorch, or simply want a refresher, we recommend you start by looking through these [Introduction to PyTorch](https://cocoxu.github.io/CS4650_spring2022/slides/PyTorch_tutorial.pdf) slides and this interactive [PyTorch Basics notebook](http://bit.ly/pytorchbasics). Additionally, this [Text Sentiment](http://bit.ly/pytorchexample) notebook will provide some insight into working with PyTorch for NLP specific problems. 

In [None]:
# DO NOT MODIFY #
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

import random

RANDOM_SEED = 42
torch.manual_seed(RANDOM_SEED)
random.seed(RANDOM_SEED)

# this is how we select a GPU if it's avalible on your computer or in the Colab environment.
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

You can check to make sure a GPU is available using the following code block.

If the below message is shown, it means you are using a CPU.
```
/bin/bash: nvidia-smi: command not found
```

---





In [None]:
gpu_info = !nvidia-smi
gpu_info = '\n'.join(gpu_info)
if gpu_info.find('failed') >= 0:
  print('Select the Runtime > "Change runtime type" menu to enable a GPU accelerator, ')
  print('and then re-execute this cell.')
else:
  print(gpu_info)

Fri Oct 28 03:42:38 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   75C    P0    32W /  70W |   1016MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

### Part 1.2 Preparing Data

`train.txt`: The training data is present in this file. This file contains sequences of words and their respective tags. The data is split into 80% training and 20% development to train the model and tune the hyperparameters, respectively. See `load_tag_data` for details on how to read the training data.

In [None]:
!curl https://raw.githubusercontent.com/chaojiang06/chaojiang06.github.io/master/TA/spring2022_CS4650/train.txt > train.txt

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 2775k  100 2775k    0     0  12.3M      0 --:--:-- --:--:-- --:--:-- 12.3M


In [None]:
def load_tag_data(tag_file):
    all_sentences = []
    all_tags = []
    sent = []
    tags = []
    with open(tag_file, 'r') as f:
        for line in f:
            if line.strip() == "":
                all_sentences.append(sent)
                all_tags.append(tags)
                sent = []
                tags = []
            else:
                word, tag, _ = line.strip().split()
                sent.append(word)
                tags.append(tag)
    return all_sentences, all_tags

train_sentences, train_tags = load_tag_data('train.txt')

unique_tags = set([tag for tag_seq in train_tags for tag in tag_seq])

# Create train-val split from train data
train_val_data = list(zip(train_sentences, train_tags))
random.shuffle(train_val_data)
split = int(0.8 * len(train_val_data))
training_data = train_val_data[:split]
val_data = train_val_data[split:]

print("Train Data: ", len(training_data))
print("Val Data: ", len(val_data))
print("Total tags: ", len(unique_tags))

Train Data:  7148
Val Data:  1788
Total tags:  44


### Part 1.3 Word-to-Index and Tag-to-Index mapping
In order to work with text in Tensor format, we need to map each word to an index.

In [None]:
word_to_idx = {}
for sent in train_sentences:
    for word in sent:
        if word not in word_to_idx:
            word_to_idx[word] = len(word_to_idx)
            
tag_to_idx = {}
for tag in unique_tags:
    if tag not in tag_to_idx:
        tag_to_idx[tag] = len(tag_to_idx)

idx_to_tag = {}
for tag in tag_to_idx:
    idx_to_tag[tag_to_idx[tag]] = tag

print("Total tags", len(tag_to_idx))
print("Vocab size", len(word_to_idx))

Total tags 44
Vocab size 19122


In [None]:
def prepare_sequence(sent, idx_mapping):
    idxs = [idx_mapping[word] for word in sent]
    return torch.tensor(idxs, dtype=torch.long)

### Part 1.4 Set up model
We will build and train a Basic POS Tagger which is an LSTM model to tag the parts of speech in a given sentence.


First we need to define some default hyperparameters.

In [None]:
EMBEDDING_DIM = 4
HIDDEN_DIM = 8
LEARNING_RATE = 0.1
LSTM_LAYERS = 1
DROPOUT = 0
EPOCHS = 10

### Part 1.5 Define Model

The model takes as input a sentence as a tensor in the index space. This sentence is then converted to embedding space where each word maps to its word embedding. The word embeddings is learned as part of the model training process. These word embeddings act as input to the LSTM which produces a representation for each word. Then the representations of words are passed to a Linear layer.

In [None]:
class BasicPOSTagger(nn.Module):
    def __init__(self, embedding_dim, hidden_dim, vocab_size, tagset_size):
        super(BasicPOSTagger, self).__init__()
        #############################################################################
        # TODO: Define and initialize anything needed for the forward pass.
        # You are required to create a model with:
        # an embedding layer: that maps words to the embedding space
        # an LSTM layer: that takes word embeddings as input and outputs hidden states
        # a linear layer: maps from hidden state space to tag space
        #############################################################################

        self.hidden_dim = hidden_dim

        self.word_embeddings = nn.Embedding(vocab_size, embedding_dim)

        # an LSTM layer: that takes word embeddings as input and outputs hidden states
        self.lstm_layer = nn.LSTM(embedding_dim, hidden_dim, num_layers = LSTM_LAYERS)

        # a linear layer: maps from hidden state space to tag space
        self.linear_layer = nn.Linear(hidden_dim, tagset_size)

        #############################################################################
        #                             END OF YOUR CODE                              #
        #############################################################################

    def forward(self, sentence):
        # tag_scores = None
        #############################################################################
        # TODO: Implement the forward pass.
        # Given a tokenized index-mapped sentence as the argument, 
        # compute the corresponding raw scores for tags (without softmax)
        # returns:: tag_scores (Tensor)
        #############################################################################
        
        x = self.word_embeddings(sentence)
        x, _ = self.lstm_layer(x)
        tag_scores = self.linear_layer(x)

        #############################################################################
        #                             END OF YOUR CODE                              #
        #############################################################################
        return tag_scores

### Part 1.6 Training

We define train and evaluate procedures that allow us to train our model using our created train-val split.

In [None]:
def train(epoch, model, loss_function, optimizer):
    model.train()
    train_loss = 0
    train_examples = 0
    for sentence, tags in training_data:
        #############################################################################
        # TODO: Implement the training method
        # Hint: you can use the prepare_sequence method for creating index mappings 
        # for sentences. Find the gradient with respect to the loss and update the
        # model parameters using the optimizer.
        #############################################################################
        
        #zero out the parameter gradients
        optimizer.zero_grad()

        #prepare input data (sentences and gold labels)
        sentence_in = prepare_sequence(sentence, word_to_idx)
        targets = prepare_sequence(tags, tag_to_idx)

        #do forward pass with current batch of input
        pred_prob = model.forward(sentence_in.to(device))

        #get loss with model predictions and true labels
        loss = loss_function(pred_prob, targets.to(device))
        loss.backward()

        #update model parameters
        optimizer.step()

        #increase running total loss and the number of past training samples 
        train_loss += loss.item()
        train_examples += len(targets)

        #############################################################################
        #                             END OF YOUR CODE                              #
        #############################################################################
    
    avg_train_loss = train_loss / train_examples
    avg_val_loss, val_accuracy = evaluate(model, loss_function)
        
    print("Epoch: {}/{}\tAvg Train Loss: {:.4f}\tAvg Val Loss: {:.4f}\t Val Accuracy: {:.0f}".format(epoch, 
                                                                      EPOCHS, 
                                                                      avg_train_loss, 
                                                                      avg_val_loss,
                                                                      val_accuracy))

def evaluate(model, loss_function):
  # returns:: avg_val_loss (float)
  # returns:: val_accuracy (float)
    model.eval()
    correct = 0
    val_loss = 0
    val_examples = 0
    with torch.no_grad():
        for sentence, tags in val_data:
            #############################################################################
            # TODO: Implement the evaluate method
            # Find the average validation loss along with the validation accuracy.
            # Hint: To find the accuracy, argmax of tag predictions can be used.
            #############################################################################
            
            #prepare input data (sentences and gold labels)
            sentence_in = prepare_sequence(sentence, word_to_idx)
            targets = prepare_sequence(tags, tag_to_idx)

            #do forward pass with current batch of input
            pred_prob = model.forward(sentence_in.to(device))

            #get loss with model predictions and true labels
            loss = loss_function(pred_prob, targets.to(device))

            #get the predicted labels
            pred_labels = torch.argmax(pred_prob, dim=1)

            #get number of correct prediction
            for i in range(len(targets)):
              if targets[i] == pred_labels[i]:
                correct += 1

            #increase running total loss and the number of past valid samples
            val_loss += loss.item()
            val_examples += len(targets)


            #############################################################################
            #                             END OF YOUR CODE                              #
            #############################################################################
    val_accuracy = 100. * correct / val_examples
    avg_val_loss = val_loss / val_examples
    return avg_val_loss, val_accuracy

In [None]:
'''
EMBEDDING_DIM = 4
HIDDEN_DIM = 8
LEARNING_RATE = 0.1
LSTM_LAYERS = 1
DROPOUT = 0
EPOCHS = 10
'''
#############################################################################
# TODO: Initialize the model, optimizer and the loss function
#############################################################################
from torch.optim import Adam

model = BasicPOSTagger(embedding_dim = EMBEDDING_DIM,
                       hidden_dim    = HIDDEN_DIM,
                       vocab_size    = len(word_to_idx.keys()),
                       tagset_size   = len(tag_to_idx.keys())).to(device)

loss_function = nn.CrossEntropyLoss()

optimizer = torch.optim.Adagrad(model.parameters(), lr = LEARNING_RATE)


#############################################################################
#                             END OF YOUR CODE                              #
#############################################################################
for epoch in range(1, EPOCHS + 1): 
    train(epoch, model, loss_function, optimizer)

Epoch: 1/10	Avg Train Loss: 0.0630	Avg Val Loss: 0.0492	 Val Accuracy: 68
Epoch: 2/10	Avg Train Loss: 0.0432	Avg Val Loss: 0.0402	 Val Accuracy: 75
Epoch: 3/10	Avg Train Loss: 0.0353	Avg Val Loss: 0.0352	 Val Accuracy: 79
Epoch: 4/10	Avg Train Loss: 0.0303	Avg Val Loss: 0.0321	 Val Accuracy: 82
Epoch: 5/10	Avg Train Loss: 0.0268	Avg Val Loss: 0.0298	 Val Accuracy: 83
Epoch: 6/10	Avg Train Loss: 0.0242	Avg Val Loss: 0.0282	 Val Accuracy: 84
Epoch: 7/10	Avg Train Loss: 0.0222	Avg Val Loss: 0.0270	 Val Accuracy: 85
Epoch: 8/10	Avg Train Loss: 0.0206	Avg Val Loss: 0.0261	 Val Accuracy: 86
Epoch: 9/10	Avg Train Loss: 0.0193	Avg Val Loss: 0.0254	 Val Accuracy: 86
Epoch: 10/10	Avg Train Loss: 0.0182	Avg Val Loss: 0.0249	 Val Accuracy: 87


**Sanity Check!** Under the default hyperparameter setting, after 5 epochs you should be able to get at least 75% accuracy on the validation set.

### Part 1.7 Error analysis

In this step, we will analyze what kind of errors it was making on the validation set.

Step 1, write a method to generate predictions from the validation set. For every sentence, get its words, predicted tags (model_tags), and the ground truth tags (gt_tags). To make the next step easier, you may want to concatenate words from all sentences into a very long list, and same for model_tags and gt_tags.


Step 2, analyze what kind of errors the model was making. For example, it may frequently label NN as VB. Let's get the top-10 most frequent types of errors, each of their frequency, and some example words. One example is at below. It is interpreted as the model predicts NNP as VBG for 626 times, five random example words are shown.

```
['VBG', 'NNP', 626, ['Rowe', 'Livermore', 'Parker', 'F-16', 'HEYNOW']]
```

In [None]:
#############################################################################
# TODO: Generate predictions for val_data
# Create lists of words, tags predicted by the model and ground truth tags.
# Hint: It should look very similar to the evaluate function.
#############################################################################
def generate_predictions(model, val_data):
    # returns:: word_list (str list)
    # returns:: model_tags (str list) predicted tags
    # returns:: gt_tags (str list)
    # Your code here
    word_list = []
    model_tags = []
    gt_tags = []
    with torch.no_grad():
        for sentence, tags in val_data:
            
            #prepare input data (sentences and gold labels)
            sentence_in = prepare_sequence(sentence, word_to_idx)
            targets = prepare_sequence(tags, tag_to_idx)

            #do forward pass with current batch of input
            tag_scores = model.forward(sentence_in.to(device))

            #get the predicted labels
            pred_labels = torch.argmax(tag_scores, 1)

            #get number of correct prediction
            for i in range(len(targets)):
              model_tags.append(idx_to_tag[pred_labels.tolist()[i]])
              gt_tags.append(tags[i])
              word_list.append(sentence[i])
    
    #############################################################################
    #                             END OF YOUR CODE                              #
    #############################################################################

    return word_list, model_tags, gt_tags

#############################################################################
# TODO: Carry out error analysis
# From those lists collected from the above method, find the 
# top-10 tuples of (model_tag, ground_truth_tag, frequency, example words)
# sorted by frequency
# ['VBG', 'NNP', 626, ['Rowe', 'Livermore', 'Parker', 'F-16', 'HEYNOW']]
# For example, it may frequently label NN as VB. 
# Let's get the top-10 most frequent types of errors, each of their frequency, and some example words.
#############################################################################
def error_analysis(word_list, model_tags, gt_tags):
    # returns: errors (list of tuples)
    # Your code here

    error_dict = {}
    example_words = {}
    for i in range(len(model_tags)):
      if (model_tags[i] != gt_tags[i]): 
        if (model_tags[i], gt_tags[i]) not in error_dict.keys():
          error_dict[(model_tags[i], gt_tags[i])] = 1
          example_words[(model_tags[i], gt_tags[i])] = [word_list[i]]
        else:
          error_dict[(model_tags[i], gt_tags[i])] += 1
          example_words[(model_tags[i], gt_tags[i])].append(word_list[i])

    sorted_error_dict = sorted(error_dict.items(), key=lambda kv:(kv[1], kv[0]), reverse=True)
    errors = []
    for (tags, frequency) in sorted_error_dict:
      errors.append([tags, frequency, example_words[tags][:5]])
    #############################################################################
    #                             END OF YOUR CODE                              #
    
    # '''
    # NNP: Noun proper
    # NN: Noun common
    # JJ: Adjective
    # NNS: Noun plural
    # VBZ: verb 3rrd person singualr
    # VBD: Verb singular
    # VBN: verb past
    # '''
    #############################################################################

    return errors

word_list, model_tags, gt_tags = generate_predictions(model, val_data)
errors = error_analysis(word_list, model_tags, gt_tags)

for i in errors[:10]:
  print(i)




[('NN', 'NNP'), 537, ['Ivy', 'Egon', 'Pennsylvania', 'Wenz', 'Barron']]
[('JJ', 'NN'), 221, ['commercial', 'mother', 'hotdog', 'Source', 'net']]
[('NNS', 'JJ'), 182, ['much-beloved', 'diverse', 'unable', 'sure', 'snooty']]
[('NN', 'JJ'), 173, ['snake-oil', 'huge', 'bargain-basement', 'desperate', 'deflationary']]
[('NNP', 'NN'), 146, ['novelist', 'Market', 'magazine', 'revolution', 'treasury']]
[('VBD', 'VBN'), 123, ['ended', 'requested', 'continued', 'acquired', 'held']]
[('NN', 'NNS'), 112, ['pleas', 'loopholes', 'restrictions', 'bills', 'Plans']]
[('NNS', 'NNP'), 110, ['Galles', 'Allenport', 'Himont', 'Mitsuoka', 'Malapai']]
[('NNS', 'CD'), 102, ['334,000', '617', '9.9', '142.43', '220']]
[('NNS', 'NN'), 97, ['club', 'reflection', 'rolling', 're-election', 'chicken']]


**Report your findings here.**  
What kinds of errors did the model make and why do you think it made them?

The most frequent errors were the model over predicting common nouns and proper nouns. It mixed up the different types of nouns (common, singular, plural etc) which could be because sometimes common nouns are parts of titles which are proper nouns. It also predicted adjectives as nouns, maybe because they are uncommon and weren't seen in the training data. 

## Part 2 Hyper-parameter Tuning [10 points]

In order to improve your model performance, try making some modifications on `EMBEDDING_DIM`, `HIDDEN_DIM`, and `LEARNING_RATE`. You will receive 50%/75%/100% credit for this section if your model, after being trained for 10 epochs, is able to achieve 80%/85%/90% accuracy on the validation set.

In [None]:
YOUR_EMBEDDING_DIM = 10
YOUR_HIDDEN_DIM = 20
YOUR_LEARNING_RATE = 0.001

#############################################################################
# TODO: Set three hyper-parameters. Initialize the model, optimizer and the loss function
# Hint, you may want to use reduction='sum' in the CrossEntropyLoss function
#############################################################################

from torch.optim import Adam

model_tuned = BasicPOSTagger(embedding_dim = YOUR_EMBEDDING_DIM,
                       hidden_dim    = YOUR_HIDDEN_DIM,
                       vocab_size    = len(word_to_idx.keys()),
                       tagset_size   = len(tag_to_idx.keys())).to(device)

loss_function = nn.CrossEntropyLoss()

optimizer_tuned = torch.optim.Adam(model_tuned.parameters(), lr = YOUR_LEARNING_RATE)

#############################################################################
#                             END OF YOUR CODE                              #
#############################################################################
for epoch in range(1, EPOCHS + 1): 
    train(epoch, model_tuned, loss_function, optimizer_tuned)

Epoch: 1/10	Avg Train Loss: 0.0743	Avg Val Loss: 0.0505	 Val Accuracy: 66
Epoch: 2/10	Avg Train Loss: 0.0416	Avg Val Loss: 0.0353	 Val Accuracy: 77
Epoch: 3/10	Avg Train Loss: 0.0297	Avg Val Loss: 0.0275	 Val Accuracy: 83
Epoch: 4/10	Avg Train Loss: 0.0230	Avg Val Loss: 0.0231	 Val Accuracy: 85
Epoch: 5/10	Avg Train Loss: 0.0188	Avg Val Loss: 0.0203	 Val Accuracy: 87
Epoch: 6/10	Avg Train Loss: 0.0159	Avg Val Loss: 0.0185	 Val Accuracy: 89
Epoch: 7/10	Avg Train Loss: 0.0137	Avg Val Loss: 0.0172	 Val Accuracy: 89
Epoch: 8/10	Avg Train Loss: 0.0120	Avg Val Loss: 0.0162	 Val Accuracy: 90
Epoch: 9/10	Avg Train Loss: 0.0105	Avg Val Loss: 0.0155	 Val Accuracy: 91
Epoch: 10/10	Avg Train Loss: 0.0093	Avg Val Loss: 0.0149	 Val Accuracy: 91


## Part 3 Character-level POS Tagger  [15 points]

Define CharPOSTagger -- 5 points

Train and evaluate CharPOSTagger -- 5 points

Error analysis for CharPOSTagger -- 5 points

Use the character-level information to augment word embeddings. For example, words that end with -ing or -ly give quite a bit of information about their POS tags. To incorporate this information, run a character-level LSTM on every word to create a character-level representation of the word. Take the last hidden state from the character-level LSTM as the representation and concatenate with the word embedding (as in the BasicPOSTagger) to create a new word representation that captures more information.

In [None]:
# Create char to index mapping
char_to_idx = {}
unique_chars = set()
MAX_WORD_LEN = 0

for sent in train_sentences:
    for word in sent:
        for c in word:
            unique_chars.add(c)
        if len(word) > MAX_WORD_LEN:
            MAX_WORD_LEN = len(word)

for c in unique_chars:
    char_to_idx[c] = len(char_to_idx)
char_to_idx[' '] = len(char_to_idx)


### How to do padding correctly for the characters?


Assume we have got a sentence ["We", "love", "NLP"]. You are supposed to first prepend a certain number of blank characters to each of the words in this sentence.

How to determine the number of blank characters we need? The calculation of MAX_WORD_LEN is here for help (which we already provide in the starter code). For the given sentence, MAX_WORD_LEN equals 4. Therefore we prepend two blank characters to "We", zero blank character to "love", and one blank character to "NLP". So the resultant padded sentence we get should be ["  We", "love", " NLP"].

Then, we feed all characters in ["  We", "love", " NLP"] into a char-embedding layer, and get a tensor of shape (3, 4, char_embedding_dim). To make this tensor's shape proper for the char-level LSTM (nn.LSTM), we need to transpose this tensor, i.e. swap the first and the second dimension. So we get a tensor of shape (4, 3, char_embedding_dim), where 4 corresponds to seq_len and 3 corresponds to batch_size.

The last thing you need to do is to obtain the last hidden state from the char-level LSTM, and concatenate it with the word embedding, so that you can get an augmented representation of that word.

[This](https://raw.githubusercontent.com/chaojiang06/chaojiang06.github.io/master/TA/spring2022_CS4650/char_padding.png) is an illustration for left padding characters.

### Why doing the padding?
Someone may ask why we want to do such a kind of padding, instead of directly passing each of the character sequences of each word one by one through an LSTM, to get the last hidden state. The reason is that if you don't do padding, then that means you can only implement this process using "for loop". For CharPOSTagger, if you implement it using "for loop", the training time would be approximately 150s (GPU) / 250s (CPU) per epoch, while it would be around 30s (GPU) / 150s (CPU) per epoch if you do the padding and feed your data in batches. Therefore, we strongly recommend you learn how to do the padding and transform your data into batches. In fact, those are quite important concepts which you should get yourself familar with, although it might take you some time.

### Why doing left padding?
Our hypothesis is that the suffixes of English words (e.g., -ly, -ing, etc) are more indicative than prefixes for the part-of-speech (POS). Though LSTM is supposed to be able to handle long sequences, it still lose information along the way and the information closer to the last state (which you use as char-level representations) will be retained better. 

### How to understand the dimention change?
Assume we have got a sentence with 3 words ["We", "love", "NLP"], and assume the dimension of character embedding is 2, the dimension of word embedding is 4, the dimension of word-level LSTM's hidden layer is 5, the dimension of character-level LSTM's hidden layer is 6.

In BasicPOSTagger, the dimension change would be (3x1x4) ----word-level LSTM----> (3x1x5) ----linear layer----> (3x1x44).

In CharPOSTagger, after padding, character embedding, and swapping, the dimension change would be (MAX_WORD_LEN, 3, 2) ----character-level LSTM----> (MAX_WORD_LEN, 3, 6) ----Take the last hidden state----> (3, 6) ----concatenate with word embedings----> (3x1x10) ----word-level LSTM----> (3x1x5) ----linear layer----> (3x1x44).

In [None]:
# New Hyperparameters
EMBEDDING_DIM = 4
HIDDEN_DIM = 8
LEARNING_RATE = 0.1
LSTM_LAYERS = 1
DROPOUT = 0
EPOCHS = 10
CHAR_EMBEDDING_DIM = 4
CHAR_HIDDEN_DIM = 4

In [None]:
class CharPOSTagger(nn.Module):
    def __init__(self, embedding_dim, hidden_dim, char_embedding_dim, 
                 char_hidden_dim, char_size, vocab_size, tagset_size):
        super(CharPOSTagger, self).__init__()
        #############################################################################
        # TODO: Define and initialize anything needed for the forward pass.
        # You are required to create a model with:
        # an embedding layer for word: that maps words to their embedding space
        # an embedding layer for character: that maps characters to their embedding space
        # a character-level LSTM layer: that finds the character-level embedding for a word
        # a word-level LSTM layer: that takes the concatenated representation per word (word embedding + char-lstm) as input and outputs hidden states
        # a linear layer: maps from hidden state space to tag space
        #############################################################################

        # an embedding layer for word: that maps words to their embedding space
        self.word_embedding_layer = nn.Embedding(vocab_size, embedding_dim)
        
        # an embedding layer for character: that maps characters to their embedding space
        self.char_embedding_layer = nn.Embedding(char_size, char_embedding_dim)

        # a character-level LSTM layer: that finds the character-level embedding for a word
        self.char_lstm_layer = nn.LSTM(char_embedding_dim, char_hidden_dim)
        
        # a word-level LSTM layer: that takes the concatenated representation per word (word embedding + char-lstm) as input and outputs hidden states
        self.lstm_layer = nn.LSTM(embedding_dim + char_hidden_dim, hidden_dim)
        
        # a linear layer: maps from hidden state space to tag space
        self.linear_layer = nn.Linear(hidden_dim, tagset_size)

        #############################################################################
        #                             END OF YOUR CODE                              #
        #############################################################################

    def forward(self, sentence, chars):
        # tag_scores = None
        #############################################################################
        # TODO: Implement the forward pass.
        # Given a tokenized index-mapped sentence and a character sequence as the arguments, 
        # find the corresponding raw scores for tags (without softmax)
        # returns:: tag_scores (Tensor)

        # Then, we feed all characters in [" We", "love", " NLP"] into a char-embedding layer,
        # and get a tensor of shape (3, 4, char_embedding_dim). To make this tensor's shape proper for the char-level LSTM (nn.LSTM), 
        # we need to transpose this tensor, i.e. swap the first and the second dimension. 
        # So we get a tensor of shape (4, 3, char_embedding_dim), where 4 corresponds to seq_len and 3 corresponds to batch_size.
        #############################################################################

        word_embeds = self.word_embedding_layer(sentence)
        char_embeds = self.char_embedding_layer(chars)
        torch.transpose(char_embeds, 0, 1)
        # print("char_embeds: ", char_embeds.shape)
        char_lstm_out, _ = self.char_lstm_layer(char_embeds)
        
        # take the end of the sequence
        end_char_embedding = char_lstm_out[:, -1, :]
        # print("word_embeds: ", word_embeds.shape)
        # print("end_char_embedding: ", end_char_embedding.shape)
        combined = torch.cat((word_embeds, end_char_embedding), dim=1)
        lstm_out, _ = self.lstm_layer(combined)
        tag_scores = self.linear_layer(lstm_out)

        #############################################################################
        #                             END OF YOUR CODE                              #
        #############################################################################
        return tag_scores



In [None]:

def train_char(epoch, model, loss_function, optimizer):
    model.train()
    train_loss = 0
    train_examples = 0
    for sentence, tags in training_data:
        #############################################################################
        # TODO: Implement the training method
        # Hint: you can use the prepare_sequence method for creating index mappings 
        # for sentences. For constructing character input, you may want to left pad
        # each word to MAX_WORD_LEN first, then use prepare_sequence method to create
        # index  mappings. 
        #############################################################################
        
        #zero out the parameter gradients
        optimizer.zero_grad()

        #prepare input data (sentences, characters, and gold labels)
        sentence_in = prepare_sequence(sentence, word_to_idx)
        targets = prepare_sequence(tags, tag_to_idx)

        encoded_words = list()
        # print("sentence: ", len(sentence))

        for word in sentence:
          # word = word.rjust(MAX_WORD_LEN, ' ')
          word = ' ' * (MAX_WORD_LEN - len(word)) + word
          character_in = prepare_sequence(word, char_to_idx)
          encoded_words.append(character_in.resize_(1, MAX_WORD_LEN))
        encoded_words = torch.cat(encoded_words, 0)

        #do forward pass with current batch of input
        pred_prob = model.forward(sentence_in.to(device), encoded_words.to(device))

        #get loss with model predictions and true labels
        loss = loss_function(pred_prob, targets.to(device))
        loss.backward()

        #update model parameters
        optimizer.step()

        #increase running total loss and the number of past training samples 
        train_loss += loss.item()
        train_examples += len(targets)

        #############################################################################
        #                             END OF YOUR CODE                              #
        #############################################################################
    
    avg_train_loss = train_loss / train_examples
    avg_val_loss, val_accuracy = evaluate_char(model, loss_function)
        
    print("Epoch: {}/{}\tAvg Train Loss: {:.4f}\tAvg Val Loss: {:.4f}\t Val Accuracy: {:.0f}".format(epoch, 
                                                                      EPOCHS, 
                                                                      avg_train_loss, 
                                                                      avg_val_loss,
                                                                      val_accuracy))

def evaluate_char(model, loss_function):
    # returns:: avg_val_loss (float)
    # returns:: val_accuracy (float)
    model.eval()
    correct = 0
    val_loss = 0
    val_examples = 0
    with torch.no_grad():
        for sentence, tags in val_data:
            #############################################################################
            # TODO: Implement the evaluate method
            # Find the average validation loss along with the validation accuracy.
            # Hint: To find the accuracy, argmax of tag predictions can be used. 
            #############################################################################

            #prepare input data (sentences, characters, and gold labels)
            sentence_in = prepare_sequence(sentence, word_to_idx)
            targets = prepare_sequence(tags, tag_to_idx)

            encoded_words = list()
            for word in sentence:
              word = ' ' * (MAX_WORD_LEN - len(word)) + word
              character_in = prepare_sequence(word, char_to_idx)
              encoded_words.append(character_in.resize_(1, MAX_WORD_LEN))
            encoded_words = torch.cat(encoded_words, 0)

            #do forward pass with current batch of input
            pred_prob = model.forward(sentence_in.to(device), encoded_words.to(device))

            #get loss with model predictions and true labels
            loss = loss_function(pred_prob, targets.to(device))

            #get the predicted labels
            pred_labels = torch.argmax(pred_prob, dim=1)

            #get number of correct prediction
            for i in range(len(targets)):
              if targets[i] == pred_labels[i]:
                correct += 1
            #increase running total loss and the number of past valid samples
            val_loss += loss.item()
            val_examples += len(targets)

            #############################################################################
            #                             END OF YOUR CODE                              #
            #############################################################################
    val_accuracy = 100. * correct / val_examples
    avg_val_loss = val_loss / val_examples
    return avg_val_loss, val_accuracy

In [None]:
#############################################################################
# TODO: Initialize the model, optimizer and the loss function
# Hint, you may want to use reduction='sum' in the CrossEntropyLoss function

# New Hyperparameters
# EMBEDDING_DIM = 4
# HIDDEN_DIM = 8
# LEARNING_RATE = 0.1
# LSTM_LAYERS = 1
# DROPOUT = 0
# EPOCHS = 10
# CHAR_EMBEDDING_DIM = 4
# CHAR_HIDDEN_DIM = 4

# self, embedding_dim, hidden_dim, char_embedding_dim, 
#                  char_hidden_dim, char_size, vocab_size, tagset_size

#############################################################################
model = CharPOSTagger(embedding_dim = EMBEDDING_DIM,
                      hidden_dim    = HIDDEN_DIM,
                      char_embedding_dim = CHAR_HIDDEN_DIM,
                      char_hidden_dim = CHAR_HIDDEN_DIM,
                      char_size = len(char_to_idx.keys()),
                      vocab_size    = len(word_to_idx.keys()),
                      tagset_size   = len(tag_to_idx.keys())).to(device)

loss_function = nn.CrossEntropyLoss(reduction="sum")

optimizer = torch.optim.Adagrad(model.parameters(), lr = LEARNING_RATE)


#############################################################################
#                             END OF YOUR CODE                              #
#############################################################################
for epoch in range(1, EPOCHS + 1): 
    train_char(epoch, model, loss_function, optimizer)

Epoch: 1/10	Avg Train Loss: 1.1944	Avg Val Loss: 0.8803	 Val Accuracy: 74
Epoch: 2/10	Avg Train Loss: 0.7410	Avg Val Loss: 0.6854	 Val Accuracy: 82
Epoch: 3/10	Avg Train Loss: 0.5827	Avg Val Loss: 0.5883	 Val Accuracy: 85
Epoch: 4/10	Avg Train Loss: 0.4935	Avg Val Loss: 0.5308	 Val Accuracy: 86
Epoch: 5/10	Avg Train Loss: 0.4324	Avg Val Loss: 0.4905	 Val Accuracy: 88
Epoch: 6/10	Avg Train Loss: 0.3886	Avg Val Loss: 0.4628	 Val Accuracy: 88
Epoch: 7/10	Avg Train Loss: 0.3506	Avg Val Loss: 0.4364	 Val Accuracy: 89
Epoch: 8/10	Avg Train Loss: 0.3223	Avg Val Loss: 0.4200	 Val Accuracy: 90
Epoch: 9/10	Avg Train Loss: 0.3005	Avg Val Loss: 0.4081	 Val Accuracy: 90
Epoch: 10/10	Avg Train Loss: 0.2824	Avg Val Loss: 0.3983	 Val Accuracy: 90


**Sanity Check!** Under the default hyperparameter setting, after 5 epochs you should be able to get at least 85% accuracy on the validation set.

### Part 3.1 Error analysis
Write a method to generate predictions for the validation set.
Create lists of words, tags predicted by the model and ground truth tags. 

Then use these lists to carry out error analysis to find the top-10 types of errors made by the model.

This part is very similar to part 1.7. You may want to refer to your implementation there.

In [None]:
#############################################################################
# TODO: Generate predictions for val_data
# Create lists of words, tags predicted by the model and ground truth tags.
# Hint: It should look very similar to the evaluate function.
#############################################################################
def generate_predictions(model, val_data):
    # returns:: word_list (str list)
    # returns:: model_tags (str list)
    # returns:: gt_tags (str list)
    # Your code here

    word_list = []
    model_tags = []
    gt_tags = []
    with torch.no_grad():
        for sentence, tags in val_data:
            
            #prepare input data (sentences and gold labels)
            sentence_in = prepare_sequence(sentence, word_to_idx)
            targets = prepare_sequence(tags, tag_to_idx)

            encoded_words = list()
            for word in sentence:
              word = ' ' * (MAX_WORD_LEN - len(word)) + word
              character_in = prepare_sequence(word, char_to_idx)
              encoded_words.append(character_in.resize_(1, MAX_WORD_LEN))
            encoded_words = torch.cat(encoded_words, 0)

            #do forward pass with current batch of input
            pred_prob = model.forward(sentence_in.to(device), encoded_words.to(device))

            #get the predicted labels
            pred_labels = torch.argmax(pred_prob, 1)

            #get number of correct prediction
            for i in range(len(targets)):
              model_tags.append(idx_to_tag[pred_labels.tolist()[i]])
              gt_tags.append(tags[i])
              word_list.append(sentence[i])


    #############################################################################
    #                             END OF YOUR CODE                              #
    #############################################################################


    return word_list, model_tags, gt_tags

#############################################################################
# TODO: Carry out error analysis
# From those lists collected from the above method, find the 
# top-10 tuples of (model_tag, ground_truth_tag, frequency, example words)
# sorted by frequency
#############################################################################
def error_analysis(word_list, model_tags, gt_tags):
    # returns: errors (list of tuples)
    # Your code here

    error_dict = {}
    example_words = {}
    for i in range(len(model_tags)):
      if (model_tags[i] != gt_tags[i]):
        if (model_tags[i], gt_tags[i]) not in error_dict.keys():
          error_dict[(model_tags[i], gt_tags[i])] = 1
          example_words[(model_tags[i], gt_tags[i])] = [word_list[i]]
        else:
          error_dict[(model_tags[i], gt_tags[i])] += 1
          example_words[(model_tags[i], gt_tags[i])].append(word_list[i])

    sorted_error_dict = sorted(error_dict.items(), key=lambda kv:(kv[1], kv[0]), reverse=True)
    errors = []
    for (tags, frequency) in sorted_error_dict:
      errors.append([tags, frequency, example_words[tags][:5]])

    #############################################################################
    #                             END OF YOUR CODE                              #
    #############################################################################

    return errors

word_list, model_tags, gt_tags = generate_predictions(model, val_data)
errors = error_analysis(word_list, model_tags, gt_tags)

for i in errors[:10]:
  print(i)

[('NN', 'NNP'), 246, ['Harrison', 'Hickman', 'English', 'CSC', 'Himont']]
[('JJ', 'NN'), 198, ['commercial', 'hotdog', 'corridor', 'stockpile', 'net']]
[('NNP', 'NN'), 176, ['award', 'park', 'Market', 'softness', 'tandem']]
[('JJ', 'NNP'), 158, ['Bryan', 'mature', 'British', 'Telerate', 'Employee']]
[('NN', 'JJ'), 154, ['snake-oil', 'gullible', 'previous', '60-inch', 'net']]
[('NNP', 'JJ'), 153, ['plain', 'California', 'West', 'FEDERAL', 'South']]
[('VBN', 'VBD'), 149, ['held', 'permitted', 'ended', 'conceded', 'Warned']]
[('NNS', 'VBZ'), 131, ['implies', 'plans', 'approaches', 'targets', 'follows']]
[('VBD', 'VBN'), 123, ['made', 'ended', 'requested', 'acquired', 'offered']]
[('JJ', 'VBP'), 108, ['have', 'see', 'have', 'have', 'get']]


**Report your findings here.**  
What kinds of errors does the character-level model make as compared to the original model, and why do you think it made them? 

In the original model, a lot of the mistakes the model made were that it predicted things to be nouns incorrectly. In this model, the model is still incorrectly predicting words to be nouns incorrectly, but less than in the original model. This is probably because certain suffixes can lead the model to predict a words to be an adverb (-ly) or present tense verb (-ing) etc.

### Part 4: Submit Your Homework
This is the end. Congratulations!  

Now, follow the steps below to submit your homework in [Gradescope](https://www.gradescope.com/courses/345683):

1. Rename this ipynb file to 'CS4650_p2_GTusername.ipynb'. We recommend ensuring you have removed any extraneous cells & print statements, clearing all outputs, and using the Runtime --> Run all tool to make sure all output is update to date. Additionally, leaving comments in your code to help us understand your operations will assist the teaching staff in grading. It is not a requirement, but is recommended. 
2. Click on the menu 'File' --> 'Download' --> 'Download .py'.
3. Click on the menu 'File' --> 'Download' --> 'Download .ipynb'.
4. Download the notebook as a .pdf document. Make sure the output from Parts 1.6 & 2 & 3 are captured so we can see how the loss and accuracy changes while training.
5. Upload all 3 files to GradeScope.

