<a href="https://colab.research.google.com/github/daiwikpal/NLP/blob/main/CS4650_Project_2_sp25.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# Licensing Information:  You are free to use or extend this project for
# educational purposes provided that (1) you do not distribute or publish
# solutions, (2) you retain this notice, and (3) you provide clear
# attribution to The Georgia Institute of Technology, including a link to https://aritter.github.io/CS-7650/
# Attribution Information: This assignment was developed at The Georgia Institute of Technology
# by Alan Ritter (alan.ritter@cc.gatech.edu)

# 📘 Named Entity Recognition (NLP Project 2)
**Course: CS 4650 "Natural Language Processing"**  
**Institution: Georgia Institute of Technology**  
**Semester: Spring 2025**  
**Instructor: Dr. Wei Xu**  
**Teaching Assistants: Tarek Naous, Jonathan Zheng, Xiaofeng Wu, Yao Dou **

---

## 🚀 Introduction
In this assignment, you will implement a bidirectional LSTM-CNN-CRF for sequence labeling, following [this paper by Xuezhe Ma and Ed Hovy](https://www.aclweb.org/anthology/P16-1101.pdf), on the CoNLL named entity recognition dataset.  Before starting the assignment, we recommend reading the Ma and Hovy paper. It is quite helpful to understanding the big picture of what exactly you are going to be implementing.

---

## 💻 Utilizing GPUs
For enhanced training performance, consider utilizing GPUs. To switch your instance to use a GPU, navigate to: Runtime -> Change runtime type -> Hardware accelerator.

---

## 🕯️ Getting Started with PyTorch
This project is centered around PyTorch, a powerful library for building neural networks. If you're new to PyTorch or need a quick refresher, the following resources are highly recommended:
- **Introduction to PyTorch**: Explore these informative [slides](https://cocoxu.github.io/CS4650_spring2025/slides/PyTorch_tutorial.pdf) for a comprehensive introduction.
- **PyTorch Basics**: Gain hands-on experience with this interactive [notebook](http://bit.ly/pytorchbasics).
- **NLP Task Example**: Understand PyTorch's application in NLP through this [text sentiment analysis notebook](http://bit.ly/pytorchexample).

---

**As a first step, ensure you have a personal copy of this notebook. You can do so by downloading it to your local drive (File -> Download -> Download .ipynb) or by saving a copy to your Google Drive (File -> Save a copy in Drive).**


## Imports + GPU

First, let's import some libraries and make sure the runtime has access to a GPU.


In [None]:
import torch
import torch.nn as nn
import torch.optim as optim

gpu_info = !nvidia-smi
gpu_info = '\n'.join(gpu_info)
if gpu_info.find('failed') >= 0:
  print('Select the Runtime > "Change runtime type" menu to enable a GPU accelerator, ')
  print('and then re-execute this cell.')
else:
  print(gpu_info)

print(f'GPU available: {torch.cuda.is_available()}')

Sat Feb 22 18:14:05 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 |
| N/A   49C    P8             12W /   70W |       0MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

## Download the Data

Run the following code to download the English part of the CoNLL 2003 dataset, the evaluation script and pre-filtered GloVe embeddings we are providing for this data.

In [None]:
#CoNLL 2003 data
!wget https://raw.githubusercontent.com/patverga/torch-ner-nlp-from-scratch/master/data/conll2003/eng.train
!wget https://raw.githubusercontent.com/patverga/torch-ner-nlp-from-scratch/master/data/conll2003/eng.testa
!wget https://raw.githubusercontent.com/patverga/torch-ner-nlp-from-scratch/master/data/conll2003/eng.testb
!cat eng.train | awk '{print $1 "\t" $4}' > train
!cat eng.testa | awk '{print $1 "\t" $4}' > dev
!cat eng.testb | awk '{print $1 "\t" $4}' > test

#Evaluation Script
!wget https://raw.githubusercontent.com/aritter/twitter_nlp/master/data/annotated/wnut16/conlleval.pl

#Pre-filtered GloVe embeddings
!wget https://raw.githubusercontent.com/aritter/aritter.github.io/master/files/glove.840B.300d.conll_filtered.txt

--2025-02-22 18:14:06--  https://raw.githubusercontent.com/patverga/torch-ner-nlp-from-scratch/master/data/conll2003/eng.train
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.111.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3283420 (3.1M) [text/plain]
Saving to: ‘eng.train’


2025-02-22 18:14:06 (51.0 MB/s) - ‘eng.train’ saved [3283420/3283420]

--2025-02-22 18:14:06--  https://raw.githubusercontent.com/patverga/torch-ner-nlp-from-scratch/master/data/conll2003/eng.testa
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 827443 (808K) [text/plain]
Saving to: ‘eng.testa’


2025-02-22

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## CoNLL Data Format

Run the following cell to see a sample of the data in CoNLL format.  As you can see, each line in the file represents a word and its labeled named entity tag in BIO format.  A blank line is used to seperate sentences.

In [None]:
!head -n 20 train

-DOCSTART-	O
	
EU	I-ORG
rejects	O
German	I-MISC
call	O
to	O
boycott	O
British	I-MISC
lamb	O
.	O
	
Peter	I-PER
Blackburn	I-PER
	
BRUSSELS	I-LOC
1996-08-22	O
	
The	O
European	I-ORG


## Data Preprocessing Diagram

Here is a simplified diagram showing what is happening in the data processing sections coming up. Make sure to read through these sections, as you will need to use these functions in your code!

![](https://drive.google.com/uc?export=view&id=13yYcgvEyLC0m3RZegH8V6fErf-8DRtLd)

## Reading in the Data

Below we proivide a bit of code to read in data in the CoNLL format.  This also reads in the filtered GloVe embeddings, to save you some effort - we will discuss this more later.

In [None]:
#Read in the training data
def read_conll_format(filename):
    (words, tags, currentSent, currentTags) = ([],[],['-START-'],['START'])
    for line in open(filename).readlines():
        line = line.strip()
        #print(line)
        if line == "":
            currentSent.append('-END-')
            currentTags.append('END')
            words.append(currentSent)
            tags.append(currentTags)
            (currentSent, currentTags) = (['-START-'], ['START'])
        else:
            (word, tag) = line.split()
            currentSent.append(word)
            currentTags.append(tag)
    return (words, tags)

def sentences2char(sentences):
    return [[['start'] + [c for c in w] + ['end'] for w in l] for l in sentences]


(sentences_train, tags_train) = read_conll_format("train")
(sentences_dev, tags_dev)     = read_conll_format("dev")

print("The second sentence in train set:", sentences_train[2])
print("The NER label of the sentence:   ", tags_train[2])

sentencesChar = sentences2char(sentences_train)

print("The char repersentation of the sentence:", sentencesChar[2])

The second sentence in train set: ['-START-', 'Peter', 'Blackburn', '-END-']
The NER label of the sentence:    ['START', 'I-PER', 'I-PER', 'END']
The char repersentation of the sentence: [['start', '-', 'S', 'T', 'A', 'R', 'T', '-', 'end'], ['start', 'P', 'e', 't', 'e', 'r', 'end'], ['start', 'B', 'l', 'a', 'c', 'k', 'b', 'u', 'r', 'n', 'end'], ['start', '-', 'E', 'N', 'D', '-', 'end']]


In [None]:
#Read GloVe embeddings.
def read_GloVe(filename):
    embeddings = {}
    for line in open(filename).readlines():
        #print(line)
        fields = line.strip().split(" ")
        word = fields[0]
        embeddings[word] = [float(x) for x in fields[1:]]
    return embeddings

GloVe = read_GloVe("glove.840B.300d.conll_filtered.txt")

print("The GloVe word embedding of the word 'the':", GloVe["the"])
print("dimension of glove embedding:", len(GloVe["the"]))

The GloVe word embedding of the word 'the': [0.27204, -0.06203, -0.1884, 0.023225, -0.018158, 0.0067192, -0.13877, 0.17708, 0.17709, 2.5882, -0.35179, -0.17312, 0.43285, -0.10708, 0.15006, -0.19982, -0.19093, 1.1871, -0.16207, -0.23538, 0.003664, -0.19156, -0.085662, 0.039199, -0.066449, -0.04209, -0.19122, 0.011679, -0.37138, 0.21886, 0.0011423, 0.4319, -0.14205, 0.38059, 0.30654, 0.020167, -0.18316, -0.0065186, -0.0080549, -0.12063, 0.027507, 0.29839, -0.22896, -0.22882, 0.14671, -0.076301, -0.1268, -0.0066651, -0.052795, 0.14258, 0.1561, 0.05551, -0.16149, 0.09629, -0.076533, -0.049971, -0.010195, -0.047641, -0.16679, -0.2394, 0.0050141, -0.049175, 0.013338, 0.41923, -0.10104, 0.015111, -0.077706, -0.13471, 0.119, 0.10802, 0.21061, -0.051904, 0.18527, 0.17856, 0.041293, -0.014385, -0.082567, -0.035483, -0.076173, -0.045367, 0.089281, 0.33672, -0.22099, -0.0067275, 0.23983, -0.23147, -0.88592, 0.091297, -0.012123, 0.013233, -0.25799, -0.02972, 0.016754, 0.01369, 0.32377, 0.039546, 0.

## Mapping Tokens to Indices

As in the last project, we will need to convert words in the dataset to numeric indices, so they can be presented as input to a neural network.  Code to handle this for you with sample usage is provided below.

In [None]:
#Create mappings between tokens and indices.

from collections import Counter
import random

#Will need this later to remove 50% of words that only appear once in the training data from the vocabulary (and don't have GloVe embeddings).
wordCounts = Counter([w for l in sentences_train for w in l])
charCounts = Counter([c for l in sentences_train for w in l for c in w])
singletons = set([w for (w,c) in wordCounts.items() if c == 1 and not w in GloVe.keys()])
charSingletons = set([w for (w,c) in charCounts.items() if c == 1])

#Build dictionaries to map from words, characters to indices and vice versa.
#Save first two words in the vocabulary for padding and "UNK" token.
word2i = {w:i+2 for i,w in enumerate(set([w for l in sentences_train for w in l] + list(GloVe.keys())))}
char2i = {w:i+2 for i,w in enumerate(set([c for l in sentencesChar for w in l for c in w]))}
i2word = {i:w for w,i in word2i.items()}
i2char = {i:w for w,i in char2i.items()}

vocab_size = max(word2i.values()) + 1
char_vocab_size = max(char2i.values()) + 1

#Tag dictionaries.
tag2i = {w:i for i,w in enumerate(set([t for l in tags_train for t in l]))}
i2tag = {i:t for t,i in tag2i.items()}

#When training, randomly replace singletons with UNK tokens sometimes to simulate situation at test time.
def getDictionaryRandomUnk(w, dictionary, train=False):
    if train and (w in singletons and random.random() > 0.5):
        return 1
    else:
        return dictionary.get(w, 1)

#Map a list of sentences from words to indices.
def sentences2indices(words, dictionary, train=False):
    #1.0 => UNK
    return [[getDictionaryRandomUnk(w,dictionary, train=train) for w in l] for l in words]

#Map a list of sentences containing to indices (character indices)
def sentences2indicesChar(chars, dictionary):
    #1.0 => UNK
    return [[[dictionary.get(c,1) for c in w] for w in l] for l in chars]

#Indices
X       = sentences2indices(sentences_train, word2i, train=True)
X_char  = sentences2indicesChar(sentencesChar, char2i)
Y       = sentences2indices(tags_train, tag2i)

print("vocab size:", vocab_size)
print("char vocab size:", char_vocab_size)
print()

print("index of word 'the':", word2i["the"])
print("word of index 253:", i2word[253])
print()

#Print out some examples of what the dev inputs will look like
for i in range(10):
    print(" ".join([i2word.get(w,'UNK') for w in X[i]]))

vocab size: 29148
char vocab size: 88

index of word 'the': 10705
word of index 253: cantons

-START- -DOCSTART- -END-
-START- EU rejects German call to boycott British lamb . -END-
-START- Peter Blackburn -END-
-START- BRUSSELS 1996-08-22 -END-
-START- The European Commission said on Thursday it disagreed with German advice to consumers to shun British lamb until scientists determine whether mad cow disease can be transmitted to sheep . -END-
-START- Germany 's representative to the European Union 's veterinary committee Werner Zwingmann said on Wednesday consumers should buy sheepmeat from countries other than Britain until the scientific advice was clearer . -END-
-START- " We do n't support any such recommendation because we do n't see any grounds for it , " the Commission 's chief spokesman Nikolaus van der Pas told a news briefing . -END-
-START- He said further scientific study was required and if it was found that action was needed it should be taken by the European Union . -EN

## Padding and Batching

In this assignment, you should train your models using minibatched SGD.  When presenting multiple sentences to the network at the same time, we will need to pad them to be of the same length. We use [torch.nn.utils.rnn.pad_sequence](https://pytorch.org/docs/stable/generated/torch.nn.utils.rnn.pad_sequence.html) to do so.


Below we provide some code to prepare batches of data to present to the network. We pad the sequence so that all the sequences have the same length.

**Side Note:** PyTorch includes utilities in [`torch.utils.data`](https://pytorch.org/docs/stable/data.html) to help with padding, batching, shuffling and some other things, but for this assignment we will do everything from scratch to help you see exactly how this works.

In [None]:
#Pad inputs to max sequence length (for batching)
def prepare_input(X_list):
    X_padded = torch.nn.utils.rnn.pad_sequence([torch.as_tensor(l) for l in X_list], batch_first=True).type(torch.LongTensor) # padding the sequences with 0
    X_mask   = torch.nn.utils.rnn.pad_sequence([torch.as_tensor([1.0] * len(l)) for l in X_list], batch_first=True).type(torch.FloatTensor) # consisting of 0 and 1, 0 for padded positions, 1 for non-padded positions
    return (X_padded, X_mask)

#Maximum word length (for character representations)
MAX_CLEN = 32

def prepare_input_char(X_list):
    MAX_SLEN = max([len(l) for l in X_list])
    X_padded  = [l + [[]]*(MAX_SLEN-len(l))  for l in X_list]
    X_padded  = [[w[0:MAX_CLEN] for w in l] for l in X_padded]
    X_padded  = [[w + [1]*(MAX_CLEN-len(w)) for w in l] for l in X_padded]
    return torch.as_tensor(X_padded).type(torch.LongTensor)

#Pad outputs using one-hot encoding
def prepare_output_onehot(Y_list, NUM_TAGS=max(tag2i.values())+1):
    Y_onehot = [torch.zeros(len(l), NUM_TAGS) for l in Y_list]
    for i in range(len(Y_list)):
        for j in range(len(Y_list[i])):
            # print(type(Y_list[i][j]))
            Y_onehot[i][j,Y_list[i][j]] = 1.0
    Y_padded = torch.nn.utils.rnn.pad_sequence(Y_onehot, batch_first=True).type(torch.FloatTensor)
    return Y_padded

print("max slen:", max([len(x) for x in X_char]))


(X_padded, X_mask) = prepare_input(X)
X_padded_char      = prepare_input_char(X_char)
Y_onehot           = prepare_output_onehot(Y)

print("X_padded:", X_padded.shape)
print("X_mask:", X_mask.shape)
print("X_padded_char:", X_padded_char.shape)
print("Y_onehot:", Y_onehot.shape)

max slen: 115
X_padded: torch.Size([14987, 115])
X_mask: torch.Size([14987, 115])
X_padded_char: torch.Size([14987, 115, 32])
Y_onehot: torch.Size([14987, 115, 10])


## Model Diagram

Below is a simplified diagram of the model you will be implementing. A lot is abstracted away here, but it should help you get the bigger picture of the model you are implementing. Red is part 1, green is part 2, and purple is the extra credit, part 3.


![](https://drive.google.com/uc?export=view&id=1GEkDtTxu-x060pltpiHi5LtcaDpLJdw3)

## **Your code starts here:** Basic LSTM Tagger (10 points)

OK, now you should have everything you need to get started.

Recall that your goal is to to implement the BiLSTM-CNN-CRF, as described in [(Ma and Hovy, 2016)](https://www.aclweb.org/anthology/P16-1101.pdf).  This is a relatively complex network with various components.  Below we provide starter code to break down your implementation into increasingly complex versions of the final model, starting with a Basic LSTM tagger.  This way you can be confident that each part is working correctly before incrementally increasing the complexity of your implementation.  This is generally a good approach to take when implementing complex models, since buggy PyTorch code is often partially working, but produces worse results than a correct implementation, so it's hard to know whether added complexities are helping or hurting.  Also, if you aren't able to match published results it's hard to know which component of your model has the problem (or even whether or not it is a problem in the published result!)

**Fill in the functions marked as `TODO` in the code block below. Please make your code changes only within the given commented block (#####).** If everything is working correctly, you should be able to achieve an **F1 score of 0.86 on the dev set and 0.82 on the test set (with GloVe embeddings)**. You are required to initialize word embeddings with GloVe later, but you can randomly initialize the word embeddings in the beginning.

In [None]:
#####################################################################################
#TODO: Add imports if needed:

#####################################################################################
seed = 42
torch.manual_seed(seed)
random.seed(seed)

class BasicLSTMtagger(nn.Module):
    def __init__(self, DIM_EMB=10, DIM_HID=10):
        super(BasicLSTMtagger, self).__init__()
        NUM_TAGS = max(tag2i.values())+1


        (self.DIM_EMB, self.NUM_TAGS) = (DIM_EMB, NUM_TAGS)
        #####################################################################################
        #TODO: initialize parameters - embedding layer, nn.LSTM, nn.Linear and nn.LogSoftmax

        self.vocab_size = max(word2i.values())+1

        self.word_embeddings = nn.Embedding(vocab_size, self.DIM_EMB, padding_idx = 0)
        self.init_glove(GloVe)

        self.lstm = nn.LSTM(self.DIM_EMB, DIM_HID, bidirectional=True, batch_first=True)

        self.linear = nn.Linear(2 * DIM_HID, self.NUM_TAGS)

        self.LogSoftmax = nn.LogSoftmax(dim = 2)

        #####################################################################################

    def forward(self, X, train=False):
        #####################################################################################
        #TODO: Implement the forward computation.

        embed = self.word_embeddings(X)

        lstm_out, _ = self.lstm(embed)

        linear_out = self.linear(lstm_out)

        output = self.LogSoftmax(linear_out)

        return output


        #####################################################################################

    def init_glove(self, GloVe):
        #####################################################################################
        #TODO: initialize word embeddings using GloVe (you can skip this part in your first version, if you want, see instructions below).


        embedding_matrix = torch.FloatTensor(self.vocab_size, self.DIM_EMB).uniform_(-0.1, 0.1)

        for word, idx in word2i.items():
          if word in GloVe:
              # print(type(torch.tensor(GloVe[word], dtype=torch.float32)))

              embedding_matrix[idx] = torch.tensor(GloVe[word], dtype=torch.float32)

        self.word_embeddings.weight.data.copy_(embedding_matrix)
        self.word_embeddings.weight.requires_grad = False

        #####################################################################################

    def inference(self, sentences):
        X, X_mask       = prepare_input(sentences2indices(sentences, word2i))
        pred = self.forward(X.cuda()).argmax(dim=2)
        return [[i2tag[pred[i,j].item()] for j in range(len(sentences[i]))] for i in range(len(sentences))]

    def print_predictions(self, words, tags):
        Y_pred = self.inference(words)
        for i in range(len(words)):
            print("----------------------------")
            print(" ".join([f"{words[i][j]}/{Y_pred[i][j]}/{tags[i][j]}" for j in range(len(words[i]))]))
            print("Predicted:\t", Y_pred[i])
            print("Gold:\t\t", tags[i])

    def write_predictions(self, sentences, outFile):
        fOut = open(outFile, 'w')
        for s in sentences:
            y = self.inference([s])[0]
            #print("\n".join(y[1:len(y)-1]))
            fOut.write("\n".join(y[1:len(y)-1]))  #Skip start and end tokens
            fOut.write("\n\n")

#The following code will initialize a model and test that your forward computation runs without errors.
lstm_test   = BasicLSTMtagger(DIM_HID=7, DIM_EMB=300)
lstm_output = lstm_test.forward(prepare_input(X[0:5])[0])
Y_onehot    = prepare_output_onehot(Y[0:5])

#Check the shape of the lstm_output and one-hot label tensors.
print("lstm output shape:", lstm_output.shape)
print("Y onehot shape:", Y_onehot.shape)

lstm output shape: torch.Size([5, 32, 10])
Y onehot shape: torch.Size([5, 32, 10])


In [None]:
#Read in the data

(sentences_dev, tags_dev)     = read_conll_format('dev')
(sentences_train, tags_train) = read_conll_format('train')
(sentences_test, tags_test)   = read_conll_format('test')

## Train your Model (10 points)

Next, implement the function below to train your basic BiLSTM tagger.  See [torch.nn.lstm](https://pytorch.org/docs/stable/generated/torch.nn.LSTM.html).  Make sure to save your predictions on the test set (`test_pred_lstm.txt`) for submission to GradeScope. Feel free to change number of epochs, optimizer, learning rate and batch size.

In [None]:
#Training

#####################################################################################
#TODO: Add imports if needed:

from torch.optim import Adam, AdamW, SGD


#####################################################################################

from random import sample
from tqdm import tqdm
import os
import subprocess
import random

def shuffle_sentences(sentences, tags):
    shuffled_sentences = []
    shuffled_tags      = []
    indices = list(range(len(sentences)))
    random.shuffle(indices)
    for i in indices:
        shuffled_sentences.append(sentences[i])
        shuffled_tags.append(tags[i])
    return (shuffled_sentences, shuffled_tags)


def train_basic_lstm(sentences, tags, lstm):
    #####################################################################################
    #TODO: initialize optimizer and other parameters

    optim = AdamW(lstm.parameters(), lr=0.01)
    nEpochs = 15
    batchSize = 64
    loss_function = nn.NLLLoss()

    lstm.train()
    lstm.cuda()


    #####################################################################################

    for epoch in range(nEpochs):
        totalLoss = 0.0

        (sentences_shuffled, tags_shuffled) = shuffle_sentences(sentences, tags)
        for batch in tqdm(range(0, len(sentences), batchSize), leave=False):
            #####################################################################################
            #TODO: Implement gradient update on a batch of data.

            optim.zero_grad()

            input_data = prepare_input(sentences2indices(sentences_shuffled[batch:batch+batchSize], word2i, train=True))[0].cuda()

            lstm_output = lstm.forward(input_data)

            Y_onehot = prepare_output_onehot(sentences2indices(tags_shuffled[batch:batch+batchSize], tag2i)).cuda()

            loss = loss_function(lstm_output.view(-1, lstm.NUM_TAGS), Y_onehot.view(-1, lstm.NUM_TAGS).argmax(dim=1))

            loss.backward()

            optim.step()

            totalLoss += loss

            #####################################################################################

        print(f"loss on epoch {epoch} = {totalLoss}")
        lstm.write_predictions(sentences_dev, 'dev_pred')   #Performance on dev set
        print('conlleval:')
        print(subprocess.Popen('paste dev dev_pred | perl conlleval.pl -d "\t"', shell=True, stdout=subprocess.PIPE,stderr=subprocess.STDOUT).communicate()[0].decode('UTF-8'))

        if epoch % 10 == 0:
            s = sample(range(len(sentences_dev)), 5)
            lstm.print_predictions([sentences_dev[i] for i in s], [tags_dev[i] for i in s])


lstm = BasicLSTMtagger(DIM_HID=500, DIM_EMB=300).cuda()
train_basic_lstm(sentences_train, tags_train, lstm)



loss on epoch 0 = 17.932065963745117
conlleval:
processed 51578 tokens with 5942 phrases; found: 5783 phrases; correct: 5055.
accuracy:  97.63%; precision:  87.41%; recall:  85.07%; FB1:  86.23
              LOC: precision:  93.77%; recall:  89.33%; FB1:  91.50  1750
             MISC: precision:  84.33%; recall:  77.66%; FB1:  80.86  849
              ORG: precision:  79.41%; recall:  71.89%; FB1:  75.46  1214
              PER: precision:  88.02%; recall:  94.14%; FB1:  90.98  1970

----------------------------
-START-/START/START -DOCSTART-/O/O -END-/END/END
Predicted:	 ['START', 'O', 'END']
Gold:		 ['START', 'O', 'END']
----------------------------
-START-/START/START BASEBALL/O/O -/O/O MAJOR/I-MISC/I-MISC LEAGUE/I-MISC/I-MISC RESULTS/O/O THURSDAY/O/O ./O/O -END-/END/END
Predicted:	 ['START', 'O', 'O', 'I-MISC', 'I-MISC', 'O', 'O', 'O', 'END']
Gold:		 ['START', 'O', 'O', 'I-MISC', 'I-MISC', 'O', 'O', 'O', 'END']
----------------------------
-START-/START/START Russian/I-MISC/I-MISC



loss on epoch 1 = 3.171403408050537
conlleval:
processed 51578 tokens with 5942 phrases; found: 6068 phrases; correct: 5357.
accuracy:  98.33%; precision:  88.28%; recall:  90.15%; FB1:  89.21
                 : precision:   0.00%; recall:   0.00%; FB1:   0.00  1
              LOC: precision:  90.03%; recall:  96.30%; FB1:  93.06  1965
             MISC: precision:  84.50%; recall:  82.75%; FB1:  83.62  903
              ORG: precision:  83.07%; recall:  80.84%; FB1:  81.93  1305
              PER: precision:  91.92%; recall:  94.52%; FB1:  93.20  1894





loss on epoch 2 = 1.727961540222168
conlleval:
processed 51578 tokens with 5942 phrases; found: 6130 phrases; correct: 5430.
accuracy:  98.38%; precision:  88.58%; recall:  91.38%; FB1:  89.96
                 : precision:   0.00%; recall:   0.00%; FB1:   0.00  1
              LOC: precision:  92.30%; recall:  95.32%; FB1:  93.79  1897
             MISC: precision:  79.29%; recall:  85.14%; FB1:  82.11  990
              ORG: precision:  83.86%; recall:  85.61%; FB1:  84.72  1369
              PER: precision:  93.22%; recall:  94.79%; FB1:  94.00  1873





loss on epoch 3 = 1.0144685506820679
conlleval:
processed 51578 tokens with 5942 phrases; found: 6049 phrases; correct: 5417.
accuracy:  98.49%; precision:  89.55%; recall:  91.16%; FB1:  90.35
              LOC: precision:  94.27%; recall:  94.88%; FB1:  94.57  1849
             MISC: precision:  87.37%; recall:  83.30%; FB1:  85.29  879
              ORG: precision:  81.46%; recall:  87.17%; FB1:  84.22  1435
              PER: precision:  92.10%; recall:  94.30%; FB1:  93.19  1886





loss on epoch 4 = 0.5038003325462341
conlleval:
processed 51578 tokens with 5942 phrases; found: 6055 phrases; correct: 5435.
accuracy:  98.52%; precision:  89.76%; recall:  91.47%; FB1:  90.61
              LOC: precision:  94.02%; recall:  94.99%; FB1:  94.50  1856
             MISC: precision:  85.67%; recall:  84.27%; FB1:  84.96  907
              ORG: precision:  83.58%; recall:  85.76%; FB1:  84.65  1376
              PER: precision:  92.01%; recall:  95.71%; FB1:  93.83  1916





loss on epoch 5 = 0.3063792288303375
conlleval:
processed 51578 tokens with 5942 phrases; found: 6043 phrases; correct: 5437.
accuracy:  98.52%; precision:  89.97%; recall:  91.50%; FB1:  90.73
              LOC: precision:  94.44%; recall:  93.47%; FB1:  93.95  1818
             MISC: precision:  84.59%; recall:  85.14%; FB1:  84.86  928
              ORG: precision:  82.68%; recall:  88.67%; FB1:  85.57  1438
              PER: precision:  93.92%; recall:  94.79%; FB1:  94.35  1859





loss on epoch 6 = 0.19363655149936676
conlleval:
processed 51578 tokens with 5942 phrases; found: 6019 phrases; correct: 5448.
accuracy:  98.55%; precision:  90.51%; recall:  91.69%; FB1:  91.10
              LOC: precision:  95.37%; recall:  95.26%; FB1:  95.32  1835
             MISC: precision:  84.81%; recall:  85.36%; FB1:  85.08  928
              ORG: precision:  85.30%; recall:  86.13%; FB1:  85.71  1354
              PER: precision:  92.32%; recall:  95.33%; FB1:  93.80  1902





loss on epoch 7 = 0.27926748991012573
conlleval:
processed 51578 tokens with 5942 phrases; found: 6032 phrases; correct: 5459.
accuracy:  98.56%; precision:  90.50%; recall:  91.87%; FB1:  91.18
              LOC: precision:  93.74%; recall:  95.43%; FB1:  94.58  1870
             MISC: precision:  88.24%; recall:  84.60%; FB1:  86.38  884
              ORG: precision:  84.54%; recall:  86.88%; FB1:  85.69  1378
              PER: precision:  92.68%; recall:  95.60%; FB1:  94.12  1900





loss on epoch 8 = 0.9046532511711121
conlleval:
processed 51578 tokens with 5942 phrases; found: 6103 phrases; correct: 5341.
accuracy:  98.26%; precision:  87.51%; recall:  89.89%; FB1:  88.68
              LOC: precision:  92.43%; recall:  93.74%; FB1:  93.08  1863
             MISC: precision:  82.50%; recall:  80.80%; FB1:  81.64  903
              ORG: precision:  80.78%; recall:  83.07%; FB1:  81.91  1379
              PER: precision:  89.89%; recall:  95.55%; FB1:  92.63  1958





loss on epoch 9 = 1.12472403049469
conlleval:
processed 51578 tokens with 5942 phrases; found: 6108 phrases; correct: 5417.
accuracy:  98.40%; precision:  88.69%; recall:  91.16%; FB1:  89.91
              LOC: precision:  92.30%; recall:  95.86%; FB1:  94.05  1908
             MISC: precision:  83.39%; recall:  83.84%; FB1:  83.61  927
              ORG: precision:  82.83%; recall:  86.73%; FB1:  84.74  1404
              PER: precision:  92.03%; recall:  93.38%; FB1:  92.70  1869





loss on epoch 10 = 0.6710819005966187
conlleval:
processed 51578 tokens with 5942 phrases; found: 6064 phrases; correct: 5442.
accuracy:  98.51%; precision:  89.74%; recall:  91.59%; FB1:  90.65
              LOC: precision:  92.22%; recall:  96.79%; FB1:  94.45  1928
             MISC: precision:  85.64%; recall:  85.36%; FB1:  85.50  919
              ORG: precision:  85.09%; recall:  83.82%; FB1:  84.45  1321
              PER: precision:  92.46%; recall:  95.17%; FB1:  93.79  1896

----------------------------
-START-/START/START Everything/O/O else/O/O is/O/O muddy/O/O ,/O/O the/O/O waters/O/O of/O/O the/O/O fjord/O/O leaden/O/O ./O/O -END-/END/END
Predicted:	 ['START', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'END']
Gold:		 ['START', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'END']
----------------------------
-START-/START/START Most/O/O of/O/O the/O/O Marines/I-ORG/I-MISC are/O/O on/O/O three/O/O ships/O/O in/O/O the/O/O Tarawa/I-ORG/I-ORG A



loss on epoch 11 = 0.3149062991142273
conlleval:
processed 51578 tokens with 5942 phrases; found: 6045 phrases; correct: 5466.
accuracy:  98.62%; precision:  90.42%; recall:  91.99%; FB1:  91.20
              LOC: precision:  94.29%; recall:  96.24%; FB1:  95.26  1875
             MISC: precision:  84.67%; recall:  84.49%; FB1:  84.58  920
              ORG: precision:  84.57%; recall:  87.47%; FB1:  86.00  1387
              PER: precision:  93.72%; recall:  94.79%; FB1:  94.25  1863





loss on epoch 12 = 0.16931620240211487
conlleval:
processed 51578 tokens with 5942 phrases; found: 6049 phrases; correct: 5452.
accuracy:  98.60%; precision:  90.13%; recall:  91.75%; FB1:  90.93
              LOC: precision:  94.63%; recall:  94.94%; FB1:  94.78  1843
             MISC: precision:  83.80%; recall:  85.25%; FB1:  84.52  938
              ORG: precision:  84.35%; recall:  87.25%; FB1:  85.78  1387
              PER: precision:  93.14%; recall:  95.11%; FB1:  94.12  1881





loss on epoch 13 = 0.09059637784957886
conlleval:
processed 51578 tokens with 5942 phrases; found: 6019 phrases; correct: 5463.
accuracy:  98.65%; precision:  90.76%; recall:  91.94%; FB1:  91.35
              LOC: precision:  94.39%; recall:  96.14%; FB1:  95.25  1871
             MISC: precision:  86.86%; recall:  84.60%; FB1:  85.71  898
              ORG: precision:  85.65%; recall:  86.80%; FB1:  86.22  1359
              PER: precision:  92.70%; recall:  95.17%; FB1:  93.92  1891





loss on epoch 14 = 0.05419852212071419
conlleval:
processed 51578 tokens with 5942 phrases; found: 6042 phrases; correct: 5481.
accuracy:  98.66%; precision:  90.71%; recall:  92.24%; FB1:  91.47
              LOC: precision:  94.83%; recall:  95.92%; FB1:  95.37  1858
             MISC: precision:  87.49%; recall:  84.92%; FB1:  86.19  895
              ORG: precision:  85.47%; recall:  87.32%; FB1:  86.39  1370
              PER: precision:  91.97%; recall:  95.82%; FB1:  93.86  1919



In [None]:
#Evaluation on test data
lstm.write_predictions(sentences_test, 'test_pred_lstm.txt')
!wget https://raw.githubusercontent.com/aritter/twitter_nlp/master/data/annotated/wnut16/conlleval.pl
!paste test test_pred_lstm.txt | perl conlleval.pl -d "\t"

--2025-02-22 18:17:51--  https://raw.githubusercontent.com/aritter/twitter_nlp/master/data/annotated/wnut16/conlleval.pl
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 12754 (12K) [text/plain]
Saving to: ‘conlleval.pl.1’


2025-02-22 18:17:51 (16.8 MB/s) - ‘conlleval.pl.1’ saved [12754/12754]

processed 46666 tokens with 5648 phrases; found: 5769 phrases; correct: 5022.
accuracy:  97.79%; precision:  87.05%; recall:  88.92%; FB1:  87.97
                 : precision:   0.00%; recall:   0.00%; FB1:   0.00  2
              LOC: precision:  90.56%; recall:  92.03%; FB1:  91.29  1695
             MISC: precision:  75.42%; recall:  76.50%; FB1:  75.95  712
              ORG: precision:  83.81%; recall:  86.94%; FB1:  85.34  1723
              PER: precision:  92.

## Initialization with GloVe Embeddings (5 points)

If you haven't already, implement the `init_glove()` method in `BasicLSTMtagger` above.

Rather than initializing word embeddings randomly, it is common to use learned word embeddings (GloVe or Word2Vec), as discussed in lecture.  To make this simpler, we have already pre-filtered [GloVe](https://nlp.stanford.edu/projects/glove/) embeddings to only contain words in the vocabulary of the CoNLL NER dataset, and loaded them into a dictionary (`GloVe`) at the beginning of this notebook.



## Character Embeddings (10 points)

Now that you have your basic LSTM tagger working, the next step is to add a convolutional network that computes word embeddings from character representations of words.  See Figure 2 and Figure 3 in the [Ma and Hovy](https://www.aclweb.org/anthology/P16-1101.pdf) paper.  We have provided code in `sentences2input_tensors` to convert sentences into lists of word and character indices.  See also [nn.Conv1d](https://pytorch.org/docs/stable/generated/torch.nn.Conv1d.html) and [MaxPool1d](https://pytorch.org/docs/stable/generated/torch.nn.MaxPool1d.html).

Hint: The nn.Conv1d accepts input size $(N, C_{in}, L_{in})$, but we have input size $(N, \text{SLEN}, \text{CLEN}, \text{EMB_DIM})$. We can reshape and [permute](https://pytorch.org/docs/stable/generated/torch.permute.html) our input to satisfy the nn.Conv1d, and recover the dimensions later.

Make sure to save your predictions on the test set, for submission to GradeScope. You should be able to achieve **90 F1 / 85 F1 on the dev/test sets**.

**Fill in the functions marked as `TODO` in the code block below. Please make your code changes only within the given commented block (#####).**

In [None]:
#####################################################################################
#TODO: Add imports if needed:
import torch.nn.functional as F
#####################################################################################


class CharLSTMtagger(BasicLSTMtagger):
    def __init__(self, DIM_EMB=10, DIM_CHAR_EMB=30, DIM_HID=10):
        super(CharLSTMtagger, self).__init__(DIM_EMB=DIM_EMB, DIM_HID=DIM_HID)
        NUM_TAGS = max(tag2i.values())+1

        (self.DIM_EMB, self.NUM_TAGS) = (DIM_EMB, NUM_TAGS)
        #####################################################################################
        #TODO: Initialize parameters.


        self.char_embeddings = nn.Embedding(char_vocab_size, DIM_CHAR_EMB)
        self.char_cnn = nn.Conv1d(in_channels = DIM_CHAR_EMB, out_channels = 30, kernel_size = 3, padding = 1)
        self.maxpool = nn.MaxPool1d(kernel_size=3, stride=1, padding=1)
        self.char_dropout = nn.Dropout(p=0.5)
        self.word_embeddings = nn.Embedding(vocab_size, DIM_EMB)
        self.dropout_in = nn.Dropout(p=0.5)
        self.dropout_out = nn.Dropout(p=0.5)
        self.log_softmax = nn.LogSoftmax(dim = 2)
        self.linear = nn.Linear(2 * DIM_HID, NUM_TAGS)
        self.lstm = nn.LSTM(DIM_EMB + 30, DIM_HID, batch_first = True, bidirectional = True)
        self.init_glove(GloVe)

        #####################################################################################

    def forward(self, X, X_char, train=False):
        #####################################################################################
        #TODO: Implement the forward computation.

        word_embeddings = self.word_embeddings(X)


        char_embeddings = self.char_embeddings(X_char)
        if train:
            char_embeddings = self.char_dropout(char_embeddings)


        batch_size, seq_len, word_len, char_dim = char_embeddings.shape
        char_embeddings = char_embeddings.view(-1, char_dim, word_len)


        char_cnn_out = F.relu(self.char_cnn(char_embeddings))
        char_pooled = self.maxpool(char_cnn_out)


        char_rep = torch.max(char_pooled, dim=2)[0]
        char_rep = char_rep.view(batch_size, seq_len, -1)


        combined_embeddings = torch.cat((word_embeddings, char_rep), dim=2)

        if train:
            combined_embeddings = self.dropout_in(combined_embeddings)


        lstm_out, _ = self.lstm(combined_embeddings)

        if train:
            lstm_out = self.dropout_out(lstm_out)


        linear_out = self.linear(lstm_out)
        output = self.log_softmax(linear_out)

        return output

        #####################################################################################

    def sentences2input_tensors(self, sentences):
        (X, X_mask)   = prepare_input(sentences2indices(sentences, word2i))
        X_char        = prepare_input_char(sentences2indicesChar(sentences, char2i))
        return (X, X_mask, X_char)

    def inference(self, sentences):
        (X, X_mask, X_char) = self.sentences2input_tensors(sentences)
        pred = self.forward(X.cuda(), X_char.cuda()).argmax(dim=2)
        return [[i2tag[pred[i,j].item()] for j in range(len(sentences[i]))] for i in range(len(sentences))]

    def print_predictions(self, words, tags):
        Y_pred = self.inference(words)
        for i in range(len(words)):
            print("----------------------------")
            print(" ".join([f"{words[i][j]}/{Y_pred[i][j]}/{tags[i][j]}" for j in range(len(words[i]))]))
            print("Predicted:\t", Y_pred[i])
            print("Gold:\t\t", tags[i])

char_lstm_test = CharLSTMtagger(DIM_HID=7, DIM_EMB=300)
lstm_output    = char_lstm_test.forward(prepare_input(X[0:5])[0], prepare_input_char(X_char[0:5]))
Y_onehot       = prepare_output_onehot(Y[0:5])

print("lstm output shape:", lstm_output.shape)
print("Y onehot shape:", Y_onehot.shape)

lstm output shape: torch.Size([5, 32, 10])
Y onehot shape: torch.Size([5, 32, 10])


In [None]:
#Training LSTM w/ character embeddings. Feel free to change number of epochs, optimizer, learning rate and batch size.

#####################################################################################
#TODO: Add imports if necessary.
from torch.optim import Adam, AdamW, SGD


#####################################################################################


def train_char_lstm(sentences, tags, lstm):
    #####################################################################################
    #TODO: initialize optimizer and other hyperparameters.
    optim = AdamW(lstm.parameters(), lr=0.001)
    nEpochs = 15
    batchSize = 64
    loss_function = nn.NLLLoss()

    # lstm.train()
    lstm.cuda()

    #####################################################################################

    for epoch in range(nEpochs):
        totalLoss = 0.0

        (sentences_shuffled, tags_shuffled) = shuffle_sentences(sentences, tags)
        for batch in tqdm(range(0, len(sentences), batchSize), leave=False):

            #####################################################################################
            #TODO: Implement gradient update on a batch of data.

            optim.zero_grad()

            input_data = prepare_input(sentences2indices(sentences_shuffled[batch:batch+batchSize], word2i, train=True))[0].cuda()

            X_char = prepare_input_char(sentences2indicesChar(sentences_shuffled[batch:batch+batchSize], char2i))
            X_char = X_char.cuda()
            lstm_output = lstm.forward(input_data, X_char, train=True)

            Y_onehot = prepare_output_onehot(sentences2indices(tags_shuffled[batch:batch+batchSize], tag2i)).cuda()

            loss = loss_function(lstm_output.view(-1, lstm.NUM_TAGS), Y_onehot.view(-1, lstm.NUM_TAGS).argmax(dim=1))

            loss.backward()

            optim.step()

            totalLoss += loss


            #####################################################################################

        print(f"loss on epoch {epoch} = {totalLoss}")
        lstm.write_predictions(sentences_dev, 'dev_pred')   #Performance on dev set
        print('conlleval:')
        print(subprocess.Popen('paste dev dev_pred | perl conlleval.pl -d "\t"', shell=True, stdout=subprocess.PIPE,stderr=subprocess.STDOUT).communicate()[0].decode('UTF-8'))

        if epoch % 10 == 0:
            s = sample(range(len(sentences_dev)), 5)
            lstm.print_predictions([sentences_dev[i] for i in s], [tags_dev[i] for i in s])

char_lstm = CharLSTMtagger(DIM_HID=500, DIM_EMB=300).cuda()
train_char_lstm(sentences_train, tags_train, char_lstm)



loss on epoch 0 = 35.71760559082031
conlleval:
processed 51578 tokens with 5942 phrases; found: 6011 phrases; correct: 4542.
accuracy:  95.90%; precision:  75.56%; recall:  76.44%; FB1:  76.00
              LOC: precision:  80.18%; recall:  84.32%; FB1:  82.20  1932
             MISC: precision:  74.54%; recall:  65.40%; FB1:  69.67  809
              ORG: precision:  60.05%; recall:  59.51%; FB1:  59.78  1329
              PER: precision:  82.02%; recall:  86.43%; FB1:  84.17  1941

----------------------------
-START-/START/START "/O/O KDP/O/I-ORG (/O/O Kurdistan/I-ORG/I-ORG Democratic/I-ORG/I-ORG Party/I-ORG/I-ORG )/O/O is/O/O trying/O/O to/O/O overtake/O/O the/O/O city/O/O ./O/O -END-/END/END
Predicted:	 ['START', 'O', 'O', 'O', 'I-ORG', 'I-ORG', 'I-ORG', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'END']
Gold:		 ['START', 'O', 'I-ORG', 'O', 'I-ORG', 'I-ORG', 'I-ORG', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'END']
----------------------------
-START-/START/START Sanders/I-PER/I-PER (/O



loss on epoch 1 = 11.586722373962402
conlleval:
processed 51578 tokens with 5942 phrases; found: 5744 phrases; correct: 4804.
accuracy:  96.97%; precision:  83.64%; recall:  80.85%; FB1:  82.22
              LOC: precision:  90.18%; recall:  86.99%; FB1:  88.56  1772
             MISC: precision:  72.57%; recall:  74.62%; FB1:  73.58  948
              ORG: precision:  75.18%; recall:  63.68%; FB1:  68.95  1136
              PER: precision:  88.14%; recall:  90.34%; FB1:  89.22  1888





loss on epoch 2 = 8.849044799804688
conlleval:
processed 51578 tokens with 5942 phrases; found: 5937 phrases; correct: 5080.
accuracy:  97.58%; precision:  85.57%; recall:  85.49%; FB1:  85.53
              LOC: precision:  89.79%; recall:  93.85%; FB1:  91.78  1920
             MISC: precision:  77.46%; recall:  76.03%; FB1:  76.74  905
              ORG: precision:  78.52%; recall:  70.62%; FB1:  74.36  1206
              PER: precision:  89.61%; recall:  92.73%; FB1:  91.14  1906





loss on epoch 3 = 7.123241901397705
conlleval:
processed 51578 tokens with 5942 phrases; found: 5955 phrases; correct: 5281.
accuracy:  98.10%; precision:  88.68%; recall:  88.88%; FB1:  88.78
              LOC: precision:  91.92%; recall:  95.43%; FB1:  93.64  1907
             MISC: precision:  81.03%; recall:  80.15%; FB1:  80.59  912
              ORG: precision:  83.32%; recall:  78.60%; FB1:  80.89  1265
              PER: precision:  92.73%; recall:  94.19%; FB1:  93.46  1871





loss on epoch 4 = 6.0636186599731445
conlleval:
processed 51578 tokens with 5942 phrases; found: 5937 phrases; correct: 5315.
accuracy:  98.25%; precision:  89.52%; recall:  89.45%; FB1:  89.49
              LOC: precision:  92.03%; recall:  95.59%; FB1:  93.78  1908
             MISC: precision:  85.32%; recall:  80.04%; FB1:  82.60  865
              ORG: precision:  85.63%; recall:  79.57%; FB1:  82.49  1246
              PER: precision:  91.45%; recall:  95.22%; FB1:  93.30  1918





loss on epoch 5 = 5.178715229034424
conlleval:
processed 51578 tokens with 5942 phrases; found: 5925 phrases; correct: 5366.
accuracy:  98.36%; precision:  90.57%; recall:  90.31%; FB1:  90.44
              LOC: precision:  93.64%; recall:  95.37%; FB1:  94.50  1871
             MISC: precision:  87.15%; recall:  80.15%; FB1:  83.50  848
              ORG: precision:  85.11%; recall:  82.70%; FB1:  83.89  1303
              PER: precision:  92.80%; recall:  95.87%; FB1:  94.31  1903





loss on epoch 6 = 4.571539878845215
conlleval:
processed 51578 tokens with 5942 phrases; found: 6080 phrases; correct: 5436.
accuracy:  98.42%; precision:  89.41%; recall:  91.48%; FB1:  90.43
              LOC: precision:  94.44%; recall:  95.21%; FB1:  94.82  1852
             MISC: precision:  85.31%; recall:  81.89%; FB1:  83.56  885
              ORG: precision:  80.48%; recall:  87.32%; FB1:  83.76  1455
              PER: precision:  93.27%; recall:  95.60%; FB1:  94.42  1888





loss on epoch 7 = 3.7956318855285645
conlleval:
processed 51578 tokens with 5942 phrases; found: 6009 phrases; correct: 5456.
accuracy:  98.62%; precision:  90.80%; recall:  91.82%; FB1:  91.31
              LOC: precision:  93.78%; recall:  96.90%; FB1:  95.31  1898
             MISC: precision:  86.01%; recall:  82.00%; FB1:  83.95  879
              ORG: precision:  87.27%; recall:  84.86%; FB1:  86.05  1304
              PER: precision:  92.43%; recall:  96.74%; FB1:  94.54  1928





loss on epoch 8 = 3.260950803756714
conlleval:
processed 51578 tokens with 5942 phrases; found: 6015 phrases; correct: 5475.
accuracy:  98.71%; precision:  91.02%; recall:  92.14%; FB1:  91.58
              LOC: precision:  94.90%; recall:  96.30%; FB1:  95.60  1864
             MISC: precision:  84.61%; recall:  84.06%; FB1:  84.33  916
              ORG: precision:  86.82%; recall:  85.46%; FB1:  86.13  1320
              PER: precision:  93.21%; recall:  96.91%; FB1:  95.02  1915





loss on epoch 9 = 2.8344316482543945
conlleval:
processed 51578 tokens with 5942 phrases; found: 6016 phrases; correct: 5497.
accuracy:  98.72%; precision:  91.37%; recall:  92.51%; FB1:  91.94
              LOC: precision:  94.28%; recall:  96.90%; FB1:  95.57  1888
             MISC: precision:  87.67%; recall:  83.30%; FB1:  85.43  876
              ORG: precision:  87.46%; recall:  86.35%; FB1:  86.90  1324
              PER: precision:  92.89%; recall:  97.23%; FB1:  95.01  1928





loss on epoch 10 = 2.506464719772339
conlleval:
processed 51578 tokens with 5942 phrases; found: 6022 phrases; correct: 5501.
accuracy:  98.78%; precision:  91.35%; recall:  92.58%; FB1:  91.96
              LOC: precision:  94.62%; recall:  96.68%; FB1:  95.64  1877
             MISC: precision:  87.74%; recall:  83.84%; FB1:  85.75  881
              ORG: precision:  87.09%; recall:  86.50%; FB1:  86.79  1332
              PER: precision:  92.75%; recall:  97.29%; FB1:  94.97  1932

----------------------------
-START-/START/START "/O/O I/O/O changed/O/O my/O/O strategy/O/O against/O/O him/O/O today/O/O and/O/O had/O/O him/O/O rattled/O/O ,/O/O "/O/O he/O/O added/O/O ./O/O -END-/END/END
Predicted:	 ['START', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'END']
Gold:		 ['START', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'END']
----------------------------
-START-/START/START -DOCSTART-/O/O -END-/END/END




loss on epoch 11 = 2.143665075302124
conlleval:
processed 51578 tokens with 5942 phrases; found: 6028 phrases; correct: 5507.
accuracy:  98.78%; precision:  91.36%; recall:  92.68%; FB1:  92.01
              LOC: precision:  93.92%; recall:  96.79%; FB1:  95.34  1893
             MISC: precision:  88.84%; recall:  83.73%; FB1:  86.21  869
              ORG: precision:  86.84%; recall:  86.58%; FB1:  86.71  1337
              PER: precision:  93.11%; recall:  97.50%; FB1:  95.25  1929





loss on epoch 12 = 2.010096311569214
conlleval:
processed 51578 tokens with 5942 phrases; found: 6011 phrases; correct: 5517.
accuracy:  98.84%; precision:  91.78%; recall:  92.85%; FB1:  92.31
              LOC: precision:  94.46%; recall:  96.52%; FB1:  95.48  1877
             MISC: precision:  86.66%; recall:  85.25%; FB1:  85.95  907
              ORG: precision:  87.56%; recall:  87.10%; FB1:  87.33  1334
              PER: precision:  94.56%; recall:  97.18%; FB1:  95.85  1893





loss on epoch 13 = 1.7580928802490234
conlleval:
processed 51578 tokens with 5942 phrases; found: 6018 phrases; correct: 5540.
accuracy:  98.88%; precision:  92.06%; recall:  93.23%; FB1:  92.64
              LOC: precision:  94.24%; recall:  97.06%; FB1:  95.63  1892
             MISC: precision:  87.78%; recall:  86.44%; FB1:  87.10  908
              ORG: precision:  88.83%; recall:  87.17%; FB1:  87.99  1316
              PER: precision:  94.16%; recall:  97.23%; FB1:  95.67  1902





loss on epoch 14 = 1.5483956336975098
conlleval:
processed 51578 tokens with 5942 phrases; found: 5987 phrases; correct: 5531.
accuracy:  98.86%; precision:  92.38%; recall:  93.08%; FB1:  92.73
              LOC: precision:  94.58%; recall:  96.95%; FB1:  95.75  1883
             MISC: precision:  88.14%; recall:  85.47%; FB1:  86.78  894
              ORG: precision:  90.10%; recall:  86.88%; FB1:  88.46  1293
              PER: precision:  93.74%; recall:  97.56%; FB1:  95.61  1917



In [None]:
#Evaluation on test set
char_lstm.write_predictions(sentences_test, 'test_pred_cnn_lstm.txt')
!wget https://raw.githubusercontent.com/aritter/twitter_nlp/master/data/annotated/wnut16/conlleval.pl
!paste test test_pred_cnn_lstm.txt | perl conlleval.pl -d "\t"

--2025-02-22 18:22:47--  https://raw.githubusercontent.com/aritter/twitter_nlp/master/data/annotated/wnut16/conlleval.pl
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 12754 (12K) [text/plain]
Saving to: ‘conlleval.pl.2’


2025-02-22 18:22:47 (15.2 MB/s) - ‘conlleval.pl.2’ saved [12754/12754]

processed 46666 tokens with 5648 phrases; found: 5755 phrases; correct: 5043.
accuracy:  97.83%; precision:  87.63%; recall:  89.29%; FB1:  88.45
                 : precision:   0.00%; recall:   0.00%; FB1:   0.00  2
              LOC: precision:  89.23%; recall:  93.41%; FB1:  91.27  1746
             MISC: precision:  74.76%; recall:  76.78%; FB1:  75.76  721
              ORG: precision:  85.71%; recall:  85.55%; FB1:  85.63  1658
              PER: precision:  93.

## Conditional Random Fields (+2% Course Grade - optional extra credit)

Now we are ready to add a CRF layer to the `CharacterLSTMTagger`.  To train the model, implement `conditional_log_likelihood`, using the score (unnormalized log probability) of the gold sequence, in addition to the partition function, $Z(X)$, which is computed using the forward algorithm.  Then, you can simply use Pytorch's automatic differentiation to compute gradients by running backpropagation through the computation graph of the dynamic program (this should be very simple, so long as you are able to correctly implement the forward algorithm using a computation graph that is supported by PyTorch).  This approach to computing gradients for CRFs is discussed in Section 7.5.3 of the [Eisenstein Book](https://github.com/jacobeisenstein/gt-nlp-class/blob/master/notes/eisenstein-nlp-notes.pdf)

You will also need to implement the Viterbi algorithm for inference during decoding.

After including CRF training and Viterbi decoding, you should be getting about **92 F1 / 88 F1 on the dev and test set**, respectively.

---




**IMPORTANT:** Note that training will be substantially slower this time - depending on the efficiency of your implementation, it could take about 5 minutes per epoch (e.g. 50 minutes for 10 iterations).  It is recommended to start out training on a single batch of data (and testing on this same batch), so that you can quickly debug, making sure your model can memorize the labels on a single batch, and then optimize your code.  Once you are fairly confident your code is working properly, then you can train using the full dataset.  We have provided a (commented out) line of code to switch between training on a single batch and the full dataset below.

**Hint #1:** While debugging your implementation of the Forward algorithm it is helpful to look at the loss during training.  The loss should never be less than zero (the log-likelihood should always be negative).

**Hint #2:** To sum log-probabilities in a numerically stable way at the end of the Forward algorithm, you will want to use [`torch.logsumexp`](https://pytorch.org/docs/stable/generated/torch.logsumexp.html).

**Fill in the functions marked as `TODO` in the code block below. Please make your code changes only within the given commented block (#####).**

In [None]:
import torch.nn.functional as F

#####################################################################################
#TODO: Add imports if needed.

#####################################################################################

class LSTM_CRFtagger(CharLSTMtagger):
    def __init__(self, DIM_EMB=10, DIM_CHAR_EMB=30, DIM_HID=10):
        super(LSTM_CRFtagger, self).__init__(DIM_EMB=DIM_EMB, DIM_HID=DIM_HID, DIM_CHAR_EMB=DIM_CHAR_EMB)
        #####################################################################################
        #TODO: Initialize parameters.

        self.transitions = nn.Parameter(torch.randn(self.NUM_TAGS, self.NUM_TAGS))

        #####################################################################################

    def gold_score(self, lstm_scores, Y):
        #####################################################################################
        #TODO: compute score of gold sequence Y (unnormalized conditional log-probability)

        score = lstm_scores[0, Y[0]]
        seq_len = Y.size(0)
        for t in range(1, seq_len):
            score = score + self.transitions[Y[t-1], Y[t]] + lstm_scores[t, Y[t]]
        return score


        #####################################################################################


    #Forward algorithm for a single sentence
    #Efficiency will eventually be important here.  We recommend you start by
    #training on a single batch and make sure your code can memorize the
    #training data.  Then you can go back and re-write the inner loop using
    #tensor operations to speed things up.
    def forward_algorithm(self, lstm_scores, sLen):
        #####################################################################################
        #TODO: compute partition function Z

        alpha = lstm_scores[0]  # shape: (NUM_TAGS,)
        for t in range(1, sLen):
          alpha = torch.logsumexp(alpha.unsqueeze(1) + self.transitions, dim=0) + lstm_scores[t]

        return torch.logsumexp(alpha, dim=0)


        #####################################################################################

    def conditional_log_likelihood(self, sentences, tags, train=True):
        #####################################################################################
        #TODO: compute conditional log likelihood of Y (use forward_algorithm and gold_score)
        (X, X_mask, X_char) = self.sentences2input_tensors(sentences)
        lstm_scores = self.forward(X.cuda(), X_char.cuda(), train=train)
        total_loss = 0.0
        batch_size = lstm_scores.size(0)
        for i in range(batch_size):
            sLen = int(X_mask[i].sum().item())
            scores = lstm_scores[i][:sLen]  # (sLen, NUM_TAGS)
            gold_tags = torch.tensor(sentences2indices([tags[i]], tag2i)[0], device=scores.device)
            gold = self.gold_score(scores, gold_tags)
            partition = self.forward_algorithm(scores, sLen)
            total_loss = total_loss + (partition - gold)
        return total_loss / batch_size

        #####################################################################################

    def viterbi(self, lstm_scores, sLen):
        #####################################################################################
        #TODO: Implement the Viterbi algorithm
        backpointers = []

        best_score = lstm_scores[0]
        for t in range(1, sLen):

            scores_t = best_score.unsqueeze(1) + self.transitions  # shape: (NUM_TAGS, NUM_TAGS)
            best_prev_scores, best_prev_tags = torch.max(scores_t, dim=0)
            best_score = best_prev_scores + lstm_scores[t]
            backpointers.append(best_prev_tags)

        best_final_tag = torch.argmax(best_score).item()
        best_path = [best_final_tag]
        for bp in reversed(backpointers):
            best_final_tag = bp[best_final_tag].item()
            best_path.insert(0, best_final_tag)
        return best_path, best_score[best_final_tag]

        #####################################################################################

    #Computes Viterbi sequences on a batch of data.
    def viterbi_batch(self, sentences):
        viterbiSeqs = []
        (X, X_mask, X_char) = self.sentences2input_tensors(sentences)
        lstm_scores = self.forward(X.cuda(), X_char.cuda(), train=True)
        for s in range(len(sentences)):
            (viterbiSeq, ll) = self.viterbi(lstm_scores[s], len(sentences[s]))
            viterbiSeqs.append(viterbiSeq)

        max_len = max(len(seq) for seq in viterbiSeqs)

        padded_seqs = [seq + [0] * (max_len - len(seq)) for seq in viterbiSeqs]


        return torch.tensor(padded_seqs, dtype=torch.long)


    def forward(self, X, X_char, train=False):
        #####################################################################################
        #TODO: Implement the forward computation.
        return super(LSTM_CRFtagger, self).forward(X, X_char, train=train)

        #####################################################################################


    def print_predictions(self, words, tags):
        Y_pred = self.inference(words)
        for i in range(len(words)):
            print("----------------------------")
            print(" ".join([f"{words[i][j]}/{Y_pred[i][j]}/{tags[i][j]}" for j in range(len(words[i]))]))
            print("Predicted:\t", [Y_pred[i][j] for j in range(len(words[i]))])
            print("Gold:\t\t", tags[i])

    #Need to use Viterbi this time.
    def inference(self, sentences, viterbi=True):
        pred = self.viterbi_batch(sentences)
        return [[i2tag[pred[i][j].item()] for j in range(len(sentences[i]))] for i in range(len(sentences))]

lstm_crf = LSTM_CRFtagger(DIM_EMB=300).cuda()
# print(lstm_crf.conditional_log_likelihood(sentences_dev[0:5], tags_dev[0:5]))

In [None]:
# This is a cell for debugging, feel free to change it as you like
print(lstm_crf.conditional_log_likelihood(sentences_dev[0:5], tags_dev[0:5]))

tensor(44.7500, device='cuda:0', grad_fn=<DivBackward0>)


In [None]:
#CharLSTM-CRF Training

#####################################################################################
# TODO: Add imports if needed.
import torch.optim as optim
import torch.nn.utils as utils
#####################################################################################

#Get CoNLL evaluation script
os.system('wget https://raw.githubusercontent.com/aritter/twitter_nlp/master/data/annotated/wnut16/conlleval.pl')

def train_crf_lstm(sentences, tags, lstm):
    #####################################################################################
    #TODO: initialize optimizer and hyperparameters.

    optimizer = AdamW(lstm.parameters(), lr=0.01)
    # optimizer = SGD(lstm.parameters(), lr=0.01, momentum=0.9)
    nEpochs = 20
    batchSize = 128

    lstm.train()
    lstm.cuda()



    #####################################################################################
    for epoch in range(nEpochs):
        totalLoss = 0.0
        lstm.train()

        #Shuffle the sentences
        (sentences_shuffled, tags_shuffled) = shuffle_sentences(sentences, tags)
        for batch in tqdm(range(0, len(sentences), batchSize), leave=False):
            #####################################################################################
            #TODO: Implement gradient update on a batch of data.

            optimizer.zero_grad()

            batch_sentences = sentences_shuffled[batch:batch+batchSize]
            batch_tags = tags_shuffled[batch:batch+batchSize]


            loss = lstm.conditional_log_likelihood(batch_sentences, batch_tags, train=True)


            loss.backward()
            utils.clip_grad_norm_(lstm.parameters(), max_norm=5.0)
            optimizer.step()



            totalLoss += loss.item()

            #####################################################################################



        print(f"loss on epoch {epoch} = {totalLoss}")
        lstm.write_predictions(sentences_dev, 'dev_pred')   #Performance on dev set
        print('conlleval:')
        print(subprocess.Popen('paste dev dev_pred | perl conlleval.pl -d "\t"', shell=True, stdout=subprocess.PIPE,stderr=subprocess.STDOUT).communicate()[0].decode('UTF-8'))

        if epoch % 5 == 0:
            lstm.eval()
            s = random.sample(range(50), 5)
            lstm.print_predictions([sentences_train[i] for i in s], [tags_train[i] for i in s])   #Print predictions on train data (useful for debugging)

crf_lstm = LSTM_CRFtagger(DIM_HID=500, DIM_EMB=300, DIM_CHAR_EMB=30).cuda()
train_crf_lstm(sentences_train, tags_train, crf_lstm)             #Train on the full dataset
# train_crf_lstm(sentences_train[0:50], tags_train[0:50], crf_lstm)   #Train only the first batch (use this during development/debugging)



loss on epoch 0 = 694.3001825809479
conlleval:
processed 51578 tokens with 5942 phrases; found: 5692 phrases; correct: 4684.
accuracy:  96.42%; precision:  82.29%; recall:  78.83%; FB1:  80.52
              LOC: precision:  91.30%; recall:  82.80%; FB1:  86.84  1666
             MISC: precision:  73.99%; recall:  67.57%; FB1:  70.63  842
              ORG: precision:  72.07%; recall:  67.34%; FB1:  69.62  1253
              PER: precision:  84.77%; recall:  88.87%; FB1:  86.77  1931

----------------------------
-START-/START/START It/O/O brought/O/O in/O/O 4,275/O/O tonnes/O/O of/O/O British/I-MISC/I-MISC mutton/O/O ,/O/O some/O/O 10/O/O percent/O/O of/O/O overall/O/O imports/O/O ./O/O -END-/END/END
Predicted:	 ['START', 'O', 'O', 'O', 'O', 'O', 'O', 'I-MISC', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'END']
Gold:		 ['START', 'O', 'O', 'O', 'O', 'O', 'O', 'I-MISC', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'END']
----------------------------
-START-/START/START FRANKFURT/I-LOC/I



loss on epoch 1 = 143.81007074564695
conlleval:
processed 51578 tokens with 5942 phrases; found: 5710 phrases; correct: 4947.
accuracy:  97.19%; precision:  86.64%; recall:  83.25%; FB1:  84.91
              LOC: precision:  92.16%; recall:  88.95%; FB1:  90.53  1773
             MISC: precision:  83.40%; recall:  71.91%; FB1:  77.23  795
              ORG: precision:  78.85%; recall:  73.68%; FB1:  76.18  1253
              PER: precision:  87.98%; recall:  90.23%; FB1:  89.09  1889





loss on epoch 2 = 103.66524684429169
conlleval:
processed 51578 tokens with 5942 phrases; found: 5925 phrases; correct: 5148.
accuracy:  97.66%; precision:  86.89%; recall:  86.64%; FB1:  86.76
              LOC: precision:  91.27%; recall:  90.53%; FB1:  90.90  1822
             MISC: precision:  77.10%; recall:  78.52%; FB1:  77.81  939
              ORG: precision:  82.95%; recall:  78.00%; FB1:  80.40  1261
              PER: precision:  90.12%; recall:  93.11%; FB1:  91.59  1903





loss on epoch 3 = 81.7652291432023
conlleval:
processed 51578 tokens with 5942 phrases; found: 5845 phrases; correct: 5177.
accuracy:  97.81%; precision:  88.57%; recall:  87.13%; FB1:  87.84
              LOC: precision:  94.00%; recall:  91.24%; FB1:  92.60  1783
             MISC: precision:  80.17%; recall:  80.69%; FB1:  80.43  928
              ORG: precision:  85.81%; recall:  77.11%; FB1:  81.23  1205
              PER: precision:  89.32%; recall:  93.54%; FB1:  91.38  1929





loss on epoch 4 = 71.03925773501396
conlleval:
processed 51578 tokens with 5942 phrases; found: 5941 phrases; correct: 5250.
accuracy:  97.93%; precision:  88.37%; recall:  88.35%; FB1:  88.36
              LOC: precision:  93.05%; recall:  92.54%; FB1:  92.79  1827
             MISC: precision:  84.12%; recall:  79.28%; FB1:  81.63  869
              ORG: precision:  79.12%; recall:  82.25%; FB1:  80.66  1394
              PER: precision:  92.71%; recall:  93.16%; FB1:  92.93  1851





loss on epoch 5 = 68.21696072816849
conlleval:
processed 51578 tokens with 5942 phrases; found: 5938 phrases; correct: 5243.
accuracy:  97.93%; precision:  88.30%; recall:  88.24%; FB1:  88.27
              LOC: precision:  89.61%; recall:  93.41%; FB1:  91.47  1915
             MISC: precision:  81.15%; recall:  81.24%; FB1:  81.19  923
              ORG: precision:  85.18%; recall:  77.55%; FB1:  81.19  1221
              PER: precision:  92.50%; recall:  94.35%; FB1:  93.42  1879

----------------------------
-START-/START/START BRUSSELS/I-LOC/I-LOC 1996-08-22/O/O -END-/END/END
Predicted:	 ['START', 'I-LOC', 'O', 'END']
Gold:		 ['START', 'I-LOC', 'O', 'END']
----------------------------
-START-/START/START State/O/O media/O/O quoted/O/O China/I-LOC/I-LOC 's/O/O top/O/O negotiator/O/O with/O/O Taipei/I-LOC/I-LOC ,/O/O Tang/I-PER/I-PER Shubei/I-PER/I-PER ,/O/O as/O/O telling/O/O a/O/O visiting/O/O group/O/O from/O/O Taiwan/I-LOC/I-LOC on/O/O Wednesday/O/O that/O/O it/O/O was/O/O time/



loss on epoch 6 = 59.262015253305435
conlleval:
processed 51578 tokens with 5942 phrases; found: 5936 phrases; correct: 5285.
accuracy:  98.10%; precision:  89.03%; recall:  88.94%; FB1:  88.99
              LOC: precision:  92.98%; recall:  93.69%; FB1:  93.33  1851
             MISC: precision:  82.24%; recall:  81.34%; FB1:  81.79  912
              ORG: precision:  85.35%; recall:  79.49%; FB1:  82.32  1249
              PER: precision:  90.85%; recall:  94.90%; FB1:  92.83  1924





loss on epoch 7 = 56.06701733916998
conlleval:
processed 51578 tokens with 5942 phrases; found: 5916 phrases; correct: 5287.
accuracy:  98.12%; precision:  89.37%; recall:  88.98%; FB1:  89.17
                 : precision:   0.00%; recall:   0.00%; FB1:   0.00  1
              LOC: precision:  91.81%; recall:  92.71%; FB1:  92.25  1855
             MISC: precision:  83.09%; recall:  80.48%; FB1:  81.76  893
              ORG: precision:  86.25%; recall:  80.46%; FB1:  83.26  1251
              PER: precision:  92.01%; recall:  95.71%; FB1:  93.83  1916





loss on epoch 8 = 54.454126477241516
conlleval:
processed 51578 tokens with 5942 phrases; found: 5871 phrases; correct: 5215.
accuracy:  98.00%; precision:  88.83%; recall:  87.77%; FB1:  88.29
              LOC: precision:  91.86%; recall:  92.16%; FB1:  92.01  1843
             MISC: precision:  87.22%; recall:  79.18%; FB1:  83.00  837
              ORG: precision:  79.97%; recall:  82.48%; FB1:  81.20  1383
              PER: precision:  93.25%; recall:  91.53%; FB1:  92.38  1808





loss on epoch 9 = 53.42488244920969
conlleval:
processed 51578 tokens with 5942 phrases; found: 5923 phrases; correct: 5279.
accuracy:  98.07%; precision:  89.13%; recall:  88.84%; FB1:  88.98
              LOC: precision:  92.90%; recall:  93.36%; FB1:  93.13  1846
             MISC: precision:  82.31%; recall:  80.26%; FB1:  81.27  899
              ORG: precision:  84.73%; recall:  82.33%; FB1:  83.51  1303
              PER: precision:  91.73%; recall:  93.38%; FB1:  92.55  1875





loss on epoch 10 = 49.51237042248249
conlleval:
processed 51578 tokens with 5942 phrases; found: 5906 phrases; correct: 5269.
accuracy:  98.11%; precision:  89.21%; recall:  88.67%; FB1:  88.94
                 : precision:   0.00%; recall:   0.00%; FB1:   0.00  1
              LOC: precision:  92.08%; recall:  92.98%; FB1:  92.52  1855
             MISC: precision:  83.09%; recall:  80.48%; FB1:  81.76  893
              ORG: precision:  84.34%; recall:  79.94%; FB1:  82.08  1271
              PER: precision:  92.63%; recall:  94.84%; FB1:  93.72  1886

----------------------------
-START-/START/START BEIJING/I-LOC/I-LOC 1996-08-22/O/O -END-/END/END
Predicted:	 ['START', 'I-LOC', 'O', 'END']
Gold:		 ['START', 'I-LOC', 'O', 'END']
----------------------------
-START-/START/START -DOCSTART-/O/O -END-/END/END
Predicted:	 ['START', 'O', 'END']
Gold:		 ['START', 'O', 'END']
----------------------------
-START-/START/START China/I-LOC/I-LOC says/O/O Taiwan/I-LOC/I-LOC spoils/O/O atmosphere/



loss on epoch 11 = 47.332240611314774
conlleval:
processed 51578 tokens with 5942 phrases; found: 5906 phrases; correct: 5303.
accuracy:  98.19%; precision:  89.79%; recall:  89.25%; FB1:  89.52
              LOC: precision:  91.16%; recall:  94.88%; FB1:  92.98  1912
             MISC: precision:  86.55%; recall:  78.85%; FB1:  82.52  840
              ORG: precision:  85.13%; recall:  81.95%; FB1:  83.51  1291
              PER: precision:  93.08%; recall:  94.14%; FB1:  93.60  1863





loss on epoch 12 = 48.974113926291466
conlleval:
processed 51578 tokens with 5942 phrases; found: 5894 phrases; correct: 5272.
accuracy:  98.09%; precision:  89.45%; recall:  88.72%; FB1:  89.08
              LOC: precision:  94.46%; recall:  91.83%; FB1:  93.13  1786
             MISC: precision:  83.63%; recall:  80.91%; FB1:  82.25  892
              ORG: precision:  81.40%; recall:  83.89%; FB1:  82.63  1382
              PER: precision:  93.46%; recall:  93.05%; FB1:  93.25  1834





loss on epoch 13 = 48.682118490338326
conlleval:
processed 51578 tokens with 5942 phrases; found: 5895 phrases; correct: 5253.
accuracy:  98.05%; precision:  89.11%; recall:  88.40%; FB1:  88.76
                 : precision:   0.00%; recall:   0.00%; FB1:   0.00  1
              LOC: precision:  93.99%; recall:  92.00%; FB1:  92.98  1798
             MISC: precision:  81.36%; recall:  80.48%; FB1:  80.92  912
              ORG: precision:  83.38%; recall:  81.21%; FB1:  82.28  1306
              PER: precision:  92.23%; recall:  94.03%; FB1:  93.12  1878





loss on epoch 14 = 48.52940855920315
conlleval:
processed 51578 tokens with 5942 phrases; found: 5884 phrases; correct: 5259.
accuracy:  98.12%; precision:  89.38%; recall:  88.51%; FB1:  88.94
              LOC: precision:  92.78%; recall:  92.27%; FB1:  92.52  1827
             MISC: precision:  83.37%; recall:  81.56%; FB1:  82.46  902
              ORG: precision:  83.64%; recall:  81.58%; FB1:  82.60  1308
              PER: precision:  93.02%; recall:  93.27%; FB1:  93.14  1847





loss on epoch 15 = 48.969045743346214
conlleval:
processed 51578 tokens with 5942 phrases; found: 5957 phrases; correct: 5321.
accuracy:  98.17%; precision:  89.32%; recall:  89.55%; FB1:  89.44
              LOC: precision:  91.91%; recall:  94.01%; FB1:  92.95  1879
             MISC: precision:  84.81%; recall:  81.78%; FB1:  83.27  889
              ORG: precision:  84.66%; recall:  81.06%; FB1:  82.82  1284
              PER: precision:  92.02%; recall:  95.17%; FB1:  93.57  1905

----------------------------
-START-/START/START China/I-LOC/I-LOC has/O/O said/O/O it/O/O was/O/O time/O/O for/O/O political/O/O talks/O/O with/O/O Taiwan/I-LOC/I-LOC and/O/O that/O/O the/O/O rival/O/O island/O/O should/O/O take/O/O practical/O/O steps/O/O towards/O/O that/O/O goal/O/O ./O/O -END-/END/END
Predicted:	 ['START', 'I-LOC', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'I-LOC', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'END']
Gold:		 ['START', 'I-LOC', 'O', 'O', 'O', 'O



loss on epoch 16 = 48.595340356230736
conlleval:
processed 51578 tokens with 5942 phrases; found: 5941 phrases; correct: 5228.
accuracy:  97.90%; precision:  88.00%; recall:  87.98%; FB1:  87.99
              LOC: precision:  92.54%; recall:  91.78%; FB1:  92.16  1822
             MISC: precision:  78.75%; recall:  82.00%; FB1:  80.34  960
              ORG: precision:  80.87%; recall:  83.22%; FB1:  82.03  1380
              PER: precision:  93.87%; recall:  90.66%; FB1:  92.24  1779





loss on epoch 17 = 50.954543486237526
conlleval:
processed 51578 tokens with 5942 phrases; found: 5953 phrases; correct: 5275.
accuracy:  98.05%; precision:  88.61%; recall:  88.77%; FB1:  88.69
                 : precision:   0.00%; recall:   0.00%; FB1:   0.00  2
              LOC: precision:  91.98%; recall:  93.09%; FB1:  92.53  1859
             MISC: precision:  84.94%; recall:  80.15%; FB1:  82.48  870
              ORG: precision:  81.66%; recall:  81.36%; FB1:  81.51  1336
              PER: precision:  91.99%; recall:  94.19%; FB1:  93.08  1886





loss on epoch 18 = 47.84092094004154
conlleval:
processed 51578 tokens with 5942 phrases; found: 5886 phrases; correct: 5295.
accuracy:  98.17%; precision:  89.96%; recall:  89.11%; FB1:  89.53
              LOC: precision:  93.07%; recall:  92.87%; FB1:  92.97  1833
             MISC: precision:  86.87%; recall:  80.37%; FB1:  83.49  853
              ORG: precision:  84.23%; recall:  81.28%; FB1:  82.73  1294
              PER: precision:  92.24%; recall:  95.44%; FB1:  93.81  1906





loss on epoch 19 = 45.419065117836
conlleval:
processed 51578 tokens with 5942 phrases; found: 5885 phrases; correct: 5232.
accuracy:  98.01%; precision:  88.90%; recall:  88.05%; FB1:  88.48
                 : precision:   0.00%; recall:   0.00%; FB1:   0.00  2
              LOC: precision:  91.98%; recall:  93.63%; FB1:  92.80  1870
             MISC: precision:  79.78%; recall:  80.48%; FB1:  80.13  930
              ORG: precision:  85.37%; recall:  78.75%; FB1:  81.92  1237
              PER: precision:  92.85%; recall:  93.05%; FB1:  92.95  1846



In [None]:
crf_lstm.eval()
crf_lstm.write_predictions(sentences_test, 'test_pred_cnn_lstm_crf.txt')
!wget https://raw.githubusercontent.com/aritter/twitter_nlp/master/data/annotated/wnut16/conlleval.pl
!paste test test_pred_cnn_lstm_crf.txt | perl conlleval.pl -d "\t"

## Gradescope

This is the end. Congratulations!  

Now, follow the steps below to submit your homework in [Gradescope](https://www.gradescope.com/courses/939466):

1. Rename this ipynb file to 'CS4650_p2_GTusername.ipynb'. We recommend ensuring you have removed any extraneous cells & print statements, clearing all outputs, and using the Runtime --> Run all tool to make sure all output is update to date. Additionally, leaving comments in your code to help us understand your operations will assist the teaching staff in grading. It is not a requirement, but is recommended.
2. Click on the menu 'File' --> 'Download' --> 'Download .py'.
3. Click on the menu 'File' --> 'Download' --> 'Download .ipynb'.
4. Download the notebook as a .pdf document. Make sure the training and evaluation output are captured so we can see how the loss and accuracy changes while training.
5. Download the predictions from Colab by clicking the folder icon on the left and finding them under Files, including 'test_pred_lstm.txt', 'test_pred_cnn_lstm.txt', and 'test_pred_cnn_lstm_crf.txt' (optional).
5. Upload all 5 or 6 files to GradeScope:
> CS4650_p2_GTusername.ipynb
>
> CS4650_p2_GTusername.py
>
> CS4650_p2_GTusername.pdf
>
> test_pred_lstm.txt
>
> test_pred_cnn_lstm.txt
>
> test_pred_cnn_lstm_crf.txt (optional)


**Please make sure your implementation meets the accuracy requirements to get full credit.**

**Please make sure that you name the files as specified above. You will be able to see the test set accuracy for your predictions on leaderboard. However, the final score will be assigned later based on accuracy in the notebook / PDF and implementation.**

You can submit multiple times before the deadline and choose the submission which you want to be graded by going to `Submission History` on gradescope.
