# 2-1. NER with LSTM + FC layer
For Part 2, I tried two different models. The first model is a LSTM unit + fully-connected layer and the second model is a Bidirectional LSTM unit with CRF. The second model learns the word embeddings through bidirectional LSTM and use the output word vectors as input sequences to the CRF based model to predict the probability over the tag sequences and the weights on features.  This notebook focuses on the first model.

In [45]:
import sys,os
import random
print(sys.path)
if ".." not in sys.path:
    sys.path.append("..")

['..', '', '/home/hayley/miniconda3/envs/fastai/lib/python36.zip', '/home/hayley/miniconda3/envs/fastai/lib/python3.6', '/home/hayley/miniconda3/envs/fastai/lib/python3.6/lib-dynload', '/home/hayley/miniconda3/envs/fastai/lib/python3.6/site-packages', '/home/hayley/miniconda3/envs/fastai/lib/python3.6/site-packages/defusedxml-0.5.0-py3.6.egg', '/home/hayley/miniconda3/envs/fastai/lib/python3.6/site-packages/IPython/extensions', '/home/hayley/.ipython', '..']


In [63]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats

# sklearn imports
import sklearn
from sklearn.metrics import make_scorer, f1_score
# from sklearn.cross_validation import cross_val_score
from sklearn.model_selection import RandomizedSearchCV, cross_val_predict

# sklearn_crfsuite imports
import sklearn_crfsuite
from sklearn_crfsuite import scorers
from sklearn_crfsuite import metrics

# pytorch imports 
import torch
import torch.autograd as autograd
from torch.utils.data import Dataset, DataLoader

from torch import Tensor
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F

# train logging
import logging
from tqdm import trange
# import .utils as my_utils
from nlp_utils.model_evaluate import evaluate
from nlp_utils import model_utils
# set a random seed
torch.manual_seed(10);

# model saving and inspection
import joblib
import eli5

import pdb 

# auto-reloads
%reload_ext autoreload
%autoreload 2

In [3]:
print(f"sklearn version: {sklearn.__version__}")
print(f"pytorch version: {torch.__version__}")
# make sure we are using pytorch > 0.4.0

sklearn version: 0.20.0
pytorch version: 0.4.1


In [4]:
print(sys.path)

['', '/home/hayley/miniconda3/envs/fastai/lib/python36.zip', '/home/hayley/miniconda3/envs/fastai/lib/python3.6', '/home/hayley/miniconda3/envs/fastai/lib/python3.6/lib-dynload', '/home/hayley/miniconda3/envs/fastai/lib/python3.6/site-packages', '/home/hayley/miniconda3/envs/fastai/lib/python3.6/site-packages/defusedxml-0.5.0-py3.6.egg', '/home/hayley/miniconda3/envs/fastai/lib/python3.6/site-packages/IPython/extensions', '/home/hayley/.ipython', '..']


## Globals

In [5]:
# Extra word for PAD (padding) and UNK (unrecognized word)
PAD_WORD = '<PAD>'
PAD_TAG = 'O'
UNK_WORD = '<UNK>'

START_TAG = '<START>'
STOP_TAG = '<STOP>'

## Load data

Unlike the CRF models, we don't need to hand-engineer the features for our word representation. We use the index mapper (from word to index and tag to index) as we defined in `create_vocab.ipynb` and in the Step section above. To summarize, we use the most common $N$ words out of the total words that occured in all the datasets provided (`eng:train`, `eng:testa`, `eng:testb`). In our experiments $N$ is set $15,000$ which is half of the number of words that occured at least once in any of the given datasets. I added several extra words: PAD_WORD and UNK_WORD. PAD_WORDs are appended to sentences in a mini-batch to make every sentence of equal length (as to the length of the longest sentence in the mini-batch). UNK_WORD is used to map words that are not in our vocab (because we excluded 15,000 uncommon words). In addition, I assigned a PAD_TAG to indicate the padding words, and START_TAG and STOP_TAG. 


## Load processed sentences and labels as well as the word2idx and tag2idx dictionaries.

In [6]:
train_sentences = joblib.load('../data/train_sentences.sav')
train_labels = joblib.load('../data/train_labels.sav')

dev_sentences = joblib.load('../data/testa_sentences.sav')
dev_labels = joblib.load('../data/testa_labels.sav')

test_sentences = joblib.load('../data/testb_sentences.sav')

# indice mappers
word2idx = joblib.load('../data/word2idx.sav')
tag2idx = joblib.load('../data/tag2idx.sav')

In [7]:
# Basic statistics on the datasets and lookup tables
print('train sentences: ', len(train_sentences), len(train_labels))
print('dev sentences: ', len(dev_sentences), len(dev_labels))
print('test sentences: ', len(test_sentences))
print('vocab size: ', len(word2idx))
print('number of tags: ', len(tag2idx))

train sentences:  10490 10490
dev sentences:  3464 3464
test sentences:  3683
vocab size:  15002
number of tags:  11


## Prepare data loaders for mini-batch of sequences

When we sample a batch of sentences, the sentences usually have a different length. In a batch of sentences, (`batch_sentences`) with correspoonding batch of tags `batch_tags`, we add PAD_WORD for sentences that have fewer words than SQE_LENGTH (set to maximum length of a sentence in the current `batch_sentences`). Below shows this processing in code. 

In [8]:
# This is just to show the processing and is not meant to actually run.
# Maximum sentence lengths in current batch 
batch_max_len = max([len(s) for s in batch_sentences])

# Intial matrix
batch_data = word2idx[PAD_WORD]*np.ones((len(batch_sentences), batch_max_len))
batch_labels = -1*np.ones((len(batch_sentences), batch_max_len))

# Fill in the matrix with current batch sentences and labels
for i in range(len(batch_sentences)):
    curr_len = len(batch_sentences[i])
    batch_data[i][:curr_len] = batch_sentences[i]
    batch_labels[i][:curr_len] = batch_tags[i]

# Convert to torch.LongTensors (since each entry is an index)
batch_data = torch.LongTensor(batch_data), torch.LongTensor(batch_labels)
               

NameError: name 'batch_sentences' is not defined

As a result of the processing above, our `batch_data` now consists of the same length of sequences,

In [9]:
class SentenceDataLoader(object):
    """
    Loads a mini-batch of data at each iterations. 
    Stores the following properties:
    - dataset_params
    - word2idx: word to index mapping
    - tag2idx: tag to index mapping
    
    Args:
    - data_dir (str): path to the directory containing the dataset
    - parsm (dict): hyperparameters for data loading
    """
    def __init__(self, data_dir, word2idx_file, tag2idx_file):
        self.data_dir = data_dir
        
        word2idx_path = os.path.join(data_dir, word2idx_file)
        tag2idx_path = os.path.join(data_dir, tag2idx_file)
        self.word2idx = joblib.load(word2idx_path)
        self.tag2idx = joblib.load(tag2idx_path)
        self.vocab_size = len(self.word2idx)
        self.n_tags = len(self.tag2idx)
        
    def load_sentences_labels(self, sentences_path, labels_path, d):
        """
        Load sentences and labels for this dataset to the input dictionary d
        """
        d['data'] = joblib.load(sentences_path)
        d['labels'] = joblib.load(labels_path)
        d['size'] = len(d['data'])
        
        
    def load_data(self,types):
        """
        Load dataset(s) from data_dir.
        Args:
        - data_dir (str): path to the directory that contains dataset files
        - types (list): a list of string(s) which is one of 'train', 'dev', 'test'
        
        Returns:
        - data (dict): contains the sentences and labels for each type in types
        """            
        data = {}
        for split in ['train', 'dev', 'testa']:
            if split in types:
                sentences_path = os.path.join(self.data_dir, split+"_sentences.sav")
                labels_path = os.path.join(self.data_dir, split+"_labels.sav")
                print(sentences_path, "\n", labels_path)
                                     
                data[split] = {}
                self.load_sentences_labels(sentences_path, labels_path, data[split])
        return data
    
    def data_iterator(self, data, params, shuffle=False):
        """
        Returns a generator that yields a mini-batch of data (sentences and labels).
        It iteratates once over the data
        
        Args:
        - data (dict): a dictionary with keys of 'data', 'labels', 'size'
        - params (dict): hyperparams of the training 
        - shuffle (bool): to shuffle the mini-batch or not
        
        Yields:
        - batch_data (torch.LongTensor): word indices of size batch_size * seq_len 
        - batch_labels (torch.LongTensor): tag indices of size batch_size * seq_len
        """
        order = list(range(data['size']))
        if shuffle:
            random.seed(0)
            random.shuffle(order)
        
        # One iteration over data
        for i in range( (data['size']+1)//params.batch_size ):
            # Get a batch of sentences and tags 
            batch_sentences = [data['data'][i] for i in order[i*params.batch_size: (i+1)*params.batch_size]]
            batch_tags = [data['labels'][i] for i in order[i*params.batch_size: (i+1)*params.batch_size]]
            
            # Perform the two modification mentioned above
            # Append PAD words so that all sentences are of the same length in each batch
            # mark unseen word's tag as -1
            # Maximum sentence lengths in current batch 
            batch_max_len = max([len(s) for s in batch_sentences])

            # Intial matrix
            ## Use -1 for initial tags to differentiate it with tags from PAD_WORDs
            batch_data = self.word2idx[PAD_WORD]*np.ones((len(batch_sentences), batch_max_len))
            batch_labels = -1*np.ones((len(batch_sentences), batch_max_len))
#             print(type(batch_data), type(batch_labels))


            # Fill in the matrix with current batch sentences and labels
            for i in range(len(batch_sentences)):
                curr_len = len(batch_sentences[i])
                batch_data[i][:curr_len] = batch_sentences[i]
                batch_labels[i][:curr_len] = batch_tags[i]

            # Convert to torch.LongTensors (since each entry is an index)
#             batch_data = torch.LongTensor(batch_data), torch.LongTensor(batch_labels)
            batch_data, batch_labels = torch.from_numpy(batch_data), torch.from_numpy(batch_labels)
            batch_data = batch_data.type(torch.LongTensor)
            batch_labels = batch_labels.type(torch.LongTensor)
#             print(type(batch_data), type(batch_labels))

            # If gpu available
            if params.cuda:
                batch_data, batch_labels = batch_data.cuda(), batch_labels.cuda()
            yield batch_data, batch_labels

Now we define a data_iterator function using this logic


In [10]:
# train_data contains train_sentences and tarin_labels
# params is a dictionary that contains a key of 'batch_size'
loader = SentenceDataLoader('../data', 'word2idx.sav', 'tag2idx.sav')

In [11]:
loader.vocab_size
# train_iterator = data_iterator(train_data, dataiter_params, shuffle=True)

15002

In [12]:
d = loader.load_data(['train', 'dev', 'testa'])

../data/train_sentences.sav 
 ../data/train_labels.sav
../data/dev_sentences.sav 
 ../data/dev_labels.sav
../data/testa_sentences.sav 
 ../data/testa_labels.sav


In [13]:
train_data = d['train']
dev_data = d['dev']
test_data = d['testa']

In [14]:
params = model_utils.Params('../data/base_params.json')
diter=loader.data_iterator(train_data, params)
bdata, blabels = next(diter)

In [15]:
bdata=bdata.type(torch.LongTensor);print(bdata.type())

torch.LongTensor


Great!!


## Model 1: RNN (LSTM + FC)
This model is largely taken from [cs230](https://github.com/cs230-stanford/cs230-code-examples/blob/master/pytorch/nlp/model/net.py).

In [105]:
class Net(nn.Module):
    def __init__(self, params):
        """
        RNN model (lstm) that predicts the NER tags for each token in the sentence. It is composed of:
        
        - an embedding layer: maps each index in range(params.vocab_size) to a params.embedding_dim vector
        - lstm: applying the LSTM on the sequential input returns an output for each token in the sentence
        - fc: a fully connected layer that converts the LSTM output for each token to a distribution over NER tags
        
        Args:
            params (dict): contains vocab_size, embedding_dim, lstm_hidden_dim
        """
        super(Net, self).__init__()

        self.embed = nn.Embedding(params.vocab_size, params.embedding_dim)
        self.lstm = nn.LSTM(params.embedding_dim, params.lstm_hidden_dim, batch_first=True)
        self.fc = nn.Linear(params.lstm_hidden_dim, params.number_of_tags)
        
    def forward(self, s):
        """
        Args:
            s (torch.tensor): a batch of sentences, of dimension batch_size x seq_len, where seq_len is
               the length of the longest sentence in the batch. For sentences shorter than seq_len, the remaining
               tokens are PAD_WORD. Each row is a sentence with each element corresponding to the index of
               the token in the vocab.
        Returns:
            out (torch.tensor): of dimension batch_size*seq_len x num_tags with the log probabilities of tokens for each token
                 of each sentence.
        Note: the dimensions after each step are provided
        """
        #                                -> batch_size x seq_len
        # apply the embedding layer that maps each token to its embedding
        s = self.embed(s)            # dim: batch_size x seq_len x embedding_dim

        # run the LSTM along the sentences of length seq_len
        s, _ = self.lstm(s)              # dim: batch_size x seq_len x lstm_hidden_dim

        # make the Variable contiguous in memory (a PyTorch artefact)
        s = s.contiguous()

        # reshape the Variable so that each row contains one token
        s = s.view(-1, s.shape[2])       # dim: batch_size*seq_len x lstm_hidden_dim

        # apply the fully connected layer and obtain the output (before softmax) for each token
        s = self.fc(s)                   # dim: batch_size*seq_len x num_tags

        # apply log softmax on each token's output (this is recommended over applying softmax
        # since it is numerically more stable)
        return F.log_softmax(s, dim=1)   # dim: batch_size*seq_len x num_tags


def loss_fn(outputs, labels):
    """
    Compute the cross entropy loss given outputs from the model and labels for all tokens. Exclude loss terms
    for PADding tokens.
    Args:
        outputs: (Variable) dimension batch_size*seq_len x num_tags - log softmax output of the model
        labels: (Variable) dimension batch_size x seq_len where each element is either a label in [0, 1, ... num_tag-1],
                or -1 in case it is a PADding token.
    Returns:
        loss: (Variable) cross entropy loss for all tokens in the batch
    Note: you may use a standard loss function from http://pytorch.org/docs/master/nn.html#loss-functions. This example
          demonstrates how you can easily define a custom loss function.
    """

    # reshape labels to give a flat vector of length batch_size*seq_len
    labels = labels.view(-1)

    # since PADding tokens have label -1, we can generate a mask to exclude the loss from those terms
    mask = (labels >= 0).float()

    # indexing with negative values is not supported. Since PADded tokens have label -1, we convert them to a positive
    # number. This does not affect training, since we ignore the PADded tokens with the mask.
    labels = labels % outputs.shape[1]

    num_tokens = int(torch.sum(mask).data[0])

    # compute cross entropy loss for all tokens (except PADding tokens), by multiplying with mask.
    return -torch.sum(outputs[range(outputs.shape[0]), labels]*mask)/num_tokens

def accuracy(outputs, labels):
    """
    Compute the accuracy, given the outputs and labels for all tokens. Exclude PADding terms.
    Args:
        outputs: (np.ndarray) dimension batch_size*seq_len x num_tags - log softmax output of the model
        labels: (np.ndarray) dimension batch_size x seq_len where each element is either a label in
                [0, 1, ... num_tag-1], or -1 in case it is a PADding token.
    Returns: (float) accuracy in [0,1]
    """

    # reshape labels to give a flat vector of length batch_size*seq_len
#     print(f"label shape (bs, seq_len): {labels.shape}")
    labels = labels.ravel()

    # since PADding tokens have label -1, we can generate a mask to exclude the loss from those terms
    mask = (labels >= 0)

    # np.argmax gives us the class predicted for each token by the model
    outputs = np.argmax(outputs, axis=1)
    
    # for debug
#     print(f"outputs size: {outputs.shape}")
#     print(f"raveled label size: {labels.shape}")
#     pdb.set_trace()

    # compare outputs with labels and divide by number of tokens (excluding PADding tokens)
    return np.sum(outputs==labels)/float(np.sum(mask))

def f1_entity(outputs, labels, selected_tags=None):
    """
    Compute entity-level F1 score per class first and aggregate over classes using either micro or macro.
    Args:
        outputs: (np.ndarray) dimension batch_size*seq_len x num_tags - log softmax output of the model
        labels: (np.ndarray) dimension batch_size x seq_len where each element is either a label in
                [0, 1, ... num_tag-1], or -1 in case it is a PADding token.
    Returns: (float) average f1 score over class f1 scores
    """
    # Implement me!
    labels = labels.ravel()
    outputs = np.argmax(outputs, axis=1)
    
    # for debug
#     print(f"outputs size: {outputs.shape}")
#     print(f"raveled label size: {labels.shape}")
#     pdb.set_trace()

    if selected_tags is None:
        selected_tags = [0,1,2,3,4,5,6,7,8] # To exclude -1 (PAD_TAG); todo: don't hard-code it though..
        
    return f1_score(outputs, labels, labels=selected_tags, average='weighted')

    
# maintain all metrics required in this dictionary- these are used in the training and evaluation loops
metrics = {
    'accuracy': accuracy,
    # could add more metrics such as accuracy for each token type
    'f1_entity':f1_entity,
}


In [99]:
def test_metrics():
    # test metric functions
    tag_set = np.unique([t for tlist in train_labels for t in tlist])
    params = model_utils.Params('../data/base_params.json')
    diter=loader.data_iterator(train_data, params)
    with torch.no_grad():
        for bdata, blabels in diter:
            output = test_model(bdata)
            print(np.unique(blabels))
            acc = accuracy(output.numpy(), blabels.numpy())
            f1 = f1_entity(output.numpy(), blabels.numpy())
            print(f"acc: {acc}")
            print(f"f1: {f1}")
            pdb.set_trace()

[-1  1  3  4  5]
label shape (bs, seq_len): (5, 26)
outputs size: (130,)
raveled label size: (130,)
> <ipython-input-97-b18f79c6f00f>(105)accuracy()
-> return np.sum(outputs==labels)/float(np.sum(mask))


(Pdb)  c


outputs size: (130,)
raveled label size: (130,)
> <ipython-input-97-b18f79c6f00f>(122)f1_entity()
-> if selected_tags is None:


(Pdb)  c


acc: 1.0
f1: 0.6211985688729875
> <ipython-input-99-34c0270796e8>(6)<module>()
-> for bdata, blabels in diter:


(Pdb)  c


[-1  0  1  3  4  5  6]
label shape (bs, seq_len): (5, 34)
outputs size: (170,)
raveled label size: (170,)
> <ipython-input-97-b18f79c6f00f>(105)accuracy()
-> return np.sum(outputs==labels)/float(np.sum(mask))


(Pdb)  c


outputs size: (170,)
raveled label size: (170,)
> <ipython-input-97-b18f79c6f00f>(122)f1_entity()
-> if selected_tags is None:


(Pdb)  c


acc: 1.0
f1: 0.4363407430708469
> <ipython-input-99-34c0270796e8>(6)<module>()
-> for bdata, blabels in diter:


(Pdb)  c


[-1  0  1  3  4  5  6]
label shape (bs, seq_len): (5, 19)
outputs size: (95,)
raveled label size: (95,)
> <ipython-input-97-b18f79c6f00f>(105)accuracy()
-> return np.sum(outputs==labels)/float(np.sum(mask))


(Pdb)  c


outputs size: (95,)
raveled label size: (95,)
> <ipython-input-97-b18f79c6f00f>(122)f1_entity()
-> if selected_tags is None:


(Pdb)  c


acc: 1.0
f1: 0.5780205230056273
> <ipython-input-99-34c0270796e8>(6)<module>()
-> for bdata, blabels in diter:


(Pdb)  c


[-1  0  1  2  3  6]
label shape (bs, seq_len): (5, 24)
outputs size: (120,)
raveled label size: (120,)
> <ipython-input-97-b18f79c6f00f>(105)accuracy()
-> return np.sum(outputs==labels)/float(np.sum(mask))


(Pdb)  c


outputs size: (120,)
raveled label size: (120,)
> <ipython-input-97-b18f79c6f00f>(122)f1_entity()
-> if selected_tags is None:


(Pdb)  c


acc: 1.0
f1: 0.6915708812260536
> <ipython-input-99-34c0270796e8>(6)<module>()
-> for bdata, blabels in diter:


(Pdb)  c


[-1  0  1  2  3  4]
label shape (bs, seq_len): (5, 26)
outputs size: (130,)
raveled label size: (130,)
> <ipython-input-97-b18f79c6f00f>(105)accuracy()
-> return np.sum(outputs==labels)/float(np.sum(mask))


(Pdb)  c


outputs size: (130,)
raveled label size: (130,)
> <ipython-input-97-b18f79c6f00f>(122)f1_entity()
-> if selected_tags is None:


(Pdb)  c


acc: 0.9787234042553191
f1: 0.5113859136643947
> <ipython-input-99-34c0270796e8>(6)<module>()
-> for bdata, blabels in diter:


(Pdb)  q


BdbQuit: 

In [17]:
# define training 
def train(model, optimizer, loss_fn, data_iterator, metrics, params, num_steps):
    """Train the model on `num_steps` batches
    Args:
        model: (torch.nn.Module) the neural network
        optimizer: (torch.optim) optimizer for parameters of model
        loss_fn: a function that takes batch_output and batch_labels and computes the loss for the batch
        data_iterator: (generator) a generator that generates batches of data and labels
        metrics: (dict) a dictionary of functions that compute a metric using the output and labels of each batch
        params: (Params) hyperparameters
        num_steps: (int) number of batches to train on, each of size params.batch_size
    """

    # set model to training mode
    model.train()

    # summary for current training loop and a running average object for loss
    summ = []
    loss_avg = model_utils.RunningAverage()
    
    # Use tqdm for progress bar
    t = trange(num_steps) 
    for i in t:
        # fetch the next training batch

        train_batch, labels_batch = next(data_iterator)        

        # compute model output and loss
        output_batch = model(train_batch)
        loss = loss_fn(output_batch, labels_batch)

        # clear previous gradients, compute gradients of all variables wrt loss
        optimizer.zero_grad()
        loss.backward()

        # performs updates using calculated gradients
        optimizer.step()

        # Evaluate summaries only once in a while
        if i % params.save_summary_steps == 0:
            # extract data from torch Variable, move to cpu, convert to numpy arrays
            output_batch = output_batch.data.cpu().numpy()
            labels_batch = labels_batch.data.cpu().numpy()

            # compute all metrics on this batch
            summary_batch = {metric:metrics[metric](output_batch, labels_batch)
                             for metric in metrics}
            summary_batch['loss'] = loss.data[0]
            summ.append(summary_batch)

        # update the average loss
        loss_avg.update(loss.data[0])
        t.set_postfix(loss='{:05.3f}'.format(loss_avg()))

    # compute mean of all metrics in summary
    metrics_mean = {metric:np.mean([x[metric] for x in summ]) for metric in summ[0]} 
    metrics_string = " ; ".join("{}: {:05.3f}".format(k, v) for k, v in metrics_mean.items())
    logging.info("- Train metrics: " + metrics_string)
    

def train_and_evaluate(model, train_data, val_data, optimizer, loss_fn, metrics, params, model_dir, restore_file=None):
    """Train the model and evaluate every epoch.
    Args:
        model: (torch.nn.Module) the neural network
        train_data: (dict) training data with keys 'data' and 'labels'
        val_data: (dict) validaion data with keys 'data' and 'labels'
        optimizer: (torch.optim) optimizer for parameters of model
        loss_fn: a function that takes batch_output and batch_labels and computes the loss for the batch
        metrics: (dict) a dictionary of functions that compute a metric using the output and labels of each batch
        params: (Params) hyperparameters
        model_dir: (string) directory containing config, weights and log
        restore_file: (string) optional- name of file to restore from (without its extension .pth.tar)
    """
    # reload weights from restore_file if specified
#     if restore_file is not None:
#         restore_path = os.path.join(model_dir, restore_file + '.pth.tar')
#         logging.info("Restoring parameters from {}".format(restore_path))
#         nlp_utils.load_checkpoint(restore_path, model, optimizer)
        
    best_val_acc = 0.0

    for epoch in range(params.num_epochs):
        # Run one epoch
        logging.info("Epoch {}/{}".format(epoch + 1, params.num_epochs))

        # compute number of batches in one epoch (one full pass over the training set)
        num_steps = (params.train_size + 1) // params.batch_size
        train_data_iterator = data_loader.data_iterator(train_data, params, shuffle=True)
        train(model, optimizer, loss_fn, train_data_iterator, metrics, params, num_steps)
            
        # Evaluate for one epoch on validation set
        num_steps = (params.val_size + 1) // params.batch_size
        val_data_iterator = data_loader.data_iterator(val_data, params, shuffle=False)
        val_metrics = evaluate(model, loss_fn, val_data_iterator, metrics, params, num_steps)
        
        val_acc = val_metrics['f1_entity']
        is_best = val_acc >= best_val_acc

        # Save weights
        model_utils.save_checkpoint({'epoch': epoch + 1,
                               'state_dict': model.state_dict(),
                               'optim_dict' : optimizer.state_dict()}, 
                               is_best=is_best,
                               checkpoint=model_dir)
            
        # If best_eval, best_save_path        
        if is_best:
            logging.info("- Found new best f1_entity")
            best_val_acc = val_acc
            
            # Save best val metrics in a json file in the model directory
            best_json_path = os.path.join(model_dir, "metrics_val_best_weights.json")
            model_utils.save_dict_to_json(val_metrics, best_json_path)

        # Save latest val metrics in a json file in the model directory
        last_json_path = os.path.join(model_dir, "metrics_val_last_weights.json")
        model_utils.save_dict_to_json(val_metrics, last_json_path)

## Let's train this LSTM + FC model

In [107]:
# Set params
params = model_utils.Params('../data/rnn_1_params_2.json')

# set a directory to save our model
model_dir = '../log/progress_rnn1_10_09_12_12'
data_dir = '../data'

# use GPU if available
# params.cuda = torch.cuda.is_available()

# Set the random seed for reproducible experiments
torch.manual_seed(230)
# if params.cuda: torch.cuda.manual_seed(230)

# Set the logger
model_utils.set_logger(os.path.join(model_dir, 'train.log'))

# Create the input data pipeline
logging.info("Loading the datasets...")

# load data
data_loader = SentenceDataLoader(data_dir, 'word2idx.sav', 'tag2idx.sav')
# data = data_loader.load_data(datadir, ['train', 'val'])
# train_data = data['train']
# val_data = data['val']

# specify the train and val dataset sizes
params.train_size = train_data['size']
params.val_size = dev_data['size']

Loading the datasets...


In [None]:
# %debug
# Define the model and optimizer
model = Net(params).cuda() if params.cuda else Net(params)
optimizer = optim.Adam(model.parameters(), lr=params.learning_rate)

# fetch loss function and metrics
loss_fn = loss_fn
metrics = metrics

# Train the model
logging.info("Starting training for {} epoch(s)".format(params.num_epochs))
train_and_evaluate(model, train_data, dev_data, optimizer, loss_fn, metrics, params, model_dir)

Starting training for 200 epoch(s)
Epoch 1/200




  0%|          | 0/3497 [00:00<?, ?it/s, loss=2.404][A[A

  0%|          | 0/3497 [00:00<?, ?it/s, loss=2.358][A[A

  0%|          | 2/3497 [00:00<03:59, 14.58it/s, loss=2.358][A[A

  0%|          | 2/3497 [00:00<03:59, 14.58it/s, loss=2.319][A[A

  0%|          | 2/3497 [00:00<03:59, 14.58it/s, loss=2.300][A[A

  0%|          | 4/3497 [00:00<03:42, 15.71it/s, loss=2.300][A[A

  0%|          | 4/3497 [00:00<03:42, 15.71it/s, loss=2.282][A[A

  0%|          | 4/3497 [00:00<03:42, 15.71it/s, loss=2.234][A[A

  0%|          | 4/3497 [00:00<03:42, 15.71it/s, loss=2.210][A[A

  0%|          | 7/3497 [00:00<03:23, 17.17it/s, loss=2.210][A[A

  0%|          | 7/3497 [00:00<03:23, 17.17it/s, loss=2.167][A[A

  0%|          | 7/3497 [00:00<03:23, 17.17it/s, loss=2.142][A[A

  0%|          | 9/3497 [00:00<03:17, 17.62it/s, loss=2.142][A[A

  0%|          | 9/3497 [00:00<03:17, 17.62it/s, loss=2.117][A[A

  0%|        

## Predict on `testb`

In [52]:
# Load the best model
model_dir = '../log/progress_rnn1'
# data_dir = '../data'
params = model_utils.Params('../data/rnn_1_params_1-Copy1.json')
old_model = Net(params)
new_model = Net(params)
old_checkpoint = model_utils.load_checkpoint('../log/progress_rnn1/old_best.pth.tar', old_model);
new_checkpoint = model_utils.load_checkpoint('../log/progress_rnn1/best.pth.tar', new_model);
print(old_checkpoint['epoch'], new_checkpoint['epoch'])

21 70


In [53]:
new_model

Net(
  (embed): Embedding(15002, 500)
  (lstm): LSTM(500, 200, batch_first=True)
  (fc): Linear(in_features=200, out_features=11, bias=True)
)

In [20]:
"""
    Evaluate the model on the test set.
"""
# Load the parameters
# model_dir = '../log/
# json_path = os.path.join(model_dir, 'rnn1_params_1.json')
# assert os.path.isfile(json_path), "No json configuration file found at {}".format(json_path)
# params = model_utils.Params(json_path)

SyntaxError: EOL while scanning string literal (<ipython-input-20-a703289ccb45>, line 5)

In [54]:
# load testb dataset
testb_sentences = joblib.load('../data/testb_sentences.sav')

In [55]:
# inverse dictionary
idx2tag = {i:t for t,i in tag2idx.items()}
idx2word = {i:w for w,i in word2idx.items()}

In [59]:
# test_model = old_model
test_model = new_model

In [24]:
#predict
with torch.no_grad():
    test_preds = []
    for i in range(len(testb_sentences)):
        prob = best_model(torch.LongTensor([test_sentences[i]])).numpy()
        tags_i = np.argmax(prob, axis=1)
        tags = [idx2tag[idx] for idx in tags_i]
        test_preds.append(tags)
        
        # debugging
#         print("="*50)
#         print(i, "\n", prob.shape)
#         print("number of words: ", len(tags))
#         print(len(tags), "\n", tags)
#         words = [idx2word[j] for j in test_sentences[i]]
#         print(words)
#         pdb.set_trace()
        

print("done")
print(len(test_preds))

done
3683


In [92]:
#check accuracy metric
tag_set = np.unique([t for tlist in train_labels for t in tlist])
params = model_utils.Params('../data/base_params.json')
diter=loader.data_iterator(train_data, params)
with torch.no_grad():
    for bdata, blabels in diter:
        output = test_model(bdata)
        print(np.unique(blabels))
        acc = accuracy(output.numpy(), blabels.numpy())
        f1 = f1_entity(output.numpy(), blabels.numpy())
        print(f"acc: {acc}")
        print(f"f1: {f1}")
        pdb.set_trace()


[-1  1  3  4  5]
label shape (bs, seq_len): (5, 26)
outputs size: (130,)
raveled label size: (130,)
> <ipython-input-88-a39a786fd046>(105)accuracy()
-> return np.sum(outputs==labels)/float(np.sum(mask))


(Pdb)  l


100  	    print(f"outputs size: {outputs.shape}")
101  	    print(f"raveled label size: {labels.shape}")
102  	    pdb.set_trace()
103  	
104  	    # compare outputs with labels and divide by number of tokens (excluding PADding tokens)
105  ->	    return np.sum(outputs==labels)/float(np.sum(mask))
106  	
107  	def f1_entity(outputs, labels, selected_tags=None):
108  	    """
109  	    Compute entity-level F1 score per class first and aggregate over classes using either micro or macro.
110  	    Args:


(Pdb)  c


outputs size: (130,)
raveled label size: (130,)


ValueError: Found input variables with inconsistent numbers of samples: [130, 5]

NameError: name 'blabels' is not defined

In [72]:
tag2idx

{'B-ORG': 0,
 'O': 1,
 'B-MISC': 2,
 'B-PER': 3,
 'I-PER': 4,
 'B-LOC': 5,
 'I-ORG': 6,
 'I-MISC': 7,
 'I-LOC': 8,
 '<START>': 9,
 '<STOP>': 10}

In [82]:
print(f"output size: {output.shape}")
print(f"batch labels: {blabels.shape}")
blabels = blabels.numpy().ravel()
output = np.argmax(output, axis=1)

In [84]:
print("-"*80)
print(f"output size: {output.shape}")
print(f"batch labels: {blabels.shape}")
f1_score(output, blabels, labels=tag_set, average='weighted')
    

--------------------------------------------------------------------------------
output size: torch.Size([130])
batch labels: (130,)


0.6211985688729875

In [85]:
output

tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 5, 1, 5, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 3, 4, 1, 5, 1, 1, 3, 4, 1, 5, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1])

In [78]:
    # reshape labels to give a flat vector of length batch_size*seq_len
    print(f"label shape (bs, seq_len): {labels.shape}")
    labels = labels.ravel()

    # since PADding tokens have label -1, we can generate a mask to exclude the loss from those terms
    mask = (labels >= 0)

    # np.argmax gives us the class predicted for each token by the model
    outputs = np.argmax(outputs, axis=1)
    print(f"outputs size: {outputs.shape}")
    print(f"raveled label size: {labels.shape}")
    pdb.set_trace()

    # compare outputs with labels and divide by number of tokens (excluding PADding tokens)
    return np.sum(outputs==labels)/float(np.sum(mask))

torch.Size([130, 11])
torch.Size([5, 26])


## Write predictions to a new column in `testb` file.

In [41]:
# First, convert the tagging scheme from BIO to IBO
from nlp_utils import data_converter

test_preds_ibo = data_converter.tags_to_conll(test_preds)

In [42]:
# Let's add the predicted labels for all words
testb_data = data_converter.read_conll('../data/eng.testb')[1:]

In [43]:
augmented = data_converter.add_column(testb_data, test_preds_ibo)
print("Sanity check: ", len(augmented))

Sanity check:  3683


In [44]:
_ = data_converter.conll_to_data_stream(augmented, write_to_file='./testb_rnn1_preds.txt');