## Documentation

### Task
The syntax of a natural language, similar to the syntax of a programming language involves the arrangement of tokens into meaningful groups. Phrasal chunking is the task of finding non-recursive syntactic groups of words. 

In this task we are given a clean training set, and a dirty test set in which noise is added into the text. We are supposed to implement a character level method which creates 2 seperate one hot, and one multi hot, vector for each word.

The 3 vectors are:

First Char (One Hot)

Middle (Multi Hot)

Last (One Hot)

### Method

These vectors are to be concatenated to the word embedding from the text, then feed through an LSTM. At each time step the LSTM will output a tag for that word.

As an alternative to this weird one hot representing, we can make an additional character embedding matrix for the one / multi hot vectors and learn this during training time. This can be done with a basic matrix multiplication. Both methods were implemented, and the character embedding matrix gave us better results. 

In [27]:
from chunker import *
import os

## Run the default solution on dev

In [28]:
chunker = LSTMTagger(os.path.join('../data', 'train.txt.gz'), os.path.join('../data', 'chunker'), '.tar')
decoder_output = chunker.decode('../data/input/dev.txt')

100%|██████████| 1027/1027 [00:01<00:00, 822.65it/s]


## Evaluate the default output

In [29]:
flat_output = [ output for sent in decoder_output for output in sent ]
import conlleval
true_seqs = []
with open(os.path.join('../data','reference','dev.out')) as r:
    for sent in conlleval.read_file(r):
        true_seqs += sent.split()
conlleval.evaluate(true_seqs, flat_output)

processed 23663 tokens with 11896 phrases; found: 11672 phrases; correct: 8568.
accuracy:  84.35%; (non-O)
accuracy:  85.65%; precision:  73.41%; recall:  72.02%; FB1:  72.71
             ADJP: precision:  36.49%; recall:  11.95%; FB1:  18.00  74
             ADVP: precision:  71.36%; recall:  39.45%; FB1:  50.81  220
            CONJP: precision:   0.00%; recall:   0.00%; FB1:   0.00  0
             INTJ: precision:   0.00%; recall:   0.00%; FB1:   0.00  0
               NP: precision:  70.33%; recall:  76.80%; FB1:  73.42  6811
               PP: precision:  92.40%; recall:  87.14%; FB1:  89.69  2302
              PRT: precision:  65.00%; recall:  57.78%; FB1:  61.18  40
             SBAR: precision:  84.62%; recall:  41.77%; FB1:  55.93  117
               VP: precision:  63.66%; recall:  58.25%; FB1:  60.83  2108


(73.40644276901988, 72.02420981842637, 72.70875763747455)

### Looking at given code to figure out what's going on

Need this function to read for next few functions

In [30]:
import os, sys, gzip
import re

def read_conll(handle, input_idx=0, label_idx=2):
    conll_data = []
    contents = re.sub(r'\n\s*\n', r'\n\n', handle.read())
    contents = contents.rstrip()
    for sent_string in contents.split('\n\n'):
        annotations = list(zip(*[ word_string.split() for word_string in sent_string.split('\n') ]))
        assert(input_idx < len(annotations))
        if label_idx < 0:
            conll_data.append( annotations[input_idx] )
            logging.info("CoNLL: {}".format( " ".join(annotations[input_idx])))
        else:
            assert(label_idx < len(annotations))
            conll_data.append(( annotations[input_idx], annotations[label_idx] ))
            logging.info("CoNLL: {} ||| {}".format( " ".join(annotations[input_idx]), " ".join(annotations[label_idx])))
    return conll_data


First we need to determine the number of unique characters

In class Anoop seemed to suggest it is 100 (this changed in the instructions later to just using the python symbols, too lazy to change now)

Just loop through and append each char in a dict to count

In [31]:
trainfile = '../data/train.txt.gz'

chars = set()
with gzip.open(trainfile, 'rt') as f:
    contents = re.sub(r'\n\s*\n', r'\n\n', f.read())
    contents = contents.rstrip()
    for sent_string in contents.split('\n\n'):
        # Tuple of (words, labels)
        annotations = list(zip(*[ word_string.split() for word_string in sent_string.split('\n') ]))
        words = annotations[0]
        for word in words:
            for char in word:
                chars.add(char)

chars = list(chars)
chars = sorted(chars)
print(len(chars))
print(chars)
charToDex = {char: dex for dex, char in enumerate(chars)}

print(charToDex)

81
['!', '#', '$', '%', '&', "'", '*', ',', '-', '.', '/', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', ':', ';', '=', '?', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', '[', '\\', ']', '`', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']
{'!': 0, '#': 1, '$': 2, '%': 3, '&': 4, "'": 5, '*': 6, ',': 7, '-': 8, '.': 9, '/': 10, '0': 11, '1': 12, '2': 13, '3': 14, '4': 15, '5': 16, '6': 17, '7': 18, '8': 19, '9': 20, ':': 21, ';': 22, '=': 23, '?': 24, 'A': 25, 'B': 26, 'C': 27, 'D': 28, 'E': 29, 'F': 30, 'G': 31, 'H': 32, 'I': 33, 'J': 34, 'K': 35, 'L': 36, 'M': 37, 'N': 38, 'O': 39, 'P': 40, 'Q': 41, 'R': 42, 'S': 43, 'T': 44, 'U': 45, 'V': 46, 'W': 47, 'X': 48, 'Y': 49, 'Z': 50, '[': 51, '\\': 52, ']': 53, '`': 54, 'a': 55, 'b': 56, 'c': 57, 'd': 58, 'e': 59, 'f': 60, 'g': 61, 'h': 62, 'i': 63, 'j': 64, 'k': 65, 'l':

Seems short but also complete? Where is he getting these extra 19 characters from? 

Turns out he is using the python built in string 

Using a single random for unknown VS 19 random for unknown is likely better anyways

I don't know of any work that uses 19 unknowns???

We can just make 3 seperate one hot vectors for each word

Multiply them by the character embedding matrix then use a torch view to rearrange

So lets say we have 3 words, we have 9 x 81, multiply by embedding matrix, we get 9 x embedding size use torch.view(dim[0]/3, 3, embDim).sum(1) and we have our embeddings

In [32]:
import numpy as np

trainfile = '../data/train.txt.gz'

with gzip.open(trainfile, 'rt') as f:
    contents = re.sub(r'\n\s*\n', r'\n\n', f.read())
    contents = contents.rstrip()
    for sent_string in contents.split('\n\n'):
        # Tuple of (words, labels)
        annotations = list(zip(*[ word_string.split() for word_string in sent_string.split('\n') ]))
        
        print(annotations)
        print(annotations[0][3])
        word = annotations[0][3]
        
        oneHot = np.zeros((3, len(charToDex)))
        
        if len(word) >= 3:    
            first = word[0]
            last = word[-1]
            mid = word[1:-1]
            print("Word: {} First: {} Middle: {} Last: {}".format(word, first, mid, last))

            # Could be more than one so just add
            for c in mid:
                oneHot[1, charToDex[c]] += 1.0
                
            oneHot[2, charToDex[last]] += 1.0
        elif len(word) == 2:
            first = word[0]
            last = word[-1]
            oneHot[2, charToDex[last]] += 1.0
        else:
            first = word[0]
            
        oneHot[0, charToDex[first]] += 1.0
        
        print(oneHot)
        break
        

[('[UNK]', 'in', 'the', 'pound', 'is', 'widely', 'expected', 'to', 'take', 'another', 'sharp', 'dive', 'if', 'trade', 'figures', 'for', 'September', ',', 'due', 'for', 'release', 'tomorrow', ',', 'fail', 'to', 'show', 'a', 'substantial', 'improvement', 'from', 'July', 'and', 'August', "'s", 'near-record', '[UNK]', '.'), ('NN', 'IN', 'DT', 'NN', 'VBZ', 'RB', 'VBN', 'TO', 'VB', 'DT', 'JJ', 'NN', 'IN', 'NN', 'NNS', 'IN', 'NNP', ',', 'JJ', 'IN', 'NN', 'NN', ',', 'VB', 'TO', 'VB', 'DT', 'JJ', 'NN', 'IN', 'NNP', 'CC', 'NNP', 'POS', 'JJ', 'NNS', '.'), ('B-NP', 'B-PP', 'B-NP', 'I-NP', 'B-VP', 'I-VP', 'I-VP', 'I-VP', 'I-VP', 'B-NP', 'I-NP', 'I-NP', 'B-SBAR', 'B-NP', 'I-NP', 'B-PP', 'B-NP', 'O', 'B-ADJP', 'B-PP', 'B-NP', 'B-NP', 'O', 'B-VP', 'I-VP', 'I-VP', 'B-NP', 'I-NP', 'I-NP', 'B-PP', 'B-NP', 'I-NP', 'I-NP', 'B-NP', 'I-NP', 'I-NP', 'O')]
pound
Word: pound First: p Middle: oun Last: d
[[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 

The above looks good to me. Let's make this into a function.

In [None]:
def read_conll_char(handle, input_idx=0, label_idx=2):
    conll_data = []
    contents = re.sub(r'\n\s*\n', r'\n\n', handle.read())
    contents = contents.rstrip()
    for sent_string in contents.split('\n\n'):
        annotations = list(zip(*[ word_string.split() for word_string in sent_string.split('\n') ]))
        assert(input_idx < len(annotations))
        if label_idx < 0:
            conll_data.append( annotations[input_idx] )
            logging.info("CoNLL: {}".format( " ".join(annotations[input_idx])))
        else:
            assert(label_idx < len(annotations))
            
            charTups = []
            for word in annotations[input_idx]:
        
                first = mid = last = None
                refLen = len(word)
                if refLen >= 3:    
                    first = word[0]
                    last = word[-1]
                    mid = word[1:-1]
                elif refLen == 2:
                    first = word[0]
                    last = word[-1]
                else:
                    first = word[0]

                charTups.append( (first, mid, last, refLen) )
                
            conll_data.append( ( annotations[input_idx], annotations[label_idx] , charTups) )
            logging.info("CoNLL: {} ||| {}".format( " ".join(annotations[input_idx]), " ".join(annotations[label_idx])))
            
    return conll_data


In [34]:
with gzip.open(trainfile, 'rt') as f:
    conll_w_char = read_conll_char(f)
    
print(len(conll_w_char))
print(conll_w_char[0][0])
print()
print(conll_w_char[0][2])


8936
('[UNK]', 'in', 'the', 'pound', 'is', 'widely', 'expected', 'to', 'take', 'another', 'sharp', 'dive', 'if', 'trade', 'figures', 'for', 'September', ',', 'due', 'for', 'release', 'tomorrow', ',', 'fail', 'to', 'show', 'a', 'substantial', 'improvement', 'from', 'July', 'and', 'August', "'s", 'near-record', '[UNK]', '.')

[('[', 'UNK', ']', 5), ('i', None, 'n', 2), ('t', 'h', 'e', 3), ('p', 'oun', 'd', 5), ('i', None, 's', 2), ('w', 'idel', 'y', 6), ('e', 'xpecte', 'd', 8), ('t', None, 'o', 2), ('t', 'ak', 'e', 4), ('a', 'nothe', 'r', 7), ('s', 'har', 'p', 5), ('d', 'iv', 'e', 4), ('i', None, 'f', 2), ('t', 'rad', 'e', 5), ('f', 'igure', 's', 7), ('f', 'o', 'r', 3), ('S', 'eptembe', 'r', 9), (',', None, None, 1), ('d', 'u', 'e', 3), ('f', 'o', 'r', 3), ('r', 'eleas', 'e', 7), ('t', 'omorro', 'w', 8), (',', None, None, 1), ('f', 'ai', 'l', 4), ('t', None, 'o', 2), ('s', 'ho', 'w', 4), ('a', None, None, 1), ('s', 'ubstantia', 'l', 11), ('i', 'mprovemen', 't', 11), ('f', 'ro', 'm', 4), 

Need to create charToDex with annotation format

In [35]:
# Need to create charToDex with annotation format
chars = set()
for annDex, ann in enumerate(conll_w_char):
    if (annDex % 500) == 0:
        print("On [{}/{}]".format(annDex, len(conll_w_char)))
    # Tuple of (words, labels, charTups)
    charTups = ann[-1]
    #print(charTups)
    
    for ct in charTups:
        # Don't need length
        chars.add(ct[0])
        
        if ct[2] is not None:
            chars.add(ct[2])
        
        if ct[1] is not None:
            for char in ct[1]:
                chars.add(char)
                
chars = list(chars)
chars = sorted(chars)
print(len(chars))
print(chars)
charToDex = {char: dex for dex, char in enumerate(chars)}
# Seems short but also complete? Where is he getting these extra 19 characters from?
print(charToDex)

On [0/8936]
On [500/8936]
On [1000/8936]
On [1500/8936]
On [2000/8936]
On [2500/8936]
On [3000/8936]
On [3500/8936]
On [4000/8936]
On [4500/8936]
On [5000/8936]
On [5500/8936]
On [6000/8936]
On [6500/8936]
On [7000/8936]
On [7500/8936]
On [8000/8936]
On [8500/8936]
81
['!', '#', '$', '%', '&', "'", '*', ',', '-', '.', '/', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', ':', ';', '=', '?', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', '[', '\\', ']', '`', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']
{'!': 0, '#': 1, '$': 2, '%': 3, '&': 4, "'": 5, '*': 6, ',': 7, '-': 8, '.': 9, '/': 10, '0': 11, '1': 12, '2': 13, '3': 14, '4': 15, '5': 16, '6': 17, '7': 18, '8': 19, '9': 20, ':': 21, ';': 22, '=': 23, '?': 24, 'A': 25, 'B': 26, 'C': 27, 'D': 28, 'E': 29, 'F': 30, 'G': 31, 'H': 32, 'I': 33, 'J': 34, 'K': 35, 'L': 36, 

Might as well try to integrate with the other code as much as possible 

In [36]:
# Might as well try to integrate with the other code as much as possible 
word_to_ix = {}
tag_to_ix = {}
ix_to_tag = []
chars = set()

for sent, tags, charTups in conll_w_char:
    
    for word in sent:
        if word not in word_to_ix:
            word_to_ix[word] = len(word_to_ix)
            
    for tag in tags:
        if tag not in tag_to_ix:
            tag_to_ix[tag] = len(tag_to_ix)
            ix_to_tag.append(tag)

    for ct in charTups:
        # Don't need length
        chars.add(ct[0])
        
        if ct[2] is not None:
            chars.add(ct[2])
        
        if ct[1] is not None:
            for char in ct[1]:
                chars.add(char)
                
chars = list(chars)
chars = sorted(chars)
print(len(chars))
print(chars)
charToDex = {char: dex for dex, char in enumerate(chars)}
# Seems short but also complete? Where is he getting these extra 19 characters from?
print(charToDex)
charToDex['unk'] = len(charToDex)

81
['!', '#', '$', '%', '&', "'", '*', ',', '-', '.', '/', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', ':', ';', '=', '?', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', '[', '\\', ']', '`', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']
{'!': 0, '#': 1, '$': 2, '%': 3, '&': 4, "'": 5, '*': 6, ',': 7, '-': 8, '.': 9, '/': 10, '0': 11, '1': 12, '2': 13, '3': 14, '4': 15, '5': 16, '6': 17, '7': 18, '8': 19, '9': 20, ':': 21, ';': 22, '=': 23, '?': 24, 'A': 25, 'B': 26, 'C': 27, 'D': 28, 'E': 29, 'F': 30, 'G': 31, 'H': 32, 'I': 33, 'J': 34, 'K': 35, 'L': 36, 'M': 37, 'N': 38, 'O': 39, 'P': 40, 'Q': 41, 'R': 42, 'S': 43, 'T': 44, 'U': 45, 'V': 46, 'W': 47, 'X': 48, 'Y': 49, 'Z': 50, '[': 51, '\\': 52, ']': 53, '`': 54, 'a': 55, 'b': 56, 'c': 57, 'd': 58, 'e': 59, 'f': 60, 'g': 61, 'h': 62, 'i': 63, 'j': 64, 'k': 65, 'l':

In [37]:
def prepare_sequence(seq, to_ix, unk):

    if type(seq[0]) == tuple:
        charOHs = []
        for charTup in seq:
            oneHot = np.zeros((3, len(to_ix)))
            # Last item in tuple was saved as word len
            refLen = charTup[-1]
            if refLen >= 3:    
                first = charTup[0]
                mid = charTup[1]
                last = charTup[2] 
                
                # Could be more than one so just add
                for c in mid:
                    if c not in charToDex:
                        c = "unk"
                    oneHot[1, charToDex[c]] += 1.0

                if last not in charToDex:
                    last = "unk"
                oneHot[2, charToDex[last]] += 1.0
            elif refLen == 2:
                first = charTup[0]
                last = charTup[2]
                
                if last not in charToDex:
                    last = "unk"
                    
                oneHot[2, charToDex[last]] += 1.0
            else:
                first = charTup[0]
            
            if first not in charToDex:
                first = "unk"
                
            oneHot[0, charToDex[first]] += 1.0
            
            charOHs.append(oneHot)
        charOHs = np.stack(charOHs)
        return torch.from_numpy(charOHs).type(torch.FloatTensor)
    else:
        idxs = []
        if unk not in to_ix:
            idxs = [to_ix[w] for w in seq]
        else:
            idxs = [to_ix[w] for w in map(lambda w: unk if w not in to_ix else w, seq)]
        return torch.tensor(idxs, dtype=torch.long)


In [38]:
charSparse = prepare_sequence(conll_w_char[0][2], charToDex, unk=None)
print(charSparse.shape)
print(charSparse.view(-1,charSparse.shape[-1]).shape)
print(len(conll_w_char[0][2]))

torch.Size([37, 3, 82])
torch.Size([111, 82])
37


Modify the original code now

In [39]:
# Code adapted from original code by Robert Guthrie

import numpy as np
import os, sys, optparse, gzip, re, logging
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import tqdm

def read_conll(handle, input_idx=0, label_idx=2):
    conll_data = []
    contents = re.sub(r'\n\s*\n', r'\n\n', handle.read())
    contents = contents.rstrip()
    for sent_string in contents.split('\n\n'):
        annotations = list(zip(*[ word_string.split() for word_string in sent_string.split('\n') ]))
        assert(input_idx < len(annotations))
        if label_idx < 0:
            
            charTups = []
            for word in annotations[input_idx]:
        
                first = mid = last = None
                refLen = len(word)
                if refLen >= 3:    
                    first = word[0]
                    last = word[-1]
                    mid = word[1:-1]
                elif refLen == 2:
                    first = word[0]
                    last = word[-1]
                else:
                    first = word[0]

                charTups.append( (first, mid, last, refLen) )
                
            conll_data.append( ( annotations[input_idx], charTups) )
            
            #conll_data.append( annotations[input_idx] )
            logging.info("CoNLL: {}".format( " ".join(annotations[input_idx])))
        else:
            assert(label_idx < len(annotations))
            
            charTups = []
            for word in annotations[input_idx]:
        
                first = mid = last = None
                refLen = len(word)
                if refLen >= 3:    
                    first = word[0]
                    last = word[-1]
                    mid = word[1:-1]
                elif refLen == 2:
                    first = word[0]
                    last = word[-1]
                else:
                    first = word[0]

                charTups.append( (first, mid, last, refLen) )
                
            conll_data.append( ( annotations[input_idx], annotations[label_idx] , charTups) )
            logging.info("CoNLL: {} ||| {}".format( " ".join(annotations[input_idx]), " ".join(annotations[label_idx])))
            
    return conll_data

def prepare_sequence(seq, to_ix, unk):

    if type(seq[0]) == tuple:
        charOHs = []
        for charTup in seq:
            oneHot = np.zeros((3, len(to_ix)))
            # Last item in tuple was saved as word len
            refLen = charTup[-1]
            if refLen >= 3:    
                first = charTup[0]
                mid = charTup[1]
                last = charTup[2] 
                
                # Could be more than one so just add
                for c in mid:
                    if c not in to_ix:
                        c = "unk"
                    oneHot[1, to_ix[c]] += 1.0

                if last not in to_ix:
                    last = "unk"
                oneHot[2, to_ix[last]] += 1.0
            elif refLen == 2:
                first = charTup[0]
                last = charTup[2]
                
                if last not in to_ix:
                    last = "unk"
                    
                oneHot[2, to_ix[last]] += 1.0
            else:
                first = charTup[0]
            
            if first not in to_ix:
                first = "unk"
                
            oneHot[0, to_ix[first]] += 1.0
            
            charOHs.append(oneHot)
        charOHs = np.stack(charOHs)
        charOHs = torch.from_numpy(charOHs).type(torch.FloatTensor)
        charOHs = charOHs.view(-1, charOHs.shape[-1])
        return charOHs
    else:
        idxs = []
        if unk not in to_ix:
            idxs = [to_ix[w] for w in seq]
        else:
            idxs = [to_ix[w] for w in map(lambda w: unk if w not in to_ix else w, seq)]
        return torch.tensor(idxs, dtype=torch.long)

Need to implement weird baseline model which appends just the one hot char vects for some reason???

In [40]:
class LSTMTaggerModel(nn.Module):

    def __init__(self, embedding_dim, hidden_dim, vocab_size, tagset_size, char_size):
        torch.manual_seed(1)
        super(LSTMTaggerModel, self).__init__()
        self.hidden_dim = hidden_dim

        self.word_embeddings = nn.Embedding(vocab_size, embedding_dim)
        self.char_embeddings = nn.Parameter(torch.zeros(char_size, embedding_dim))
        
        torch.nn.init.normal_(self.char_embeddings)
        
        # The LSTM takes word embeddings as inputs, and outputs hidden states
        # with dimensionality hidden_dim.
        # If sum chars
        # self.lstm = nn.LSTM(embedding_dim * 2, hidden_dim, bidirectional=False)
        # If 4 unique embeds
        self.lstm = nn.LSTM(embedding_dim + 82 * 3, hidden_dim, bidirectional=False)

        # The linear layer that maps from hidden state space to tag space
        self.hidden2tag = nn.Linear(hidden_dim, tagset_size)

    def forward(self, sentence, charEmbed):
        embeds = self.word_embeddings(sentence)
        
        charEmbed = charEmbed.view(int(charEmbed.shape[0] / 3), 3, charEmbed.shape[1])
        charEmbed = charEmbed.view(charEmbed.shape[0], -1)
        
        embeds = torch.cat([embeds, charEmbed],-1)
        
        lstm_out, _ = self.lstm(embeds.view(len(sentence), 1, -1))
        tag_space = self.hidden2tag(lstm_out.view(len(sentence), -1))
        tag_scores = F.log_softmax(tag_space, dim=1)
        return tag_scores

class LSTMTagger:

    def __init__(self, trainfile, modelfile, modelsuffix, unk="[UNK]", epochs=10, embedding_dim=128, hidden_dim=64):
        self.unk = unk
        self.embedding_dim = embedding_dim
        self.hidden_dim = hidden_dim
        self.epochs = epochs
        self.modelfile = modelfile
        self.modelsuffix = modelsuffix
        self.training_data = []
        if trainfile[-3:] == '.gz':
            with gzip.open(trainfile, 'rt') as f:
                self.training_data = read_conll(f)
        else:
            with open(trainfile, 'r') as f:
                self.training_data = read_conll(f)

        self.word_to_ix = {} # replaces words with an index (one-hot vector)
        self.tag_to_ix = {} # replace output labels / tags with an index
        self.ix_to_tag = [] # during inference we produce tag indices so we have to map it back to a tag

        chars = set()
        for sent, tags, charTups in self.training_data:
            
            for word in sent:
                if word not in self.word_to_ix:
                    self.word_to_ix[word] = len(self.word_to_ix)
                    
            for tag in tags:
                if tag not in self.tag_to_ix:
                    self.tag_to_ix[tag] = len(self.tag_to_ix)
                    self.ix_to_tag.append(tag)
                    
            for ct in charTups:
    
                chars.add(ct[0])

                if ct[2] is not None:
                    chars.add(ct[2])

                if ct[1] is not None:
                    for char in ct[1]:
                        chars.add(char)

        chars = list(chars)
        chars = sorted(chars)
        charToDex = {char: dex for dex, char in enumerate(chars)}
        charToDex['unk'] = len(charToDex)
        self.charToDex = charToDex
        
        logging.info("word_to_ix:", self.word_to_ix)
        logging.info("tag_to_ix:", self.tag_to_ix)
        logging.info("ix_to_tag:", self.ix_to_tag)
        logging.info("char_to_dex:", self.charToDex)
        
        print("Creating Modified Model")
        self.model = LSTMTaggerModel(self.embedding_dim, self.hidden_dim, len(self.word_to_ix), len(self.tag_to_ix), len(self.charToDex))
        self.optimizer = optim.SGD(self.model.parameters(), lr=0.01)

    def argmax(self, seq, charTups):
        output = []
        with torch.no_grad():
            inputs = prepare_sequence(seq, self.word_to_ix, self.unk)
            charEmbeds = prepare_sequence(charTups, self.charToDex, "unk")
            tag_scores = self.model(inputs, charEmbeds)
            for i in range(len(inputs)):
                output.append(self.ix_to_tag[int(tag_scores[i].argmax(dim=0))])
        return output

    def train(self):
        loss_function = nn.NLLLoss()

        self.model.train()
        loss = float("inf")
        for epoch in range(self.epochs):
            for sentence, tags, charTups in tqdm.tqdm(self.training_data):
                # Step 1. Remember that Pytorch accumulates gradients.
                # We need to clear them out before each instance
                self.model.zero_grad()

                # Step 2. Get our inputs ready for the network, that is, turn them into
                # Tensors of word indices.
                sentence_in = prepare_sequence(sentence, self.word_to_ix, self.unk)
                targets = prepare_sequence(tags, self.tag_to_ix, self.unk)
                charEmbeds = prepare_sequence(charTups, self.charToDex, "unk")

                # Step 3. Run our forward pass.
                tag_scores = self.model(sentence_in, charEmbeds)

                # Step 4. Compute the loss, gradients, and update the parameters by
                #  calling optimizer.step()
                loss = loss_function(tag_scores, targets)
                loss.backward()
                self.optimizer.step()

            if epoch == self.epochs-1:
                epoch_str = '' # last epoch so do not use epoch number in model filename
            else:
                epoch_str = str(epoch)
            savefile = self.modelfile + epoch_str + self.modelsuffix
            print("saving model file: {}".format(savefile), file=sys.stderr)
            torch.save({
                        'epoch': epoch,
                        'model_state_dict': self.model.state_dict(),
                        'optimizer_state_dict': self.optimizer.state_dict(),
                        'loss': loss,
                        'unk': self.unk,
                        'word_to_ix': self.word_to_ix,
                        'tag_to_ix': self.tag_to_ix,
                        'ix_to_tag': self.ix_to_tag,
                        'char_to_dex': self.charToDex
                    }, savefile)

    def decode(self, inputfile):
        if inputfile[-3:] == '.gz':
            with gzip.open(inputfile, 'rt') as f:
                input_data = read_conll(f, input_idx=0, label_idx=-1)
        else:
            with open(inputfile, 'r') as f:
                input_data = read_conll(f, input_idx=0, label_idx=-1)

        if not os.path.isfile(self.modelfile + self.modelsuffix):
            raise IOError("Error: missing model file {}".format(self.modelfile + self.modelsuffix))

        saved_model = torch.load(self.modelfile + self.modelsuffix)
        self.model.load_state_dict(saved_model['model_state_dict'])
        self.optimizer.load_state_dict(saved_model['optimizer_state_dict'])
        epoch = saved_model['epoch']
        loss = saved_model['loss']
        self.unk = saved_model['unk']
        self.word_to_ix = saved_model['word_to_ix']
        self.tag_to_ix = saved_model['tag_to_ix']
        self.ix_to_tag = saved_model['ix_to_tag']
        self.charToDex = saved_model['char_to_dex']
        self.model.eval()
        print("Decoding")
        decoder_output = []
        for sent, charTups in tqdm.tqdm(input_data):
            #print(sent)
            decoder_output.append(self.argmax(sent, charTups))
        return decoder_output


In [41]:
trainfile = os.path.join('../data', 'train.txt.gz')
modelsuffix = '.tar'
unk = '[UNK]'

chunker = LSTMTagger(trainfile, "charEmbedBase", modelsuffix, unk)
chunker.train()

  0%|          | 7/8936 [00:00<02:20, 63.36it/s]

Creating Modified Model


100%|██████████| 8936/8936 [01:11<00:00, 124.48it/s]
saving model file: charEmbedBase0.tar
100%|██████████| 8936/8936 [01:15<00:00, 118.42it/s]
saving model file: charEmbedBase1.tar
100%|██████████| 8936/8936 [01:15<00:00, 118.31it/s]
saving model file: charEmbedBase2.tar
100%|██████████| 8936/8936 [01:15<00:00, 118.27it/s]
saving model file: charEmbedBase3.tar
100%|██████████| 8936/8936 [01:18<00:00, 114.07it/s]
saving model file: charEmbedBase4.tar
100%|██████████| 8936/8936 [01:13<00:00, 121.98it/s]
saving model file: charEmbedBase5.tar
100%|██████████| 8936/8936 [01:13<00:00, 121.26it/s]
saving model file: charEmbedBase6.tar
100%|██████████| 8936/8936 [01:22<00:00, 107.88it/s]
saving model file: charEmbedBase7.tar
100%|██████████| 8936/8936 [01:14<00:00, 119.30it/s]
saving model file: charEmbedBase8.tar
100%|██████████| 8936/8936 [01:06<00:00, 135.09it/s]
saving model file: charEmbedBase.tar


In [42]:
decoder_output = chunker.decode('../data/input/dev.txt')

  7%|▋         | 71/1027 [00:00<00:01, 707.83it/s]

Decoding


100%|██████████| 1027/1027 [00:01<00:00, 706.24it/s]


In [43]:
flat_output = [ output for sent in decoder_output for output in sent ]
import conlleval
true_seqs = []
with open(os.path.join('../data','reference','dev.out')) as r:
    for sent in conlleval.read_file(r):
        true_seqs += sent.split()
conlleval.evaluate(true_seqs, flat_output)

processed 23663 tokens with 11896 phrases; found: 11983 phrases; correct: 9206.
accuracy:  86.85%; (non-O)
accuracy:  87.88%; precision:  76.83%; recall:  77.39%; FB1:  77.11
             ADJP: precision:  43.33%; recall:  17.26%; FB1:  24.68  90
             ADVP: precision:  66.55%; recall:  46.48%; FB1:  54.73  278
            CONJP: precision:   0.00%; recall:   0.00%; FB1:   0.00  0
             INTJ: precision:   0.00%; recall:   0.00%; FB1:   0.00  0
               NP: precision:  75.71%; recall:  80.49%; FB1:  78.02  6631
               PP: precision:  93.03%; recall:  86.44%; FB1:  89.62  2268
              PRT: precision:  65.22%; recall:  66.67%; FB1:  65.93  46
             SBAR: precision:  81.54%; recall:  44.73%; FB1:  57.77  130
               VP: precision:  67.56%; recall:  74.48%; FB1:  70.85  2540


(76.82550279562714, 77.38735709482178, 77.1054064240546)

Character Embeddings this time. Makes more sense I think. 

In [44]:
class LSTMTaggerModel(nn.Module):

    def __init__(self, embedding_dim, hidden_dim, vocab_size, tagset_size, char_size):
        torch.manual_seed(1)
        super(LSTMTaggerModel, self).__init__()
        self.hidden_dim = hidden_dim

        self.word_embeddings = nn.Embedding(vocab_size, embedding_dim)
        self.char_embeddings = nn.Parameter(torch.zeros(char_size, embedding_dim))
        
        torch.nn.init.normal_(self.char_embeddings)
        
        # The LSTM takes word embeddings as inputs, and outputs hidden states
        # with dimensionality hidden_dim.
        # If sum chars
        # self.lstm = nn.LSTM(embedding_dim * 2, hidden_dim, bidirectional=False)
        # If 4 unique embeds
        self.lstm = nn.LSTM(embedding_dim * 4, hidden_dim, bidirectional=False)

        # The linear layer that maps from hidden state space to tag space
        self.hidden2tag = nn.Linear(hidden_dim, tagset_size)

    def forward(self, sentence, charEmbed):
        embeds = self.word_embeddings(sentence)
        #print(embeds.shape)
        # Not sure what embedding dim is, whether [B, Max Set Len, Embed dim] or something else???
        charEmbeds = torch.matmul(charEmbed, self.char_embeddings)
        #print(charEmbeds.shape)
        #charEmbeds = charEmbeds.view(int(charEmbeds.shape[0] / 3), 3, charEmbeds.shape[1]).sum(1)
        
        # Concat all this time
        charEmbeds = charEmbeds.view(int(charEmbeds.shape[0] / 3), 3, charEmbeds.shape[1])
        charEmbeds = charEmbeds.view(charEmbeds.shape[0], -1)
        
        #print(charEmbeds.shape)
        embeds = torch.cat([embeds, charEmbeds],-1)
        #print(embeds.shape)
        #print("Concat Done")
        lstm_out, _ = self.lstm(embeds.view(len(sentence), 1, -1))
        tag_space = self.hidden2tag(lstm_out.view(len(sentence), -1))
        tag_scores = F.log_softmax(tag_space, dim=1)
        return tag_scores

class LSTMTagger:

    def __init__(self, trainfile, modelfile, modelsuffix, unk="[UNK]", epochs=10, embedding_dim=128, hidden_dim=64):
        self.unk = unk
        self.embedding_dim = embedding_dim
        self.hidden_dim = hidden_dim
        self.epochs = epochs
        self.modelfile = modelfile
        self.modelsuffix = modelsuffix
        self.training_data = []
        if trainfile[-3:] == '.gz':
            with gzip.open(trainfile, 'rt') as f:
                self.training_data = read_conll(f)
        else:
            with open(trainfile, 'r') as f:
                self.training_data = read_conll(f)

        self.word_to_ix = {} # replaces words with an index (one-hot vector)
        self.tag_to_ix = {} # replace output labels / tags with an index
        self.ix_to_tag = [] # during inference we produce tag indices so we have to map it back to a tag

        chars = set()
        for sent, tags, charTups in self.training_data:
            
            for word in sent:
                if word not in self.word_to_ix:
                    self.word_to_ix[word] = len(self.word_to_ix)
                    
            for tag in tags:
                if tag not in self.tag_to_ix:
                    self.tag_to_ix[tag] = len(self.tag_to_ix)
                    self.ix_to_tag.append(tag)
                    
            for ct in charTups:
    
                chars.add(ct[0])

                if ct[2] is not None:
                    chars.add(ct[2])

                if ct[1] is not None:
                    for char in ct[1]:
                        chars.add(char)

        chars = list(chars)
        chars = sorted(chars)
        charToDex = {char: dex for dex, char in enumerate(chars)}
        charToDex['unk'] = len(charToDex)
        self.charToDex = charToDex
        
        logging.info("word_to_ix:", self.word_to_ix)
        logging.info("tag_to_ix:", self.tag_to_ix)
        logging.info("ix_to_tag:", self.ix_to_tag)
        logging.info("char_to_dex:", self.charToDex)
        
        print("Creating Modified Model")
        self.model = LSTMTaggerModel(self.embedding_dim, self.hidden_dim, len(self.word_to_ix), len(self.tag_to_ix), len(self.charToDex))
        self.optimizer = optim.SGD(self.model.parameters(), lr=0.01)

    def argmax(self, seq, charTups):
        output = []
        with torch.no_grad():
            inputs = prepare_sequence(seq, self.word_to_ix, self.unk)
            charEmbeds = prepare_sequence(charTups, self.charToDex, "unk")
            tag_scores = self.model(inputs, charEmbeds)
            for i in range(len(inputs)):
                output.append(self.ix_to_tag[int(tag_scores[i].argmax(dim=0))])
        return output

    def train(self):
        loss_function = nn.NLLLoss()

        self.model.train()
        loss = float("inf")
        for epoch in range(self.epochs):
            for sentence, tags, charTups in tqdm.tqdm(self.training_data):
                # Step 1. Remember that Pytorch accumulates gradients.
                # We need to clear them out before each instance
                self.model.zero_grad()

                # Step 2. Get our inputs ready for the network, that is, turn them into
                # Tensors of word indices.
                sentence_in = prepare_sequence(sentence, self.word_to_ix, self.unk)
                targets = prepare_sequence(tags, self.tag_to_ix, self.unk)
                charEmbeds = prepare_sequence(charTups, self.charToDex, "unk")

                # Step 3. Run our forward pass.
                tag_scores = self.model(sentence_in, charEmbeds)

                # Step 4. Compute the loss, gradients, and update the parameters by
                #  calling optimizer.step()
                loss = loss_function(tag_scores, targets)
                loss.backward()
                self.optimizer.step()

            if epoch == self.epochs-1:
                epoch_str = '' # last epoch so do not use epoch number in model filename
            else:
                epoch_str = str(epoch)
            savefile = self.modelfile + epoch_str + self.modelsuffix
            print("saving model file: {}".format(savefile), file=sys.stderr)
            torch.save({
                        'epoch': epoch,
                        'model_state_dict': self.model.state_dict(),
                        'optimizer_state_dict': self.optimizer.state_dict(),
                        'loss': loss,
                        'unk': self.unk,
                        'word_to_ix': self.word_to_ix,
                        'tag_to_ix': self.tag_to_ix,
                        'ix_to_tag': self.ix_to_tag,
                        'char_to_dex': self.charToDex
                    }, savefile)

    def decode(self, inputfile):
        if inputfile[-3:] == '.gz':
            with gzip.open(inputfile, 'rt') as f:
                input_data = read_conll(f, input_idx=0, label_idx=-1)
        else:
            with open(inputfile, 'r') as f:
                input_data = read_conll(f, input_idx=0, label_idx=-1)

        if not os.path.isfile(self.modelfile + self.modelsuffix):
            raise IOError("Error: missing model file {}".format(self.modelfile + self.modelsuffix))

        saved_model = torch.load(self.modelfile + self.modelsuffix)
        self.model.load_state_dict(saved_model['model_state_dict'])
        self.optimizer.load_state_dict(saved_model['optimizer_state_dict'])
        epoch = saved_model['epoch']
        loss = saved_model['loss']
        self.unk = saved_model['unk']
        self.word_to_ix = saved_model['word_to_ix']
        self.tag_to_ix = saved_model['tag_to_ix']
        self.ix_to_tag = saved_model['ix_to_tag']
        self.charToDex = saved_model['char_to_dex']
        self.model.eval()
        print("Decoding")
        decoder_output = []
        for sent, charTups in tqdm.tqdm(input_data):
            #print(sent)
            decoder_output.append(self.argmax(sent, charTups))
        return decoder_output


In [None]:
trainfile = os.path.join('../data', 'train.txt.gz')
modelsuffix = '.tar'
unk = '[UNK]'

chunker = LSTMTagger(trainfile, "charEmbedMod.", modelsuffix, unk)
chunker.train()

  0%|          | 9/8936 [00:00<01:43, 86.48it/s]

Creating Modified Model


100%|██████████| 8936/8936 [01:08<00:00, 130.40it/s]
saving model file: charEmbedMod.0.tar
100%|██████████| 8936/8936 [01:14<00:00, 120.57it/s]
saving model file: charEmbedMod.1.tar
100%|██████████| 8936/8936 [01:12<00:00, 123.40it/s]
saving model file: charEmbedMod.2.tar
  6%|▌         | 525/8936 [00:03<01:07, 124.09it/s]

In [None]:
decoder_output = chunker.decode('../data/input/dev.txt')

In [None]:
flat_output = [ output for sent in decoder_output for output in sent ]
import conlleval
true_seqs = []
with open(os.path.join('../data','reference','dev.out')) as r:
    for sent in conlleval.read_file(r):
        true_seqs += sent.split()
conlleval.evaluate(true_seqs, flat_output)