# chunker: default program

In [3]:
from default import *
import os

## Run the default solution on dev

In [4]:
chunker = LSTMTagger(os.path.join('data', 'train.txt.gz'), os.path.join('data', 'chunker'), '.tar')
decoder_output = chunker.decode('data/input/dev.txt')

100%|██████████| 1027/1027 [00:03<00:00, 304.95it/s]


## Evaluate the default output

In [5]:
flat_output = [ output for sent in decoder_output for output in sent ]
import conlleval
true_seqs = []
with open(os.path.join('data','reference','dev.out')) as r:
    for sent in conlleval.read_file(r):
        true_seqs += sent.split()
conlleval.evaluate(true_seqs, flat_output)

processed 23663 tokens with 11896 phrases; found: 12077 phrases; correct: 9383.
accuracy:  87.41%; (non-O)
accuracy:  88.41%; precision:  77.69%; recall:  78.88%; FB1:  78.28
             ADJP: precision:  45.00%; recall:  19.91%; FB1:  27.61  100
             ADVP: precision:  71.80%; recall:  47.99%; FB1:  57.53  266
            CONJP: precision:   0.00%; recall:   0.00%; FB1:   0.00  0
             INTJ: precision:   0.00%; recall:   0.00%; FB1:   0.00  0
               NP: precision:  76.63%; recall:  82.36%; FB1:  79.39  6704
               PP: precision:  91.33%; recall:  88.45%; FB1:  89.86  2364
              PRT: precision:  70.27%; recall:  57.78%; FB1:  63.41  37
             SBAR: precision:  77.62%; recall:  46.84%; FB1:  58.42  143
               VP: precision:  69.59%; recall:  74.39%; FB1:  71.91  2463


(77.69313571251139, 78.87525218560862, 78.27973136445169)

## Documentation

We have implemented:<br>
Option 1: Baseline model by concatenating character vectors. Filename: `chunker_baseline.py` and `default_baseline.py`<br>
Option 2: Concatenating hidden layer of RNN having character vectors as input. Filename: `chunker.py` and `default.py`<br>

#### Function for preparing character vector 
The `prepare_character_vectors(sentence, width=100)` method creates a character level representation of the word
    - v1 is a one-hot vector for the first character of the word.
    - v2 is a vector where the index of all the inbetween characters have the count of that character in the word
    - v3 is a one-hot vector for the last character of the word.
```
def prepare_character_vectors(sentence, width=100):
    character_vectors = []
    for word in sentence:
        v1 = torch.zeros(width)
        v2 = torch.zeros(width)
        v3 = torch.zeros(width)

        if word is not '[UNK]':
            v1[string.printable.find(word[0])] = 1

            unique_chars = list(set(word[1:-1]))
            for unique_char in unique_chars:
                v2[string.printable.find(unique_char)] = word.count(unique_char)

            v3[string.printable.find(word[-1])] = 1

        character_vectors.append(torch.cat((v1, v2, v3), 0))
    return torch.stack(character_vectors)
```

#### RNN Network for Option 2
For the Option 2 we have additionally implemented a separate RNN that takes in the character vector representation and outputs its hidden state which is concatenated with the word embeddings before passing through the LSTM

In the first step, a hidden state is seeded as a matrix of zeros, so that it can be fed into the RNN cell together with the first input in the sequence. The hidden state and the input data will be multiplied with weight matrices. The result of these multiplications will then be passed through an activation function(such as a tanh function) to introduce non-linearity. This gives us the hidden state of the RNN cell. We do not compute the output for the cell since it's not needed.

`hidden_t = tanh(weight_hidden ∗ hidden_t−1 + weight_input ∗ input_t)`

```
class CharacterRNN(nn.Module):
    def __init__(self, input_dim, hidden_dim):
        super(CharacterRNN, self).__init__()

        self.Wih = nn.Linear(input_dim, hidden_dim)
        self.Wio = nn.Linear(hidden_dim, hidden_dim)
        self.tanh = nn.Tanh()

    def forward(self, input_seq, hidden):
        # combined = torch.cat((input_seq, hidden), 2)
        hidden = self.tanh(self.Wih(input_seq) + self.Wio(hidden))

        return hidden
```

## Analysis

Do some analysis of the results. What ideas did you try? What worked and what did not?

Option 1 gave us a score of `77.2100` on dev<br>
For Option 2 we experimented with various sizes for the RNN's hidden layer. A size of 64 gave us the best score of `78.2797` on dev<br>

Below are their individual runs:

### Baseline Option 1 - Concatenating character vectors

In [6]:
from default_baseline import *
import os

chunker = LSTMTagger(os.path.join('data', 'train.txt.gz'), os.path.join('data', 'chunker_baseline'), '.tar')
print("Model:", chunker.model)
decoder_output = chunker.decode('data/input/dev.txt')

flat_output = [ output for sent in decoder_output for output in sent ]
import conlleval
true_seqs = []
with open(os.path.join('data','reference','dev.out')) as r:
    for sent in conlleval.read_file(r):
        true_seqs += sent.split()
conlleval.evaluate(true_seqs, flat_output)

  3%|▎         | 28/1027 [00:00<00:03, 276.36it/s]

Model: LSTMTaggerModel(
  (word_embeddings): Embedding(9675, 128)
  (lstm): LSTM(428, 64)
  (hidden2tag): Linear(in_features=64, out_features=22, bias=True)
)


100%|██████████| 1027/1027 [00:03<00:00, 313.14it/s]


processed 23663 tokens with 11896 phrases; found: 11961 phrases; correct: 9210.
accuracy:  86.86%; (non-O)
accuracy:  87.85%; precision:  77.00%; recall:  77.42%; FB1:  77.21
             ADJP: precision:  42.11%; recall:  17.70%; FB1:  24.92  95
             ADVP: precision:  69.74%; recall:  47.49%; FB1:  56.50  271
            CONJP: precision:   0.00%; recall:   0.00%; FB1:   0.00  0
             INTJ: precision:   0.00%; recall:   0.00%; FB1:   0.00  0
               NP: precision:  75.36%; recall:  80.66%; FB1:  77.92  6676
               PP: precision:  91.10%; recall:  88.49%; FB1:  89.78  2371
              PRT: precision:  69.23%; recall:  60.00%; FB1:  64.29  39
             SBAR: precision:  84.80%; recall:  44.73%; FB1:  58.56  125
               VP: precision:  69.51%; recall:  71.92%; FB1:  70.69  2384


(77.00025081514924, 77.42098184263618, 77.21004317391123)

### Option 2 Concatenating hidden layer of RNN having character vectors as input

In [7]:
from default import *
import os

chunker = LSTMTagger(os.path.join('data', 'train.txt.gz'), os.path.join('data', 'chunker'), '.tar')
print("Model:", chunker.model)
decoder_output = chunker.decode('data/input/dev.txt')

flat_output = [ output for sent in decoder_output for output in sent ]
import conlleval
true_seqs = []
with open(os.path.join('data','reference','dev.out')) as r:
    for sent in conlleval.read_file(r):
        true_seqs += sent.split()
conlleval.evaluate(true_seqs, flat_output)

  3%|▎         | 29/1027 [00:00<00:03, 284.44it/s]

Model: LSTMTaggerModel(
  (word_embeddings): Embedding(9675, 128)
  (character_rnn): CharacterRNN(
    (Wih): Linear(in_features=300, out_features=64, bias=True)
    (Wio): Linear(in_features=64, out_features=64, bias=True)
    (tanh): Tanh()
  )
  (lstm): LSTM(192, 64)
  (hidden2tag): Linear(in_features=64, out_features=22, bias=True)
)


100%|██████████| 1027/1027 [00:03<00:00, 271.24it/s]


processed 23663 tokens with 11896 phrases; found: 12077 phrases; correct: 9383.
accuracy:  87.41%; (non-O)
accuracy:  88.41%; precision:  77.69%; recall:  78.88%; FB1:  78.28
             ADJP: precision:  45.00%; recall:  19.91%; FB1:  27.61  100
             ADVP: precision:  71.80%; recall:  47.99%; FB1:  57.53  266
            CONJP: precision:   0.00%; recall:   0.00%; FB1:   0.00  0
             INTJ: precision:   0.00%; recall:   0.00%; FB1:   0.00  0
               NP: precision:  76.63%; recall:  82.36%; FB1:  79.39  6704
               PP: precision:  91.33%; recall:  88.45%; FB1:  89.86  2364
              PRT: precision:  70.27%; recall:  57.78%; FB1:  63.41  37
             SBAR: precision:  77.62%; recall:  46.84%; FB1:  58.42  143
               VP: precision:  69.59%; recall:  74.39%; FB1:  71.91  2463


(77.69313571251139, 78.87525218560862, 78.27973136445169)