# CoNLL 2000 Syntactic Chunker Task
The  **`CoNLL 2000`** chunking task consists of data with annotations for 11 types of syntactic chunks (Noun Phrases, Verb phrases, etc). The data is in a [IOB2](https://en.wikipedia.org/wiki/Inside%E2%80%93outside%E2%80%93beginning_(tagging)) format.  Each token enitity has a `'B-'` or `'I-'` tag indicating if it is the start of the entity or if the token is inside the annotation. 

Tags include:

* **Noun Phrase**: `'B-NP', 'I-NP'`

* **Verb Phrase**: `'B-VP','I-VP'`

* **Adjective Phrase**:`'B-ADJP', 'I-ADJP'`

* **Adverb Phrase**:`'B-ADVP', 'I-ADVP'`

* **Prepositional Phrase**:`'B-PP', 'I-PP`

* **Conjunctions Phrase**:`'B-CONJP', 'I-CONJP'`

* **Interjection Phrase**:`B-INTJ', 'I-INTJ'`

* **Verb Particle  Phrase**:`'B-PRT', 'I-PRT'`

* **List Marker Phrase**:`'B-LST', 'I-LST'`

* **Unlike Coordinated  phrase**:`'B-UCP', 'I-UCP'`

* **Complementizer Phrase**:`'B-SBAR', 'I-SBAR'`

* **Outside Tokens**:`'O'`



See [website](https://www.clips.uantwerpen.be/conll2000/chunking/) and [paper](https://www.clips.uantwerpen.be/conll2000/pdf/12732tjo.pdf) for more info.

The data is already tokenized and tagged with Part of Speech(POS) and Chunk labels:

In [None]:
# token         POS Chunk  
#------------------------
# Rockwell      NNP B-NP
# International NNP I-NP
# Corp.         NNP I-NP
# 's            POS B-NP
# Tulsa         NNP I-NP
# unit          NN  I-NP
# said          VBD B-VP
# it            PRP B-NP
# signed        VBD B-VP
# a             DT  B-NP
# tentative     JJ  I-NP
# agreement     NN  I-NP
# extending     VBG B-VP
# its           PRP B-NP
# contract      NN  I-NP
# with          IN  B-PP
# Boeing        NNP B-NP
# Co.           NNP I-NP
# to            TO  B-VP
# provide       VB  I-VP
# structural    JJ  B-NP
# parts         NNS I-NP
# for           IN  B-PP
# Boeing        NNP B-NP
# 's            POS B-NP
# 747           CD  I-NP
# jetliners     NNS I-NP
# .             .   O

The first column is the token, the second column is Part of Speech(POS) tag, the third is syntactic chunk tag.

So for the chunking task the data consists of features:`X`and labels:`y`


* **`X`** :  a list of list of tokens 


* **`y`** :  a list of list of syntactic chunk tags


In [2]:
%%bash
DATADIR="chunker_english"
if test ! -d "$DATADIR";then
    echo "Creating $DATADIR dir"
    mkdir "$DATADIR"
    cd "$DATADIR"
    wget https://www.clips.uantwerpen.be/conll2000/chunking/train.txt.gz
    gunzip train.txt.gz
    wget https://www.clips.uantwerpen.be/conll2000/chunking/test.txt.gz
    gunzip test.txt.gz
fi

In [3]:
"""
Train data: 8936 sentences, 211727 tokens
Test data: 2012 sentences, 47377 tokens
"""
import os
import sys

import numpy as np
import pandas as pd
from sklearn.metrics import f1_score
from sklearn.metrics import classification_report

sys.path.append("../") 
from bert_sklearn import BertTokenClassifier, load_model

def flatten(l):
    return [item for sublist in l for item in sublist]

def read_CoNLL2003_format(filename, idx=3):
    """Read file in CoNLL-2003 shared task format""" 
    
    lines =  open(filename).read().strip()
    
    # find sentence-like boundaries
    lines = lines.split("\n\n")  
    
    # throw out -DOCSTART- lines 
    lines = [line for line in lines if not line.startswith("-DOCSTART-")]
    
     # split on newlines
    lines = [line.split("\n") for line in lines]
    
    # get tokens
    tokens = [[l.split()[0] for l in line] for line in lines]
    
    # get labels/tags
    labels = [[l.split()[idx] for l in line] for line in lines]
    
    data= {'tokens': tokens, 'labels': labels}
    df=pd.DataFrame(data=data)
    
    return df

DATADIR = "./chunker_english/"

def get_data(trainfile=DATADIR + "train.txt",
            testfile=DATADIR + "test.txt"):

    train = read_CoNLL2003_format(trainfile,idx=2)
    print("Train data: %d sentences, %d tokens"%(len(train),len(flatten(train.tokens))))

    test = read_CoNLL2003_format(testfile,idx=2)
    print("Test data: %d sentences, %d tokens"%(len(test),len(flatten(test.tokens))))
    
    return train, test

train, test = get_data()
X_train, y_train = train.tokens, train.labels
X_test, y_test = test.tokens, test.labels

# chunk tags may be spread out among the different sets i.e we cant rely on all 
# the chunk tags being present in the training set
label_list = set(flatten(y_train)).union(set(flatten(y_test)))
label_list = list(label_list)
print("\nChunker tags/labels:\n", label_list)

Train data: 8936 sentences, 211727 tokens
Test data: 2012 sentences, 47377 tokens

Chunker tags/labels:
 ['I-UCP', 'O', 'B-LST', 'I-ADJP', 'B-INTJ', 'I-CONJP', 'I-VP', 'I-LST', 'B-CONJP', 'I-INTJ', 'B-PP', 'I-SBAR', 'I-PP', 'B-UCP', 'B-ADVP', 'B-VP', 'I-PRT', 'B-NP', 'I-ADVP', 'B-ADJP', 'B-PRT', 'I-NP', 'B-SBAR']


In [4]:
train.head()

Unnamed: 0,tokens,labels
0,"[Confidence, in, the, pound, is, widely, expec...","[B-NP, B-PP, B-NP, I-NP, B-VP, I-VP, I-VP, I-V..."
1,"[Chancellor, of, the, Exchequer, Nigel, Lawson...","[O, B-PP, B-NP, I-NP, B-NP, I-NP, B-NP, I-NP, ..."
2,"[But, analysts, reckon, underlying, support, f...","[O, B-NP, B-VP, B-NP, I-NP, B-PP, B-NP, B-VP, ..."
3,"[This, has, increased, the, risk, of, the, gov...","[B-NP, B-VP, I-VP, B-NP, I-NP, B-PP, B-NP, I-N..."
4,"[``, The, risks, for, sterling, of, a, bad, tr...","[O, B-NP, I-NP, B-PP, B-NP, B-PP, B-NP, I-NP, ..."


And lets look at an observation on the tokens,labels pair and make sure it makes sense:

In [3]:
i = 151
tokens = X_test[i]
labels = y_test[i]

print(" ".join(tokens))
print(" ".join(labels))

His meals are most often prepared by women he trusts -- his full-time mistress , Vicky Amado , and her mother , Norma .
B-NP I-NP B-VP I-VP I-VP I-VP B-PP B-NP B-NP B-VP O B-NP I-NP I-NP O B-NP I-NP O O B-NP I-NP O B-NP O


Lets define our model using the **`BertTokenClassifier`** class

* We will include an `ignore_label` option to ignore the `'O'` labels.


* We will also use the cased model as we did with NER.

In [4]:
# define model
model = BertTokenClassifier(bert_model='bert-base-cased',
                            epochs=3,
                            label_list = label_list,
                            learning_rate=2e-5,
                            train_batch_size=16,
                            eval_batch_size=16,
                            ignore_label=['O'])

Building sklearn token classifier...


One issue that we need to be mindful of is the max token length in the token lists. 
There are 2 complications:
    
* We have a **`max_seq_length`** parameter  with BERT that will dictate how long a token sequence we can handle. All input tokens will be truncaed based on this. The limit on this is 512, but we would like smaller sequences since they are much faster and consume less memory on the GPU. 
    
    
* Each token will be tokenized again by the BERT wordpiece tokenizer. This will result in longer token sequences than the input token lists. 
    
Let's check our bert token lengths by running the data through the BERT wordpiece tokenizer:

In [11]:
%%time
print("Bert wordpiece tokenizer max token length in train: %d tokens"% model.get_max_token_len(X_train))
print("Bert wordpiece tokenizer max token length in test: %d tokens"% model.get_max_token_len(X_test))

Bert wordpiece tokenizer max token length in train: 109 tokens
Bert wordpiece tokenizer max token length in test: 88 tokens
CPU times: user 3.59 s, sys: 20 ms, total: 3.61 s
Wall time: 3.97 s


So based on this we will set the max_seq_length to at least 111 = 109 + 2( for the `'[CLS]'` and `'[SEP]'` tokens that Bert uses).

## finetune model

In [10]:
%%time
# set max_seq_length
model.max_seq_length = 111

# finetune model on train data
model.fit(X_train, y_train)

# score model on test data
f1_test = model.score(X_test, y_test)
print("Test f1: %0.02f"%(f1_test))

# get predictions on test data
y_preds = model.predict(X_test)

# calculate the probability of each class
y_probs = model.predict_proba(X_test)

# print report on tag classifier stats
print(classification_report(flatten(y_test), flatten(y_preds)))

Loading bert-base-cased model...
Defaulting to linear classifier/regressor
train data size: 8043, validation data size: 893


Training: 100%|██████████| 503/503 [03:30<00:00,  2.45it/s, loss=0.0724]
                                                           

Epoch 1, Train loss: 0.0724, Val loss: 0.0168, Val accy: 97.92%, f1: 97.92


Training: 100%|██████████| 503/503 [04:25<00:00,  2.19it/s, loss=0.0132]
                                                           

Epoch 2, Train loss: 0.0132, Val loss: 0.0148, Val accy: 98.05%, f1: 98.07


Training: 100%|██████████| 503/503 [05:08<00:00,  1.80it/s, loss=0.00871]
                                                           

Epoch 3, Train loss: 0.0087, Val loss: 0.0144, Val accy: 98.15%, f1: 98.17


Predicting:   0%|          | 0/126 [00:00<?, ?it/s]          

Test f1: 97.76


                                                             

              precision    recall  f1-score   support

      B-ADJP       0.86      0.83      0.85       438
      B-ADVP       0.87      0.87      0.87       866
     B-CONJP       0.55      0.67      0.60         9
      B-INTJ       0.00      0.00      0.00         2
       B-LST       0.00      0.00      0.00         5
        B-NP       0.99      0.98      0.99     12422
        B-PP       0.98      0.99      0.99      4811
       B-PRT       0.78      0.92      0.84       106
      B-SBAR       0.94      0.96      0.95       535
        B-VP       0.98      0.98      0.98      4658
      I-ADJP       0.85      0.74      0.79       167
      I-ADVP       0.81      0.71      0.75        89
     I-CONJP       0.62      0.77      0.69        13
       I-LST       0.00      0.00      0.00         2
        I-NP       0.98      0.99      0.99     14376
        I-PP       0.84      0.77      0.80        48
      I-SBAR       0.20      0.75      0.32         4
        I-VP       0.97    

In [11]:
labels = list(set(label_list) - set(model.ignore_label))
labels

['I-CONJP',
 'B-SBAR',
 'B-PP',
 'I-LST',
 'B-PRT',
 'B-NP',
 'I-ADVP',
 'I-PP',
 'I-SBAR',
 'B-INTJ',
 'I-PRT',
 'I-ADJP',
 'B-UCP',
 'I-VP',
 'B-CONJP',
 'B-ADJP',
 'I-INTJ',
 'B-VP',
 'B-LST',
 'I-NP',
 'B-ADVP',
 'I-UCP']

If we want span level precision, recall, and f1 then install this helpful utility:

In [8]:
!pip install seqeval

Collecting seqeval
  Downloading https://files.pythonhosted.org/packages/55/dd/3bf1c646c310daabae47fceb84ea9ab66df7f518a31a89955290d82b8100/seqeval-0.0.10-py3-none-any.whl
[31mallennlp 0.7.3-unreleased has requirement pytorch-pretrained-bert==0.3.0, but you'll have pytorch-pretrained-bert 0.4.0 which is incompatible.[0m
Installing collected packages: seqeval
Successfully installed seqeval-0.0.10
[33mYou are using pip version 10.0.1, however version 19.1.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m


In [9]:
from seqeval.metrics import classification_report as seqeval_report
print(seqeval_report(flatten(y_test),flatten(y_preds)))

           precision    recall  f1-score   support

       NP       0.97      0.97      0.97     12422
     ADJP       0.82      0.81      0.81       438
     ADVP       0.86      0.86      0.86       866
       PP       0.98      0.99      0.98      4811
       VP       0.96      0.97      0.97      4658
      PRT       0.78      0.92      0.84       106
     SBAR       0.94      0.96      0.95       535
    CONJP       0.38      0.56      0.45         9
      LST       0.00      0.00      0.00         5
     INTJ       0.00      0.00      0.00         2

micro avg       0.96      0.97      0.96     23852
macro avg       0.96      0.97      0.96     23852



Let's also take a look at the example from the test set we looked at before and compare the predicted tags with the actuals:

In [20]:
i = 151
tokens = X_test[i]
labels = y_test[i]
preds  = y_preds[i]
prob   = y_probs[i]

data = {"token": tokens,"label": labels,"predict": preds}
df=pd.DataFrame(data=data)
print(df)

        token label predict
0         His  B-NP    B-NP
1       meals  I-NP    I-NP
2         are  B-VP    B-VP
3        most  I-VP    I-VP
4       often  I-VP    I-VP
5    prepared  I-VP    I-VP
6          by  B-PP    B-PP
7       women  B-NP    B-NP
8          he  B-NP    B-NP
9      trusts  B-VP    B-VP
10         --     O       O
11        his  B-NP    B-NP
12  full-time  I-NP    I-NP
13   mistress  I-NP    I-NP
14          ,     O       O
15      Vicky  B-NP    B-NP
16      Amado  I-NP    I-NP
17          ,     O       O
18        and     O       O
19        her  B-NP    B-NP
20     mother  I-NP    I-NP
21          ,     O       O
22      Norma  B-NP    B-NP
23          .     O       O


Lets calculate tthe probability of each class labels:

In [21]:
## pprint out probs for this observation
tokens_prob = model.tokens_proba(tokens, prob)

        token  I-CONJP  B-SBAR  B-PP  I-LST  B-PRT  B-NP    O  I-ADVP  I-PP  \
0         His     0.00    0.00  0.00   0.00   0.00  1.00 0.00    0.00  0.00   
1       meals     0.00    0.00  0.00   0.00   0.00  0.00 0.00    0.00  0.00   
2         are     0.00    0.00  0.00   0.00   0.00  0.00 0.00    0.00  0.00   
3        most     0.00    0.00  0.00   0.00   0.00  0.00 0.00    0.00  0.00   
4       often     0.00    0.00  0.00   0.00   0.00  0.00 0.00    0.14  0.00   
5    prepared     0.00    0.00  0.00   0.00   0.00  0.00 0.00    0.00  0.00   
6          by     0.00    0.00  1.00   0.00   0.00  0.00 0.00    0.00  0.00   
7       women     0.00    0.00  0.00   0.00   0.00  1.00 0.00    0.00  0.00   
8          he     0.00    0.00  0.00   0.00   0.00  1.00 0.00    0.00  0.00   
9      trusts     0.00    0.00  0.00   0.00   0.00  0.00 0.00    0.00  0.00   
10         --     0.00    0.00  0.00   0.00   0.00  0.00 1.00    0.00  0.00   
11        his     0.00    0.00  0.00   0.00   0.00  

Finally, lets predict the tags and tag probabilities on some new text:

In [22]:
text = "I really want to go to the museum today, but I first need to visit my mom and dad."       

tag_predicts  = model.tag_text(text)       
prob_predicts = model.tag_text_proba(text)    

Predicting:   0%|          | 0/1 [00:00<?, ?it/s]        

     token predicted tags
0        I           B-NP
1   really         B-ADVP
2     want           B-VP
3       to           I-VP
4       go           I-VP
5       to           B-PP
6      the           B-NP
7   museum           I-NP
8    today           B-NP
9        ,              O
10     but              O
11       I           B-NP
12   first         B-ADVP
13    need           B-VP
14      to           I-VP
15   visit           I-VP
16      my           B-NP
17     mom           I-NP
18     and           I-NP
19     dad           I-NP
20       .              O


                                                         

     token  I-CONJP  B-SBAR  B-PP  I-LST  B-PRT  B-NP    O  I-ADVP  I-PP  \
0        I     0.00    0.00  0.00   0.00   0.00  1.00 0.00    0.00  0.00   
1   really     0.00    0.00  0.00   0.00   0.00  0.00 0.00    0.00  0.00   
2     want     0.00    0.00  0.00   0.00   0.00  0.00 0.00    0.00  0.00   
3       to     0.00    0.00  0.00   0.00   0.00  0.00 0.00    0.00  0.00   
4       go     0.00    0.00  0.00   0.00   0.00  0.00 0.00    0.00  0.00   
5       to     0.00    0.00  1.00   0.00   0.00  0.00 0.00    0.00  0.00   
6      the     0.00    0.00  0.00   0.00   0.00  1.00 0.00    0.00  0.00   
7   museum     0.00    0.00  0.00   0.00   0.00  0.00 0.00    0.00  0.00   
8    today     0.00    0.00  0.00   0.00   0.00  1.00 0.00    0.00  0.00   
9        ,     0.00    0.00  0.00   0.00   0.00  0.00 1.00    0.00  0.00   
10     but     0.00    0.00  0.00   0.00   0.00  0.00 1.00    0.00  0.00   
11       I     0.00    0.00  0.00   0.00   0.00  1.00 0.00    0.00  0.00   
12   first  

