# KEY Assignment 2 about NERC (max 20 points for your final grade)

This notebook describes Assignment 3, which is part of Lab 3 of the text mining course. 


**Learning goals**
* going from linguistic input format to representing it in a feature space
* working with pretrained word embeddings
* train a supervised classifier (SVM)
* evaluate a supervised classifier (SVM)
* perform feature ablation and gain insight into the contribution of various features

We assume you have worked through the following notebook:
* **Lab3-Supervised NERC system.ipynb**

## [Points: 3] Exercise 1: NER and NERC definitions
* **[1 point] a) Explain what NER is**
* **[1 point] b) Explain what NERC is**
* **[1 point] c) Explain what the IOB format is and how it represents both the NER and NERC task.**

**Answer**

* NER is a text mining task of recognizing/marking entity mentions in text.
* NERC is a text mining task of recognizing/marking entity mentions in text and classifying them as one of the predefined entity types (e.g., person or location).
* The IOB format assigns a label to each token, as one of the following: O (meaning not an entity), B-* (meaning first token of an entity mention), or I-* (meaning non-first token of an entity mention). In the above, the asterisk '\*' should be replaced with an entity type, such as PER or LOC. This allows annotation of mentions as well as their class in the same label. An example label would be B-LOC, for a first token of some location entity.

## [Points: 17] Exercise 2: Training and evaluating an SVM using CoNLL-2003

**[2 point] a) Load the CoNLL-2003 training data using the *ConllCorpusReader* and create for both *train.txt* and *test.txt*:**

    [2 points]  -a list of dictionaries representing the features for each training instances, e..g,
    ```
    [
    {'words': 'EU', 'pos': 'NNP'}, 
    {'words': 'rejects', 'pos': 'VBZ'},
    ...
    ]
    ```

    [2 points] -the NERC labels associated with each training instance, e.g.,
    dictionaries, e.g.,
    ```
    [
    'B-ORG', 
    'O',
    ....
    ]
    ```

In [4]:
from nltk.corpus.reader import ConllCorpusReader

train = ConllCorpusReader('nerc_datasets/CONLL2003', 'train.txt', ['words', 'pos', 'ignore', 'chunk'])
training_features = []
training_gold_labels = []

for token, pos, ne_label in train.iob_words():
    a_dict = {
       'word':token,
        'pos':pos,
    }
    training_features.append(a_dict)
    training_gold_labels.append(ne_label)
    
print(training_features[:10])
print(training_gold_labels[:10])

[{'word': 'EU', 'pos': 'NNP'}, {'word': 'rejects', 'pos': 'VBZ'}, {'word': 'German', 'pos': 'JJ'}, {'word': 'call', 'pos': 'NN'}, {'word': 'to', 'pos': 'TO'}, {'word': 'boycott', 'pos': 'VB'}, {'word': 'British', 'pos': 'JJ'}, {'word': 'lamb', 'pos': 'NN'}, {'word': '.', 'pos': '.'}, {'word': 'Peter', 'pos': 'NNP'}]
['B-ORG', 'O', 'B-MISC', 'O', 'O', 'O', 'B-MISC', 'O', 'O', 'B-PER']


In [5]:
test = ConllCorpusReader('nerc_datasets/CONLL2003', 'test.txt', ['words', 'pos', 'ignore', 'chunk'])

test_features = []
test_gold_labels = []
for token, pos, ne_label in test.iob_words():

    a_dict = {
       'word':token,
        'pos':pos,
    }
    test_features.append(a_dict)
    test_gold_labels.append(ne_label)
    
print(test_features[:10])
print(test_gold_labels[:10])

[{'word': 'SOCCER', 'pos': 'NN'}, {'word': '-', 'pos': ':'}, {'word': 'JAPAN', 'pos': 'NNP'}, {'word': 'GET', 'pos': 'VB'}, {'word': 'LUCKY', 'pos': 'NNP'}, {'word': 'WIN', 'pos': 'NNP'}, {'word': ',', 'pos': ','}, {'word': 'CHINA', 'pos': 'NNP'}, {'word': 'IN', 'pos': 'IN'}, {'word': 'SURPRISE', 'pos': 'DT'}]
['O', 'O', 'B-LOC', 'O', 'O', 'O', 'O', 'B-PER', 'O', 'O']


**[3 points] b) provide descriptive statistics about the training and test data:**
* How many instances are in train and test?
* Provide a frequency distribution of the NERC labels, i.e., how many times does each NERC label occur?
* Discuss to what extent the training and test data is balanced (equal amount of instances for each NERC label) and to what extent the training and test data differ?

Tip: you can use the following `Counter` functionality to generate frequency list of a list:

In [6]:
print('Number of training examples', len(training_gold_labels))
print('Number of test examples', len(test_gold_labels))

Number of training examples 203621
Number of test examples 46435


In [7]:
from collections import Counter 

print(Counter(training_gold_labels))
print(Counter(test_gold_labels))

Counter({'O': 169578, 'B-LOC': 7140, 'B-PER': 6600, 'B-ORG': 6321, 'I-PER': 4528, 'I-ORG': 3704, 'B-MISC': 3438, 'I-LOC': 1157, 'I-MISC': 1155})
Counter({'O': 38323, 'B-LOC': 1668, 'B-ORG': 1661, 'B-PER': 1617, 'I-PER': 1156, 'I-ORG': 835, 'B-MISC': 702, 'I-LOC': 257, 'I-MISC': 216})


**Answer**

The training data and the test data predominantly (say around 80%) consist of 'O' labels, whih is expected since most phrases typically do not refer to entities.

The training data is 4-5 times larger that the test data which is also the case for the amount of some of the labels like B-LOC. While the ratio and the frequency rank for other labels might vary, overall we observe a similar distribution in the training and the test data.

**[1 point] c. Explain what one hot encoding is. Explain why we can not use string features to train the SVM.**

**Answer**

One hot encoding is a sparse representation of the features that mostly contains zeros and occasional ones. Every feature value in the training data corresponds to one position in the one hot vector.

The string features need to be encoded as numbers because machine learning systems require numeric input.

**[3 points] d. Concatenate the train and test features (the list of dictionaries) into one list. Load it using the *DictVectorizer*. Afterwards, split it back to training and test.**

Tip: You’ve concatenated train and test into one list and then you’ve applied the DictVectorizer.
The order of the rows is maintained. You can hence use an index (number of training instances) to split the_array back into train and test. Do NOT use: `
from sklearn.model_selection import train_test_split` here.


In [2]:
from sklearn.feature_extraction import DictVectorizer

In [3]:
vec = DictVectorizer()
the_array = training_features + test_features
the_array = vec.fit_transform(the_array)

training_onehot = the_array[:len(training_features)]
test_onehot = the_array[len(training_features):]
the_array.shape

NameError: name 'training_features' is not defined

**[4 points] e. Train the SVM using the train features and labels and evaluate on the test data. Provide a classification report (sklearn.metrics.classification_report).**
The train (*lin_clf.fit*) might take a while. On my computer, it took 1min 53s, which is acceptable. Training models normally takes much longer. If it takes more than 5 minutes, you can use a subset for training. Describe the results:
* Which NERC labels does the classifier perform well on? Why do you think this is the case?
* Which NERC labels does the classifier perform poorly on? Why do you think this is the case?

In [13]:
from sklearn import svm
from sklearn.metrics import classification_report

In [14]:
lin_clf = svm.LinearSVC()

In [16]:
lin_clf.fit(training_onehot,training_gold_labels)
y_pred = lin_clf.predict(test_onehot)
report = classification_report(test_gold_labels,y_pred,digits = 3)

print(report)

             precision    recall  f1-score   support

      B-LOC      0.812     0.775     0.793      1668
     B-MISC      0.782     0.664     0.718       702
      B-ORG      0.792     0.519     0.627      1661
      B-PER      0.860     0.437     0.579      1617
      I-LOC      0.618     0.529     0.570       257
     I-MISC      0.570     0.588     0.579       216
      I-ORG      0.703     0.467     0.561       835
      I-PER      0.333     0.871     0.481      1156
          O      0.985     0.984     0.985     38323

avg / total      0.939     0.920     0.922     46435



In [17]:
i = 0
for w,l in zip(training_features,training_gold_labels):
    if 'LOC' in l:
        print(w)
        print(l)
        i+=1
    if i>20:
        break

{'word': 'BRUSSELS', 'pos': 'NNP'}
B-LOC
{'word': 'Germany', 'pos': 'NNP'}
B-LOC
{'word': 'Britain', 'pos': 'NNP'}
B-LOC
{'word': 'Britain', 'pos': 'NNP'}
B-LOC
{'word': 'France', 'pos': 'NNP'}
B-LOC
{'word': 'France', 'pos': 'NNP'}
B-LOC
{'word': 'Britain', 'pos': 'NNP'}
B-LOC
{'word': 'Europe', 'pos': 'NNP'}
B-LOC
{'word': 'Germany', 'pos': 'NNP'}
B-LOC
{'word': 'Bonn', 'pos': 'NNP'}
B-LOC
{'word': 'Germany', 'pos': 'NNP'}
B-LOC
{'word': 'Britain', 'pos': 'NNP'}
B-LOC
{'word': 'LONDON', 'pos': 'NNP'}
B-LOC
{'word': 'U.S.', 'pos': 'NNP'}
B-LOC
{'word': 'Florida', 'pos': 'NNP'}
B-LOC
{'word': 'London', 'pos': 'NNP'}
B-LOC
{'word': 'Nottingham', 'pos': 'NNP'}
B-LOC
{'word': 'China', 'pos': 'NNP'}
B-LOC
{'word': 'Taiwan', 'pos': 'NNP'}
B-LOC
{'word': 'BEIJING', 'pos': 'VBG'}
B-LOC
{'word': 'China', 'pos': 'NNP'}
B-LOC


**[4 points] f. Train a model that uses the embeddings of these words as inputs. Test again on the same data as in 2e. Generate a classification report and compare the results with the classifier you built in 2e.**

In [18]:
import gensim
model = gensim.models.KeyedVectors.load_word2vec_format('./model/GoogleNews-vectors-negative300.bin', binary=True)  

In [37]:
def get_inputs(data, model):
    inputs=[]
    for token, pos, ne_label in data.iob_words():
        if token in model:
            vector=model[token]
        else: # if the word does not exist in the embeddings vocabulary, use an all-zeros vector
            vector=[0]*300
            #print('not in vocabulary:', token)
        inputs.append(vector)
    return inputs

In [38]:
training_inputs=get_inputs(train, model)
test_inputs=get_inputs(test,model)

In [40]:
# your code here
lin_clf = svm.LinearSVC()

lin_clf.fit(training_inputs,training_gold_labels)
y_pred = lin_clf.predict(test_inputs)
report = classification_report(test_gold_labels,y_pred,digits = 3)

print(report)

              precision    recall  f1-score   support

       B-LOC      0.759     0.801     0.779      1668
      B-MISC      0.724     0.695     0.709       702
       B-ORG      0.690     0.638     0.663      1661
       B-PER      0.746     0.669     0.705      1617
       I-LOC      0.514     0.424     0.465       257
      I-MISC      0.604     0.537     0.569       216
       I-ORG      0.480     0.332     0.392       835
       I-PER      0.586     0.501     0.540      1156
           O      0.973     0.991     0.982     38323

   micro avg      0.927     0.927     0.927     46435
   macro avg      0.675     0.621     0.645     46435
weighted avg      0.921     0.927     0.923     46435

