# Lab3-Supervised NERC classifier (SVM)

In this notebook, we provide more information about Named Entity Recognition and Classification (NERC).

**At the end of this notebook, you will be able to**:
* understand the IOB format used to format NERC data
* represent linguistic features as vectors
* train a NERC classifier (SVM)
* apply the classifier to unseen data

**Useful links**:
* [blog about SVM](https://towardsdatascience.com/support-vector-machine-introduction-to-machine-learning-algorithms-934a444fca47)
* [blog about SVM in scikit-learn](https://medium.com/@aneesha/visualising-top-features-in-linear-svm-with-scikit-learn-and-matplotlib-3454ab18a14d)
* [blog about inspecting top features using scikit-learn](https://towardsdatascience.com/named-entity-recognition-and-classification-with-scikit-learn-f05372f07ba2)
* [one hot encoding](https://scikit-learn.org/dev/modules/feature_extraction.html#loading-features-from-dicts)

## 1. NERC
In Named Entity Recognition and Classification, the goal is to determine which noun phrases refer to named entities as well as classifying them.
Named entities can be persons, locations, organizations, etc. (see [NLTK Chapter 7, Section 5](https://www.nltk.org/book/ch07.html) for more information on the task)

![title](https://researchkb.files.wordpress.com/2014/02/ner.png) 

It is not trivial to represent NERC data in a way that we can easily train NLP systems as well as evaluate them. One of the most used formats is called [Inside–outside–beginning (IOB)](https://en.wikipedia.org/wiki/Inside%E2%80%93outside%E2%80%93beginning_(tagging)). Let's look at an example from one of the most popular datasets, which is [CoNLL-2003](http://aclweb.org/anthology/W03-0419).
```
Germany NNP B-NP B-LOC
's POS B-NP O
representative NN I-NP O
to TO B-PP O
the DT B-NP O
European NNP I-NP B-ORG
Union NNP I-NP I-ORG
```

The first observation is that all information is represented at the **token-level**. For each token, e.g., *Germany*, we receive information about:
* **the word**: e.g., *Germany*
* **the part of speech**: e.g., *NNP* (from [Penn Treebank](https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html))
* **the phrase type**: e.g., a noun phrase
* **the NERC label**: e.g., a location (LOC).

This example contains two named entities: *Germany* and *European Union*.

Every first token of a named entity is prefixed with *B-*. Every token after that, e.g., *Union* in *European Union*, is prefixed with *I-*.

Please note that the IOB format is at the **token-level**, which means that we also are going to train and evaluate an NLP system at the token-level! The goal will hence not be to classify *European Union* as an *Organization*, but to classify:
* *European* as the first token of an entity that is an *Organization*
* *Union* as a token inside of an entity that is an *Organization*

## 2. SVM
We are going to train an SVM for the NERC task. The goal of an SVM is to find a hyperplane in an n-dimensional space that distinctly classifies the data points. This is exactly the problem at hand. We have multiple NERC labels and we want to classify them correctly.

### 2.1 Scikit-learn
We are going to use the **svm** module from **sklearn**, from which we will select the **LinearSVR** (Linear Support Vector Regression) class.

In [1]:
from sklearn import svm

In [2]:
svm.LinearSVR

sklearn.svm.classes.LinearSVR

### 2.2 Representing features in sklearn.svm

Similar to when we trained a Sentiment Analyzer in Lab 2, we need to represent training instances using a vector representation. For each training instance, we need:
* **its feature vector** (the representation of some input)
* **the NERC label** (the corresponding output class)

We show how to train and evaluate an SVM using a made-up example of multi-class classification for a non-linguistic dataset. The goal is to predict someone's weight category (say: skinny, fit, average, overweight) based on their properties.

We use three features:
* **age in years**
* **height in cms**
* **number of ice cream cones eaten per year**

The feature representation is:

In [6]:
X = [[30, 180, 1000], 
     [80, 180, 100],
     [50, 180, 100],
     [40, 160, 500],
     [15, 160, 400]
    ]

Please note that each row represents a training instance, i.e., the age, height, and the number of ice cream cones eaten in a year for a specific person. Each column represents a feature, i.e., the first column represents the age feature.

The labels are represented in the following way, i.e., the correct weight categories of each training instance:

In [7]:
y = ["overweight", 
     "skinny",
     "fit",
     "average",
     "average"]

### 2.3 Training and testing the model

Let's instanciate the model that we'll be using

In [8]:
lin_clf = svm.LinearSVC()

We train the model. You might get a warning stating that:
```
ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
```

This is to be expected given that we only train using five instances.

In [9]:
lin_clf.fit(X, y)



LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
     intercept_scaling=1, loss='squared_hinge', max_iter=1000,
     multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
     verbose=0)

Let's now **apply the model to new instances**: what does it think the weight category is of someone of 18 years, 171cm, and who eats a 350 ice cream cones per year.

In [24]:
predicted_label = lin_clf.predict([[18, 
                                    171, 
                                    350]])
print(predicted_label)

['average']


Apparently the SVM thinks it is **average**, which is not surprising since **number of ice cream cones eaten per year** and **height** seem to correlate highly with the weight categories.

## 3. Incorporating linguistic features: one-hot encoding

So far, we dealt with features that were numbers. However, in NLP problems, we often deal with features such as:
* part of speech
* lemma
* ....

Can an SVM deal with strings? The answer is: not really.
How can we then represent linguistic information about each token in the following phrase:
* **... Germany's representative to the European Union ...**

In the IOB-format, the phrase has the following representation:

```
Germany NNP B-NP B-LOC
's POS B-NP O
representative NN I-NP O
to TO B-PP O
the DT B-NP O
European NNP I-NP B-ORG
Union NNP I-NP I-ORG
```

We can if we use something called **one hot encoding**! When we represented *age* in the example from above, we used one column for that feature (see the first column of the matrix `X` above).

### 3.1 Generating features per instance (e.g., token)

In one hot encoding, you use a column for **each possible value** of a feature. This means that it is important to know the possible values of a feature since this will be a closed class. We represent for each feature value whether the feature in that value occurs in a training instance. 
We will now try to represent the features **part of speech** and **lemma**.

Let's first generate those values for each of our tokens, with SpaCy:

In [42]:
import spacy

nlp=spacy.load('en')

In [53]:
text="Germany's representative to the European Union"
doc=nlp(text)

training_instances=[]
for token in doc:
    one_training_instance={'part-of-speech': token.pos_, 'lemma': token.lemma_}
    training_instances.append(one_training_instance)

In [54]:
training_instances

[{'part-of-speech': 'PROPN', 'lemma': 'Germany'},
 {'part-of-speech': 'PART', 'lemma': "'s"},
 {'part-of-speech': 'NOUN', 'lemma': 'representative'},
 {'part-of-speech': 'ADP', 'lemma': 'to'},
 {'part-of-speech': 'DET', 'lemma': 'the'},
 {'part-of-speech': 'PROPN', 'lemma': 'European'},
 {'part-of-speech': 'PROPN', 'lemma': 'Union'}]

Our instance information is now a list with dictionaries, with each dictionary representing a training instance (token) with two values (POS tag and lemma).
We can now try to convert these values to a numeric vector representation.

### 3.2 Vectorizing our features

To accomplish this, we use the **DictVectorizer** from sklearn ([link to documentation](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.DictVectorizer.html) ). 

Please recall that in lab session 2 we used two other vectorizers, that create bag-of-words or tf-idf vectors from a vocabulary of words. 

In [56]:
from sklearn.feature_extraction import DictVectorizer

In [97]:
vec = DictVectorizer()

the_array = vec.fit_transform(training_instances).toarray()

### 3.3 Analyzing the vectorized format

Let's now print the resulting vector representation. Each row represents the features for one token.

In [58]:
print(the_array)

[[0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 1.]
 [1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0.]
 [0. 0. 0. 0. 1. 0. 0. 0. 0. 1. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 1. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 1. 0. 0. 1. 0. 0. 0.]
 [0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]
 [0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 1.]]


Please note that the third column now informs us whether a training instances contains the feature *lemma* with the value *Germany*. If this is not the case, the value is 0. If this is the case, the value is 1. Please note that the number of columns grows immmensely when using one hot encoding (you can easily play with this by changing the input sentence above).


Generally speaking, each column represents a **specific value** of a lemma or POS tag. We can get more information on this from the vectorizer:

In [60]:
print(vec.get_feature_names())

["lemma='s", 'lemma=European', 'lemma=Germany', 'lemma=Union', 'lemma=representative', 'lemma=the', 'lemma=to', 'part-of-speech=ADP', 'part-of-speech=DET', 'part-of-speech=NOUN', 'part-of-speech=PART', 'part-of-speech=PROPN']


We can see that the second column for example stands for the lemma 'European'. Most words do not have this lemma, but the second last word has it. For that reason, we can see that the second column in the second last row has a value 1.

Similarly, the last column represents the tokens with a PROPN (proper noun) part-of-speech. We can see that three words have this part-of-speech tag, namely the words represented in the rows: 1, 6 and 7.


As a final analysis step, let's inspect the first row, i.e. the one hot encoding representation of the following training instance,
```
{'part-of-speech': 'PROPN', 'lemma': 'Germany'}
```
The feature vector using one hot encoding is:
```
[0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 1.]
````
* **first value: 0**: the feature *lemma* with the value *'s* does not occur in the training instance
* **second value: 0**: the feature *lemma* with the value *European* does not occur in the training instance
* **third value: 1**: the feature *lemma* with the value *Germany* does occur in the training instance
* ...
* **last value: 1**: the feature *part-of-speech* with the value *PROPN* does occur in the training instance

### 3.4 Training an SVM model

Hopefully, you can see the resemblance of the vectors here to the ones we generated with bag-of-words and tf-idf last week. Not surprisingly, we can now train and test a machine learning model, such as SVM. Given that our model is trained on only 7 input vectors, it will not be a meaningful one yet; we will build a model with sufficient data in the assignment.

To train, we also need to have the 'gold' labels for each of the token. Let's define them manually here, according to the example below:

In [93]:
y=['B-LOC', 'O', 'O', 'O', 'O', 'B-ORG', 'I-ORG']

Let's now train the model:

In [98]:
lin_clf = svm.LinearSVC()
lin_clf.fit(the_array, y)

LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
     intercept_scaling=1, loss='squared_hinge', max_iter=1000,
     multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
     verbose=0)

### 3.5 Testing our model on new examples

We can now reuse the same vectorizer of the training data to vectorize any new example we want to train, and perform prediction on it using the trained SVM model:

In [95]:
new_training_instances={'lemma': 'America', 'part-of-speech': 'PROPN'}
vectorized=vec.transform(new_training_instances)
print(vectorized.toarray())

[[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]]


In [99]:
pred=lin_clf.predict(vectorized)
pred

array(['B-ORG'], dtype='<U5')

## 4. Embeddings-based NERC model

### 4.1 Quick introduction to embeddings

Extracting features manually can get us a long way. In addition to lemma and part-of-speech, people have used a huge number of other information: previous words (on the left), next words (on the right), whether the word starts with a capital, whether it is an abbreviation, etc.

One recent way to create a 'semantic' representation of a word is by word embeddings

In this section, we will load pre-trained word embeddings called word2vec, created by Google. 

First, please download the file from [their google drive](https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit). Create a folder in the same directory as this notebook, called 'model' and unpack the word2vec file in that folder.

In [128]:
import gensim

We can now load the file using the gensim library:

In [103]:
model = gensim.models.KeyedVectors.load_word2vec_format('./model/GoogleNews-vectors-negative300.bin', binary=True)  

Word embeddings are expected to capture certain meaning of the words. Previous research has shown to some extent that they are good in terms of simiarity, relatedness, and analogy. For example, we can compute the cosine similarity between two word vectors. We will expect for example, that cat and tiger is more similar than cat and Germany. Feel free to play a bit with the words below to get some feeling of the information these embeddings capture.

In [109]:
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

In [142]:
word1='tapas'
word2='pintxos'
dog_vector=np.array(model[word1]).reshape(1, -1)
cat_vector=np.array(model[word2]).reshape(1, -1)
print(cosine_similarity(dog_vector, cat_vector))

[[0.6477412]]


We can also get the most similar words to some word, say 'apple':

In [141]:
print(model.most_similar('apple', topn=10))

[('apples', 0.720359742641449), ('pear', 0.6450697183609009), ('fruit', 0.641014575958252), ('berry', 0.6302294135093689), ('pears', 0.6133961081504822), ('strawberry', 0.6058261394500732), ('peach', 0.6025872230529785), ('potato', 0.5960935354232788), ('grape', 0.5935864448547363), ('blueberry', 0.5866668224334717)]


### 4.2 Using embeddings in our NERC model

We will now replace our one-hot input representation of our words with embeddings. We will do that by simply looking up each word in the embeddings model.

In [146]:
training_inputs=[]
for token in doc:
    word=token.text
    if word in model:
        vector=model[word]
    else:
        vector=[0]*300
        print('no fun', word)
    training_inputs.append(vector)

no fun 's
no fun to


In [148]:
lin_clf = svm.LinearSVC()
lin_clf.fit(training_inputs, y)

LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
     intercept_scaling=1, loss='squared_hinge', max_iter=1000,
     multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
     verbose=0)

**Testing the model** Let's say we want to test our model with the sentence: 'I love beer from Munich'.

In [162]:
test_sentence='I love beer from Munich'
test_doc=nlp(test_sentence)
gold_labels=['O', 'O', 'O', 'O', 'B-LOC']

test_inputs=[]

for token in test_doc:
    word=token.text
    if word in model:
        vector=model[word]
    else:
        vector=[0]*300
    test_inputs.append(vector)
    
pred=lin_clf.predict(test_inputs)

In [163]:
pred

array(['O', 'O', 'O', 'O', 'B-LOC'], dtype='<U5')

## 5. NERC datasets

### 5.1 CoNLL-2003

Now that we've seen how to represent linguistic features, we also need to access relevant linguistic training data for the NERC task. One of the most popular datasets is [CoNLL-2003](http://aclweb.org/anthology/W03-0419), which was provided with the zip file you downloaded from Canvas.
You can load it using the following code snippet.

In [171]:
from nltk.corpus.reader import ConllCorpusReader

train = ConllCorpusReader('NERC_datasets/CONLL2003', # the folder where ConLL-2003 is stored (you downloaded this with the zip file from canvas) 
                          'train.txt', # this will load the file 'train.txt', for the exercise you also need to load 'test.xt' 
                          ['words', 'pos', 'ignore', 'chunk'])
for token, pos, ne_label in train.iob_words():
    print(token, pos, ne_label) # please represent this information using a dictionary for the feature representation
    break

EU NNP B-ORG


We can for example iterate through this data, and make a list of the tokens as inputs, and of the `ne_label` values as desirable outputs. The input tokens could for example be looked up in our word embeddings dictionary.

In [174]:
input_vectors=[]
labels=[]
for token, pos, ne_label in train.iob_words():
    
    if token!='' and token!='DOCSTART':
        if token in model:
            vector=model[token]
        else:
            vector=[0]*300
        input_vectors.append(vector)
        labels.append(ne_label)

We have successfully loaded our data. Let's see how many tokens/labels we have:

In [175]:
print(len(labels))

203621


In a next step, we could easily train a model on this data as shown in section 4.2 above.

### 5.2 Kaggle
Another interesting dataset is the [Annotated Corpus for Named Entity Recognition](https://www.kaggle.com/abhinavwalia95/entity-annotated-corpus), which we also provided in the zip file you downloaded from Canvas. You can load it in the following way:

In [166]:
import pandas

In [167]:
path = 'NERC_datasets/kaggle/ner_v2.csv'

In [168]:
kaggle_dataset = pandas.read_csv(path, error_bad_lines=False)

b'Skipping line 281837: expected 25 fields, saw 34\n'


You will see the following output after running the above code cell:
```
b'Skipping line 281837: expected 25 fields, saw 34\n'
```
You can ignore this.

**pandas.read_csv** will load the csv file into a [pandas DataFrame](https://towardsdatascience.com/pandas-dataframe-a-lightweight-intro-680e3a212b96).

You can inspect which columns are in the csv file by running the following code cell:

In [169]:
kaggle_dataset.columns

Index(['id', 'lemma', 'next-lemma', 'next-next-lemma', 'next-next-pos',
       'next-next-shape', 'next-next-word', 'next-pos', 'next-shape',
       'next-word', 'pos', 'prev-iob', 'prev-lemma', 'prev-pos',
       'prev-prev-iob', 'prev-prev-lemma', 'prev-prev-pos', 'prev-prev-shape',
       'prev-prev-word', 'prev-shape', 'prev-word', 'sentence_idx', 'shape',
       'word', 'tag'],
      dtype='object')

[Here](https://www.kaggle.com/abhinavwalia95/entity-annotated-corpus), you can read what each column represents.

You loop can loop through the dataset in the following way:

In [170]:
for index, instance in kaggle_dataset.iterrows():
    print()
    print(index)
    print(instance) # you can access information by using instance['A COLUMN NAME'] which you can use to convert to a dictionary needed for the feature representation.
    print('NERC label', instance['tag'])
    break


0
id                             0
lemma                   thousand
next-lemma                    of
next-next-lemma         demonstr
next-next-pos                NNS
next-next-shape        lowercase
next-next-word     demonstrators
next-pos                      IN
next-shape             lowercase
next-word                     of
pos                          NNS
prev-iob              __START1__
prev-lemma            __start1__
prev-pos              __START1__
prev-prev-iob         __START2__
prev-prev-lemma       __start2__
prev-prev-pos         __START2__
prev-prev-shape         wildcard
prev-prev-word        __START2__
prev-shape              wildcard
prev-word             __START1__
sentence_idx                   1
shape                capitalized
word                   Thousands
tag                            O
Name: 0, dtype: object
NERC label O


We could for instance use these features as inputs in a machine learning model with our DictVectorizer, or by transforming them using embeddings.