# Lab3-Supervised NERC classifier (SVM)

In this notebook, we provide more information about Named Entity Recognition and Classification (NERC).

**At the end of this notebook, you will be able to**:
* understand the IOB format used to format NERC data
* represent linguistic features as vectors
* train a NERC classifier (SVM)
* apply the classifier to unseen data

**Useful links**:
* [blog about SVM](https://towardsdatascience.com/support-vector-machine-introduction-to-machine-learning-algorithms-934a444fca47)
* [blog about SVM in scikit-learn](https://medium.com/@aneesha/visualising-top-features-in-linear-svm-with-scikit-learn-and-matplotlib-3454ab18a14d)
* [blog about inspecting top features using scikit-learn](https://towardsdatascience.com/named-entity-recognition-and-classification-with-scikit-learn-f05372f07ba2)
* [one hot encoding](https://scikit-learn.org/dev/modules/feature_extraction.html#loading-features-from-dicts)

## 1. NERC
In Named Entity Recognition and Classification, the goal is to determine which noun phrases refer to named entities as well as classifying them.
Named entities can be persons, locations, organizations, etc. (see [NLTK Chapter 7, Section 5](https://www.nltk.org/book/ch07.html) for more information on the task)

![title](https://researchkb.files.wordpress.com/2014/02/ner.png) 

It is not trivial to represent NERC data in a way that we can easily train NLP systems as well as evaluate them. One of the most used formats is called [Inside–outside–beginning (IOB)](https://en.wikipedia.org/wiki/Inside%E2%80%93outside%E2%80%93beginning_(tagging)). Let's look at an example from one of the most popular datasets, which is [CoNLL-2003](http://aclweb.org/anthology/W03-0419).
```
Germany NNP B-NP B-LOC
's POS B-NP O
representative NN I-NP O
to TO B-PP O
the DT B-NP O
European NNP I-NP B-ORG
Union NNP I-NP I-ORG
```

The first observation is that all information is represented at the **token-level**. For each token, e.g., *Germany*, we receive information about:
* **the word**: e.g., *Germany*
* **the part of speech**: e.g., *NNP* (from [Penn Treebank](https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html))
* **the phrase type**: e.g., a noun phrase
* **the NERC label**: e.g., a location (LOC).

This example contains two named entities: *Germany* and *European Union*.

Every first token of a named entity is prefixed with *B-*. Every token after that, e.g., *Union* in *European Union*, is prefixed with *I-*.

Please note that the IOB format is at the **token-level**, which means that we also are going to train and evaluate an NLP system at the token-level! The goal will hence not be to classify *European Union* as an *Organization*, but to classify:
* *European* as the first token of an entity that is an *Organization*
* *Union* as a token inside of an entity that is an *Organization*

## 2. SVM
We are going to train an SVM for the NERC task. The goal of an SVM is to find a hyperplane in an n-dimensional space that distinctly classifies the data points. This is exactly the problem at hand. We have multiple NERC labels and we want to classify them correctly.

### 2.1 Scikit-learn
We are going to use the **svm** module from **sklearn**, from which we will select the **LinearSVR** (Linear Support Vector Regression) class.

In [None]:
from sklearn import svm

In [None]:
svm.LinearSVR

### 2.2 Representing features in sklearn.svm

Similar to when we trained a Sentiment Analyzer in Lab 2, we need to represent training instances using a vector representation. For each training instance, we need:
* **its feature vector**
* **the NERC label**

We show how to train and evaluate an SVM using a made-up example of multi-class classification for a non-linguistic dataset. The goal is to predict someone's favorite color based on their properties.

We use three features:
* **age in years**
* **height in cms**
* **number of ice cream cones eaten per year**

The feature representation is:

In [None]:
X = [[30, 180, 1000], 
     [80, 180, 100],
     [50, 180, 100],
     [40, 160, 500],
     [15, 160, 400]
    ]

Please note that each row represents a training instance, i.e., the age, height, and the number of ice cream cones eaten in a year for a specific person. Each column represents a feature, i.e., the first column represents the age feature.

The labels are represented in the following way, i.e., the favorite colors of each training instance:

In [None]:
y = ["green", 
     "blue",
     "blue",
     "red",
     "red"]

In [None]:
X = [[30, 180, 1000], 
     [80, 180, 100],
     [50, 180, 100],
     [40, 160, 500],
     [15, 160, 400]
    ]

## 2.3 Training the model

Let's instanciate the model that we'll be using

In [None]:
lin_clf = svm.LinearSVC()

We train the model. You might get a warning stating that:
```
ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
```

This is to be expected given that we only train using five instances.

In [None]:
lin_clf.fit(X, y)

## 2.4 Applying the model to new instances

Now let's see what the model thinks the favorite color is of someone of 35 years, 180cm, and who eats a 100 ice cream cones per year.

In [None]:
predicted_label = lin_clf.predict([[35, 
                                    180, 
                                    100]])
print(predicted_label)

Apparently the SVM thinks it is **blue**, which is not surprising since **number of ice cream cones eaten per year** seems to correlate highly with the favorite colors.

### 2.5 Incorporating linguistic features
So far, we dealt with features that were numbers. However, in NLP problems, we often deal with features such as:
* part of speech
* lemma
* ....

Can an SVM deal with strings? The answer is: not really.
How can we then represent linguistic information about each token in the following sentence:
* **The cat sleeps**

We can if we use something called **one hot encoding**! When we represented *age* in the example from above, we used one column for that feature (see first column of the following feature vectors). 

In [None]:
X = [[30, 180, 1000], 
     [80, 180, 100],
     [50, 180, 100],
     [40, 160, 500],
     [15, 160, 400]
    ]

In one hot encoding, you use a column for **each possible value** of a feature. This means that it is important to know the possible values of a feature since this will be a closed class. We represent for each feature value whether the feature in that value occurs in a training instance. 
Let's try to represent the features **part of speech** and **lemma**.

To accomplish this, we use the **DictVectorizer** from sklearn. Our input format is now a list with dictionaries, with each dictionary representing a training instance.

In [None]:
from sklearn.feature_extraction import DictVectorizer

In [None]:
training_instances = [
 {'part-of-speech': 'determiner', 'lemma': 'the'},
 {'part-of-speech': 'noun', 'lemma': 'cat'},
 {'part-of-speech': 'verb', 'lemma': 'sleep'},
]

vec = DictVectorizer()

the_array = vec.fit_transform(training_instances).toarray()
print(the_array)
print(vec.get_feature_names())

Please note that the first column now informs us whether a training instances contains the feature *lemma* with the value *cat*. If this is not the case, the value is 0. If this is the case, the value is 1. Please note that the number of columns grows immmensely when using one hot encoding.

Let's inspect the first row, i.e. the one hot encoding representation of the following training instance,
```
 {'part-of-speech': 'determiner', 'lemma': 'the'}
```
The feature vector using one hot encoding is:
```
[0. 0. 1. 1. 0. 0.]
````
* **first value: 0**: the feature *lemma* with the value *cat* does not occur in the training instance
* **second value: 0**: the feature *lemma* with the value *sleep* does not occur in the training instance
* **third value: 1**: the feature *lemma* with the value *the* does occur in the training instance
* ....


## 3. NERC datasets
Now that we've seen how to represent linguistic features, we also need to access relevant linguistic training data for the NERC task. One of the most popular datasets is [CoNLL-2003](http://aclweb.org/anthology/W03-0419), which was provided with the zip file you downloaded from Canvas.
You can load it using the following code snippet.

In [None]:
from nltk.corpus.reader import ConllCorpusReader

train = ConllCorpusReader('NERC_datasets/CONLL2003', # the folder where ConLL-2003 is stored (you downloaded this with the zip file from canvas) 
                          'train.txt', # this will load the file 'train.txt', for the exercise you also need to load 'test.xt' 
                          ['words', 'pos', 'ignore', 'chunk'])
for token, pos, ne_label in train.iob_words():
    print(token, pos, ne_label) # please represent this information using a dictionary for the feature representation
    break

## Kaggle
Another interesting dataset is the [Annotated Corpus for Named Entity Recognition](https://www.kaggle.com/abhinavwalia95/entity-annotated-corpus), which we also provided in the zip file you downloaded from Canvas. You can load it in the following way:

In [None]:
import pandas

In [None]:
path = 'NERC_datasets/kaggle/ner_v2.csv'

In [None]:
kaggle_dataset = pandas.read_csv(path, error_bad_lines=False)

You will see the following output after running the above code cell:
```
b'Skipping line 281837: expected 25 fields, saw 34\n'
```
You can ignore this.

**pandas.read_csv** will load the csv file into a [pandas DataFrame](https://towardsdatascience.com/pandas-dataframe-a-lightweight-intro-680e3a212b96).

You can inspect which columns are in the csv file by running the following code cell:

In [None]:
kaggle_dataset.columns

[Here](https://www.kaggle.com/abhinavwalia95/entity-annotated-corpus), you can read what each column represents.

You loop can loop through the dataset in the following way:

In [None]:
for index, instance in kaggle_dataset.iterrows():
    print()
    print(index)
    print(instance) # you can access information by using instance['A COLUMN NAME'] which you can use to convert to a dictionary needed for the feature representation.
    print('NERC label', instance['tag'])
    break