# Lab4a Supervised NERC classifier (SVM)

In this notebook, we focus on Named Entity Recognition and Classification (NERC).

**At the end of this notebook, you will be able to**:
* understand the IOB format used to format NERC data
* represent linguistic features as vectors
* use pretrained word embeddings
* train NERC classifiers (SVM)
* apply the classifiers to unseen data

**Useful links**:
* [blog about SVM](https://towardsdatascience.com/support-vector-machine-introduction-to-machine-learning-algorithms-934a444fca47)
* [blog about SVM in scikit-learn](https://medium.com/@aneesha/visualising-top-features-in-linear-svm-with-scikit-learn-and-matplotlib-3454ab18a14d)
* [blog about inspecting top features using scikit-learn](https://towardsdatascience.com/named-entity-recognition-and-classification-with-scikit-learn-f05372f07ba2)
* [one hot encoding](https://scikit-learn.org/dev/modules/feature_extraction.html#loading-features-from-dicts)

## 1. NERC
In Named Entity Recognition and Classification, the goal is to determine which noun phrases refer to named entities as well as classifying them.
Named entities can be persons, locations, organizations, etc. (see [NLTK Chapter 7, Section 5](https://www.nltk.org/book/ch07.html) for more information on the task)

![title](https://researchkb.files.wordpress.com/2014/02/ner.png) 

It is not trivial to represent NERC data in a way that we can easily train NLP systems as well as evaluate them. One of the most used formats is called [Inside–outside–beginning (IOB)](https://en.wikipedia.org/wiki/Inside%E2%80%93outside%E2%80%93beginning_(tagging)). Let's look at an example from one of the most popular datasets, which is [CoNLL-2003](http://aclweb.org/anthology/W03-0419).
```
Germany NNP B-NP B-LOC
's POS B-NP O
representative NN I-NP O
to TO B-PP O
the DT B-NP O
European NNP I-NP B-ORG
Union NNP I-NP I-ORG
```

The first observation is that all information is represented at the **token-level**. For each token, e.g., *Germany*, we receive information about:
* **the word**: e.g., *Germany*
* **the part of speech**: e.g., *NNP* (from [Penn Treebank](https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html))
* **the phrase type**: e.g., a noun phrase
* **the NERC label**: e.g., a location (LOC).

This example contains two named entities: *Germany* and *European Union*.

Every first token of a named entity is prefixed with *B-*. Every token after that, e.g., *Union* in *European Union*, is prefixed with *I-*.

Please note that the IOB format is at the **token-level**, which means that we also are going to train and evaluate an NLP system at the token-level! The goal will hence not be to classify *European Union* as an *Organization*, but to classify:
* *European* as the first token of an entity that is an *Organization*
* *Union* as a token inside of an entity that is an *Organization*

Please make sure you understand the format before you proceed ;)

## 2. SVM
We are going to train an SVM for the NERC task. The goal of an SVM is to find a hyperplane in an n-dimensional space that distinctly classifies the data points. This is exactly the problem at hand. We have multiple NERC labels and we want to classify them correctly.

### 2.1 Scikit-learn
We are going to use the **svm** module from **sklearn**, from which we will select the **LinearSVR** (Linear Support Vector Regression) class.

In [None]:
from sklearn import svm

In [None]:
svm.LinearSVR

### 2.2 Representing features in sklearn.svm

Similar to when we trained a Sentiment Analyzer in Lab 2, we need to represent training instances using a vector representation. For each training instance, we need:
* **its feature vector** (the representation of some input)
* **the NERC label** (the corresponding output class)

We show how to train and evaluate an SVM using a made-up example of multi-class classification for a non-linguistic dataset. The goal is to predict someone's weight category (say: skinny, fit, average, overweight) based on their properties.

We use three features:
* **age in years**
* **height in cms**
* **number of ice cream cones eaten per year**

The feature representation (for 5 people) is:

In [None]:
X = [[30, 180, 1000], 
     [80, 180, 100],
     [50, 180, 100],
     [40, 160, 500],
     [15, 160, 400]
    ]

Please note that each row represents a training instance, i.e., the age, height, and the number of ice cream cones eaten in a year for a specific person. Each column represents a feature, i.e., the first column represents the age feature.

The labels are represented in the following way, i.e., the correct weight categories of each training instance:

In [1]:
y = ["overweight", 
     "skinny",
     "fit",
     "average",
     "average"]

### 2.3 Training and testing the model

Let's instanciate the model that we'll be using

In [None]:
lin_clf = svm.LinearSVC()

We train the model. You might get a warning stating that:
```
ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
```

This is to be expected given that we only train using five instances.

In [None]:
lin_clf.fit(X, y)

Let's now **apply the model to new instances**: what does SVM think the weight category is of someone of 18 years, 171cm, and who eats 400 ice cream cones per year?

In [None]:
predicted_label = lin_clf.predict([[18, 
                                    171, 
                                    400]])
print(predicted_label)

Apparently the SVM thinks it is **average**, which is not surprising since **number of ice cream cones eaten per year** and **height** seem to correlate highly with the weight categories.

## 3. Incorporating linguistic features: one-hot encoding

So far, we dealt with features that were numbers. However, in NLP problems, we often deal with features such as:
* part of speech
* lemma
* ....

Can an SVM deal with strings? The answer is: not really.
How can we then represent linguistic information about each token in the following phrase:
* **... Germany's representative to the European Union ...**

In the IOB-format, the phrase has the following representation:

```
Germany NNP B-NP B-LOC
's POS B-NP O
representative NN I-NP O
to TO B-PP O
the DT B-NP O
European NNP I-NP B-ORG
Union NNP I-NP I-ORG
```

We can if we use something called **one hot encoding**! When we represented *age* in the example from above, we used one column for that feature (see the first column of the matrix `X` above). One hot encoding works a bit differently.

### 3.1 Generating features per instance (e.g., token)

In one hot encoding, you use a column for **each possible value** of a feature. This means that it is important to know the possible values of a feature since this will be a closed class. We represent for each feature value whether the feature in that value occurs in a training instance. 
We will now try to represent the features **part of speech** and **lemma**.

Let's first generate those values for each of our tokens, with SpaCy:

In [None]:
import spacy

nlp=spacy.load('en')

In [None]:
text="Germany's representative to the European Union"
doc=nlp(text)

training_instances=[]
for token in doc:
    one_training_instance={'part-of-speech': token.pos_, 'lemma': token.lemma_}
    training_instances.append(one_training_instance)

In [None]:
training_instances

Our instance information is now a list with dictionaries, with each dictionary representing a training instance (token). For each instance, we store two values: POS tag and lemma.
Next, we will convert these values to a numeric vector representation.

### 3.2 Vectorizing our features

To accomplish this, we use the **DictVectorizer** from sklearn ([link to documentation](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.DictVectorizer.html)). 

Please recall that in lab session 2 we used two other vectorizers, that create bag-of-words or tf-idf vectors from a vocabulary of words. Those two vectorizers operated on a vocabulary level: they assigned a non-zero value to the words which occur in an input, and zero to all others.

The DictVectorizer does something similar, but on a feature level: for each feature (let's say POS tag) DictVectorizer assigns a value of 1 to its value (for example, VERB) and zeros to all others (e.g., NOUN, PROPN, etc.).

In [None]:
from sklearn.feature_extraction import DictVectorizer

In [None]:
vec = DictVectorizer()

the_array = vec.fit_transform(training_instances).toarray() 
# The toarray() is only there for educational purposes. 
# Please do not use it in the assignment since you might get memory issues.

### 3.3 Analyzing the vectorized format

Let's now print the resulting vector representation. Each **row** represents the features for one token. Each **column** corresponds to one feature value (for example, VERB part-of-speech).

In [None]:
print(the_array)

Generally speaking, each column represents a **specific value** of a lemma or POS tag. We can get more information on this from the vectorizer:

In [None]:
print(vec.get_feature_names())

We can see that the second column for example stands for the lemma 'European'. Most words do not have this lemma, but the second last word has it. For that reason, we can see that the second column in the second last row has a value 1. All other rows have zeros in that column, because their lemma is different.

Similarly, the last column represents the tokens with a PROPN (proper noun) part-of-speech. We can see that three words have this part-of-speech tag, namely the words represented in the rows: 1, 6 and 7.


As a final analysis step, let's inspect the first row, i.e. the one hot encoding representation of the following training instance,
```
{'part-of-speech': 'PROPN', 'lemma': 'Germany'}
```
The feature vector using one hot encoding is:
```
[0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 1.]
````
* **first value: 0**: the feature *lemma* with the value *'s* does not occur in the training instance
* **second value: 0**: the feature *lemma* with the value *European* does not occur in the training instance
* **third value: 1**: the feature *lemma* with the value *Germany* does occur in the training instance
* ...
* **last value: 1**: the feature *part-of-speech* with the value *PROPN* does occur in the training instance


Please note that the number of columns grows immmensely when using one hot encoding (you can easily play with this by changing the input sentence above).

### 3.4 Training an SVM model

Hopefully, you can see that the vectors we end up with here resemble the ones we generated with bag-of-words and tf-idf last week. Not surprisingly, we can now use them to train and test a machine learning model, such as SVM. Given that our model is trained on only 7 input vectors, it will not be a meaningful one yet; we will build a model with sufficient data in the assignment.

To train, we also need to have the 'gold' labels for each of the token. Let's define them manually here, according to the example below:

In [None]:
y=['B-LOC', 'O', 'O', 'O', 'O', 'B-ORG', 'I-ORG']

Let's now train the model:

In [None]:
lin_clf = svm.LinearSVC()
lin_clf.fit(the_array, y)

### 3.5 Testing our model on new examples

We can now reuse the same vectorizer of the training data to vectorize any new example we want to train, and perform prediction on it using the trained SVM model:

In [None]:
new_training_instances={'lemma': 'America', 'part-of-speech': 'PROPN'}
vectorized=vec.transform(new_training_instances)
print(vectorized.toarray())

In [None]:
pred=lin_clf.predict(vectorized)
pred

Well done! You have now managed to train an SVM model based on two features and apply it to test on some new example. Feel free to play around with the training data, the features, and the test data. We will work with more data and more features in the assignment.

Let us finish this section with several key observations:
* Our vectorized representations will easily become too large. For example, only the lemma feature could easily have thousands of values. On the other hand, they are **sparse** representations, containing mostly zeros and occassional 1 values. Is there a better way to encode our data in vectors? The answer is: yes. We will work with **dense** representations in the next section!
* In the test example above, the lemma of America was not found in the training data, so no existing lemma received a value of 1 in the final vector representation. This is because the set of feature values is 'frozen' after the training, any new feature value encountered at test time is considered to be *unknown* (typically called **UNK** in NLP terminology).
* Finally, a note on the algorithm. SVM can yield some powerful models if we use good features and train it well, however, it does not have an intrinsic capability to capture **sequences of words**. For this purpose, people often use a recurrent neural network. We will not work with RNNs in this course.

## 4. Embeddings-based NERC model

### 4.1 Quick introduction to embeddings

Extracting features manually can get us a long way. In addition to lemma and part-of-speech, people have used a huge number of other information: features of the previous words (on the left) or the next words (on the right), whether the current word starts with a capital, whether it is an abbreviation, etc.

Another recent way to create a 'semantic' representation of a word is by word embeddings: mapping words (or phrases) from the vocabulary to vectors of real numbers. Conceptually it involves a mathematical embedding from a space with many dimensions per word to a continuous vector space with a much lower dimension. For this purpose, they are called dense representations.

In linguistics, word embeddings were discussed in the research area of distributional semantics. The idea is to quantify and categorize semantic similarities between linguistic items based on their distributional properties in large samples of language data. The underlying notion is that "a word is characterized by the company it keeps" (Firth).

In this section, we will load pre-trained word embeddings called word2vec, created by Google. 

First, please download the file from [their google drive](https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit). Then, create a folder in the same directory as this notebook, called 'model' and unpack the word2vec file in that folder.

In [None]:
import gensim

We can now load the file using the gensim library (this takes a bit):

In [None]:
model = gensim.models.KeyedVectors.load_word2vec_format('./model/GoogleNews-vectors-negative300.bin', binary=True)  

Word embeddings are expected to capture certain meaning of the words. Previous research has shown to some extent that they can capture simiarity (tapas is similar to pintxos), relatedness (tapas relates to Spain), and analogy (Paris is to France as Rome is to Italy). 

To get an idea of these properties of embeddings, we can compute the cosine similarity between two word vectors. We will expect for example, that cat and tiger is more similar than cat and Germany. Feel free to play a bit with the words below to get some feeling of the information these embeddings capture.

In [None]:
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

In [None]:
word1='tapas'
word2='pintxos'
dog_vector=np.array(model[word1]).reshape(1, -1)
cat_vector=np.array(model[word2]).reshape(1, -1)
print(cosine_similarity(dog_vector, cat_vector))

We can also get the most similar words to some word, say 'apple':

In [None]:
print(model.most_similar('apple', topn=10))

### 4.2 Using embeddings in our NERC model

We will now replace the one-hot input representation of our words with embeddings. We generate our input data by simply looking up each word in the embeddings model.

In [None]:
training_inputs=[]
for token in doc:
    word=token.text
    if word in model:
        vector=model[word]
    else: # if the word does not exist in the embeddings vocabulary, use an all-zeros vector
        vector=[0]*300
        print('not in vocabulary:', word)
    training_inputs.append(vector)

Same as in the earlier cases, once we have the vector representations, we can use them to train our model:

In [None]:
lin_clf = svm.LinearSVC()
lin_clf.fit(training_inputs, y)

**Testing the model** Let's say we want to test our model with the sentence: 'I love beer from Munich'.

In [None]:
test_sentence='I love beer from Munich'
test_doc=nlp(test_sentence)
gold_labels=['O', 'O', 'O', 'O', 'B-LOC']

test_inputs=[]

for token in test_doc:
    word=token.text
    if word in model:
        vector=model[word]
    else:
        vector=[0]*300
    test_inputs.append(vector)
    
pred=lin_clf.predict(test_inputs)

In [None]:
pred

Congratulations! You have now trained and testing your first embeddings-based NERC model.

As mentioned above, a more modern version of this model would be to replace SVM with a sequence-to-sequence architecture from the recurrent neural networks family.

## 5. NERC datasets

Here we will load two NERC datasets and quickly inspect their contents. In the assignment we will use this data to train and test models.

**Preparation** Please download the .zip file with the two datasets from [this link](http://kyoto.let.vu.nl/~llievski/text-mining/lab3_datasets.zip)

Then unpack the .zip, so that the folder `NERC_datasets` lies in the same directory as this notebook.

### 5.1 CoNLL-2003

Now that we've seen how to represent linguistic features, we also need to access relevant linguistic training data for the NERC task. One of the most popular datasets is [CoNLL-2003](http://aclweb.org/anthology/W03-0419), which was provided with the zip file you downloaded from Canvas.
You can load it using the following code snippet.

In [None]:
from nltk.corpus.reader import ConllCorpusReader

train = ConllCorpusReader('NERC_datasets/CONLL2003', # the folder where ConLL-2003 is stored (you downloaded this with the zip file from canvas) 
                          'train.txt', # this will load the file 'train.txt', for the exercise you also need to load 'test.xt' 
                          ['words', 'pos', 'ignore', 'chunk'])
for token, pos, ne_label in train.iob_words():
    print(token, pos, ne_label) # please represent this information using a dictionary for the feature representation
    break

We can for example iterate through this data, and make a list of the tokens as inputs, and of the `ne_label` values as desirable outputs. The input tokens could for example be looked up in our word embeddings dictionary.

In [None]:
input_vectors=[]
labels=[]
for token, pos, ne_label in train.iob_words():
    
    if token!='' and token!='DOCSTART':
        if token in model:
            vector=model[token]
        else:
            vector=[0]*300
        input_vectors.append(vector)
        labels.append(ne_label)

We have successfully loaded our data. Let's see how many tokens/labels we have:

In [None]:
print(len(labels))

In a next step, we could easily train a model on this data as shown in section 4.2 above.

### 5.2 Kaggle
Another interesting dataset is the [Annotated Corpus for Named Entity Recognition](https://www.kaggle.com/abhinavwalia95/entity-annotated-corpus), which we also provided in the zip file you downloaded from Canvas. You can load it in the following way:

In [None]:
import pandas

In [None]:
path = 'NERC_datasets/kaggle/ner_v2.csv'

In [None]:
kaggle_dataset = pandas.read_csv(path, error_bad_lines=False)

You will see the following output after running the above code cell:
```
b'Skipping line 281837: expected 25 fields, saw 34\n'
```
You can ignore this.

**pandas.read_csv** will load the csv file into a [pandas DataFrame](https://towardsdatascience.com/pandas-dataframe-a-lightweight-intro-680e3a212b96).

You can inspect which columns are in the csv file by running the following code cell:

In [None]:
kaggle_dataset.columns

[Here](https://www.kaggle.com/abhinavwalia95/entity-annotated-corpus), you can read what each column represents.

You loop can loop through the dataset in the following way:

In [None]:
for index, instance in kaggle_dataset.iterrows():
    print()
    print(index)
    print(instance) # you can access information by using instance['A COLUMN NAME'] which you can use to convert to a dictionary needed for the feature representation.
    print('NERC label', instance['tag'])
    break

We could for instance use these features as inputs in a machine learning model with our DictVectorizer, or by transforming them using embeddings.

## End of this notebook

## 4. NERC datasets

Now that we've seen how to represent linguistic features, we also need to access real linguistic training data for the NERC task. In this section, we will look at large data sets that have been created by the community in which people have been annotating entities. In the assignment, you will use this data to train and test models that give a realistic performance.

Here, we will load two NERC datasets and quickly inspect their contents.

**Preparation** Please download the .zip file with the two datasets from [this link](http://kyoto.let.vu.nl/~vossen/rma_hlt/nerc_datasets.zip)

Then unpack the .zip, so that the folder `nerc_datasets` is created in the same directory as this notebook. If you want to store it elsewhere, you can do that but need to adapt the path in the calls below.

### 4.1 CoNLL-2003

 One of the most popular datasets is [CoNLL-2003](http://aclweb.org/anthology/W03-0419), which was provided with the zip file you just downloaded. You can open the file "train.txt" in a text editor to inspect its content:

````
-DOCSTART- -X- -X- O

EU NNP B-NP B-ORG
rejects VBZ B-VP O
German JJ B-NP B-MISC
call NN I-NP O
to TO B-VP O
boycott VB I-VP O
British JJ B-NP B-MISC
lamb NN I-NP O
. . O O

Peter NNP B-NP B-PER
Blackburn NNP I-NP I-PER

BRUSSELS NNP B-NP B-LOC
1996-08-22 CD I-NP O
````

It follows the IOB format with one token on a line followed by columns wit the PoS, the constituent and the IOB entity tag. You can check the "test.txt" file to see it has a similar format

You can load it using the following code snippet, which makes use of the NLTK function ConllCorpusReader to do the magic. More information on the ConllCorpusReader can be found here: https://www.nltk.org/_modules/nltk/corpus/reader/conll.html

The function has three parameters:

* the path to the folder where ConLL-2003 is stored (locally in my case)
* the name of the file that will be loaded from that folder
* labels for the columns that are expected in the input file

We store the result in a variable with the name 'train' which is of the type 'nltk.corpus.reader.conll.ConllCorpusReader'

In [1]:
from nltk.corpus.reader import ConllCorpusReader

train = ConllCorpusReader('nerc_datasets/CONLL2003',
                          'train.txt', # this will load the file 'train.txt', for the exercise you also need to load 'test.xt' 
                          ['words', 'pos', 'ignore', 'chunk'])



OSError: No such file or directory: '/Users/piek/Desktop/CBS2020/text-mining-ba/lab_sessions/lab4/nerc_datasets/CONLL2003'

We can use 'dir' to see it has many data elements that correspond to the many different features that can be found in the CoNNL data.

In [None]:
dir(train)

We are for now only interested in the token, the pos and the ne_label. Let's check the first one in train:

In [None]:
for token, pos, ne_label in train.iob_words():
    print(token, pos, ne_label) # please represent this information using a dictionary for the feature representation
    break

We can for example iterate through this data, and make a list of the tokens as inputs, and of the `ne_label` values as desirable outputs. The input tokens could for example be looked up in our word embeddings dictionary.

In [None]:
import gensim
word_embedding_model = gensim.models.KeyedVectors.load_word2vec_format('./model/GoogleNews-vectors-negative300.bin', binary=True)  

In [None]:
input_vectors=[]
labels=[]
for token, pos, ne_label in train.iob_words():
    
    if token!='' and token!='DOCSTART':
        if token in word_embedding_model:
            vector=word_embedding_model[token]
        else:
            vector=[0]*300
        input_vectors.append(vector)
        labels.append(ne_label)

In [None]:
We have successfully loaded our data. Let's see how many tokens/labels we have:

In [None]:
print(len(labels))

In [None]:
print('Last ten labels =', labels[:10])

Obviously, we should have the same size of input_vectors:

In [None]:
print(len(input_vectors))

In a next step, we could easily train a model on this data as shown in above by combining the input vectors with the labels in a fit function. You will see it takes a lot longer to train the classifier with this  data set that has over 200K instances. On my machine it took about 5 minutes.

In [None]:
lin_clf.fit(input_vectors, labels)

If you want to apply this classifier to a data set for testing, you need to apply the same vectorization procedure as you have followed for the training data.

Before you apply a classifier to a data set, it is important to know the data set and especially the statistics about how the labels are distributed. In other words, how often do tokens in the data set belong a human annotated data set?

This tells you how frequent or rare certain data categories are and how challenging it is for a system to learn and predict each category.

Because we have created a list of labels from our data, we can use a simple Python function *Counter* to get the statistics:

In [None]:
from collections import Counter 
print(Counter(labels))

This clearly shows that most tokens get the label *O* and the actually enity tokens range between 1155 and 7140.

### 4.2 Kaggle
[*Kaggle*](https://www.kaggle.com/docs) is an open source platform for sharing data and competitions. It has over 1000's of datasets and  frequently releases new data and challenges. We are going to have a quick look at the [Annotated Corpus for Named Entity Recognition](https://www.kaggle.com/abhinavwalia95/entity-annotated-corpus) that they provided and which was also provided in the zip file you downloaded as a so-called CSV file: ner.csv and ner_v2.csv. CSV stands for comma-separated-values and it is a commonly used format to exchange e.g. Excell or spreadsheet data as text files. Instances of data are represented on separate lines followed by values separated by commas. Another format is tab-separated-values or TSV, in which case tabs are used as in the CoNLL formats. Very often people store TSV formats in files with the extension ".csv", so it is always good practice to check the actual content to see what is used as a separator. The first line of a CSV or TSV file is usually the header that labels the different columns. 

The [*pandas*](https://pandas.pydata.org) package is a powerful package to handle data in various formats. You can check the website for details and documentation. Here we are going to use it to inspect the data.

To load data fYou can load it in the following way:

In [None]:
import pandas
path = 'nerc_datasets/kaggle/ner_v2.csv'
kaggle_dataset = pandas.read_csv(path, error_bad_lines=False)

You will see the following output after running the above code cell:
```
b'Skipping line 281837: expected 25 fields, saw 34\n'
```
You can ignore this.

**pandas.read_csv** will load the csv file into a [pandas DataFrame](https://towardsdatascience.com/pandas-dataframe-a-lightweight-intro-680e3a212b96).

You can inspect which columns are in the csv file by running the following code cell:

In [None]:
kaggle_dataset.columns

You can seen that a wide range of features is given for each token. [Here](https://www.kaggle.com/abhinavwalia95/entity-annotated-corpus), you can read what each column represents.

You loop can loop through the dataset in the following way:

In [None]:
for index, instance in kaggle_dataset.iterrows():
    print()
    print(index)
    print(instance) # you can access information by using instance['A COLUMN NAME'] which you can use to convert to a dictionary needed for the feature representation.
    print('NERC label', instance['tag'])
    break

You can see that each token has many different features that people have considered useful for trhe task of NERC. In addition to the usual suspects that we saw before, each token also has features indicating previous and next words and their PoS, but als the shape of the word (upper and lower case patterns), and even the previous IOB tags.

We could use all these features as inputs in a machine learning model with our DictVectorizer, or by transforming them using embeddings if the values are words.

## End of this notebook