# From Text to Features for Machine Learning

This notebook describes how all kinds of implicit and explicit features of texts can be represented as so-called *one-hot-vectors* and combined in a single representation of a text. At the end of this notebook, you learned:

- How text is annotated with implicit and explicit features features
- How to create *one-hot-encodings* in vector representations for combining divers textual features
- How to train a *Named Entity Recognition and Classification* (NERC) classifier with combinations of one-hot-encodings

**Background reading:**

NLTK Book
Chapter 7, section 5: https://www.nltk.org/book/ch07.html

## 1. Natural language text as features

### 1.1 Text as a bundle of features

A text consists of a sequence of words on which we impose syntax and semantics. A machine needs to learn to associate the structural properties of the text to some interpretation. We can use various properties to do this:

- the words (regardless of the order)
- the words and their frequency
- the part-of-speech of words
- word pairs
- the characters that make up the words
- sentences with words
- phrases
- the meaning of words
- the meaning of combinations of words
- etc....

Some of the above properties, we get for free if we split a text into tokens (the words), e.g. by using spaces. Still, we need to consider what to do with punctuation and how to treate upper/lower cases (the word shape). Other properties are not explicit, such as the part-of-speech of words, phrases, syntax and the meaning.

### 1.2 Annotation of text

In order to make implicit properties text explicit to a machine, people need to annotate a text. In Natural Language Processing, annotated texts play a crucial role. Annotation are used as features for training a system in addition to just the words and they represent the labels (the interpretations of the people) that the system needs to predict.


**The Annotation process**

The annotation process as such is a delicate process during which all kinds of decisions are made that impact the performance of the systems and the emprical value of the evaluation:

<ol>
<li> Collect texts: e.g. tweets, news, blogs, books
<li> Define an annotation scheme or code book which explains:
    <ol>
<li>  the tag set or set of labels (e.g. PoS labels, emotions, entity types)
<li>  the unit of the annotation: word, phrase, sentence, paragraph, document
<li>  the criteria to apply a tag to a piece of text
    </ol>
<li>  Train human annotators to use the annotation scheme (or create a crowd task)
<li>  Provide an annotation tool that loads texts and allow the annotator to assign tags
<li>  Store the annotations with the text in some structured format
<li>  Determine the Inter-Annotator-Agreement (IAA) by analysing texts annotated by at least two annotators
<li>  Fix disagreements (adjudication): if IAA is too low (e.g. less than 60 Kappa score) the task is considered too difficult or too unclear
</ol>

The IAA is often considered the upper ceiling of NLP. Ask yourself the question: can machines do better than humans?

### 1.3 A common representation format for text annotations

It is not trivial to represent annotated data in a way that we can easily train NLP systems as well as evaluate them. One of the most used formats is called [Inside–outside–beginning (IOB)](https://en.wikipedia.org/wiki/Inside%E2%80%93outside%E2%80%93beginning_(tagging)). In this annotation that is used to label phrases, we indicate for each word token in a text, whether it belongs to some type of phrase and, if so, if it is inside or at the beginning of this phrase. One of the most popular datasets with IOB annotation is [CoNLL-2003](http://aclweb.org/anthology/W03-0419), which is among others used for the task of Named-Entity Recognition and Classification (NERC).

NERC is considered considered as a special case of phrase structure annotation. For all text tokens, the system needs to decide which words are part of a named-entity expressions and which words are not. On top of that, systems need to decide what type of named-entity phrase we are dealing with. Common types are: PERSON, LOCATION and ORGANISATION, but also others are used.

Let's consider an example fo a text in the IOB annotation:

* **... Germany's representative to the European Union ...**

In the IOB-format, the phrase has the following representation:

```
Germany NNP B-NP B-LOC
's  POS B-NP O
representative NN I-NP O
to TO B-PP O
the DT B-NP O
European NNP I-NP B-ORG
Union NNP I-NP I-ORG
```


The first observation is that all information is represented at the **token-level**. For each token, e.g., *Germany*, we receive information about:

* **the word**: e.g., *Germany*
* **the part of speech**: e.g., *NNP* (from [Penn Treebank](https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html))
* **the phrase type**: e.g., a noun phrase
* **the NERC label**: e.g., a location (LOC), an organisation (ORG).

This example contains two named entities: *Germany* and *European Union*.

Every first token of a named entity is prefixed with *B-*. Every token after that, e.g., *Union* in *European Union*, is prefixed with *I-*.

Please note that the IOB format is at the **token-level**, which means that we also are going to train and evaluate an NLP system at the token-level! The goal will hence not be to classify *European Union* as an *Organization*, but to classify:
* *European* as the first token of an entity that is an *Organization*
* *Union* as a token inside of an entity that is an *Organization*

Please make sure you understand the format before you proceed ;)

The annotation in IOB style can be quite complex as is shown in the next example on which many columns are given for each token providing many different features.

![title](images/iob-example.pdf)

The first column in this example lists the identifier for each token. The second is the token itself, followed by the lemma, two columns with part-of-speech tags (POS), some other feature, the token identifier that is the head of the phrase to which a token belongs ("museum" token 2 is the head of "0" The and "1" American), the syntactic dependency relation and the dependency relation combined. The final column indicates again the named-entity expressions and type.

The token features are usually used as additional features to help a classifier learn when a IOB tag applies. Instead of just the words, the classifier can learn that words with certain part-of-speech, syntactic dependencies are more or less likely to be associated with certain tags. Representing more features for tokens, thus can help improving the precision of the classifier as it has more knowledge to combine.

More features can also increase the recall and coverage. If the unseen data contains words that have not been seen in the training data, it can still use the other features of this unkown words to make a prediction.

The values for all these features are a mixture of words, symbols and numbers. So how can SVM or Skikitlearn deal with these diverse values in one single vector representation?

## 2 One-hot-encoding of features

The answer to this question is the notion of a **one hot encoding**! In one-hot-encoding, you use a column for **each possible value** of a feature. We already used the principle for our *bag-of-words* representation of a text. The vocabulary of the training data was used to create one (possibly huge) vector in which each each position represent a single word. So the words of a specific sentence, a document or a tweet can be represented by this complete vector by putting a "1" in the slots of the words that it contains.

Now imagine we do the same for the part-of-speech of each token. In that case, we create a vector with all possible part-of-speech values that occur in the data. The next image shows some examples of representation for the 36 part-of-speech (PoS) tags taken from the Penn Treebank. The PoS vector thus has 36 one-hot-representations:

![title](images/one-hot-pos.pdf)


Looking at the above IOB example, we can imagine that we do this for all the columns given as a feature, such as the constituent a token is part of. Following this principle, each token in the above IOB annotation is represented by a combination of vectors:

* the vocabulary vector: all zeros except for the token itself (one-hot)
* the PoS vector: 35 zeros except for the PoS of the token
* the constituent vector: X Zeros except for the constituent it is part of
* the constituent head: again the vocabulary vector with all zeros except for the slot that represents the lemma of the head
* etc..

You can imagine that the vector representation of each token becomes very large (tens-of-thousands of positions), whereas most positions are zero. Also noet that some features are still rather complex and there are various options to represent them as vectors. These details will be discussed in the machine learning course.

Again, all feature values from the training set are represented in these vectors and any other text that is classified needs to be represented with the same features.

Finally, many of the features represented in the IOB example are generated not by humans but by other programs. For example, a tokenization programm will split the text into sentences and tokens, while a PoS tagger will lemmatise and assign the part-of-speech, and finally a syntactic parser will automatically add the constituent structures and dependencies. These auxiliary programs that generate explicit features for implicit propoperties are often also machine learning modules trained in yet other data. Typically, people only contribute the labels that form the predictions that the classifier is supposed to learn.

Knowing this, what do you think will happen if the PoS tagger is trained on another genre than the genre that is annotated for for example NERC?


*Example*

In the next example, we will represent two token features **part of speech** and **lemma** as one-hot-encodings for the purpose of NERC.

Let's first generate those values for each of our tokens, with SpaCy. We use a very simple text sentence to illustrate the one-hot-encoding of features.

In [15]:
import spacy

nlp = spacy.load('en_core_web_sm')

text="Germany's representative to the European Union"

doc=nlp(text)

Now we extract the part-of-speech and the lemma for each token to represent the token features as the training_instances:

In [3]:
training_instances=[]
for token in doc:
    one_training_instance={'part-of-speech': token.pos_, 'lemma': token.lemma_}
    training_instances.append(one_training_instance)

In [4]:
print('Number of tokens = ', len(training_instances))
print('The tokens are represented as:', training_instances)

Number of tokens =  7
The tokens are represented as: [{'part-of-speech': 'PROPN', 'lemma': 'Germany'}, {'part-of-speech': 'PART', 'lemma': "'s"}, {'part-of-speech': 'NOUN', 'lemma': 'representative'}, {'part-of-speech': 'ADP', 'lemma': 'to'}, {'part-of-speech': 'DET', 'lemma': 'the'}, {'part-of-speech': 'PROPN', 'lemma': 'European'}, {'part-of-speech': 'PROPN', 'lemma': 'Union'}]


Our instance information is now a list with dictionaries, with each dictionary representing a training instance (token). For each instance, we store two values: POS tag and lemma. Next, we will convert these values to a numeric vector representation, as a one-hot-encoding.

### 2.1 Vectorizing our features

To accomplish this, we use the **DictVectorizer** function from sklearn ([link to documentation](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.DictVectorizer.html)). 

Please recall that in lab session 2 we used two other vectorizers, that create bag-of-words or tf-idf vectors from a vocabulary of words. Those two vectorizers operated on a vocabulary level: they assigned a non-zero value to the words which occur in an input, and zero to all others.

The DictVectorizer does something similar, but on a feature level: for each feature (let's say POS tag) DictVectorizer assigns a value of 1 to its value (for example, VERB) and zeros to all others (e.g., NOUN, PROPN, etc.).

In [16]:
from sklearn.feature_extraction import DictVectorizer

In [17]:
vec = DictVectorizer()

the_array = vec.fit_transform(training_instances).toarray() 
# The toarray() is only there for educational purposes. 
# Please do not use it in the assignment since you might get memory issues.

### 2.3 Analyzing the vectorized format

Let's now print the resulting vector representation. Each **row** represents the features for one token. Each **column** corresponds to one feature value (for example, VERB part-of-speech).

In [18]:
print(the_array)

[[0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 1.]
 [1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0.]
 [0. 0. 0. 0. 1. 0. 0. 0. 0. 1. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 1. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 1. 0. 0. 1. 0. 0. 0.]
 [0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]
 [0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 1.]]


Generally speaking, each column represents a **specific value** of a lemma or POS tag. We can get more information on this from the vectorizer:

In [9]:
print(vec.get_feature_names())

["lemma='s", 'lemma=European', 'lemma=Germany', 'lemma=Union', 'lemma=representative', 'lemma=the', 'lemma=to', 'part-of-speech=ADP', 'part-of-speech=DET', 'part-of-speech=NOUN', 'part-of-speech=PART', 'part-of-speech=PROPN']


We can see that the first part of the vector represent the vocabulary and the last five slots represent the part-of-speech. Following the order of the *feature_names*, we can conclude that the second column for example stands for the lemma 'European'. Most words do not have this lemma, but the second last word has it. For that reason, we can see that the second column in the second last row has a value 1. All other rows have zeros in that column, because their lemma is different.

Similarly, the last column represents the tokens with a PROPN (proper noun) part-of-speech. We can see that three words have this part-of-speech tag, namely the words represented in the rows: 1, 6 and 7.


As a final analysis step, let's inspect the first row, i.e. the one hot encoding representation of the following training instance,
```
{'part-of-speech': 'PROPN', 'lemma': 'Germany'}
```
The feature vector using one hot encoding is:
```
[0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 1.]
````
* **first value: 0**: the feature *lemma* with the value *'s* does not occur in the training instance
* **second value: 0**: the feature *lemma* with the value *European* does not occur in the training instance
* **third value: 1**: the feature *lemma* with the value *Germany* does occur in the training instance
* ...
* **last value: 1**: the feature *part-of-speech* with the value *PROPN* does occur in the training instance


Please note that the number of columns grows immmensely when using one hot encoding (you can easily play with this by changing the input sentence above). Luckily, it is generated automatically by the vectorizer.

### 3 Training an SVM model

Hopefully, you can see that the vectors we end up with here resemble the ones we generated with bag-of-words and tf-idf in the previous notebook Lab4.1. Not surprisingly, we can now use them to train and test a machine learning model, such as SVM. Given that our model is trained on only 7 input vectors, it will not be a meaningful one yet; we will build a model with sufficient data in the assignment.

To train, we also need to have the 'gold' labels for each of the token. Let's define them manually here, according to the example below:

In [19]:
y=['B-LOC', 'O', 'O', 'O', 'O', 'B-ORG', 'I-ORG']

Let's now train the model:

In [12]:
from sklearn import svm

lin_clf = svm.LinearSVC()
lin_clf.fit(the_array, y)

LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
     intercept_scaling=1, loss='squared_hinge', max_iter=1000,
     multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
     verbose=0)

### 3.1 Testing our model on new examples

We can now reuse the same vectorizer of the training data to vectorize any new example we want to train, and perform prediction on it using the trained SVM model:

In [13]:
new_training_instances={'lemma': 'America', 'part-of-speech': 'PROPN'}
vectorized=vec.transform(new_training_instances)
print(vectorized.toarray())

[[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]]


In [14]:
pred=lin_clf.predict(vectorized)
pred

array(['I-ORG'], dtype='<U5')

Well done! You have now managed to train an SVM model based on two features and apply it to test on some new example. Feel free to play around with the training data, the features, and the test data. We will work with more data and more features in the assignment.

Let us finish this section with several key observations:
* Our vectorized representations will easily become too large. For example, only the lemma feature could easily have thousands of values. On the other hand, they are **sparse** representations, containing mostly zeros and occassional 1 values. Is there a better way to encode our data in vectors? The answer is: yes. We will work with **dense** representations in the next notebook!
* In the test example above, the lemma of America was not found in the training data, so no existing lemma received a value of 1 in the final vector representation. This is because the set of feature values is 'frozen' after the training, any new feature value encountered at test time is considered to be *unknown* (typically called **UNK** in NLP terminology).
* Finally, a note on the algorithm. SVM can yield some powerful models if we use good features and train it well, however, it does not have an intrinsic capability to capture **sequences of words**. For this purpose, people often use a recurrent neural network. You will learn more about this in the machine learning course