# Labelling

A large family of NLP tasks fall under the category of *labelling tasks** 

  * Part-of-speech tagging
  * Named Entity Recognition
  * Sentiment analysis
  * ...
  
## PoS tagging

Assign a part of speech label to each token. 

SpaCy's labeling scheme is specified for each __[model of a language](https://spacy.io/models/en)__. Use `spacy.explain()` to get an explanation for a label. For example `spacy.explain('NN')`.

PoS tagging is often evaluated in terms of accuracy. For $y_n$ true labels, corresponding $\hat{y_n}$ predicted labels, and $\delta(x,y) = 1$ iff $x = y$ and otherwise 0: 

$$ \frac{\sum \delta(y_i,\hat{y_i})}{\mid Y \mid} $$

In other words, accuracy is the fraction of correctly predicted PoS tags, divided by the total number of tags to be predicted

<div class="alert alert-block alert-success"> 1. Manually tag the following three sentences. Use spaCy's English labelling scheme.</div>

  1. It's no use going back to yesterday, because I was a different person then.
  2. The best way to explain it is to do it.
  3. Never let anyone drive you crazy; it is nearby anyway and the walk is good for you.
  

<div class="alert alert-block alert-success"> 2. Compare your manual tags with those from spaCy. What is its accuracy?</div>

<div class="alert alert-block alert-success"> 3. Write a function that takes a sentence, a gold annotation, and returns the accuracy of spaCy on that sentence</div>

***

## Named Entity Recognition

Assign a label (and possibly subcategorize) to entities and non-entities. These could be individual tokens or larger spans.

Named Entity Recognition is often evaluated in terms of precision (fraction of true positives out of total entities recognized), recall (fraction of true positives out of total entities in data) and F1-score (harmonic mean of precision and recall).


$$\text{precision} = \frac{TP}{TP+FP}$$

$$\text{recall} = \frac{TP}{TP+FN}$$

$$F_1 = 2\times\frac{\text{precision} \times \text{recall}}{\text{precision} + \text{recall}}$$

<div class="alert alert-block alert-success"> 1. Manually tag the following fragment from Alice in Wonderland using spaCy's English labelling scheme</div>

  1. “Who are you?” said the Caterpillar. This was not an encouraging opening for a conversation. Alice replied, rather shyly, “I—I hardly know, Sir, just at present—at least I know who I was when I got up this morning, but I think I must have been changed several times since then.” “What do you mean by that?” said the Caterpillar, sternly. “Explain yourself!” “I can’t explain myself, I’m afraid, Sir,” said Alice, “because I am not myself, you see.” 
  
<div class="alert alert-block alert-success"> 2. Compare your manual tags with those from spaCy. What is spaCy's F1-score?</div>

<div class="alert alert-block alert-success"> 3. Write a function that takes a sentence, a gold annotation, and returns the F1-score of spaCy on that sentence</div>

***

# Other evaluation measures

![](bclass.png)

There are many different ways to evaluate a model. *Precision*, *recall*, *F1* and *accuracy* are widely used, but the main question you should ask yourself if the method you are using faithfully quantifies performance along the task you set out to test.

* Central tendencies: arithmetic mean, mode, median, harmonic mean
* Predictive accumen: accuracy, $R^2$, expected log predictive density, precision, recall
* Dispersion: variance, standard deviation


Not only the measure itself is important, but also the data you evaluate it on:

* Train/test splits
* Leave-one-out
* Leave-k-out

And also the predictions you evaluate:
* Categorical predictions
* Probablistic preidctions


### ROC curves

**R**eceiver **o**perating **c**haracterisitic (ROC) curves plot the true positive rate against the false positive rate. You can think of it as a visualization of how well your model is doing at different decision thresholds (think: $p$ at which you classify an entity as belong to a class or not).

![](ROC.jpeg)

<div class="alert alert-block alert-success"> <bf>Discussion.</bf> Which model is better? The red one (left plot) or the blue one (right plot)? Why?</div>


But computing all possible points would be cumbersome. One way to optimize this process is to approximate and compute the the **a**rea **u**nder the (ROC) **c**urve (AUC). The AUC is classification-threshold-invariant: It is an aggregate of the performance you can expect across all possible thresholds. It is the probability that your model ranks a randomly chosen positive higher than a randomly chosen negative one. In other words, when given one random positive and negative, the area under the curve is the probability that the model will be able to tell them apart.

The AUC is in [0,1]. If $\text{AUC} = 0.0$ then your model is always wrong; if it is $1$ it always classifies correctly. If it is $\geq 0.5$ then it's doing better than chance!