# Annotation & data

NLP practicioners spend most of their time on data manipulation; data quality; and data annotation. After all, that's where most gains (and errors) stem from, compared to marginal benefits of one algorithm over another. It doesn't matter how good your algorithm is if you are evaluating it on the wrong task or on the wrong data.

There is dire need for experts who understand language; how to elicit it from annotators; and how to evaluate what models (don't) learn from it.

***


<div class="alert alert-block alert-info"> <b>Discussion.</b> Why is quality human annotation hard? Where can things go wrong?</div>


<div class="alert alert-block alert-success"> <b>Terminology.</b> Explain the following terms, in your own words and with examples:

1. Sample
2. Representative sample
3. Biased sample
4. Structured/Tiered sample
5. Distribution
</div>


## Active learning

Active learning refers to dynamic techniques to identify and label/correct new data for a machine learning algorithm.


![](human-in-the-loop.png)
<center>Figure 1.1 from Monarch's Human-in-the-Loop Machine Learning</center>


### Uncertainty sampling

Sample from cases in which the model is uncertain. For instance, its posterior prediction for label A is 45% and that for label B is 55%

<div class="alert alert-block alert-info"> <b>Discussion.</b> Give an example of a linguistic phenomenon where  uncertainty sampling is important to check model performance.</div>


### Diversity sampling

Sample from underrepresented cases. For instance, label A only applies to 1% of the data, whereas B and C take the lion's share with 49% and 50%, respectively. 


<div class="alert alert-block alert-info"> <b>Discussion.</b> Give an example of a linguistic phenomenon where  diversity sampling is important to check model performance.</div>


### Random sampling

Sample at random

<div class="alert alert-block alert-info"> <b>Discussion.</b> Give an example of a linguistic phenomenon where  random sampling is important to check model performance.</div>

<div class="alert alert-block alert-info"> <b>Discussion.</b> How do you decide how (in)frequently to train and evaluate your model and go through the active learning cycle?</div>

# Before learning: Train/Test split

* Before training a model, you should split the data that you will evaluate your model against from the training data
* Keep in mind that such held-out-data is not truly random
* Keep in mind that not training on all the available data is suboptimal and, in certain contexts, avoidable
* If the data is not truly random and likely biased, consider making the split representative or tiered


### Practical example: Predicting average pitch (in Hz)

In [8]:
%%R
set.seed(123) #random seed

n_fem <- 24 #number of female students in class
n_mal <- 5 #number of male students in class

#sample n_fem pitches from a Normal distribution with a mean of 210 and a sd of 20
pitch_fem <- rnorm(mean = 210,
                   sd   = 20,
                   n    = n_fem)

#sample n_mal pitches from a Normal distribution with a mean of 110 and a sd of 20
pitch_mal <- rnorm(mean = 110,
                   sd   = 20,
                   n    = n_mal)

#Data wrangling to get everything into a dataframe
pitch <- c(pitch_fem, pitch_mal)
gen  <- c(rep('F', n_fem),
          rep('M' ,n_mal))
df   <- data.frame(pitch = pitch, gender = gen)

#univariate linear regression with no predictors 
m_avg_pitch <- lm(data    = df,
                  formula = pitch ~ 1)

print(m_avg_pitch)


Call:
lm(formula = pitch ~ 1, data = df)

Coefficients:
(Intercept)  
      190.9  



# Disaster labeling

We will go through a practical demonstration of annotation, active learning, and machine learning.

You can do execute the code either on your local machine or through a Colab. 

To run it on your local machine, first make sure you have `PyTorch` installed, then:

  * `git clone https://github.com/rmunro/pytorch_active_learning`
  * `cd pytorch_active_learning`
  * `python active_learning_basics.py`


To run it on Google's machines: 

  * Open a new Colab notebook: [https://colab.research.google.com/#create=true](https://colab.research.google.com/#create=true)
  * Create a cell with: `!git clone https://github.com/rmunro/pytorch_active_learning`
  * Create a cell with: `%cd pytorch_active_learning`
  * Create a cell with: `!python active_learning_basics.py`
  
You will go through the following cycle:

![](human-in-the-loop2.png)
<center>Figure 2.2 from Monarch's Human-in-the-Loop Machine Learning</center>


<div class="alert alert-block alert-info"> <b>Activity.</b> Go through 3-5 cycles. Each time you finish annotating a cycle, press "s" to save your data and see how well your model is now doing.

We will discuss your impressions (what data did you get? how did your model perform?) after each cycle. While you wait for your colleages, go through the script you are running to understand the cycle under the hood (or rest; annotation is hard work).</div>

<div class="alert alert-block alert-success"> <b>Discussion.</b>What is the random/uncertain/diversity split?  Does it make sense? Can you think of a case where a different split would be called for?</div>

# Rating the raters

### Inter-annotator agreement

Cohen's $\kappa$

$$\kappa = \frac{p_0 - p_e}{1-p_e},$$

where $p_0$ is the relative observed agreement between a pair of annotators and $p_e$ is the (estimated) probability of chance agreement.

$$p_e = \frac{1}{N^2} \sum_k n_{k1}n_{k2},$$
for $k$ categories, where $N$ is the observations to categorize and $n_{ki}$ the number of times rater $i$ predicted category $k$.

For binary classifications, this reduces to:
$$\kappa = \frac{2 (TP \times TN - FN \times FP)}{(TP+FP) \times (FP +TN) + (TP+FN) \times (FN + TN)}$$

If the annotators are in full agreement: $\kappa = 1$. If there is no agreement (other than what would be expected by chance) then $\kappa \leq 0$.



Other metrics:
  1. Fleiss' $\kappa$
  2. Bangdiwala's B
  3. Correlation coefficients
  4. ...

# Further topics

* Building an interface
* Deep-dive into sampling techniques
* Backups and databases
* Crowdsourcing platforms, e.g., prolific and MT