# Improving language technology with fortuitous data


## Introduction

[Natural Language Processing](https://en.wikipedia.org/wiki/Natural_language_processing) is "is a field of computer science, artificial intelligence, and computational linguistics concerned with the interactions between computers and human (natural) languages. [..] challenges in NLP [include] **enabling computers to derive meaning from human or natural language input**".

"Modern NLP algorithms are based on **machine learning**" ([NLP, Wikipedia](https://en.wikipedia.org/wiki/Natural_language_processing))

## Machine learning

<img src="pics/prog-vs-ml.png" width=600>


## Learning in the shire

<img src="pics/shire.jpg">

### As predictable as...

The shire is a quiet and wonderful place, with jolly and content people inhabiting its rolling green hills and quaint villages . 

It’s also a static and somewhat predictable world in which today looks much like yesterday, and tomorrow again will look a lot like today. In such a place a respectable choice of career might be to train as a blacksmith, spend a couple of years to learn the craft of forging iron and shoeing horses, before taking over your own shop and spending the rest of your active years applying what you have learned.

The only kind of horse that lives in the shire is the stout Hackney pony. At no point will you be asked to shoe a Belgian horse, or mend a broken bike wheel. 

Wouldn’t it be great if we actually all lived in the shire? 

## The shire, formally speaking

Traditionally, machine learning theory assumes that the world behaves predictably like the shire. 

### Input and output
 
The goal of supervised machine learning is to find a function $h$ that maps from some percept or input $x$ to a label $y$. What $x$ and $y$ are depends on the task. Many banks, for instance, use a learned function to decide whether to give credit to a customer or not. Here, $x$ is the credit application and $y$ is the outcome: approved or declined. In NLP, $x$ could be a tweet and $y$ could be its sentiment, or $x$ could be a sentence and $y$ is syntactic parse tree; and so forth. Let $x \in \mathcal{X}$ (input space) and $y \in \mathcal{Y}$ (label space).

NLP applications almost always have **discrete** output spaces. In these lectures $y$ will either be an integer (for classification) or a vector of integers (for structured prediction). 

### Target and hypothesis function

We’ll make the assumption that there exists an **unknown target function** which is solving the problem we’re interested in:

$$f: \mathcal{X} \mapsto \mathcal{Y}$$

This, of course, is a bit of a fiction. It doesn’t really exist anywhere, but it’s a useful fiction because it allows us to describe the goal, which is to learn a **hypothesis function** $h$ that is as close as possible to the target function. Naturally, the hypothesis function performs the same mapping as the unknown target function:

$$h: \mathcal{X} \mapsto \mathcal{Y}$$

### Dataset 

It gets worse before it gets better. Not only is our target function unknown, we also don’t know the true distribution of our inputs $P(x)$. We don’t know which tweets will be written or the kinds of backgrounds people who apply for credit will have.   

Supervised learning rests on the idea that we can get a limited number of examples (i.e. **a sample**) 

$$x_1, \ldots, x_n \sim P(x)$$

from the unknown input distribution $P(x)$, and that we (somehow) can evaluate on the unknown target function $f$ on these examples. 

Putting this together yields the concept of a **training set**:

$$\mathcal{D}_t = \{(x_1, f(x_1) ), \ldots (x_n, f(x_n)) \}$$

How do we gain access to the unknown target function? The bank might look at past credit applications together with the decisions. In NLP we often ask *people* to annotate.

#### Unsupervised and semi-supervised learning

It’s easy to imagine a situation where we could arrange to get a large sample of data from $P(x)$ without labels being included in the deal. The setting in which there are no labels at all is called **unsupervised learning**. When unlabeled data is available in addition to a labeled dataset this is **semi-supervised learning**. 

### Feature representation

We’ll never have to read the same Twitter message twice, hopefully. By the time a failed credit application is resubmitted, the customer’s circumstances are likely different, and so the  application isn’t the same anymore. “You cannot submit a credit application twice,” as Heraclitus might have said. 

This poses a problem in that we wish to learn from the past, but whatever happened in the past it will not happen *exactly* like that again. Instead something *similar* might happen. So we need a way to break up our observations (the $x$es) to make them comparable even if the don’t match exactly. 

Luckily, our observations are typically not unique snowflakes, but can decomposed into **features** in some **feature space** $\mathcal{F}$. Even though the learner might not have seen the new example exactly, it might have seen similar examples (or parts of the current example), and thus still be able to make a prediction.

Specifically, each input example is transformed into a suitable **input representation** for the learning algorithm by a **feature function** $\phi(x)$. The feature function $\phi(\cdot)$ maps examples from the input space to the feature space:

$$\phi: \mathcal{X} \rightarrow \mathcal{F}$$

Typically, the $\phi(x)$ is a real-valued vector of some fixed dimension $d$, i.e. 

$$\mathcal{F} = \mathbb{R}^d$$

Note that the $\phi$ feature function is deterministic and not a part of the learner. Traditionally, a large body of work in NLP focused on finding better ways to map from input to feature representations for specific tasks by hand. Feature representations will continue to be a theme in this course, but the flavour will be different. 

### Latent space

In addition to the input space, feature space, and label space, it might be useful if we can refer to a *latent* space. A latent space is where the *internal* representations live. 

People often talk about representation learning, but the representation is not strictly the output of the learner. If we define the predictive function, like before, as $h: \mathcal{X} \mapsto \mathcal{Y}$, then there’s no natural way of referring to the internal representations $h$ is using. 

One way to get around this is the latent space $\mathcal{Z}$. We use two extra functions:

- $j: \mathcal{X} \mapsto \mathcal{Z}$ from feature to latent.
- $k: \mathcal{Z} \mapsto \mathcal{Y}$ from latent to label,

and define h as the composition of $j$ and $k$: $f = j \circ k$. 

This gives us a way to “extract” e..g embeddings from the `word2vec` learning task. The feature function $\phi$ allows us to “import” them into another task. 


## Linear hypotheses

The shape of $h$ depends on our choice of **hypothesis class**, that is which kind of learner we will be using. A simple example is the linear hypothesis class for binary classification:

$$h(x; \theta, b) = \text{sign}( \theta^\top \phi(x) + b)$$ 

This example shows how the parameters $\theta$ and $b£ of the model are combined with the feature representation produced by $\phi(x)$. Other hypotheses classes (e.g. neural networks) compute more complicated expressions, meaning that they have richer internal structure, but typically use the input in a similar way to here.

## Evaluation

It’s a great summer; we’re young, and it feels like the nights, perhaps even life itself, extend indefinitely. Let’s use some of that time to come up with a parameter vector $\theta$ that classifies all the examples in our training set *perfectly*. 

Is that a good choice of parameter vector? Or would we become bitter as we grow old, looking back on a summer of wasted opportunity? 

Ultimately we don’t care about how well our hypothesis $h$ performs on the training data. It could have simply remembered all of the answers, rendering it clueless if presented with something genuinely new. Thus we are interested in a system that is able to **generalize**, i.e., that provides reasonable outputs even for examples the it hasn’t seen before. 

A hypothesis is evaluated in terms of how well it does on **unseen data**. Specifically, given a new input, the system gets as input $x$ and makes a prediction $\hat{y}$ (**predicted label**). The system incurs a **loss** (the cost of the prediction) $l(y,\hat{y})$ which is typically $0$ if the predicted label is correct, and $>0$ otherwise (if $y\ne \hat{y}$).  

Our trivial system that just memorizes the training data thus **fails to generalize**. It simply does not know what to do with an example it has not seen before. 


## Venturing outside the shire

<img src="pics/shire_baggins.jpg">

Dangerous yet not unpredictable. 

### This is not what we trained for

A number of things can go wrong outside the shire. All of a sudden the horses are not ponies anymore. 

In statistical terms supervised learning is expected to work because the evaluation set is drawn from the same distribution $P(x)$ as the training set. Therefore a good result on the training set should transfer to good results outside the training set (with caveats: it is still possible to **overfit** within the shire).

#### Input distributions differ

**Condition**: $P_t(X) \neq P_e(X)$ 

Language changes. A word like “awesome” has become much more frequent, perhaps losing some of its former oomph, but not fundamentally changing meaning.  

<img src="pics/awesome.png" width=600>

Say you learned a sentiment model on English music reviews from 1960 and wished to it *now*. What would happen?

*(Speculation)* Probably your 1960 model would have learned to pay very close attention when that word “awesome” occurred, dramatically increasing the score for positive sentiment. Now “awesome” might occur in a review several times without the record actually being special. 

#### Output distributions differ

**Condition**: $P_t(Y) \neq P_e(Y)$

Corporate IT projects in banks run for a long time. Suppose you were a British bank and used your recorded credit application history from before Brexit for the loan classifier you are using today. The market is insecure, and the bank would like to approve fewer application to reduce its overall risk. 

Here the label distribution has changed:

$$P_t(Y=\text{Approved}) > P_e(Y=\text{Approved})$$ 

This could happen without the criteria for evaluating loan risk changing. 

#### Conditional distributions differ

**Condition**: $P_t(Y|X) \neq P_e(Y|X)$

The dust has not yet settled on Brexit. Two groups of people with particularly uncertain prospects are foreigners in Britain, and Britons in Europe. Say a British family moved to Berlin and wished to purchase a property in Prinzlauerberg. Would the fact that they are British alter their chances of getting a loan, without necessary affecting anyone else? 

Another classic example from sentiment analysis is the adjective “small” which seen in a car review might be negative but positive when it describes a compact cell phone. In this case we have **negative transfer**. 

### A somewhat more general setting. 

Above we discussed how three kinds of distributions could change between training and evaluation time. For the rest of the course we’ll adopt a more general perspective, where we have a single **target task** and one or more **source datasets**. 

A target task is an label space $\mathcal{Y}$ and a loss $l(y, \tilde{y})$.

A source dataset minimally consists of a sample from a source input space $\mathcal{X}$. Often we’ll labeled data, induced classifiers, latent representations from the classifiers, and so on. 

There no requirement that the input and output spaces for the various source datasets are the same, or indeed that they coincide with the input and output spaces of the target task. As an example, the output of a source learner could be integrated in the feature function of the target task.

## Types of transfer

Example of fish classification. [To be fleshed out]


## Intuition. Learning to drive a motorcycle

How can we hope to use data from other tasks where both the input and output spaces are different? Consider the example of getting a motorcycle driver license when you already know how to drive a car. The input space is what you observe on the road. The output space describes the actions that you can take, like changing gears, speeding up, breaking, etc. 

There are transferable skills between the two modes of driving. As a piece of evidence most bike schools quote you different prices depending on whether you already have a car driver license or not. A category of traffic skills are completely independent of the mode of transport. What is the meaning of the traffic light. Can I expect these other drivers to not drive in front of me. In general your internal model of how traffic works is transferable. 

Some skills are unique to driving a motorcycle. You don’t have to worry about the vehicle tipping when stopping in car, for instance. 

The motorcycle example also demonstrates the danger of **negative transfer**. For a car it’s desirable to stop when the light changes to yellow. On a motorcycle suddenly applying the brakes can be fatal, because the car or truck behind you might decide to just continue.  

A machine learning system has three main components:

* **data**: the available data, typically a dataset of labeled data $L$
* **features**: the feature representation $\phi(X)$:
* **model**: 
    * an algorithm for optimizing and objective function to determine the parameters $\theta$: e.g., $SGD$, where the scoring function $\zeta$ is used to map from feature representations to predicted labels $\hat{y}$, and

    * an objective function/loss function: $\mathcal{l}(y,\hat{y})$ that gives us an estimate how good the current model, specified by $\theta$ is

The intuition of the algorithm is to set the model weights (parameters) $\theta$ so that the loss of the model is minimized. (see more details later)

To visualize the whole:

<img src="pics/learning.png" width=800>

### Definitions

* $\mathcal{X}$: input space, $\mathcal{Y}$: output space
* dataset $\mathcal{D}$ consisting of pairs $(x,y)$: input $x$ and desired (gold) output $y$, where $x \in \mathcal{X}$ and $ y \in \mathcal{Y}$
* a feature function $\phi(\cdot)$ mapping an input $x$ to an internal representation or feature space $\mathcal{Z}$
* a scoring function that calculates the predicted label for a new input
* a loss function that is minimized during learning, $l(y,\hat{y})$, and
* an optimization procedure that minimizes the loss


### Challenges

There are two main challenges machine learning/NLP systems are faced with:

* **wrong assumptions**: The underlying assumption of ML is that there should be a strong relationship between the data that our algorithm sees at training time and the data it sees at test time. This is almost never the case! 

* **limited samples**: There is never enough labeled data! In fact, $L$ is *tiny* compared to potential data out there. Why? Training data might be expensive or hard to collect. We cannot just simply annotate more data. Even worse, data changes continuously, it is not obvious to delimit *what* we want to annotate. And while we reach that point, our data might be already outdated. thus annotate what? 

<img src="pics/datapool.png" width=500>

We want to avoid going the manual data annotation route and instead use non-obvious sources of information that just waits to be harvested to build better models across different impoverished data situation.






### The continuum of non-standard data and fortuitous data to the rescue

In NLP, data can differ in many ways (continuum of data differences), it might be that mostly the lexicon changes, e.g., when you go from book to DVD reviews in sentiment analysis, but it might go as far as being completely different languages. Say you have a parser trained on English, but now you want to get a parser for Icelandic, but you don't have any annotated data to start of with.


Non-standard data situations arise whenever we want to process data that differs from standard benchmark corpora:

* processing data from other domains (going from newspaper to Twitter)
* processing data in other languages (going to low-resource languages)

<img src="pics/datadiff.png" width=500>

We here start from the prototypical canonical **newspaper English**. In NLP, this has developed to the the-facto 'standard', or 'canonical' form (most probably due to a historical accident). We take this historical fact as our reference point, and define our continuum of non-standard data with respect to deviations from newspaper English, e.g., deviations in features, in labels or both, going to completely different languages or label distributions.

Related work here falls under the general umbrella of **transfer learning**, with particular instances being **domain adaptation** (learning across different domains), **cross-lingual learning** (where domains can be seen as different languages), and **multi-task learning** (where one learns from related tasks). However, the terms *domain/language* and *task* are fuzzy and it is not important here to make a hard distinction. Rather, we can all of those  research areas as related to the problem of impoverished data. 



#### What do do about non-standard data

* annotate more data; problematic; plus annotate what?
* normalize (make our data more similar to the canonical form); again problematic; normalize how? what defines norm?

**Solution**: Harvest **data from non-obvious data sources**, i.e., **fortuitous data**, that can help our learner to better generalize to new unseen data. 


### approaches

big goal:
$$f(\cdot) : \mathcal{I} \rightarrow \mathcal{O}$$

approach taken:

$$f(\cdot) : \phi \circ \zeta $$

* modify first part: $\phi: \mathcal{X} \rightarrow \mathcal{Z}$, i.e., either modify feature representation or modify instances themself
    * modify features $\phi(x)$; we here could use composition here: $\phi(x) \circ \phi'(x)$ to make it more explicit that $\phi'$ might come from elsewhere than $\mathrel{X}$
        * add embeddings
        * feature dropout
        
    * modify instances $X$:
        * add instances:
            * e.g., self-training, co-training 
            * projection in cross-lingual learning
        * "drop" instances (importance weighting/data selection)
* modify the second part: $\zeta: (\mathcal{Z}, \theta) \rightarrow \mathcal{Y}$, i.e., the algorithm itself (can refer both to training and decoding): **nb. not sure if this is best way to present it; it's strictly speaking not the scoring function itself, but the learning of weights, which is part of the scoring function** (but easier to point to it as it is the second part of f(.))
    * distant supervision (wiktionary constraints)
    * ILP for cross-lingual learning
    * modify the objective/add auxiliary loss as in multi-task learning (might also involve first part if additional data from distinct sources is included);
    

## Challenges 
% these are just notes
<s>* NLP: sparse data. Need for large amounts of labeled data (expensive)
   * (e.g. self-training approaches)
* Only little labeled data, but related data is available. how to incorporate?
   * data from related resources (e.g. Wiktionary; type-constraints)
   * data from related tasks (e.g. multitask-learning day 4)
* Domain adaptation
    * change in input representation (e.g. from news to twitter; example hyperlinks)
    * change in relation between input and output (e.g., cross-lingual learning -> extreme case of not having any data available, day 5)</s>

## The hunt for a learning signal 

### A learning analogy

(cooking example) 

### Learning signal

(comparing input output)


### Classification and gradient-based learning

(day 1 classification, i.e., In the first lecture we’ll assume that both our input and output have a fixed size and structure, i.e. the problem is classification;) e.g., [shelter outcome classification problem](https://www.kaggle.com/c/shelter-animal-outcomes)? <img src="https://kaggle2.blob.core.windows.net/competitions/kaggle/5039/media/kaggle_pets2.png">

ubiquitous in NLP and accounts for most successes in machine learning, including deep learning

(give intuition of gradient-based learning, more details in day 2?)

ML as function learning (examples from your [SciProg class](https://github.com/andersjo/scientific-programming-2015/blob/master/lectures/lecture05/ML_Classification.ipynb) ?)

### Learning from non-obvious sources: Fortuitous data

<img src="pics/fortuitous-def.png" width=600>