# Multi-task learning (MTL)

After this lecture you should:
* know how you can use data from other tasks (or views) 
* understand the basic principles behind multi-task learning (MTL)
* [be able to implement a simple MTL example in Keras] // most probably no time


## Challenge: 

How can we hope to use **data from other tasks** where **both the input and output spaces are different**? 

### Example 1: Card game

You are in beautiful Italy and want to get acquainted with local card games. You hear about 'scala 40', and are eager to learn it. 

The input space are cards, and the output space are configuations (hands) of your cards. You seem acquainted with the type of input space, but the rules have changed. Luckily you know already how to play poker. Rather than starting from scratch (*tabula rasa*), you use your internal knowledge of poker (or generally how to play a card game) to learn how to play 'scala 40', you can quicker get what are valiable configurations (possible outputs) for the new game.

<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/e/ed/Royal_straight_flush.jpg/1920px-Royal_straight_flush.jpg" width=300> 




### Example 2: riding a motorcycle

We have seen this example already during day 1. You want to learn how to ride a motorcycle, the input space is the street, the output space the possible actions you can take, accelaterate, break, change gears, etc. 

Some skills are unique to driving a motorcycle (need for hand for the clutch, to worry not to tip over when stopping, etc). However, you can use your internal knowledge of how to drive a car in order to learn how to drive a motorcycle. 

### Multi-task learning (MTL)


##### Single neural networks
The figure shows three **separate** feedforward neural networks for three different tasks.

<img src="pics/stl.png">

The idea of **multi-task learning** (Caruana, 1997; Collobert et al., 2011) to exploit the training signal of **other tasks**. 

But before we go into MTL, lets recap learning for a single task.

## Recap: Backpropagation

### Learning a single tasks

Link to [Notebook Recap learning](Learning-recap.ipynb)


## Multi-task learning

We have seen the computational graph abstraction and backpropagation for a single task. 

A common approach to multi-task learning is to **add additional output layers** to a neural network that otherwise shares the same underlying structure.

MTL does exploit other tasks by **learning multiple tasks** in parallel while  **using a shared structure/representation**.

<img src="pics/mtl.png" width=300>

By sharing representations a model is train **jointly** for both/all tasks. 

### Why is this useful?

We can use MTL when we believe that information useful for one type of prediction can also be useful for another type (tasks). 

Instead of creating separate networks for each task, we can build a single network with shared layers. By using shared representations most **parameters are shared** between the tasks, and information learned to be useful for one task might also be useful for other tasks.


The computational graph makes it very easy to compute separate losses for each task. These losses are then summed and backpropagated through the network. 

<img src="pics/mtl-loss.png" width=300>

Note that there can be different specific setups:

#### Same corpus labeled for different tasks (jointly labeled data)

In this case you have the same input but several different tasks annotated on top. For example, you might have a corpus of sentences annotated for both part-of-speech tags and named entities. 

A MTL here could be to train a single network that outputs POS and NER labels in two different output nodes.

#### Several corpora (distincly labeled data)

However, an advantage of MTL in neural networks is that you do not need to have jointly labeled data, your datasets might also come from different sources. 

In this case the training procedure will take the input from the different corpora and perform gradient computations with respect to the particular loss function (and task) of the current training example. 

### Does MTL work? 

[lets look at an example (freqbin)] to continue

In [2]:
### Example with distinct sources

In [3]:
### Why does it work? possible explanations


In [2]:
### What are related tasks? 