# Chapter 7: Multitask Learning

## 8.1: Introduction
* Multitask learning = Learning several things are the same time
    * Learn part-of-speech tagging and sentiment analysis together
* We would primarily do multitask learning for classifier performance improvements
    * Every ML algorithm suffers from inductive bias
        * Inductive Bias = Implicit assumptions underlying computations
            * Ex: Maximization of distances b/w classes in SVM.
    * In Multitask learning, since they will both have the inductive bias, they can both try to optimize for one inductive bias
        * Leads to better generalization properties of each task
        * Makes for a more robust classifier

## 8.2: Data
* Restaurant / Electronic product reviews
    * Will be used to test domain transfer
    * Domain Transfer: Transferring knowledge from one domain to another during learning in order to supplement small data sets with more data
* Reuters news dataset
    * Given combinations of pairs of topics (A+B, C+D), can we create combinations that benefit the modeling of each topic?
* Joint learning of Spanish pos tagging and named entity tagging
    * Shared data with different labelings
    * Can we get each learner to benefit from this?

## 8.3: Consumer Reviews: Yelp and Amazon
### 8.3.1: Data Handling
* Loading the data flow:
    * Data
        * Documents(X)
            * Vectorized documents
            * Padded feature vectors
        * Labels
            * Labels converted to ints
            * Vectorized labels

* Single Task Sentiment Classifier Performance
    * Amazon: 77.9%
    * Yelp: 71.5%

### 8.3.2: Hard Parameter Sharing
```
Task A Output           Task B Output
        Task A + B Hidden Layer
Task A Input            Task B Input
```
* We need to have our task-specific subnets share information!
* Create a shared layer that combines some of the hidden layer info
* Figuring out which layer to combine is discovered through experimentation
    * Hyperparameter Optimization
* To do this correctly, use the same classifier code from 8.3.1
* Performance:
    * Amazon: 54.5%
    * Yelp: 76.5%
    * So only Yelp did better

### 8.3.3 Soft Parameter Sharing
* Soft Parameter = Independent subnetworks that don't share layers but they are all constrained the same
```
Task A Output                 Task B Output
            Task A + B Hidden Layer
Task A Hidden Layer  <->   Task B Hidden Layer
     |        (Loss Function)      |
Task A Input                  Task B Input

def custom_loss(a,b):
        def loss(y_true,y_pred):
            e1=keras.losses.categorical_crossentropy(y_true,y_pred)
            e2=keras.losses.mean_squared_error(a,b)
            e3=keras.losses.cosine_proximity(a,b)
            e4=K.mean(K.square(a-b), axis=-1)
            return e1+e2+e3+e4
        return loss
```
* We can constrain them in the same way by having them share a custom loss function!
    * Since a model measures its performance error with this
    * Also allows for loss conditioning on arbitrary information
    * The composition is kinda arbitrary a bit and open to experimentation
        * Just depends on the pair of layers to work on, and teh composition of losses in the loss function

* Performance:
    * Amazon: 77.5%
    * Yelp: 75%
    * Better than our baseline!

### 8.3.4 Mixed Parameter Sharing
* Combination of hard and soft parameter sharing
* The subnetworks share 1+ layers and adapt the internal parameters to each other
```
Task A Output                 Task B Output
            Task A + B Hidden Layer
Task A Hidden Layer  <->   Task B Hidden Layer
     |                              |
Task A Input                  Task B Input
```
* Performance: 
    * Amazon: 74%
    * Yelp: 69.5%
    * Worse than the single classifier!

## 8.4 Reuters Topic Classification
* Reuters dataset is a part of Keras' library
* Texts are encoded as integer-based vectors w/ each integer representing a unique word

### 8.4.1 Data Handling
* For 2 pairs of topics, which pair can be learned as a separate additional task such that both tasks benefit?
* one-versus-one classification:
    * For N classes, train n(n-1)/2 binary classifiers
    * Compute per test document the most assigned class label
    * Then assign that label to the document
* Loading the data flow:
    * Data
        * Topic Combinations
            * Documents(X)
                * Vectorized documents
                * Padded feature vectors
            * Labels
                * Labels converted to ints
                * Vectorized labels

### 8.4.2 Hard Parameter Sharing
* For each pair of topics, find another pair of two topics, and learn those two two-class problems together!
* Performance:
    * Pretty good job even for the low-scoring topics!

### 8.4.3 Soft Parameter Sharing
* Rinse and repeat for 8.3.3 which is that it did the best

### 8.4.4 Mixed Parameter Sharing
* Rinse and repeat for 8.3.4 which is that it did the worst
* Each approach did a good job with some topics but not with others.
* How can we get just the best from each method?
    * Would make sense to create an ensemmble classifier that would consist of 3 types of one-verus-all parameter sharing!

## 8.5 Part-of-Speech and Named Entity Recognition Data
* Comes from the Computational Natural Language Learning (CoNLL) conference
* Joint speech tagging / named entity tagging for Dutch and Spanish
* High range within the data
    * Melbourne: One-word entity
    * United States of America: Phrasal entity

### 8.5.1 Data Handling
```
DATA -> Windowing -> Split windowed data into pos tagging and named entity -> 
Task 1 Docs | Task 1 Labels -> Vectorized docs
                            -> Labels converted to int -> Vectorized labels
Task 2 Docs | Task 2 Labels -> Vectorized docs
                            -> Labels converted to int -> Vectorized labels
```
* Baseline Single Task CoNLL Data Performance:
    * POS: 91.7%
    * NER: 93.2%

### 8.5.2 Hard Parameter Sharing
* Rinse + repeat of 8.2.2
* Performance:
    * POS: 91.8 %
    * NER: 94.2 %
    * NER benefits more but both tasks benefit

### 8.5.3 Soft Parameter Sharing
* Rinse + repeat of 8.2.3
* Performance:
    * POS: 91.7 %
    * NER: 94.8 %
    * Similar to baseline for POS but benefits NER

### 8.5.4 Mixed Parameter Sharing
* Rinse + repeat of 8.2.4
* Performance:
    * POS: 91.7 %
    * NER: 94.6 %
    * Similar to baseline for POS but benefits NER

## 8.6 Summary
* You can implement and apply deep learning multitask learning with hard parameter sharing (sharing layers across sub-classifiers, soft parameter sharing, and mixed parameter sharing (a combination of hard and soft parameter sharing)).
* Multitask learning can produce significantly better results for one or all of the subtasks learned.
* Soft parameter sharing yields the best results overall, but differences with hard parameter sharing are usually small.
* Mixed parameter sharing does not stand out compared to the other two approaches.
* It is up to the NLP engineer to come up with optimal combinations of layers, task combinations, and custom loss functions, using practical experimentation guided by trial and error.