# Stance Classification of Tweets using Transfer Learning
This notebook shows how *transfer learning*, an extension of deep learning, can be used for predicting Tweet stance toward a particular topic.

# 1. Motivation

The traditional approach to applying deep learning methods in NLP have involved feeding a model large amounts of labeled training data, and fitting the model's parameters to this data. In practice, natural language data is highly variable and can come in a variety of forms (tweets, blog posts, reviews etc.), and hence, a model trained for a particular language task does not generalize well to new data from another distribution. In addition, many natural language applications do not come with an abundance of labeled examples, and human annotation can get very expensive as the datasets get larger.

This offers good motivation to explore the notion of [transfer learning](http://ruder.io/transfer-learning/index.html#whatistransferlearning) - a machine-learning technique that has the ability to transfer knowledge to novel scenarios not encountered during training. While transfer learning has been ubiquitous throughout computer vision applications since the advent of huge datasets such as ImageNet, it is only since 2017-18 that significant progress has been made for transfer learning in NLP applications. There have been a string of interesting papers in 2018 that discuss the power of language models in natural language understanding and how they can be used to provide pre-trained representations of a language's syntax, which can be far more useful when training a neural network for previously unseen tasks.

Twitter data is a very interesting use case for transfer learning, mainly because the typical language syntax seen in Tweets is quite different from that which is used to train language models. For these reasons, the **2016 SemEval Stance Detection task** is chosen for studying the effectiveness of our transfer learning approach. The dataset, experiments and the evaluation criteria used are explained in below sections. 

The aim of this notebook is to highlight the development of a model that can help answer the following questions:
- How does our approach generalize to Twitter-specific language syntax?
- Are we able to achieve reasonable results (comparable to the winning team of SemEval 2016 Task 6) with *limited amounts of training data* and *limited computing resources*?
- How much fine-tuning effort is required to achieve reasonable results?

# 2. Approach
In this notebook, we approach the stance detection problem using a [PyTorch port](https://github.com/huggingface/pytorch-openai-transformer-lm) of the **OpenAI transformer** [[paper]](https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf) as per Radford et. al, 2018.

## 2.1 Transformer Model Architecture
The OpenAI transformer is an adaptation of the well-known transformer from Google Brain's 2017 paper. 

![title](assets/transformer_arch.png)

Source: [Attention is all you need, 2017](https://arxiv.org/pdf/1706.03762.pdf)

While the original version from Google Brain used an identical encoder-decoder 6-layer stack, the OpenAI transformer uses similar dual 6-layer encoder-decoder stacks. Each layer has two sub-layers, consisting of a multi-head self-attention mechanism, and a fully connected (position-wise) feed-forward network. A full description of the transformer architecture used for transfer learning is given in the paper. 

Classification is performed in the following stages:
1. **Unsupervised pretaining**: The OpenAI transformer is given an unsupervised corpus of tokens from the Google Books corpus (thousands of books) and the pretrained weights are made publicly available for further fine-tuning.
2. **Supervised fine-tuning**: We can adapt the parameters to the supervised target task. The inputs are passed through the pre-trained model to obtain the final transformer block's activation $h_{l}^{m}$, which is then fed into an added linear output layer with parameters $W-y$ to predict $y$:

$P(y | x^1, ..., x^m) = softmax(h_{l}^{m}W_y)$

The first step (unsupervised pretaining) is *very* expensive, and was done by OpenAI on a large GPU cluster, and need not be repeated in our case - we can directly use the pretrained weights and fine-tune them before training the classifier. 

To perform out-of-domain target tasks such as text classification, the transformer includes language modeling as an additional objective to the fine-tuning, which helps generalized learning [[OpenAI transformer, 2018]](https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf). This auxiliary language modeling objective is specified with a weighting parameter $\lambda$ as shown below.

$L_3(C) = L_2(C) + \lambda \cdot L_1(C)$

where $L1$, $L2$ and $L3$ are the likelihoods for the language modeling objective, task-specific objective and combined objective respectively.

## 2.2 Task-specific input transformations
OpenAI designed their transformer to generalize to a range of natural language tasks. In order to do this, they allow the definition of custom task-specific "heads" as per the below schematic. The task-specific head acts on top of the base transformer language model, and is defined in the ```DoubleHeadModel``` class in ```model_pytorch.py```. 

![title](assets/openai_taskheads.png)

Source: [Improving language understanding by generative pre-training, 2018](https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf)


The [PyTorch port](https://github.com/huggingface/pytorch-openai-transformer-lm) of the OpenAI transformer that is used was originally tested on a multiple choice classification dataset (ROCStories). For this Tweet stance detection task, we utilize the above image to write a classification transform, such that we pad every text (representing each Tweet, in our case) with a start symbol and tokenize them for input to the encoder layer. 

# 3. Data

We perform stance detection of Tweets on five distinct topics, as per [SemEval 2016: Task 6](http://alt.qcri.org/semeval2016/task6/). Due to time constraints, we only look at Task A: "Supervised Framework" in this notebook. The train and test data (including the gold) are in the the ```data/``` directory provided along with this repository.

The five topics for which we classify the stance are given below. 

| Topic        
|:------------: | 
| Atheism     |
| Climate Change is a Real Concern  | 
| Feminist Movement | 
| Hillary Clinton  |
| Legalization of Abortion  | 

A more detailed breakdown of the tweets for this shared task is provided in [this link](http://www.saifmohammad.com/WebPages/StanceDataset.htm). 

## Size of dataset
The total number of Tweets (in the training set) available for this task is roughly 2700, which amounts to roughly 500-600 Tweets per topic. Thus, this can be considered a small dataset. 

![title](assets/stance_balance.png)

Upon inspecting the training data, it is clear that there is quite a large variance in terms of the number of Tweets in favor vs. those against a topic. There is quite a large variance *within* classes as well as the overall data as well. 

An interactive visualization of the complete dataset (along with a topic-specific breakdown) is provided by the organizers of the competition [in this link](http://www.saifmohammad.com/WebPages/StanceDataset.htm).

## Pretrained language models

The OpenAI transformer language model pretrained weights can be downloaded directly from the [OpenAI GitHub repository](https://github.com/openai/finetune-transformer-lm). These are then stored in a directory called ```model/```.

# 4. Code

The code used for classifying Tweet stance is [this PyTorch port of the OpenAI transformer](https://github.com/huggingface/pytorch-openai-transformer-lm). The following modifications are made to the original version of the transformer that was written to perform multiple-choice classification. 

1.  ```datasets.py```: A custom dataloader was written that processes the Tweets and splits the output into training, validation and test data

2.  ```train_stance.py```: A custom classification input transform is written as per the image in section 2.3 (above), to feed in the Tweets to the transformer for classification.

3.  ```parse_output.py```: The predicted stance for each Tweet is written out in a format that can be read in by the evaluation script, for scoring the model. 

The below code shows the dataloader functionality in ```datasets.py```.

In [1]:
import os
import csv
import numpy as np
import pandas as pd
from pathlib import Path
from sklearn.utils import shuffle
from sklearn.model_selection import train_test_split

### Cleaning code for the input dataloader

In [2]:
def _stance(path, topic=None):
    def clean_ascii(text):
        # function to remove non-ASCII chars from data
        return ''.join(i for i in text if ord(i) < 128)
    orig = pd.read_csv(path, delimiter='\t', header=0, encoding = "latin-1")
    orig['Tweet'] = orig['Tweet'].apply(clean_ascii)
    df = orig
    # Get only those tweets that pertain to a single topic in the training data
    if topic is not None:
        df = df.loc[df['Target'] == topic]
    X = df.Tweet.values
    stances = ["AGAINST", "FAVOR", "NONE", "UNKNOWN"]
    class_nums = {s: i for i, s in enumerate(stances)}
    Y = np.array([class_nums[s] for s in df.Stance])
    return X, Y

### Split raw Tweet data into training, validation and test sets

In [3]:
def stance(data_dir, topic=None):
    path = Path(data_dir)
    trainfile = 'semeval2016-task6-trainingdata.txt'
    testfile = 'SemEval2016-Task6-subtaskA-testdata.txt'

    X, Y = _stance(path/trainfile, topic=topic)
    teX, _ = _stance(path/testfile, topic=topic)
    tr_text, va_text, tr_sent, va_sent = train_test_split(X, Y, test_size=0.2, random_state=seed)
    trX = []
    trY = []
    for t, s in zip(tr_text, tr_sent):
        trX.append(t)
        trY.append(s)

    vaX = []
    vaY = []
    for t, s in zip(va_text, va_sent):
        vaX.append(t)
        vaY.append(s)
    trY = np.asarray(trY, dtype=np.int32)
    vaY = np.asarray(vaY, dtype=np.int32)
    return (trX, trY), (vaX, vaY), (teX, )

### Example output from dataloader
The dataloader methods output the train/validation/test data in the form of a list of tuples, as shown below. These will be used to feed the input transforms and perform stance classification in ```train_stance.py```. 

In [4]:
# Define random seed for train/val split
seed = 3535999445
data_dir = "./data"
# Dataloader output
(trX, trY), (vaX, vaY), teX = stance(data_dir)

We can print each training set's Tweet along with its numericalized stance (0: AGAINST, 1: FAVOR, 2: NONE). Note that we do not perform any special cleaning of the data to remove '@' mentions or hashtags - all information is retained and encoded so that the language model can be fine-tuned using as much input information as possible.

In [7]:
for i, t in enumerate(trX[:10]):
    print(t, trY[i])

.@msimire says children, women, elderly people especially at risk to health impacts of #ClimateReporting2015 #SemST 1


The test data does not come with a stance (that's what we need to predict).

In [6]:
# Test set
print(teX[0][:5])

['He who exalts himself shall      be humbled; and he who humbles himself shall be exalted.Matt 23:12.     #SemST'
 'RT @prayerbullets: I remove Nehushtan -previous moves of God that have become idols, from the high places -2 Kings 18:4 #SemST'
 '@Brainman365 @heidtjj @BenjaminLives I have sought the truth of my soul and found it strong enough to stand on its own merits. #SemST'
 '#God is utterly powerless without Human intervention... #SemST'
 '@David_Cameron   Miracles of #Multiculturalism   Miracles of shady 786  #Taqiya #Tawriya #Jaziya #Kafirs #Dhimmi #Jihad #Allah #SemST']


# 5. Experimental Setup

For stance detection, we use a semi-supervised approach where we reuse weights from a pretrained language model, and perform multi-class classification for the training data over the three classes ('FAVOR', 'AGAINST' and 'NONE').

### Reference result
To inform our methods and to have a benchmark to compare our results against, we looked at the winning paper for this shared task, from team *MITRE*, who [published their methodology and approach](https://arxiv.org/pdf/1606.03784.pdf). 

### Evaluation
The metric used to score the stance classification is **F-score**. The SemEval event organizers provided an [evaluation script](http://alt.qcri.org/semeval2016/task6/index.php?id=data-and-tools) that calculates the macro-average of F-score (FAVOR) and F-score (AGAINST) for task A. This compares our model's predicted stance for each Tweet against the gold reference.

We use the *perl* script provided by the organizers to generate our F- score. The evaluation script is in ```data/eval/``` and has the following usage:
    
    cd data/eval
    perl eval.pl -u

    ---------------------------
    Usage:
    perl eval.pl goldFile guessFile

    goldFile:  file containing gold standards;
    guessFile: file containing your prediction.
    
### Stance Prediction
The predicted output stances on the test dataset is written out according to the format expected by the evaluation *perl* script, and the F-scores are published as per this evaluation.

# 6. Results

## 6.1 Transfer Learning Using the Transformer
We use the pretrained weights for the model provided in [OpenAI's GitHub repository](https://github.com/openai/finetune-transformer-lm). The parameter files are placed inside the ```model/``` directory so that they can be called into ```train_stance.py```.  As described in section 4, various parts of the code were modified to suit the Tweet stance detection task.

To study the performance of the transformer model, we first study the hyperparameter selection.  The main focus is on adjusting the dropout of the various layers, and the LM coefficient $\lambda$, which controls the tradeoff between the language modeling head and the task head as per the following formula from the OpenAI paper:

$$L_3(C) = L_2(C) + \lambda \cdot L_1(C)$$

where $C$ is a labelled input dataset, $L_2$ is the likelihood function for the classification head, and $L_1$ is the likehilood function for the language modeling head.  Three trials are run for most configurations - the only times single trials are run are in cases where the first trial showed a large enough change in results to be fairly certain that the hyperparameter change was causing a significant difference.

The transformer uses [spaCy](https://spacy.io/) for tokenization. The file ```text_utils.py``` contains a wrapper for a byte-pair encoded tokenizer.  In order to use spaCy's English tokenizer, first make sure that it was installed properly.

    python3 -m spacy download en 

To start running the training, type the following into the terminal

    cd transformer-openai
    python3 train_stance.py --dataset stance --desc stance --submit --data_dir ../data --submission_dir default

In order to evaluate the results,, we need to get the predictions into the format required by the eval script. This is done using the script `parse_output.py`.  It takes the path of the test data file, the path of the predictions created by `train_stance.py`, the path to write the results to, and an optional topic to filter by as arguments.  Example usage is below:

    python3 parse_output.py ../data/SemEval2016-Task6-subtaskA-testdata.txt default/stance.tsv ../results/predicted_default.txt


Now we can check the F-score using the evaluation script provided with the dataset.  For example:

In [7]:
!perl data/eval/eval.pl data/eval/gold.txt results/transformer_predicted_2.txt



Results				 
FAVOR     precision: 0.5899 recall: 0.6908 f-score: 0.6364
AGAINST   precision: 0.7990 recall: 0.7007 f-score: 0.7466
------------
Macro F: 0.6915



This result appears to be marginally higher than the one reported by *MITRE* in their winning submission!

## Experiments
Below are the training commands used for several experiments searching for optimal hyperparameter settings.  For cases with three trials, the training was run with a `--seed` argument, using values 42, 43, and 44 for the each separate trial.  For brevity, only the command used to train a single trial is shown beloww for each configuration.

### Default
    python3 train_stance.py --dataset stance --desc stance --submit --data_dir ../data --submission_dir default

### Dropout
The training accuracies tend to be extremely high compared to the validation and test accuracies, most likely due to the small size of the dataset relative to the number of parameters in the model.  Typically, this can be mitigated by varying dropout.  Since we are not modifying the unsupervised pretrained weights (we are reusing OpenAI's publicly shared parameter files), we only modify the dropout of the classification layer in our model.

The below four commands show the input arguments used for the first trial of each value of classification dropout tested.

    python3 train_stance.py --dataset stance --desc stance --submit --data_dir ../data --submission_dir --clf_pdrop 0.05 clf_drop
    python3 train_stance.py --dataset stance --desc stance --submit --data_dir ../data --submission_dir --clf_pdrop 0.2 clf_less_drop
    python3 train_stance.py --dataset stance --desc stance --submit --data_dir ../data --submission_dir --clf_pdrop 0.3 clf_3drop
    python3 train_stance.py --dataset stance --desc stance --submit --data_dir ../data --submission_dir --clf_pdrop 0.5 clf_5drop

The results of this are output to the `results` folder, called `predicted_lessdrop`, `predicted_drop`, `predicted_3drop`, and `predicted_5drop` respectively.  The case with 0.05 dropout included only one trial since it didn't seem very promising.

Next, we can increase the dropout in all layers, this time only trying with 0.3 and 0.5 since 0.2 did not seem to affect things much in the previous experiment.

    python3 train_stance.py --dataset stance --desc stance --submit --data_dir ../data --clf_pdrop 0.3 --embd_pdrop 0.3 --resid_pdrop 0.3 --submission_dir all_3drop
    python3 train_stance.py --dataset stance --desc stance --submit --data_dir ../data --submission_dir --clf_pdrop 0.5 --embd_pdrop 0.5 --resid_pdrop 0.5 --submission_dir all_5drop

These results are called `predicted_super3drop` and `predicted_super5drop` respectively.

### LM Coefficient
Next, we set the LM coefficient to zero so that the weights are updated purely based on the performance of the classification head rather than the language model head. This was something suggested in the [OpenAI paper](https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf) in cases where the amount of training data is small. Hence, it makes logical sense to try this.

    python3 train_stance.py --dataset stance --desc stance --submit --data_dir ../data --submission_dir no_lm --lm_coeff 0

## Results
Case | Trial 1 F-score | Trial 2 F-score | Trial 3 F-score
--- | --- | ---- | ---
default | 0.6790 | 0.6915 | 0.6649
less_drop | 0.6432 | 
drop | 0.6625 | 0.6734 | 0.6799
3drop | 0.6758 | 0.6725 | 0.6701
5drop | 0.6690 | 0.6826 | 
super3drop | 0.6610 | 0.6808 | 0.6767
super5drop | 0.3876 |
no_lm | 0.6293 |

Overall, modifying the dropout seemed to make very little difference unless it was lowered, or raised to an extreme amount on all layers. The single best F-score obtained was when the default values in ```train_stance.py``` were applied for all hyperparameters.  The training accuracy *reduced* with higher dropout, especially in the case of 0.3 dropout on all layers, but this did nothing to improve the F-score.

Despite the paper's suggestion that excluding (or reducing the weight of) the language modeling objective could help for smaller datasets, in this Tweet stance detection task, excluding the language modeling objective seemed to make things noticeably worse.  This is likely due to how different the training set distribution (of Tweets with their quirky syntax) is from the Google Books corpus that the language model was pretrained on.

# 8. Conclusions
This notebook showed a training and classification pipeline for a PyTorch-port of the OpenAI transformer for evaluating stance of Tweets towards a particular topic. With some basic data cleaning operations and adding some custom code for task-specific goals, the transfer learning approach using the OpenAI transformer seems to provide good classification accuracy F-scores (compared with the winning results from the SemEval 2016 winning team *MITRE*). 

The main benefit of using the OpenAI transformer is that it appears to generalize well to a completely different distribution (i.e., Tweets), with very minimal modifications to the original framework and minimal hyperparameter fine-tuning. In addition to the relative ease of adapting the transformer code, the model's is also quite inexpensive. The fine-tuning of the language model and the classification layer happens simultaneously, and we achieve a good F-score (of 0.69) comparable to that of the winning team *MITRE* (0.67), in **just 2-3 epochs of training!** This shows that transformers are a promising tool to perform generalizable transfer learning for a wide range of classification tasks, even when the input distribution is very different from the pretrained language model's distribution.

One limitation of the transformer is that it is prone to overfitting when we have a very small input dataset (of ~500 Tweets). This is noticeable when we try to perform classification on each topic *individually*, where we only have around 500 Tweets per topic. In that case, the F-score drops as does the validation accuracy (whereas training accuracy remains very high) indicating significant overfitting. This makes sense considering that the transformer has 786 dimensions on the task-heads, so we would logically have to train it on a dataset several times this size. 

We can likely further improve the classification accuracy across all topics by feeding in a larger dataset, of several thousand Tweets to the transformer model. 

In general, Tweets are sufficiently different from typical language data used to generate pre-trained language models, and  hence are an interesting usecase for analyzing the effectiveness of transfer learning techniques. It will be interesting to see how transfer learning techniques for such NLP tasks evolve with time.