<a href="https://colab.research.google.com/github/alexgastone/tweet_extraction/blob/master/TweetSentimentExtraction_QA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Tweet Sentiment Extraction as a Question Answering task

## The Problem

https://www.kaggle.com/c/tweet-sentiment-extraction

> "My ridiculous dog is amazing." `sentiment: positive`

With all of the tweets circulating every second it is hard to tell whether the sentiment behind a specific tweet will impact a company, or a person's, brand for being viral (positive), or devastate profit because it strikes a negative tone. Capturing sentiment in language is important in these times where decisions and reactions are created and updated in seconds. But, which words actually lead to the sentiment description? Pick out the part of the tweet (word or phrase) that reflects the sentiment.

> What words in tweets support a positive, negative, or neutral sentiment? 

## Data
In this competition Kaggle has extracted support phrases from Figure Eight's Data for Everyone platform. The dataset is titled Sentiment Analysis: Emotion in Text tweets with **existing sentiment labels**, used here under creative commons attribution 4.0. international licence. The objective in this competition is to construct a model that can do the same - look at the labeled sentiment for a given tweet and figure out what word or phrase best supports it.

*Disclaimer: The dataset for this competition contains text that may be considered profane, vulgar, or offensive.*

The supervised training data therefore consists of prelabelled (i.e. sentiment) tweets with their corresponding selected text that best supports that sentiment.

## Question Answering 
The goal of question answering is to `answer` a `question` given a `context`. We can phrase our problem as a question answering task in the following way:
> `question`: sentiment label

> `context`: tweet

> `answer`: selected text

This is one of the nine NLP tasks that transformer models, such as Google's pretrained BERT, have been able to solve. Currently, one of the top scoring BERT models for question answering tasks on the SQuAD dataset is ALBERT (lite BERT), an ensemble model published in Sept 2019.

We will therefore be implementing this model using the transformers library from [Hugging Face](https://huggingface.co/).

For more info on ALBERT, check out https://paperswithcode.com/paper/albert-a-lite-bert-for-self-supervised.


## 1.0 Setup

In [0]:
!git clone https://github.com/huggingface/transformers \
&& cd transformers \

In [0]:
!pip install ./transformers
!pip install tensorboardX

In [0]:
from google.colab import files
import io
import pandas as pd
import numpy as np
import os
import json

## 2.0 Train Model

### 2.1 Load training data

In [0]:
# import train
uploaded = files.upload()

for fn in uploaded.keys():
  print('User uploaded file "{name}" with length {length} bytes'.format(
      name=fn, length=len(uploaded[fn])))
  
# read in as dataframe
train_df_full = pd.read_csv(io.BytesIO(uploaded['train.csv']))

In [0]:
# import test
uploaded = files.upload()

for fn in uploaded.keys():
  print('User uploaded file "{name}" with length {length} bytes'.format(
      name=fn, length=len(uploaded[fn])))

# read in as dataframe
test_df = pd.read_csv(io.BytesIO(uploaded['test.csv']))

In [0]:
# drop nan at row index 314
train_df_full.dropna(axis=0, inplace=True)
print(f'size of train set: {train_df_full.shape[0]}')
print(f'size of test set: {test_df.shape[0]}')

In [0]:
train_df_full.head()

### 2.2 Split train into evaluation and training sets

#### 2.2.1. Train/test split by fraction

In [0]:
def train_split(df, frac=0.1):
  length = df.shape[0]
  cutoff = int(length * frac)

  # randomize rows
  np.random.seed(32)
  df = df.sample(frac=1).reset_index(drop=True)

  # split
  val_df, train_df = df.iloc[:cutoff], df.iloc[cutoff:]

  return train_df, val_df

In [0]:
train_df, val_df = train_split(train_df_full)
print(f'size of train set: {train_df.shape[0]}')
print(f'size of evaluation set: {val_df.shape[0]}')

In [0]:
val_df.isnull().sum()

### 2.3 Convert Training and Evaluation Data to SQuAD-like format for use in training

In [0]:
# check which column corresponds to question, context, answer, and ID
for col in train_df:
  print(col)

In [0]:
def squad(array):
  """"
  array: dataset to convert, as numpy array   
  returns SQuAD-like dictionary
  """
  output = {} 

  # SQuAD dataset additionally contains version key along with the data key
  output['version'] = 'v1.0'
  output['data'] = []

  for idx, row in enumerate(array):
    answer_list = []
    qu_ans = [] # question and answer
    paragraphs = [] # context, question, and answer

    q_id, context, answer, question = row[0], row[1], row[2], row[-1]
    
    # for test set
    if answer == question: 
      if idx==0:
        print('Testing dataset')
      ans_index=None
      answer=None
      title='Test'

    # for training/eval sets (i.e. answer is provided)
    else:
      if idx == 0:
        print('Training dataset')
      ans_index = context.lower().find(answer.lower())
      if ans_index==-1:
        print('No index found')
      answer = answer.lower()
      title='Train'

    answer_list.append({'answer_start': ans_index, 'text': answer})
    qu_ans.append({'question': question, 'id': q_id, 'is_impossible': False, 'answers': answer_list})
    paragraphs.append({'context': context.lower(), 'qas': qu_ans})

    output['data'].append({'title': title, 'paragraphs': paragraphs})

  return output

In [0]:
train_squad, val_squad, test_squad = squad(np.array(train_df)), squad(np.array(val_df)), squad(np.array(test_df))

In [0]:
# save as json files
directory = 'dataset'

if not os.path.exists(directory):
  os.makedirs(directory)

filename = 'train.json'
with open(os.path.join(directory, filename), 'w') as outfile:
  json.dump(train_squad, outfile)
filename = 'val.json'
with open(os.path.join(directory, filename), 'w') as outfile:
  json.dump(val_squad, outfile)
filename = 'test.json'
with open(os.path.join(directory, filename), 'w') as outfile:
  json.dump(test_squad, outfile)

### 2.2 Run training 

Train the model with the training set and evaluate on evaluation set using Hugging Face transformers library. They offer a `run_squad.py` function to finetune the library models for question answering using our own (SQuAD-like) data. The function also converts the SQuAD examples to features before training and saves the tensorboard events file under the `runs` folder using tensorboardX . If we want to run training with fp16 (mixed precision), make sure apex is installed before. The model will be stored in `model_output` folder.

*Approximate running time on GPU provided by Colab for 1 epoch: 51min*

In [0]:
!git clone https://github.com/NVIDIA/apex
!pip install --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./apex

In [0]:
!python transformers/examples/run_squad.py \
  --model_type albert \
  --model_name_or_path albert-xxlarge-v1 \
  --do_train \
  --do_eval \
  --do_lower_case \
  --train_file /content/dataset/train.json \
  --predict_file /content/dataset/val.json \
  --per_gpu_train_batch_size 12 \
  --learning_rate 3e-5 \
  --num_train_epochs 3.0 \
  --max_seq_length 384 \
  --doc_stride 128 \
  --output_dir /content/model_output \
  --save_steps 1000 \
  --overwrite_output_dir \
  --fp16

If have more capabilities, could eventually try using `albert-xxlarge-v2`: 12 repeating layer, 128 embedding, 4096-hidden, 64-heads, 223M parameters
ALBERT xxlarge model with no dropout, additional training data and longer training. Or `albert-xxlarge-v1` (v1 seems to do slightly better on SQuAD than v2 for xxlarge). More info at https://github.com/google-research/ALBERT. Also, still have to run k-fold cross validation, train model for specified fold, and when doing inference take average of start and end logits.

### 2.4 Check similarity using Jaccard index 
The Jaccard index is the final evaluation metric. It's defined as the size of the intersection divided by the size of the union of the two sets of text: $J(A,B) = \frac{A \cap B}{A \cup B}$

In [0]:
def jaccard(str1, str2): 
    a = set(str1.lower().split()) 
    b = set(str2.lower().split())
    c = a.intersection(b)
    return float(len(c)) / (len(a) + len(b) - len(c))

In [0]:
import json
with open('model_output/predictions_.json', 'r') as f:
  eval_predictions = json.load(f)

In [0]:
with open('val.json', 'r') as f:
  eval_truth = json.load(f)

In [0]:
def get_jaccard_score(eval_predictions, eval_truth):
  eval_truth_list = [item for topic in eval_truth['data'] for item in topic['paragraphs']] 
  train_score = {'neutral':[], 'positive':[], 'negative':[], 'total':[]}

  for idx in range(len(eval_truth_list)):
    q_id = eval_truth_list[idx]['qas'][0]['id']
    answer = eval_truth_list[idx]['qas'][0]['answers'][0]['text']
    sentiment = eval_truth_list[idx]['qas'][0]['question']

    score = jaccard(answer, eval_predictions[q_id])

    train_score[sentiment].append(score)
    train_score['total'].append(score)

  for sentiment in ['neutral', 'positive', 'negative', 'total']:
    score = np.array(train_score[sentiment])
    print(sentiment + ' - ' + str(len(score)) + ' examples, average score: ' + str(score.mean()))

In [0]:
get_jaccard_score(eval_predictions, eval_truth)

In [0]:
import numpy as np

eval_truth_list = [item for topic in eval_truth['data'] for item in topic['paragraphs']] 
train_score = {'neutral':[], 'positive':[], 'negative':[], 'total':[]}

for idx in range(len(eval_truth_list)):
  q_id = eval_truth_list[idx]['qas'][0]['id']
  answer = eval_truth_list[idx]['qas'][0]['answers'][0]['text']
  sentiment = eval_truth_list[idx]['qas'][0]['question']

  score = jaccard(answer, eval_predictions[q_id])

  train_score[sentiment].append(score)
  train_score['total'].append(score)

In [0]:
for sentiment in ['neutral', 'positive', 'negative', 'total']:
  score = np.array(train_score[sentiment])
  print(sentiment + ' - ' + str(len(score)) + ' examples, average score: ' + str(score.mean()))

## 3.0 Train model with K-Fold cross validation

In [0]:
def run_training(model_folder):
  !python transformers/examples/run_squad.py \
  --model_type albert \
  --model_name_or_path albert-base-v2 \
  --do_train \
  --do_eval \
  --do_lower_case \
  --train_file dataset/train.json \
  --predict_file dataset/val.json \
  --per_gpu_train_batch_size 12 \
  --learning_rate 3e-5 \
  --num_train_epochs 3.0 \
  --max_seq_length 384 \
  --doc_stride 128 \
  --output_dir $model_folder \
  --save_steps 1000 \
  --threads 4 \
  --fp16

In [0]:
from sklearn.model_selection import KFold

kf = KFold(n_splits=5, shuffle=True, random_state=32)
kf.get_n_splits(train_df_full) 
print(kf)

fold = 0
for train_index, val_index in kf.split(train_df_full):
  print(f'Fold: {fold}')
  #print("TRAIN:", train_index, "VAL:", val_index)
  train_fold = train_df_full.iloc[train_index, :]
  val_fold = train_df_full.iloc[val_index, :]

  # convert to squad datasets
  print()
  print('Converting to SQuAD datasets...')
  train_squad, val_squad = squad(np.array(train_fold)), squad(np.array(val_fold))

  # save as json files
  directory = 'dataset'
  if not os.path.exists(directory):
    os.makedirs(directory)
  filename = 'train.json'
  with open(os.path.join(directory, filename), 'w') as outfile:
    json.dump(train_squad, outfile)
  filename = 'val.json'
  with open(os.path.join(directory, filename), 'w') as outfile:
    json.dump(val_squad, outfile)

  # run training and save model
  print()
  print('Running training and evaluation...')
  output_folder = '/content/model'+str(fold)
  run_training(output_folder)

  # calculate Jaccard score
  print()
  print('Calculating Jaccard scores...')
  prediction_file = output_folder + '/predictions_.json'
  with open(prediction_file, 'r') as f:
    eval_predictions = json.load(f) 
  get_jaccard_score(eval_predictions, val_squad)

## 4.0 Setup prediction code


Process the tweet and outputs the features necessary for model inference.

In [0]:
with open('test.json', 'r') as f:
  test = json.load(f)

In [0]:
test_data = [item for topic in test['data'] for item in topic['paragraphs']]

For this next part, a lot of it is pulled from https://github.com/spark-ming/albert-qa-demo/, great demo. Retrofits parts of `run_squad.py` and makes useother `squad_metrics` functions for setting up the model configuration, processing tweets from test set and features necessary for the model, then running predictions, constructing `SquadResult` from logits corresponding to start and end of the answer, and finally saving the predictions to files.

### 4.1 Configuration

In [0]:
import torch
import time
from torch.utils.data import DataLoader, RandomSampler, SequentialSampler

from transformers import (
    AlbertConfig,
    AlbertForQuestionAnswering,
    AlbertTokenizer,
    squad_convert_examples_to_features
)
from transformers.data.processors.squad import SquadResult, SquadV2Processor, SquadExample
from transformers.data.metrics.squad_metrics import compute_predictions_logits

model_name_or_path = "/content/model_output"
output_dir = ""

# Config
n_best_size = 1 # choose best prediction
max_answer_length = 192 #30
do_lower_case = True
null_score_diff_threshold = 0.0

def to_list(tensor):
    return tensor.detach().cpu().tolist()

# Setup model
config_class, model_class, tokenizer_class = (
    AlbertConfig, AlbertForQuestionAnswering, AlbertTokenizer)
config = config_class.from_pretrained(model_name_or_path)
tokenizer = tokenizer_class.from_pretrained(
    model_name_or_path, do_lower_case=do_lower_case)
model = model_class.from_pretrained(model_name_or_path, config=config)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
processor = SquadV2Processor()

### 4.2 Get predictions

In [0]:
def run_prediction(test_data):
    """Setup test data"""
    examples = []

    for i, entry in enumerate(test_data):
        example = SquadExample(
            qas_id=entry['qas'][0]['id'],
            question_text=entry['qas'][0]['question'],
            context_text=entry['context'],
            answer_text=None,
            start_position_character=None,
            title="Predict",
            is_impossible=False,
            answers=None,
        )

        examples.append(example)

    features, dataset = squad_convert_examples_to_features(
        examples=examples,
        tokenizer=tokenizer,
        max_seq_length=384,
        doc_stride=128,
        max_query_length=64,
        is_training=False,
        return_dataset="pt",
        threads=1,
    )

    eval_sampler = SequentialSampler(dataset)
    eval_dataloader = DataLoader(dataset, sampler=eval_sampler, batch_size=10)
    all_results = []
    
    for batch in eval_dataloader:
        model.eval()
        batch = tuple(t.to(device) for t in batch)
        with torch.no_grad():
            inputs = {
                "input_ids": batch[0],
                "attention_mask": batch[1],
                "token_type_ids": batch[2],
            }
            example_indices = batch[3]
            outputs = model(**inputs)
            for i, example_index in enumerate(example_indices):
                eval_feature = features[example_index.item()]
                unique_id = int(eval_feature.unique_id)
                output = [to_list(output[i]) for output in outputs]
                start_logits, end_logits = output
                result = SquadResult(unique_id, start_logits, end_logits)
                all_results.append(result)

    output_prediction_file = "predictions.json"
    output_nbest_file = "nbest_predictions.json"
    output_null_log_odds_file = "null_predictions.json"

    predictions = compute_predictions_logits(
        examples,
        features,
        all_results,
        n_best_size,
        max_answer_length,
        do_lower_case,
        output_prediction_file,
        output_nbest_file,
        output_null_log_odds_file,
        False,  # verbose_logging
        True,  # version_2_with_negative
        null_score_diff_threshold,
        tokenizer,
    )

    return predictions

### 4.3 Run inference

In [0]:
predictions = run_prediction(test_data)

In [0]:
with open('predictions.json', 'r') as f:
  test_predictions = json.load(f)

In [0]:
test_items = test_predictions.items()
test_list = list(test_items)

test_df = pd.DataFrame(test_list, columns=(['textID', 'selected_text']))
test_df.head()