# Environment Setup
In this step lets first setup the environment, we will do the following steps here.
1.   Transformers Model Installation
2.   Hyper Parameter Tuning Library Installation
3.   Colab Setup


In [None]:
# Install Tranformers & related libraries
! pip install transformers
! pip install datasets

Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/ed/db/98c3ea1a78190dac41c0127a063abf92bd01b4b0b6970a6db1c2f5b66fa0/transformers-4.0.1-py3-none-any.whl (1.4MB)
[K     |▎                               | 10kB 20.3MB/s eta 0:00:01[K     |▌                               | 20kB 28.0MB/s eta 0:00:01[K     |▊                               | 30kB 26.8MB/s eta 0:00:01[K     |█                               | 40kB 19.4MB/s eta 0:00:01[K     |█▏                              | 51kB 14.8MB/s eta 0:00:01[K     |█▌                              | 61kB 15.9MB/s eta 0:00:01[K     |█▊                              | 71kB 13.6MB/s eta 0:00:01[K     |██                              | 81kB 14.9MB/s eta 0:00:01[K     |██▏                             | 92kB 14.8MB/s eta 0:00:01[K     |██▍                             | 102kB 13.6MB/s eta 0:00:01[K     |██▋                             | 112kB 13.6MB/s eta 0:00:01[K     |███                             | 

In [None]:
# Install hyper paramerter tuning & related libraries
! pip install optuna
! pip install ray[tune]

Collecting optuna
[?25l  Downloading https://files.pythonhosted.org/packages/87/10/06b58f4120f26b603d905a594650440ea1fd74476b8b360dbf01e111469b/optuna-2.3.0.tar.gz (258kB)
[K     |█▎                              | 10kB 20.6MB/s eta 0:00:01[K     |██▌                             | 20kB 28.1MB/s eta 0:00:01[K     |███▉                            | 30kB 26.1MB/s eta 0:00:01[K     |█████                           | 40kB 18.1MB/s eta 0:00:01[K     |██████▍                         | 51kB 15.4MB/s eta 0:00:01[K     |███████▋                        | 61kB 17.5MB/s eta 0:00:01[K     |████████▉                       | 71kB 16.3MB/s eta 0:00:01[K     |██████████▏                     | 81kB 15.8MB/s eta 0:00:01[K     |███████████▍                    | 92kB 13.7MB/s eta 0:00:01[K     |████████████▊                   | 102kB 13.5MB/s eta 0:00:01[K     |██████████████                  | 112kB 13.5MB/s eta 0:00:01[K     |███████████████▏                | 122kB 13.5MB/s eta 0:0

In [None]:
gpu_info = !nvidia-smi
gpu_info = '\n'.join(gpu_info)
if gpu_info.find('failed') >= 0:
  print('Select the Runtime > "Change runtime type" menu to enable a GPU accelerator, ')
  print('and then re-execute this cell.')
else:
  print(gpu_info)

Sat Dec 12 20:23:10 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 455.45.01    Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla V100-SXM2...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   37C    P0    24W / 300W |      0MiB / 16130MiB |      0%      Default |
|                               |                      |                 ERR! |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [None]:
from psutil import virtual_memory
ram_gb = virtual_memory().total / 1e9
print('Your runtime has {:.1f} gigabytes of available RAM\n'.format(ram_gb))

if ram_gb < 20:
  print('To enable a high-RAM runtime, select the Runtime > "Change runtime type"')
  print('menu, and then select High-RAM in the Runtime shape dropdown. Then, ')
  print('re-execute this cell.')
else:
  print('You are using a high-RAM runtime!')

Your runtime has 27.4 gigabytes of available RAM

You are using a high-RAM runtime!


In [None]:
# Colab setup
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

Mounted at /content/drive


# Data Load
Load the Training & Test Datasets as Pandas dataframe

In [None]:
import os
import json
import pandas as pd
import numpy as np

In [None]:
datapath = r'/content/drive/My Drive/mcsds/cs-410-text-mining/project/ClassificationCompetition/data'
train_pddf = pd.read_json(datapath+'/train.jsonl', lines=True)
test_pddf = pd.read_json(datapath+'/test.jsonl', lines=True)

In [None]:
train_pddf

Unnamed: 0,label,response,context
0,SARCASM,@USER @USER @USER I don't get this .. obviousl...,[A minor child deserves privacy and should be ...
1,SARCASM,@USER @USER trying to protest about . Talking ...,[@USER @USER Why is he a loser ? He's just a P...
2,SARCASM,@USER @USER @USER He makes an insane about of ...,[Donald J . Trump is guilty as charged . The e...
3,SARCASM,@USER @USER Meanwhile Trump won't even release...,[Jamie Raskin tanked Doug Collins . Collins lo...
4,SARCASM,@USER @USER Pretty Sure the Anti-Lincoln Crowd...,[Man ... y ’ all gone “ both sides ” the apoca...
...,...,...,...
4995,NOT_SARCASM,@USER You don't . I have purchased a lot on Am...,[@USER Apologies for the inconvenience you fac...
4996,NOT_SARCASM,@USER #Emotions you say 🤔 never knew that I th...,"[@USER 🤔 idk tho , I think I ’ m #hungry . But..."
4997,NOT_SARCASM,"@USER @USER @USER You are so right ... "" Yes !...","[@USER @USER @USER Peace to you , and two coun..."
4998,NOT_SARCASM,@USER @USER @USER Another lazy delusional vote...,[Bernie Sanders told Elizabeth Warren in priva...


# Data Preprocessing

### Feature Engineering
Create new features:
* Last Response: Extract the last response from the context since the current response was generated on Last this can be separately treated.
* Context Reversed: Reverse the context before feeding to transformers so that latest tweets are given more attention and incase if context is too big latest shall be considered.
* Combine all into a single
* Combine Current & Last Response into Single



In [None]:
# Extract Last Response from the context as a new feature
train_pddf['last_response'] = train_pddf.apply(lambda x: x.context[-1], axis=1)
test_pddf['last_response'] = test_pddf.apply(lambda x: x.context[-1], axis=1)


In [None]:
# reversed context and then concat as a single string
train_pddf['context_rev'] = train_pddf.apply(lambda x: " ".join(x.context[::-1]), axis=1)
test_pddf['context_rev'] = test_pddf.apply(lambda x: " ".join(x.context[::-1]), axis=1)


In [None]:
# all combined togeather response + context reversed
train_pddf['all'] = train_pddf['response'] + " " + train_pddf['context_rev']
test_pddf['all'] = test_pddf['response'] + test_pddf['context_rev']


In [None]:
# response & last response combined togeather
train_pddf['response_last_response'] = train_pddf['response'] + " " + train_pddf['last_response']
test_pddf['response_last_response'] = test_pddf['response'] + test_pddf['last_response']


In [None]:
train_pddf['label']= np.where(train_pddf['label'] == "SARCASM", 1, 0)

### Sequence Structuring
Define how do we want to structure the different tweets, basically two approaches are followed:
* Combine into single: Last response only, Combine all tweets togeather or current and last.
* Two Sentence: (Current, Last Response) or (Current, Context Reversed).

In [None]:
# different structural combination of data to be given as input to the transformer
task_to_keys = {
    "response_last_response_sep": ("response", "last_response"),
    "response_context_sep": ("response", "context"),
    "response_context_rev_sep": ("response", "context_rev"),
    "response_only": ("response", None),
    "all_only": ("all", None),
    "response_last_response_only": ("response_last_response", None),
    "response_only": ("response", None)
}

### Transform to Datasets
Translate preprocessed dataframes to Transformer Datasets. This step is required to make our dataset translated into Transformer datasets construct.

In [None]:
from datasets import Dataset, ClassLabel
train_ds = Dataset.from_pandas(train_pddf)

classLabel = ClassLabel(num_classes = 2, names = ["SARCASM", "NOT_SARCASM"])
features = train_ds.features
features['label'] = classLabel
train_ds.map(features=features)
print(train_ds.features)
final_ds = Dataset.from_pandas(test_pddf)

HBox(children=(FloatProgress(value=0.0, max=5000.0), HTML(value='')))


{'label': ClassLabel(num_classes=2, names=['SARCASM', 'NOT_SARCASM'], names_file=None, id=None), 'response': Value(dtype='string', id=None), 'context': Sequence(feature=Value(dtype='string', id=None), length=-1, id=None), 'last_response': Value(dtype='string', id=None), 'context_rev': Value(dtype='string', id=None), 'all': Value(dtype='string', id=None), 'response_last_response': Value(dtype='string', id=None)}


# Model Configurtion
Configure which model strategy to select, train test valid splits, performance metrics, training batch sizes etc. Below are the details:


1.   model_checkpoint: which model to use for text sequence classification. Roberta models are observed to give the maximum performance.
2.   task: specify how to structure the sequences as described in sequence structuring step. We have observed the maximum performance with 'response_context_rev_sep' structure. This format structures input as two sequence <response, context> where response is last tweet to be classified, and context tweets are the previous tweets in an reversed order of occurance.
3. metric_name: metric to be optimized while training. We have configured it to accuracy.
4. num_labels: 2, number of classes Sarcasm, Not Sarcasm
5. batch_size: 16 for roberta, 64 for bert otherwise we face out of memory issues.
6. train_test_split: to divide training data into train and test datasets.
7. test_valid_split: to divide test dataset into test and validation set. 
8. epoch: number of epochs to train model on.
9. weight_decay: determines how much an updating step influences the current value of the weights
10. learning_rate: weight update rule that causes the weights to exponentially decay to zero

In [None]:
#model_checkpoint = "bert-base-uncased"
#model_checkpoint = "roberta-large"
model_checkpoint = "roberta-base"
task = "response_context_rev_sep"
metric_name = "accuracy"
num_labels = 2
batch_size = 16
train_test_split = 0.1
test_valid_split = 0.05
epochs=3
learning_rate = 2e-5
weight_decay = 0.01

In [None]:
# setup the model arguments

from transformers import TrainingArguments
args = TrainingArguments(
    "test",
    evaluation_strategy = "epoch",
    learning_rate=learning_rate,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=epochs,
    weight_decay=weight_decay,
    load_best_model_at_end=True,
    metric_for_best_model=metric_name,
)

In [None]:
from sklearn.metrics import accuracy_score, precision_recall_fscore_support

def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average='binary')
    acc = accuracy_score(labels, preds)
    return {
        'accuracy': acc,
        'f1': f1,
        'precision': precision,
        'recall': recall
    }

# Tokenization
This step translates words to context tokens. Transformers Tokenizer tokenize the inputs (including converting the tokens to their corresponding IDs in the pretrained vocabulary) and put it in a format the model expects, as well as generate the other inputs that model requires.

In [None]:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=481.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=898823.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=456318.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1355863.0, style=ProgressStyle(descript…




In [None]:
sentence1_key, sentence2_key = task_to_keys[task]

def preprocess_function(examples):
    if sentence2_key is None:
        return tokenizer(examples[sentence1_key], truncation=True)
    return tokenizer(examples[sentence1_key], examples[sentence2_key], truncation=True)


In [None]:
train_valid_ds = train_ds.train_test_split(test_size=train_test_split, seed=1234)
encoded_ds = train_valid_ds.map(preprocess_function, batched=True)

valid_test_encoded_ds = encoded_ds["test"].train_test_split(test_size = test_valid_split)

train_encoded_ds = encoded_ds["train"]
valid_encoded_ds = valid_test_encoded_ds["train"]
test_encoded_ds = valid_test_encoded_ds["test"]

HBox(children=(FloatProgress(value=0.0, max=5.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))




In [None]:
# sample code to validate the tokenization output
from collections import Counter
print(train_encoded_ds)
print(valid_encoded_ds)
print(test_encoded_ds)

print(Counter(train_encoded_ds['label']))
print(Counter(valid_encoded_ds['label']))
print(Counter(test_encoded_ds['label']))


Dataset({
    features: ['all', 'attention_mask', 'context', 'context_rev', 'input_ids', 'label', 'last_response', 'response', 'response_last_response'],
    num_rows: 4500
})
Dataset({
    features: ['all', 'attention_mask', 'context', 'context_rev', 'input_ids', 'label', 'last_response', 'response', 'response_last_response'],
    num_rows: 475
})
Dataset({
    features: ['all', 'attention_mask', 'context', 'context_rev', 'input_ids', 'label', 'last_response', 'response', 'response_last_response'],
    num_rows: 25
})
Counter({0: 2257, 1: 2243})
Counter({1: 242, 0: 233})
Counter({1: 15, 0: 10})


In [None]:
print(train_encoded_ds.features)
print(train_encoded_ds[0])
print(tokenizer.decode(train_encoded_ds[0]['input_ids']))

{'all': Value(dtype='string', id=None), 'attention_mask': Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None), 'context': Sequence(feature=Value(dtype='string', id=None), length=-1, id=None), 'context_rev': Value(dtype='string', id=None), 'input_ids': Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None), 'label': ClassLabel(num_classes=2, names=['SARCASM', 'NOT_SARCASM'], names_file=None, id=None), 'last_response': Value(dtype='string', id=None), 'response': Value(dtype='string', id=None), 'response_last_response': Value(dtype='string', id=None)}
{'all': '@USER some in Billy rays one and only hit . @USER Her Daddy must be so proud of the post-adolescent social miscreant she\'s turned out to be ! Tragic . #Feminist Amy Schumer was " just joking " about moving to Canada if #Trump won . Miley Cyrus says was high on #crack <URL>', 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 

# Single Model Fine Tuning
Download the pretrained model and fine tune the selected model with arguments configured in the previous step.

In [None]:
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer
model = AutoModelForSequenceClassification.from_pretrained(model_checkpoint, num_labels=num_labels)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=501200538.0, style=ProgressStyle(descri…




Some weights of the model checkpoint at roberta-base were not used when initializing RobertaForSequenceClassification: ['lm_head.bias', 'lm_head.dense.weight', 'lm_head.dense.bias', 'lm_head.layer_norm.weight', 'lm_head.layer_norm.bias', 'lm_head.decoder.weight', 'roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.weight', 'classifie

In [None]:
trainer = Trainer(
    model,
    args,
    train_dataset=train_encoded_ds,
    eval_dataset=valid_encoded_ds,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

In [None]:
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
1,No log,0.464741,0.806316,0.812245,0.802419,0.822314
2,0.457755,0.425604,0.814737,0.812766,0.837719,0.789256
3,0.457755,0.424817,0.844211,0.846473,0.85,0.842975


TrainOutput(global_step=846, training_loss=0.3839202772641013)

In [None]:
trainer.evaluate()

{'epoch': 3.0,
 'eval_accuracy': 0.8442105263157895,
 'eval_f1': 0.846473029045643,
 'eval_loss': 0.4248165488243103,
 'eval_precision': 0.85,
 'eval_recall': 0.8429752066115702}

In [37]:
import torch
model_save = trainer.model
torch.save(model_save.state_dict(), 'check_point.pth')

# Test Validation
Validate the results on test data and compute the metrics.

In [None]:
test_result = trainer.predict(test_dataset=test_encoded_ds)

In [None]:
compute_metrics(test_result)

{'accuracy': 0.8,
 'f1': 0.8275862068965518,
 'precision': 0.8571428571428571,
 'recall': 0.8}

In [None]:
test_pddf.head()

Unnamed: 0,id,response,context,last_response,context_rev,all,response_last_response
0,twitter_1,"@USER @USER @USER My 3 year old , that just fi...","[Well now that ’ s problematic AF <URL>, @USER...",@USER @USER @USER No .. he actually in the gif...,@USER @USER @USER No .. he actually in the gif...,"@USER @USER @USER My 3 year old , that just fi...","@USER @USER @USER My 3 year old , that just fi..."
1,twitter_2,@USER @USER How many verifiable lies has he to...,[Last week the Fake News said that a section o...,@USER The mainstream media doesn't report the ...,@USER The mainstream media doesn't report the ...,@USER @USER How many verifiable lies has he to...,@USER @USER How many verifiable lies has he to...
2,twitter_3,@USER @USER @USER Maybe Docs just a scrub of a...,[@USER Let ’ s Aplaud Brett When he deserves i...,@USER @USER He did try keep korkmaz in in the ...,@USER @USER He did try keep korkmaz in in the ...,@USER @USER @USER Maybe Docs just a scrub of a...,@USER @USER @USER Maybe Docs just a scrub of a...
3,twitter_4,@USER @USER is just a cover up for the real ha...,[Women generally hate this president . What's ...,@USER I've hated him before he was placed in o...,@USER I've hated him before he was placed in o...,@USER @USER is just a cover up for the real ha...,@USER @USER is just a cover up for the real ha...
4,twitter_5,@USER @USER @USER The irony being that he even...,"[Dear media Remoaners , you excitedly sharing ...",@USER @USER Quite an articulate and considered...,@USER @USER Quite an articulate and considered...,@USER @USER @USER The irony being that he even...,@USER @USER @USER The irony being that he even...


# Final Result Generation

In [None]:
final_encoded_ds = final_ds.map(preprocess_function, batched=True)

HBox(children=(FloatProgress(value=0.0, max=2.0), HTML(value='')))




In [None]:
final_result = trainer.predict(test_dataset=final_encoded_ds)

In [None]:
final_preds = final_result.predictions.argmax(-1)

In [None]:
test_pddf["preds"] = final_preds

In [None]:
test_pddf['label']= np.where(test_pddf['preds'] == 1, "SARCASM", "NOT_SARCASM")

In [None]:
test_pddf[['id', 'label']].to_csv('answer.txt', index = False, header=False)

# Hyper Parameter Tuning

Using Transformer Trainer utility which supports hyperparameter search using optuna or Ray Tune libraries which we have installed in our previous step. During hyperparameter tuning step, the Trainer will run several trainings, so it needs to have the model defined via a function (so it can be reinitialized at each new run) instead of just having it passed. The hyperparameter_search method returns a BestRun objects, which contains the value of the objective maximized and the hyperparameters it used for that run.

In [None]:
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer

def model_init():
    return AutoModelForSequenceClassification.from_pretrained(model_checkpoint, num_labels=num_labels)

In [None]:
trainer = Trainer(
    model_init=model_init,
    args=args,
    train_dataset=train_encoded_ds,
    eval_dataset=valid_encoded_ds,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

Some weights of the model checkpoint at roberta-base were not used when initializing RobertaForSequenceClassification: ['lm_head.bias', 'lm_head.dense.weight', 'lm_head.dense.bias', 'lm_head.layer_norm.weight', 'lm_head.layer_norm.bias', 'lm_head.decoder.weight', 'roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.weight', 'classifie

In [None]:
best_run = trainer.hyperparameter_search(n_trials=10, direction="maximize")

[32m[I 2020-12-12 16:31:19,955][0m A new study created in memory with name: no-name-fb537655-0876-447f-8f4b-6a60fbcd3102[0m
Some weights of the model checkpoint at roberta-base were not used when initializing RobertaForSequenceClassification: ['lm_head.bias', 'lm_head.dense.weight', 'lm_head.dense.bias', 'lm_head.layer_norm.weight', 'lm_head.layer_norm.bias', 'lm_head.decoder.weight', 'roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForSequenceClassification w

RuntimeError: ignored

# Best Run Selection & Training

To reproduce the best training run from our previous hyper parameter train setp we will set the best hyperparameters  TrainingArgument before training the model again.

In [None]:
best_run

In [None]:
for n, v in best_run.hyperparameters.items():
    setattr(trainer.args, n, v)

trainer.train()

In [None]:
trainer.evaluate()