## Text Classification

-----------
https://www.kaggle.com/competitions/nlp-txt-classification

The goal is to classify tweets. There are 5 categories: Extremely Negative,  Negative, Neutral, Positive, Extremely Positive.

This falls into the "Classifying whole sentences" category of common NLP tasks.

------------

Let's start with import of modules, we are going to use and setting `cuda` device. 

In [1]:
import gc

from tqdm import tqdm
import numpy as np
import pandas as pd

import torch

In [2]:
# Cuda maintenance
gc.collect()
torch.cuda.empty_cache()

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print("Torch device: ", device)

Torch device:  cuda


### 0. Read and manually preprocess input data

---

In [4]:
df = pd.read_csv('../data/train.csv')  
df_test = pd.read_csv('../data/test.csv')  

In [5]:
df.tail()

Unnamed: 0.1,Unnamed: 0,Text,Sentiment
41154,41152,Airline pilots offering to stock supermarket s...,Neutral
41155,41153,Response to complaint not provided citing COVI...,Extremely Negative
41156,41154,You know itÂs getting tough when @KameronWild...,Positive
41157,41155,Is it wrong that the smell of hand sanitizer i...,Neutral
41158,41156,@TartiiCat Well new/used Rift S are going for ...,Negative


In [6]:
# Example of Positive tweets:
df[df.Sentiment == 'Positive'].Text

1        advice Talk to your neighbours family to excha...
2        Coronavirus Australia: Woolworths to give elde...
3        My food stock is not the only one which is emp...
5        As news of the regionÂs first confirmed COVID...
6        Cashier at grocery store was sharing his insig...
                               ...                        
41142    Good News! \r\r\nWe'll Soon Announce Our High ...
41147    How exactly are we going to re-open New York C...
41148    #Gold prices rose to a more than 7-year high t...
41152    I never that weÂd be in a situation &amp; wor...
41156    You know itÂs getting tough when @KameronWild...
Name: Text, Length: 11422, dtype: object

In [7]:
# Example of Neutral tweets:
df[df.Sentiment == 'Neutral'].Text

0        @MeNyrbie @Phil_Gahan @Chrisitv https://t.co/i...
7        Was at the supermarket today. Didn't buy toile...
10       All month there hasn't been crowding in the su...
16       ????? ????? ????? ????? ??\r\r\n?????? ????? ?...
17       @eyeonthearctic 16MAR20 Russia consumer survei...
                               ...                        
41143    #Coronavirus ?? ????? ??? ????? ?? ??? ???????...
41145    https://t.co/8s4vKvcO1r #5gtowers?? #EcuadorUn...
41146    @_Sunrise_SV @Gamzap @NPR What does not having...
41154    Airline pilots offering to stock supermarket s...
41157    Is it wrong that the smell of hand sanitizer i...
Name: Text, Length: 7711, dtype: object

In [8]:
# Example of Negative tweets:
df[df.Sentiment == 'Negative'].Text

9        For corona prevention,we should stop to buy th...
24       @10DowningStreet @grantshapps what is being do...
26       In preparation for higher demand and a potenti...
28       Do you see malicious price increases in NYC? T...
30       There Is of in the Country  The more empty she...
                               ...                        
41129    Today at the grocery store I saw someone getti...
41133    In every human affliction there are  gainers a...
41149    YÂall really shitting that much more at home?...
41151    Still shocked by the number of #Toronto superm...
41158    @TartiiCat Well new/used Rift S are going for ...
Name: Text, Length: 9917, dtype: object

In [9]:
labels = df["Sentiment"].unique()
num_labels = len(df["Sentiment"].unique())
labels

array(['Neutral', 'Positive', 'Extremely Negative', 'Negative',
       'Extremely Positive', nan], dtype=object)

In [10]:
df = df.dropna().drop_duplicates().reset_index(drop=True)
df = df.drop(["Unnamed: 0"], axis=1)
df.rename(columns={"Sentiment": "label"}, inplace=True)
df.rename(columns={"Text": "text"}, inplace=True)
df = df.astype({"text": str}, {"label": str})

df["label"] = df["label"].apply(lambda x: np.where(labels == x)[0][0])
df.head()

Unnamed: 0,text,label
0,@MeNyrbie @Phil_Gahan @Chrisitv https://t.co/i...,0
1,advice Talk to your neighbours family to excha...,1
2,Coronavirus Australia: Woolworths to give elde...,1
3,My food stock is not the only one which is emp...,1
4,"Me, ready to go at supermarket during the #COV...",2


In [11]:
df_test = df_test.astype({"Text": str})
df_test.rename(columns={"Text": "text"}, inplace=True)
df_test = df_test.drop(["id"], axis=1)

df_test.tail()

Unnamed: 0,text
3793,Meanwhile In A Supermarket in Israel -- People...
3794,Did you panic buy a lot of non-perishable item...
3795,Asst Prof of Economics @cconces was on @NBCPhi...
3796,Gov need to do somethings instead of biar je r...
3797,I and @ForestandPaper members are committed to...


### 1. Select tokenizer and pretrained model

----------------------------------------------------------



For this task we will need encoder-only model, since we only need to understand the input.
Let's use **distilbert-base-uncased-finetuned-sst-2-english** model for classification.
Description and hyper-parameters can be found here: https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english

<br/>

I have also tried to use https://huggingface.co/digitalepidemiologylab/covid-twitter-bert-v2-mnli model, however after loading it to GPU alongside with data, there were not enough space
to perform training

<br/>

In [13]:
from transformers import AutoTokenizer, AutoConfig, AutoModelForSequenceClassification

In [14]:
checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
config = AutoConfig.from_pretrained(checkpoint, num_labels=num_labels)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, 
                                                           config=config, 
                                                           ignore_mismatched_sizes=True).to(device)


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased-finetuned-sst-2-english and are newly initialized because the shapes did not match:
- classifier.weight: found shape torch.Size([2, 768]) in the checkpoint and torch.Size([6, 768]) in the model instantiated
- classifier.bias: found shape torch.Size([2]) in the checkpoint and torch.Size([6]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [15]:
from datasets import Dataset

In [16]:
from sklearn.model_selection import train_test_split

train_df, eval_df = train_test_split(df, test_size=0.2, random_state=42)
train_dataset = Dataset.from_dict(train_df.to_dict('list'))
eval_dataset = Dataset.from_dict(eval_df.to_dict('list'))

In [17]:
from torch.utils.data import random_split
from transformers import AutoTokenizer

def tokenize_function(examples, device):
    return tokenizer(examples["text"],
                     padding=True,
                     truncation=True,
                     return_tensors="pt").to(device)

tokenize_function_on_device = lambda examples: tokenize_function(examples, device)

In [18]:
# speed up the map function by setting batched=True
# to process multiple elements of the dataset at once
tokenized_train_dataset = train_dataset.map(
    tokenize_function_on_device, batched=True
)

tokenized_eval_dataset = eval_dataset.map(
    tokenize_function_on_device, batched=True
)

  0%|          | 0/33 [00:00<?, ?ba/s]

  0%|          | 0/9 [00:00<?, ?ba/s]

In [19]:
from transformers import DataCollatorWithPadding
# Data Collator is used to create a batch of examples.
# It will dynamically pad text to the length of the longest element in a batch

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

In [20]:
import evaluate
import numpy as np

def compute_metrics(eval_preds):
    metric = evaluate.load("accuracy")
    logits, labels = eval_preds
    predictions = np.argmax(logits, axis=-1) 
    return metric.compute(predictions=predictions, references=labels)

In [21]:
from transformers import Trainer
from transformers import TrainingArguments

output_dir = "./results"

# Fine-tuning hyper-parameters: https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english
training_args = TrainingArguments(
    output_dir=output_dir, # where the model predictions and checkpoints will be written.     
    evaluation_strategy="epoch", # evaluation is done at the end of each epoch
    num_train_epochs=5,
    #learning_rate=5e-5, # same as default. With 1e-5 accuracy on third epoch is 0.8
    per_device_train_batch_size=16, # If we increase bacth size to 32, then face "RuntimeError: CUDA error: out of memory"
    per_device_eval_batch_size=32,
    warmup_steps=500, # number of steps used for a linear warmup
    weight_decay=0.01, # to reduce overfitting
)

trainer = Trainer(
    model, # pretrained model
    training_args, # arguments to tweak for training
    train_dataset=tokenized_train_dataset,
    eval_dataset=tokenized_eval_dataset,
    data_collator=data_collator, # the function to use to form a batch from a list of elements of train_dataset or eval_dataset
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

In [22]:
trainer.train()

The following columns in the training set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 32924
  Num Epochs = 5
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 10290
You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,Accuracy
1,0.5489,0.563454,0.787268
2,0.3852,0.482872,0.827238
3,0.2394,0.446607,0.860041
4,0.1605,0.499279,0.876929
5,0.0981,0.620188,0.869518


Saving model checkpoint to ./results/checkpoint-500
Configuration saved in ./results/checkpoint-500/config.json
Model weights saved in ./results/checkpoint-500/pytorch_model.bin
tokenizer config file saved in ./results/checkpoint-500/tokenizer_config.json
Special tokens file saved in ./results/checkpoint-500/special_tokens_map.json
Saving model checkpoint to ./results/checkpoint-1000
Configuration saved in ./results/checkpoint-1000/config.json
Model weights saved in ./results/checkpoint-1000/pytorch_model.bin
tokenizer config file saved in ./results/checkpoint-1000/tokenizer_config.json
Special tokens file saved in ./results/checkpoint-1000/special_tokens_map.json
Saving model checkpoint to ./results/checkpoint-1500
Configuration saved in ./results/checkpoint-1500/config.json
Model weights saved in ./results/checkpoint-1500/pytorch_model.bin
tokenizer config file saved in ./results/checkpoint-1500/tokenizer_config.json
Special tokens file saved in ./results/checkpoint-1500/special_toke

***** Running Evaluation *****
  Num examples = 8231
  Batch size = 32


Training completed. Do not forget to share your model on huggingface.co/models =)




TrainOutput(global_step=10290, training_loss=0.34690826516109724, metrics={'train_runtime': 2025.6569, 'train_samples_per_second': 81.267, 'train_steps_per_second': 5.08, 'total_flos': 1.0214015704462224e+16, 'train_loss': 0.34690826516109724, 'epoch': 5.0})

In [23]:
test_dataset = Dataset.from_dict(df_test.to_dict('list'))
tokenized_test_dataset = test_dataset.map(
    tokenize_function_on_device, batched=True
)

  0%|          | 0/4 [00:00<?, ?ba/s]

In [24]:
predictions = trainer.predict(tokenized_test_dataset)

The following columns in the test set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 3798
  Batch size = 32


In [25]:
predicted_probabilities = torch.nn.functional.softmax(torch.from_numpy(predictions.predictions),
                                                      dim=-1)


In [26]:
predicted_labels = []
for prediction in predicted_probabilities:
    max_idx = prediction.argmax().item()
    predicted_labels.append(labels[max_idx])

In [28]:
sample_submission = pd.read_csv('../data/sample_submission.csv')
sample_submission.head()

Unnamed: 0,id,Sentiment
0,787bc85b-20d4-46d8-84a0-562a2527f684,Neutral
1,17e934cd-ba94-4d4f-9ac0-ead202abe241,Neutral
2,5914534b-2b0f-4de8-bb8a-e25587697e0d,Neutral
3,cdf06cfe-29ae-48ee-ac6d-be448103ba45,Neutral
4,aff63979-0256-4fb9-a2d9-86a3d3ca5470,Neutral


In [29]:
sample_submission['Sentiment'] = predicted_labels
sample_submission.to_csv('test_submission.csv', index=False)
sample_submission.head(10)

Unnamed: 0,id,Sentiment
0,787bc85b-20d4-46d8-84a0-562a2527f684,Extremely Negative
1,17e934cd-ba94-4d4f-9ac0-ead202abe241,Positive
2,5914534b-2b0f-4de8-bb8a-e25587697e0d,Extremely Positive
3,cdf06cfe-29ae-48ee-ac6d-be448103ba45,Negative
4,aff63979-0256-4fb9-a2d9-86a3d3ca5470,Neutral
5,b130f7fb-7048-48e6-a8af-57bb56ac1e27,Neutral
6,db72c632-8719-4847-b7f2-a89af05e1504,Positive
7,e45239d8-4dcf-4685-a955-a9a08ca829ee,Neutral
8,2854b1b2-5a41-4002-90d3-17fe77a3a78e,Extremely Negative
9,ff9be7e1-81a9-4c07-beda-4fee9a923f5e,Extremely Positive


## Result

Accuracy on test dataset: 0.861

