# Text Classification using Simple Transformers (RoBERTa)


## Problem statement¶
The goal of this project is to predict whether a comment is sarcastic or not based on 1 million comments scrapped from Reddit - also called sub-reddits. This means, we are facing a binary classification problem that involves incorporating NLP techniques to feed to our ML/DL models.

## Goal of this notebook
Our goal is to build a model to predict the comment's label (sarcastic or not). In this notebook, we will use some of the most popular state-of-the-art algorithms to classify text - RoBERTa. It stands for Robustly Optimized BERT Pre-training Approach. It was presented by researchers at Facebook and Washington Universityand their goals was to optimize the training of BERT architecture in order to take lesser time during pre-training. Moreover, we will be using simpletransformers library with the objective of making the implementation as simple as possible.


## Structure of this notebook
0. Set-up and data cleansing
1. Create datasets and clean data when necessary
2. Modelling
3. Evaluation
4. Conclusions

# 0. Set-up

In [None]:
###################################################################################################
###
### Install all the necessary packages 
###
###################################################################################################

!pip install -r requirements.txt

In [None]:
###################################################################################################
###
### Import all the necessary packages and custom functions (from the functions.py file)
###
###################################################################################################

from simpletransformers.classification import ClassificationModel
import pandas as pd
import logging
import sklearn

In [None]:
###################################################################################################
###
### Get the data from the Google Drive public folders
###
###################################################################################################

# Load train and test dataframes
# test_df = get_sarcasm_test_df()
# train_df = get_sarcasm_train_df()

train_df = pd.read_csv("train_df.csv")
test_df = pd.read_csv("test_df.csv")

### 1. Install Packages

In [24]:
train_df = train_df[['id', 'label', 'comment']]

train_df = train_df.rename(columns={"comment": "text", "label":"labels"})
train_df['text'] = train_df['text'].astype('str')

# Inspect dataset
train_df.head()

Unnamed: 0,id,labels,text
0,1,0,"""""I like my shortstops how I like my beef... i..."
1,2,1,He works in mysterious ways
2,3,0,You're right
3,4,0,Is this amount of meat in the ratio of meat to...
4,5,0,You can hug and kiss my ass X and O


In [25]:
test_df = test_df[['id', 'comment']]

test_df = test_df.rename(columns={"comment": "text"})
test_df['text'] = test_df['text'].astype('str')

# Inspect dataset
test_df.head()

Unnamed: 0,id,text
0,909745,Maybe they don't use one
1,909746,Ten feet higher!
2,909747,If you didn't want me to ask how you lost your...
3,909748,Gsync*
4,909749,Anyone feel like making Tux windows?


In [26]:
# #Create a new virtual environment and install packages
# conda create -n st python pandas tqdm -y

In [27]:
#Create a new virtual environment and install packages
# conda activate st

In [28]:
# Boring install stuff
# conda install pytorch torchvision -c pytorch

In [29]:
#Import the ST Classification Model and Data Wrangling Packages
from simpletransformers.classification import ClassificationModel
import pandas as pd
import logging
import sklearn
from sklearn.model_selection import train_test_split

import torch
torch.multiprocessing.set_sharing_strategy('file_system')

In [30]:
#Log Results
logging.basicConfig(level=logging.INFO)
transformers_logger = logging.getLogger("transformers")
transformers_logger.setLevel(logging.WARNING)

### 2. Import and Transform Data

In [31]:
initial_df = pd.read_csv('comments_reddit.csv')
df = initial_df[['id', 'label', 'comment']]

# Drop comments that are empty
df = df.dropna()

df = df.rename(columns={"comment": "text", "label":"labels"})
df['text'] = df['text'].astype('str')

# Inspect dataset
df.head()

Unnamed: 0,id,labels,text
0,909745,0,Maybe they don't use one
1,909746,0,Ten feet higher!
2,909747,1,If you didn't want me to ask how you lost your...
3,909748,1,Gsync*
4,909749,0,Anyone feel like making Tux windows?


In [32]:
train_df

# Drop comments that are empty
train_df = train_df.dropna()

# train_df = df.rename(columns={"comment": "text", "label":"labels"})
train_df['text'] = train_df['text'].astype('str')

In [33]:
# split train test
train_df_old, test_df_old = train_test_split(df, test_size=0.3)

In [34]:
test_df_old = test_df_old.drop(columns = 'labels')

In [35]:
#Check dataframe type
train_df_old.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 70753 entries, 32573 to 100839
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   id      70753 non-null  int64 
 1   labels  70753 non-null  int64 
 2   text    70753 non-null  object
dtypes: int64(2), object(1)
memory usage: 2.2+ MB


In [36]:
train_df

Unnamed: 0,id,labels,text
0,1,0,"""""I like my shortstops how I like my beef... i..."
1,2,1,He works in mysterious ways
2,3,0,You're right
3,4,0,Is this amount of meat in the ratio of meat to...
4,5,0,You can hug and kiss my ass X and O
...,...,...,...
909739,909740,0,That is really sad ;(
909740,909741,0,Butt funny
909741,909742,0,#identityCrisis2015
909742,909743,0,A solid rebuttal.


In [37]:
train_df_testins = train_df[:200000]
train_df_testins

Unnamed: 0,id,labels,text
0,1,0,"""""I like my shortstops how I like my beef... i..."
1,2,1,He works in mysterious ways
2,3,0,You're right
3,4,0,Is this amount of meat in the ratio of meat to...
4,5,0,You can hug and kiss my ass X and O
...,...,...,...
200004,200005,0,I can watch old replays with LOLReplay2 just t...
200005,200006,1,But because at 10+ in a 63ish possession game ...
200006,200007,0,The Self.
200007,200008,1,Hey m'lady...


In [38]:
test_df_testins = test_df[:60000]
test_df_testins

Unnamed: 0,id,text
0,909745,Maybe they don't use one
1,909746,Ten feet higher!
2,909747,If you didn't want me to ask how you lost your...
3,909748,Gsync*
4,909749,Anyone feel like making Tux windows?
...,...,...
59998,969743,I'm thinking something in the world of an IBM ...
59999,969744,"You mean you're not worried about people """"hac..."
60000,969745,I'm SHOCKED!
60001,969746,Hopefully they'll give us some free stuff to a...


In [39]:
train_df_old

Unnamed: 0,id,labels,text
32573,942318,1,It's already priced in.
83060,992805,1,And people wonder why the US has an obesity pr...
68018,977763,1,Well you need to know if they are 13 or a girl...
65917,975662,0,The best way to handle this is as an owner is ...
51972,961717,1,Can't wait to see some HD Totally Platonic Cud...
...,...,...,...
43076,952821,1,You left out the
39458,949203,0,"Shh shh shh, here take your crayons it's okay."
88634,998379,1,"Because they shouldn't have to pay to get sex,..."
41122,950867,0,I believe so


In [40]:
#Check dataframe
# test_df

### 3. Create Classification Model

- I chose to use the roBERTa classification model. SimpleTransformers supports other model types like ALBERT, BERT, or BERTweet (for English Tweets). Full list of [models](https://simpletransformers.ai/docs/classification-specifics/).

In [41]:
#Set and create the classification model we want to use
model = ClassificationModel('roberta', 'roberta-base', use_cuda=False, args={'reprocess_input_data': True, 'overwrite_output_dir': True})



Some weights of the model checkpoint at roberta-base were not used when initializing RobertaForSequenceClassification: ['lm_head.bias', 'lm_head.dense.weight', 'roberta.pooler.dense.bias', 'lm_head.layer_norm.weight', 'lm_head.decoder.weight', 'lm_head.dense.bias', 'lm_head.layer_norm.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.weight', 'classifie

In [42]:
#Train the classification model using training data
model.train_model(train_df_testins)

INFO:simpletransformers.classification.classification_utils: Converting to features started. Cache is not used.


  0%|          | 0/200000 [00:00<?, ?it/s]

INFO:simpletransformers.classification.classification_utils: Saving features into cached file cache_dir/cached_train_roberta_128_2_2


Epoch:   0%|          | 0/1 [00:00<?, ?it/s]

Running Epoch 0 of 1:   0%|          | 0/25000 [00:00<?, ?it/s]

INFO:simpletransformers.classification.classification_model: Training of roberta model complete. Saved to outputs/.


(25000, 0.6617538500607014)

### 4. Evaluate the Model

- Our trained model was able to produce an accuracy of 99.3%. It accurately identified 656 spam messages, accurately identified 4311 non-spam messages, falsely identified 16 messages as spam, and 17 messages as non-spam.

In [43]:
#Evaluate the model
result, model_outputs, wrong_predictions = model.eval_model(train_df, acc=sklearn.metrics.accuracy_score)

INFO:simpletransformers.classification.classification_utils: Converting to features started. Cache is not used.


  0%|          | 0/909697 [00:00<?, ?it/s]

INFO:simpletransformers.classification.classification_utils: Saving features into cached file cache_dir/cached_dev_roberta_128_2_2


Running Evaluation:   0%|          | 0/113713 [00:00<?, ?it/s]

INFO:simpletransformers.classification.classification_model:{'mcc': 0.3765993433817764, 'tp': 271931, 'tn': 351571, 'fp': 103294, 'fn': 182901, 'auroc': 0.7467733375503542, 'auprc': 0.7588672453083803, 'acc': 0.6853952469888326, 'eval_loss': 0.6054539253056533}


### 5. Make Predictions on Test Data and Export to Kaggle

In [44]:
# predictions, raw_outputs = model.predict(["could use another Pudge set, or it's been awhile since I've seen an Axe Axe or Sven Sword."])

In [45]:
predictions = model.predict(test_df['text'].to_numpy().tolist())[0]

INFO:simpletransformers.classification.classification_utils: Converting to features started. Cache is not used.


  0%|          | 0/101076 [00:00<?, ?it/s]

  0%|          | 0/12635 [00:00<?, ?it/s]

In [46]:
my_submission = pd.DataFrame({'id': test_df.id, 'predicted': predictions})
my_submission.head()

Unnamed: 0,id,predicted
0,909745,1
1,909746,0
2,909747,1
3,909748,0
4,909749,0


In [47]:
my_submission.to_csv('my_submission.csv', index=False)

In [48]:
# initial_df

In [49]:
#Run the model on test data
# predictions, raw_outputs = model.predict([test_df['text'].to_numpy().tolist()])

In [50]:
# #Export Predictions for Kaggle
# my_submission = pd.DataFrame({'id': test_df.id, 'predicted': predictions})
# my_submission.head()
# my_submission.to_csv('my_submission.csv', index=False)