# Text Classification using Simple Transformers (RoBERTa)


## Problem statement¶
The goal of this project is to predict whether a comment is sarcastic or not based on 1 million comments scrapped from Reddit - also called sub-reddits. This means, we are facing a binary classification problem that involves incorporating NLP techniques to feed to our ML/DL models.

## Goal of this notebook
Our goal is to build a model to predict the comment's label (sarcastic or not). In this notebook, we will use some of the most popular state-of-the-art algorithms to classify text - RoBERTa. It stands for Robustly Optimized BERT Pre-training Approach. It was presented by researchers at Facebook and Washington Universityand their goals was to optimize the training of BERT architecture in order to take lesser time during pre-training. Moreover, we will be using simpletransformers library with the objective of making the implementation as simple as possible.

## Important note: Training and predicting time
Due to the size of our dataset, it may take a very long time to train the model and predict the labels!


## Structure of this notebook
0. Set-up and data cleansing
1. Create datasets 
2. Modelling
3. Evaluation
4. Conclusions

# 0. Set-up

In [None]:
###################################################################################################
###
### Install all the necessary packages 
###
###################################################################################################

!pip install -r requirements_roberta.txt

In [None]:
###################################################################################################
###
### Import all the necessary packages and custom functions (from the functions.py file)
###
###################################################################################################

from simpletransformers.classification import ClassificationModel
import pandas as pd
import logging
import sklearn
import time

from functions import *

import torch
torch.multiprocessing.set_sharing_strategy('file_system')

In [None]:
###################################################################################################
###
### Get the data from the Google Drive public folders
###
###################################################################################################

# Load train and test dataframes
test_df = get_sarcasm_test_df()
train_df = get_sarcasm_train_df()

# 1. Create datasets
In this case, based on the reduced amount of text characters in the comments, we will keep all the text without cleaning

In [None]:
###################################################################################################
###
### Prepare the datasets for the model
###
###################################################################################################

# Initiate
start_time = time.time()

# Keep only the necessary columns
test_df = test_df[['id', 'label', 'comment']].dropna()
train_df = train_df[['id', 'label', 'comment']].dropna()

# Change column names and cast data type to the text field as string
test_df = test_df.rename(columns={"comment": "text", "label":"labels"})
test_df['text'] = test_df['text'].astype('str')

train_df = train_df.rename(columns={"comment": "text", "label":"labels"})
train_df['text'] = train_df['text'].astype('str')

# Show how does the dataframe look like
train_df.head()

# 2. Modelling
In this case, we will use 'comment' as the text input field for the model. We will use the simpletransformers library in order to make the pipeline easier to handle and prototype with.

In [None]:
###################################################################################################
###
### Prepare the datasets for the model
###
###################################################################################################

# Due to the size of our dataset and our limited hardware, let's reduce the size of the training and test datasets
#   so we can run the model. We will keep a 70:30 ratio on the data between train and test datasets

train_df_model = train_df[:210000]
test_df_model = test_df[:90000]

# Show how does the dataframe look like
train_df_model.head()

In [None]:
###################################################################################################
###
### Set up and initiate RoBERTa classification model
###
###################################################################################################

# Log Results
logging.basicConfig(level=logging.INFO)
transformers_logger = logging.getLogger("transformers")
transformers_logger.setLevel(logging.WARNING)

# Set up simpletransformers classification model - we will select RoBERTa and the necessary arguments
model = ClassificationModel('roberta', 'roberta-base', use_cuda=False, args={'reprocess_input_data': True, 'overwrite_output_dir': True})


In [None]:
###################################################################################################
###
### Train our model and predict over the test dataset
###
###################################################################################################

# Train the train dataset with RoBERTa
model.train_model(train_df_model)

# Evaluate the model
result, model_outputs, wrong_predictions = model.eval_model(train_df_model, acc=sklearn.metrics.accuracy_score)

# Predict labels (is the comment sarcastic or not) on the test dataset
predictions_values = model.predict(test_df_model['text'].to_numpy().tolist())[0]

In [None]:
###################################################################################################
###
### Evaluate the results
###
################################################################################################### 


# Inidicate when it finishes
end_time = time.time()

# Check the accuracy of the model
accuracy_score(test_df_model.labels, predictions_values)
print("RoBERTa classification model predicts correctly %.2f percent of the Reddit comments"%(accuracy_score(test_df_model.labels, predictions_values)*100))

# Plot the confusion matrix
plot_confusion_matrix(test_df_model.labels, predictions_values, ['genuine','sarcastic'], figsize=(5, 5))

In [None]:
###################################################################################################
###
### Store the results
###
###################################################################################################

# Store the final results from the RoBERTa model without cleaning
roberta_results = pd.DataFrame({'id': test_df_model.id, 'predicted': predictions_values})
roberta_results.to_csv("roberta_results.csv")

# Store the results for comparison
score = round(accuracy_score(test_df_model.labels, predictions_values),2)
model_name = 'roberta_comment_only_without_cleaning'

# Create a table with the final results and print the results
roberta_results_table = pd.DataFrame([[model_name, score, round(end_time - start_time,0)]], columns = ['Model', 'Accuracy', 'Execution_Time_Seconds'])
roberta_results_table.to_csv("roberta_results_table.csv")
roberta_results_table