## Introduction
Sarcasm has been part of our language for many years. It is the caustic use of irony, in which words are used to communicate the opposite of their surface meaning, in a humorous way or to mock someone or something. Understanding sarcasm is not always obvious, because it depends on your language skills and your knowledge of other people’s minds. For example, how would you classify the sentence “What a fantastic musician!”? Detecting sarcasm is much harder in text, as there are no additional cues. But what about a computer? Is it possible to train a machine learning model that can detect whether a sentence is sarcastic or not? 


## Problem statement
The goal of this project is to predict whether a comment is sarcastic or not based on 1 million comments scrapped from Reddit - also called sub-reddits. This means, we are facing a binary classification problem that involves incorporating NLP techniques to feed to our ML/DL models.


## Goal of this notebook
Before we head straight to the modelling part, we need to analyse our data in order to understand:
- Is our dataset balanced? 
- Is part of our data skewing the results?
- Can we detect strange data distributions beforehand? 
- Do we need to clean the data before modelling?

## Structure of this notebook
- Set-up
- Data inspection 
- Exploratory Data Analysis
- Conclusions

# 0. Set-up

In [20]:
###################################################################################################
###
### Import all the necessary packages and custom functions (from the functions.py file)
###
###################################################################################################

import pandas as pd
import numpy as np
import matplotlib as plt
from datetime import datetime
from plotly import *
import plotly.graph_objs as go
import re
from nltk.corpus import stopwords
# nltk.download('stopwords') <-- You may need to download it if you encounter a LookupError 
from sklearn.feature_extraction.text import CountVectorizer
from functions import *

In [21]:
###################################################################################################
###
### Get the data from the Google Drive public folders
###
###################################################################################################

# Load train and test dataframes
test_df = get_sarcasm_test_df()
train_df = get_sarcasm_train_df()


KeyboardInterrupt



# 1. Data cleansing
Why is this important? Predictive models, regardless of the sophistication of the algorithms employed, are only as good as the data used to train them. Incorrect data yields inaccurate insights. In addition, poorly formatted data can’t easily be sorted by computers. 

Since we will create 2 different models, the data cleaning part may be a bit different. While RoBERTa model handles (and actually needs) stopwords in order to figure out the context of the sentence, Logistic Regression will benefit from cleaning stopwords and punctuations to improve the model's accuracy.

## 1.1 Inspecting the datasets

In [None]:
###################################################################################################
###
### Identifying missing value in the datasets
###
###################################################################################################


# Do we have any duplicate rows in the datasets?
print('Duplicate rows in the train dataset:',len(train_df[train_df.duplicated() == True]))
print('Duplicate rows in the test dataset:',len(test_df[test_df.duplicated() == True]))


In [None]:
# Inspect the training dataset
print("The 'comment' column is the main field we want to use to classify whether the comment is sarcastic or not")
print("The column 'label' defines if a comment is sarcastic (sarcastic = 1) or not (genuine = 0)")
train_df.head()

In [None]:
# Do we have any missing value in the dataset?
train_df.info()
print("---------------------------------")
print("It seems that we are missing 47 comments in the 'comments' column. We will filter them out")
train_df = train_df[train_df.comment.notnull()]

In [None]:
# Inspect the test dataset
print("The structure of the dataset is identical to the training dataset, except here we are missing the 'label' column")
test_df.head()

In [None]:
# Do we have any missing value in the dataset?
test_df.info()
print("---------------------------------")
print("The test dataframe is missing some comments too. Since this field is critical, we will also filter those comments out.")
test_df = test_df[test_df.comment.notnull()]

# Exploratory Data Analysis (EDA)

In [None]:
###################################################################################################
#####
##### Dataset balance: Are sarcastic and non-sarcastic comments equally distributed? 
#####
###################################################################################################

# Every comment is unique, so we want to check the proportion of sarcastic and non-sacartic comments


train_df["label"].value_counts().plot(kind = 'pie', autopct='%1.1f%%', figsize=(6, 6)).legend()
print(round(train_df.id[train_df.label == 1].count() / train_df.id.count()*100,2), "% - perfectly balanced! An amazing dataset! This points us to consider accuracy as one of our key performance metrics in the modelling part, as it is easy to interpret. However we also want to check the model performance based on another metric like precision or recall.")


In [None]:
###################################################################################################
#####
##### Is score an indicator of sarcasm?
#####
###################################################################################################

# Create dataframes for sarcastic and genuine comments
score_sarcastic = train_df[train_df['label']==1]
score_genuine = train_df[train_df['label']==0]

# Apply function from the functions.py file to classify comments in groups by their score 
score_sarcastic['score_groups'] = score_sarcastic.apply(score_classification, axis = 1)
score_genuine['score_groups'] = score_genuine.apply(score_classification, axis = 1)

# Get the percentage over the total for each group
score_sarcastic_gb = score_sarcastic.groupby(['score_groups'])['id'].nunique().reset_index()
score_sarcastic_gb['total'] = score_sarcastic_gb.id.sum()
score_sarcastic_gb['percentage_over_total_sarcastic'] = round(score_sarcastic_gb.id / score_sarcastic_gb.total * 100,2)
score_sarcastic_gb = score_sarcastic_gb[['score_groups', 'percentage_over_total_sarcastic']]

# Get the percentage over the total for each group
score_genuine_gb = score_genuine.groupby(['score_groups'])['id'].nunique().reset_index()
score_genuine_gb['total'] = score_genuine_gb.id.sum()
score_genuine_gb['percentage_over_total_genuine'] = round(score_genuine_gb.id / score_genuine_gb.total * 100,2)
score_genuine_gb = score_genuine_gb[['score_groups', 'percentage_over_total_genuine']]

print("There are minor differences between the scores on sarcastic and non-sarcastic comments based on the score they get. Therefore, score may not be a relevant attribute for the feature generation part")

score_sarcastic_gb.merge(score_genuine_gb, how='left', on='score_groups')


In [None]:
###################################################################################################
#####
##### Are certain authors skewing our data?
#####
###################################################################################################

# Count the number of sarcastic comments per author
sarcastic_authors = train_df[train_df['label']==1].groupby(['author'])['id'].nunique().reset_index()
sarcastic_authors.columns = ['author', 'sarcastic_comments']

# Count the number of non-sarcastic comments per author
non_sarcastic_authors = train_df[train_df['label']==0].groupby(['author'])['id'].nunique().reset_index()
non_sarcastic_authors.columns = ['author', 'non_sarcastic_comments']

# Create total dataset
authors_df = sarcastic_authors.merge(non_sarcastic_authors, how='left', on ='author')

# Data transformations where needed
authors_df.non_sarcastic_comments = authors_df.non_sarcastic_comments.fillna(0)
authors_df.sarcastic_comments = authors_df.sarcastic_comments.fillna(0)
authors_df.non_sarcastic_comments = authors_df.non_sarcastic_comments.astype(int)

# Create column with total comments per author (sarcastic and non-sarcastic)
authors_df['total_comments'] = authors_df.sarcastic_comments + authors_df.non_sarcastic_comments

# Create column with 'sarcastic rate' per author - sarcastic comments / total comments
authors_df['sarcastic_rate'] = authors_df.sarcastic_comments / authors_df.total_comments

# Is our dataset balanced at a user level too?
print("We find out that", round(authors_df[authors_df.sarcastic_rate == 0.50].author.nunique() / authors_df.author.nunique()*100,2), "% of the users post as many geniune as sarcastic comments")
print("Only", round(authors_df[authors_df.sarcastic_rate >= 0.70].author.nunique() / authors_df.author.nunique()*100,2), "% of the users post sarcastic comments on 70% of the time")



In [None]:
###################################################################################################
#####
##### Is data changing over time? 
#####
###################################################################################################

# Define dataframe for sarcastic and non-sarcastic comments
sarcastic = train_df_clean[train_df_clean['label']==1]
genuine = train_df_clean[train_df_clean['label']==0]


# Group the data by dates for sarcastic comments
sarcastic=sarcastic.groupby(['date'])['label'].count()
sarcastic=pd.DataFrame(sarcastic)

# Group the data by dates for non-sarcastic comments
genuine=genuine.groupby(['date'])['label'].count()
genuine=pd.DataFrame(genuine)

# Plotting the time series graph
fig = go.Figure()
fig.add_trace(go.Scatter(
         x=genuine.index,
         y=genuine['label'],
         name='Genuine',
    line=dict(color='blue'),
    opacity=0.8))

fig.add_trace(go.Scatter(
         x=sarcastic.index,
         y=sarcastic['label'],
         name='Sarcastic',
    line=dict(color='red'),
    opacity=0.8))

fig.update_xaxes(
    rangeslider_visible=True,
    rangeselector=dict(
        buttons=list([
            dict(count=1, label="1m", step="month", stepmode="backward"),
            dict(count=6, label="6m", step="month", stepmode="backward"),
            dict(count=1, label="YTD", step="year", stepmode="todate"),
            dict(count=1, label="1y", step="year", stepmode="backward"),
            dict(step="all")
        ])
    )
)
        
    
fig.update_layout(title_text='Sarcastic and non-sarcastic comments',plot_bgcolor='rgb(248, 248, 255)',yaxis_title='Value')

fig.show()

In [None]:
# Based on the graph above, we may wonder... should we filter out the data from 2016 onwards? 
#    Because there seem to be more geniune than sarcastic comments. Let's inspect the distribution!

# Plot distribution of data without 2016 onwards data
train_df[train_df['date']<'2016-01']["label"].value_counts().plot(kind = 'pie', autopct='%1.1f%%', figsize=(6, 6)).legend()

print("If we remove 2016 onwards, then the perfectly balanced distribution (50%) doesn't hold anymore. Since the goal of this project is to model over text, not considering other business questions, let's keep all the data as it was initially intended (including 2016)")

In [None]:
###################################################################################################
#####
##### Are longer or shorter comments more sarcastic?
#####
###################################################################################################

# Create a new column and add count of words per comment
train_df['number_of_words'] = train_df.comment.str.split().str.len()

# Apply clasisfication groups from functions.py
train_df['number_of_words_group'] = train_df.apply(number_of_words_groups, axis = 1)
test_df['number_of_words_group'] = train_df.apply(number_of_words_groups, axis = 1)

print("We do not observe any indications showing that we could predict the label of a comment based on its length. Every group of comments based on comment's lenght is quite balanced")

# Aggregate data and visualize it
pd.DataFrame(train_df.groupby(['number_of_words_group','label'])['id'].count()).reset_index().sort_values(by = 'number_of_words_group')




In [None]:
###################################################################################################
#####
##### Are comment lenght distributions different between train and test datasets?
#####
###################################################################################################

# Group the data for the training dataset
train_df_gb = pd.DataFrame(train_df.groupby(['number_of_words_group'])['id'].count()).reset_index().sort_values(by = 'number_of_words_group')
train_df_gb['percentage_over_total_train'] = round(train_df_gb.id / train_df_gb.id.sum()*100,2)

# Group the data for the test dataset
test_df_gb = pd.DataFrame(test_df.groupby(['number_of_words_group'])['id'].count()).reset_index().sort_values(by = 'number_of_words_group')
test_df_gb['percentage_over_total_test'] = round(test_df_gb.id / test_df_gb.id.sum()*100,2)

# Combine both datasets and compare the distributions
df_words_combined = train_df_gb.merge(test_df_gb, on='number_of_words_group', how='left')

print("Comment's lenght distributions are almost identical for the training and test datasets. We can be confident that the datasets are well prepared for the next step: modelling")
df_words_combined[['number_of_words_group', 'percentage_over_total_train', 'percentage_over_total_test']]




# Conclusions
What did we learn from this notebook?
1. The training and test datasets are very well balanced, there is no clear pattern in the data to generate features
2. There is room for data cleansing in the datasets, eliminating punctuations, symbols, stopwords... but we need to consider what model we want to use beforehand. BERT-based models may need stopwords in order to provide better context
