# "You Can't Be Serious" - Detecting Sarcasm

As requested, below is the code and techniques employed to develop an algorithm to sort out sarcastic headlines from non-sarcastic ones. To get started, let's import the libraries necessary to make the magic happen:

In [1]:
import json
import pandas as pd
import regex as re
import pandas as pd
import numpy as np
from plotly import tools
import plotly.graph_objects as go
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
import nltk
nltk.download("stopwords")
nltk.download("wordnet")
import itertools
from collections import Counter
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.neural_network import MLPClassifier
from sklearn import metrics
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from nltk.corpus import stopwords
from sklearn.metrics import accuracy_score,precision_score,recall_score,f1_score
from nltk.stem.porter import PorterStemmer
from sklearn.preprocessing import LabelEncoder
nltk.download('punkt')
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import TfidfTransformer
import random

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\ALEJA\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\ALEJA\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\ALEJA\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.


With the libraries on hand, let's now grab our data and see what we are working with:

In [2]:
data = pd.read_json('./Sarcasm_Headlines_Dataset_v2.json',lines=True)
data["source"] = data["article_link"].apply(lambda x: re.findall(r'\w+', x)[2])
data

Unnamed: 0,is_sarcastic,headline,article_link,source
0,1,thirtysomething scientists unveil doomsday clo...,https://www.theonion.com/thirtysomething-scien...,theonion
1,0,dem rep. totally nails why congress is falling...,https://www.huffingtonpost.com/entry/donna-edw...,huffingtonpost
2,0,eat your veggies: 9 deliciously different recipes,https://www.huffingtonpost.com/entry/eat-your-...,huffingtonpost
3,1,inclement weather prevents liar from getting t...,https://local.theonion.com/inclement-weather-p...,theonion
4,1,mother comes pretty close to using word 'strea...,https://www.theonion.com/mother-comes-pretty-c...,theonion
...,...,...,...,...
28614,1,jews to celebrate rosh hashasha or something,https://www.theonion.com/jews-to-celebrate-ros...,theonion
28615,1,internal affairs investigator disappointed con...,https://local.theonion.com/internal-affairs-in...,theonion
28616,0,the most beautiful acceptance speech this week...,https://www.huffingtonpost.com/entry/andrew-ah...,huffingtonpost
28617,1,mars probe destroyed by orbiting spielberg-gat...,https://www.theonion.com/mars-probe-destroyed-...,theonion


Per above, our data consists of four variables; three of which are feature vectors and one acting as the response. Upon inspection we can see how we really just need to concern ourselves with two columns only: "is_sarcastic" and "headline".

The problem now is determining whether an article is satire in nature or not according the the phrasing of its headline. As humans, we intuitively pick up on such nuances. However, how can we program such intuition to a machine?

We can begin by creating a form of "database" conveying the frequency, and in extension relevance, of the words present in our headlines and attempt to correlate this information to the labels in question (sarcastic, non-sarcastic). We will accomplish this via a method we will refer to as a "bag of words".

Our "bag of words" data preparation entails the following steps:

- 1) Tokenizing each headline
- 2) Removing "stopwords" - Words conveying no meaning or insights (eg. "of", "the", "for", "in", etc.)
- 3) Stemming - Reducing words to their base level as to capture the essence of a word in order to later compare accurately (eg. "running", "ran", and "run" should all be deemed as the same term)
- 4) Capturing the frequency of each word per document for all the documents in the dataframe
- 5) Assign Term Frequency - Inverse Document Frequency (TFIDF) scores to each term per document. A TFIDF score reflects the uniqueness of a word against the entire body of documents considered. In a nutshell, the score increases by every occurrence in the document in question but is decreased by every occurrence across all documents. Thus capturing the uniqueness of the word to its document and not to the body of documents (filtering out common words as "what", "when", "yes","no").

Here's a very reductive demonstration:

## Tokenization:

In [3]:
example = "This is merely a test for comprehension."
example2 = re.sub('[^A-Za-z]', ' ', example.lower()) #regex used to remove punctuation and non alphabet characters
token_example = word_tokenize(example2)
token_example

['this', 'is', 'merely', 'a', 'test', 'for', 'comprehension']

## Stopwords:

In [4]:
for word in token_example:
    if word in stopwords.words("english"):
        token_example.remove(word)
        
token_example

['is', 'merely', 'test', 'comprehension']

## Stemming:

In [5]:
stemmer = PorterStemmer()

for i in range(len(token_example)):
    token_example[i] = stemmer.stem(token_example[i])
    
token_example

['is', 'mere', 'test', 'comprehens']

## Bringing it together:

In [6]:
token_example = " ".join(token_example)

token_example

'is mere test comprehens'

At the risk of being redundant, below is another demonstration of our "bag of words" approach, but in this case with a list of documents. We will conclude this demo by also calculating the "TF-IDF" scores for the words in question:

In [7]:
new_document = []
stemmer = PorterStemmer()

documents = ["This is a super test everyone.","This other sentence is to prove a super."," Super."]


for document in list(documents): #Iterating through every document and removing non alphabet chars.
    clean = re.sub('[^A-Za-z]', ' ', document)
    tokenized_clean = word_tokenize(clean)

    for word in tokenized_clean: #TOKENIZATION
        if word in stopwords.words("english"):
            tokenized_clean.remove(word)

    for i in range(len(tokenized_clean)):#STEMMING
        tokenized_clean[i] = stemmer.stem(tokenized_clean[i])
    
    new_document.append(" ".join(tokenized_clean))#BRINGING IT TOGETHER
    
vectorized_test = CountVectorizer(max_features=1000)#Capturing the frequencies of our words in a CountVectorizer object

X_test = vectorized_test.fit_transform(new_document).toarray()#Building our matrix of frequencies
print(vectorized_test.get_feature_names())
print(X_test)

['everyon', 'prove', 'sentenc', 'super', 'test', 'thi', 'to']
[[1 0 0 1 1 1 0]
 [0 1 1 1 0 1 1]
 [0 0 0 1 0 0 0]]



Function get_feature_names is deprecated; get_feature_names is deprecated in 1.0 and will be removed in 1.2. Please use get_feature_names_out instead.



Notice above how the "super" column has 1's all the way down since this word appears in every document.

Let's calculate the TF-IDF now:

In [8]:
tfidf = TfidfTransformer()
tfidf_X = tfidf.fit_transform(X_test).toarray()

np.round(tfidf_X,2)

array([[0.58, 0.  , 0.  , 0.35, 0.58, 0.44, 0.  ],
       [0.  , 0.5 , 0.5 , 0.3 , 0.  , 0.38, 0.5 ],
       [0.  , 0.  , 0.  , 1.  , 0.  , 0.  , 0.  ]])

Revisiting our "super" column,the scores are .35, .3, and 1, respectively. This makes sense since the first two scores are the lowest in our matrix since "super" appears in all three documents. The third score of 1 is explained by the fact that it was the sole word in the document.

Equipped with our "bag of words" procedure, let's apply it to our data:

In [9]:
new_headline = []
stemmer = PorterStemmer()


for headline in list(data.headline):
    clean = re.sub('[^A-Za-z]', ' ', headline)
    tokenized_clean = word_tokenize(clean)


    for word in tokenized_clean:
        if word in stopwords.words("english"):
            tokenized_clean.remove(word)

    for i in range(len(tokenized_clean)):
        tokenized_clean[i] = stemmer.stem(tokenized_clean[i])

    new_headline.append(" ".join(tokenized_clean))

In [10]:
matrix = CountVectorizer(max_features=1000)
'''
I am choosing to retain the top 1000 features with the most frequencies. Anything higher results in algorithms which take
too long to run for almost no gain in insights. After testing "max_features" of 500, 5000, and 10000, I got virtually
the same results with the higher parameters improving accuracy by 1 or two points. However, the 5000 parameter takes  
over 2 hours to run and the 10000 option is, well, not an option; ran non stop overnight to no avail.

'''

X = matrix.fit_transform(new_headline).toarray()
Y = data.is_sarcastic
le = LabelEncoder()
Y = le.fit_transform(Y)
Y = Y.reshape(-1,1)

In [11]:
tfidf = TfidfTransformer()
tfidf_X = tfidf.fit_transform(X).toarray()
tfidf_X.shape

(28619, 1000)

This is where the fun begins. Our TFIDF scores will take the place of the headline column in our algorithm training. This means we will be training our models using these scores to predict whether a headline is sarcastic or not. Before getting carried away, we will first split our data with a .7/.3 split and proceed:

In [12]:
random.seed(7)

X_train,X_test,Y_train,Y_test = train_test_split(tfidf_X,Y,test_size=0.3)

First model we will be training and testing is a support vector machine(SVM). Two type of SVMs will be examined, a linear SVM and a radial basis function kerneled SVM. The models will assume a hard classifier with a margin cost parameter set at 1 (the default value).

In [13]:
random.seed(7)

svm_results = SVC(kernel = "linear").fit(X_train,Y_train.ravel()).predict(X_test)
svm_results_kernel = SVC(kernel = "rbf").fit(X_train,Y_train.ravel()).predict(X_test)

Results for our linear SVM are as follows:

In [15]:
print(classification_report(Y_test,svm_results))
print('Accuracy score: {}'.format(accuracy_score(Y_test, svm_results)))
print('Precision score: {}'.format(precision_score(Y_test, svm_results)))
print('Recall score: {}'.format(recall_score(Y_test, svm_results)))
print('F1 score: {}'.format(f1_score(Y_test, svm_results)))

              precision    recall  f1-score   support

           0       0.75      0.79      0.77      4452
           1       0.76      0.71      0.74      4134

    accuracy                           0.75      8586
   macro avg       0.75      0.75      0.75      8586
weighted avg       0.75      0.75      0.75      8586

Accuracy score: 0.7542511064523643
Precision score: 0.7612287041817243
Recall score: 0.7133526850507983
F1 score: 0.7365134865134865


Whereas its kerneled counterparts results in:

In [16]:
print(classification_report(Y_test,svm_results_kernel))
print('Accuracy score: {}'.format(accuracy_score(Y_test, svm_results_kernel)))
print('Precision score: {}'.format(precision_score(Y_test, svm_results_kernel)))
print('Recall score: {}'.format(recall_score(Y_test, svm_results_kernel)))
print('F1 score: {}'.format(f1_score(Y_test, svm_results_kernel)))

              precision    recall  f1-score   support

           0       0.76      0.80      0.78      4452
           1       0.77      0.73      0.75      4134

    accuracy                           0.76      8586
   macro avg       0.76      0.76      0.76      8586
weighted avg       0.76      0.76      0.76      8586

Accuracy score: 0.7626368506871651
Precision score: 0.768305171530978
Recall score: 0.7259313014029996
F1 score: 0.7465174129353233


The second model we will train is a neural network. After many test runs a final model was decided on with rectified linear unit activation functions ("ReLu") for the nodes(aka. neurons) with 25 nodes in each of its 20 hidden layers.

In [None]:
random.seed(7)

mlp = MLPClassifier(activation='relu', alpha=0.0001, batch_size='auto', beta_1=.9,
       beta_2=.999, early_stopping=False, epsilon=1e-08,
       hidden_layer_sizes=(25, 20), learning_rate='constant',
       learning_rate_init=0.01, max_iter=1000, momentum=0.9,
       nesterovs_momentum=True, power_t=0.5, random_state=None,
       shuffle=True, solver='adam', tol=0.0001, validation_fraction=0.1,
       verbose=False, warm_start=False)


mlp.fit(X_train,Y_train.ravel())

NN_predictions = mlp.predict(X_test)

In [22]:
print(classification_report(Y_test,NN_predictions))
print('Accuracy score: {}'.format(accuracy_score(Y_test, NN_predictions)))
print('Precision score: {}'.format(precision_score(Y_test, NN_predictions)))
print('Recall score: {}'.format(recall_score(Y_test, NN_predictions)))
print('F1 score: {}'.format(f1_score(Y_test, NN_predictions)))

              precision    recall  f1-score   support

           0       0.73      0.74      0.73      4463
           1       0.71      0.70      0.71      4123

    accuracy                           0.72      8586
   macro avg       0.72      0.72      0.72      8586
weighted avg       0.72      0.72      0.72      8586

Accuracy score: 0.7225716282320056
Precision score: 0.7139346276726468
Recall score: 0.7045840407470289
F1 score: 0.7092285156249999


In conclusion, capturing the frequency of words across the body of documents provides us insight as to the relevance of a given word to its mother document. We can elaborate on this idea to predict whether a document is sarcastic or not based on the TF-IDF scores for all documents present. In theory, documents with similar TF-IDF scores for certain words may yield a means classification. 

Based on our results, all three models do a fair job classifying our articles; scoring in the 70's range of accuracy. Nevertheless, the kerneled SVM model proves to be the most accurate with an accuracy of 76%.