# Detecting Sarcastic Headlines Using Sklearn's Logistic Regression Model

## Introduction

For this project, we wanted to be able to read a news headline and use logistic regression to predict if it was a genuine, non-satirical news article or a humorous, satarical news article. 

In essence, we are using logistic regression to make a binary classification for news headlines as "sarcastic" or "non-sarcastic"

Our dataset comes from https://www.kaggle.com/rmisra/news-headlines-dataset-for-sarcasm-detection. It is filled news articles from both The Huffington Post and The Onion, as well as labels for sarcastic and non-sarcastic for each news article.

Before starting, we cleaned our data-file to get rid of any articles that used unicode characters like accented vowels because they throw off our vectorizer later.

Here, we open our data-file using the python's csv.reader() function and load each tuple of data into two separate lists.

'data' holds raw text strings of our news articles, and 'data_labels' holds 1's (for sarcastic headlines) and 0's (for non-sarcastic headlines).

In [22]:
import csv

data = []
data_labels = []

with open("./sarcasm_oneliners/headlines.csv") as f:
    csvreader = csv.reader(f)
    
    
    for row in csvreader:
        data.append(row[1])
        label = row[0]

        if label == '0':
            data_labels.append('non-sarcastic')
        else:
            data_labels.append('sarcastic')

Next, we want to vectorize each headline to extract our important features from the text. We do this by using sklearn's built in feature extractor for text.

We feed our data into the word count vectorizer in order to build a list of words and their frequencies.

We then save this into an n dimensional array.

In [23]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(analyzer = 'word', lowercase = False)

features = vectorizer.fit_transform(data)

features_nd = features.toarray()

Next, we use 'sklearn.model_selection.train_test_split' to partition our data set into a test set and a training set.

train_test_split returns 4 lists
- X_train: our 80% portion of 'features_nd' used to train our logistic regression model
- X_test: our 20% portion of 'features_nd' which our model will try to label as 'sarcastic' or 'non-sarcastic'
- y_train: our 80% portion of 'data_labels' used to map every data point in X_train to it's real label
- y_test: our 20% portion of 'data_labels' that are the true values of the labels assigned to X_test

In [24]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    features_nd,
    data_labels,
    test_size=0.20,
    train_size=0.80,
    random_state=12
)

Next we import our logistic regression model. 

If "solver='lbfgs'" wasn't specified, we'd get a warning that no solver was specified for the model and that 'lbfgs' would be used by default. 

(We're not too sure what that does to be honest, but we decided putting it on the LogisticRegression object so that it doesn't yell at us with a warning was better than seeing a red warning everytime we ran the jupyter notebook cell lol.)

In [25]:
from sklearn.linear_model import LogisticRegression
log_model = LogisticRegression(solver='lbfgs')

Next, we train our model on our training set of data specified by train_test_split.

In [26]:
log_model = log_model.fit(X=X_train, y=y_train)

Now that our model is trained, we can use it to predict the labels for our test set.

Then we save two more lists
- y_pred: our predicted labels that our model gave to each headline in y_pred
- y_prob: the probabilities assigned to each label for the given headline

In [27]:
y_pred = log_model.predict(X_test)
y_prob = log_model.predict_proba(X_test)

for i in range(29,34):
    print(y_pred[i])
    print(y_prob[i])

sarcastic
[0.4611363 0.5388637]
non-sarcastic
[0.57036373 0.42963627]
sarcastic
[0.01015833 0.98984167]
sarcastic
[0.28094057 0.71905943]
non-sarcastic
[0.66878904 0.33121096]


Now that the model is trained, we can play around with seeing the different labels that we assigned to our news articles in our test set.

In [28]:
import random
j = random.randint(0,len(X_test)-10)
for i in range(j,j+10):
    prob_i = y_prob[i]
    prob = 0
    
    if prob_i[0] < prob_i[1]:
        prob = prob_i[1]
    else:
        prob = prob_i[0]
        
    print(y_pred[i] + " - (" + '%.2f' % (prob*100) + "% probability)")
    ind = features_nd.tolist().index(X_test[i].tolist())
    print(data[ind].strip())
    print()

sarcastic - (80.73% probability)
6-year-old cries when told mtm productions kitten dead by now

sarcastic - (72.10% probability)
employees on other end of conference call just want it to be over

non-sarcastic - (51.88% probability)
miss france iris mittenaere wins miss universe crown

sarcastic - (84.28% probability)
ex-iranian president mahmoud ahmadinejad plans to run again

sarcastic - (86.23% probability)
glitch in country allows citizens to temporarily walk through tables

non-sarcastic - (57.33% probability)
12 stunning photos of 'tiny dancers' caught in action

sarcastic - (74.14% probability)
david byrne holds up old suit to show how far he's come in weight loss journey

non-sarcastic - (73.89% probability)
nordstrom just low-key dropped a huge fall sale

non-sarcastic - (96.74% probability)
pope francis pardons those who dodged the draft during crusades

non-sarcastic - (96.69% probability)
here's what actually happens to the money in wishing wells



Seeing the predicted labels is fun, but now let's see how accurate our model is.

We'll use sklearn's accuracy_score function to see how on the mark our model is with it's predicitons.

In [29]:
from sklearn.metrics import accuracy_score
print(accuracy_score(y_test, y_pred))

0.78625


78% isn't terrible, but definitely can be improved later on. Using a previous data set of just sarcastic and non-sarcastic Reddit comments, our model had but a 50% accuracy (which is no better than guessing). ~80% is not bad at all!

For our purposes, we also use sklearn's classification_report to see our precision, recall, and f-1 scores

In [30]:
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred, target_names=['non-sarcastic', 'sarcastic']))

               precision    recall  f1-score   support

non-sarcastic       0.76      0.78      0.77       370
    sarcastic       0.81      0.79      0.80       430

    micro avg       0.79      0.79      0.79       800
    macro avg       0.79      0.79      0.79       800
 weighted avg       0.79      0.79      0.79       800



We can now add any headline to the dataset, re-vectorize it and extract numerical features, fit our model, and predict our newly added headline

In [31]:
data.append('Pathetic: Mom And Dad Are Trying To Pass Off The Trip To Grandma\'s House As This Year\'s Family Vacation')

features = vectorizer.fit_transform(data)

features_nd = features.toarray()

In [32]:
X_train = features_nd[:len(features_nd) - 1]
X_test = features_nd[-1:]

log_model = log_model.fit(X=X_train,y=data_labels)

In [33]:
ypred = log_model.predict(X_test)
yprob = log_model.predict_proba(X_test)

if yprob[0][0] < yprob[0][1]:
    yprob = yprob[0][1]
else:
    yprob = yprob[0][0]

In [40]:
out = ypred[0] + " - (" + ('%.2f' % (yprob*100)) + "% probability)"

print(out)
ind = features_nd.tolist().index(X_test[0].tolist())
print(data[ind].strip())

non-sarcastic - (53.94% probability)
Pathetic: Mom And Dad Are Trying To Pass Off The Trip To Grandma's House As This Year's Family Vacation
