# Baseline performance on New Data - Tweets
----
To verify whether our baseline models train to detect sarcasm, we collect a second dataset for sarcasm detection, which is completely unseen data for our models. This new dataset contains short-form content much like the headlines. However, unlike headlines, tweets aren't written by professionals but by individuals expressing their feelings and opinions in an informal style. Moreover, the topics might differ as well since tweets are not necessarily for reflecting on current news affairs. 

Despite that, we opted to test our best-performing baseline model, the Naive Bayes classifier with the uni-, bi-, and trigrams, to analyze the behavior and compare the performance metrics to the test set of the headlines data we trained on. 

The main motivation for adding an extra dataset was also the observation that the Onion headlines (represent the sarcastic class) have a very specific style of writing that could be argued that is rather funny but not necassarily sarcastic in all cases, and since the representatives of the `sarcastic` class come only from the Onion headlines we wondered how much of it comes down to their specific style rather than sarcasm.

In [1]:
# imports
from sklearn.model_selection import train_test_split as split

from src.data_util import load_data
from src.naive_bayes import NaiveBayesClassifier

import numpy as np

In [15]:
# load the data
headlines = load_data("../data/headline_data/headlines.conllu")

# split into training and test sets
SEED = 42
train_headlines, other_headlines = split(headlines, test_size=0.3, random_state=SEED)
val_headlines, test_headlines = split(other_headlines, test_size=0.5, random_state=SEED)
print(
    f"Number of headlines for training, validation, \
        and test is {len(train_headlines)}, {len(val_headlines)}, \
        and {len(test_headlines)} resp."
)

Number of headlines for training, validation,         and test is 20033, 4293,         and 4293 resp.


In [3]:
naive_bayes = NaiveBayesClassifier(ngram_range=(1, 3))
naive_bayes.fit(train_headlines)
fp, fn = naive_bayes.test(train_headlines)
fp, fn = naive_bayes.test(test_headlines)

100%|██████████| 20033/20033 [05:31<00:00, 60.41it/s]
100%|██████████| 20033/20033 [01:29<00:00, 224.38it/s]


               precision    recall  f1-score   support

Non-sarcastic       1.00      1.00      1.00     10530
    Sarcastic       1.00      1.00      1.00      9503

     accuracy                           1.00     20033
    macro avg       1.00      1.00      1.00     20033
 weighted avg       1.00      1.00      1.00     20033



100%|██████████| 4293/4293 [00:16<00:00, 266.28it/s]


               precision    recall  f1-score   support

Non-sarcastic       0.84      0.89      0.86      2237
    Sarcastic       0.87      0.81      0.84      2056

     accuracy                           0.85      4293
    macro avg       0.85      0.85      0.85      4293
 weighted avg       0.85      0.85      0.85      4293



In [5]:
# balance the dataset
tweets = load_data("../data/tweets_data/tweets.conllu")

In [6]:
tweets_sarcastic = [tweet for tweet in tweets if tweet[0].metadata['class'] == "1"]
tweets_non_sarcastic = [tweet for tweet in tweets if tweet[0].metadata['class'] == "0"]

In [7]:
len(tweets_sarcastic)

867

In [8]:
len(tweets_non_sarcastic)

2601

In [9]:
tweets[0][0]

TokenList<The, only, thing, I, got, from, college, is, a, caffeine, addiction, metadata={text: "The only thing I got from college is a caffeine addiction", headline_id: "1", sent_id: "0", class: "1"}>

In [12]:
tweets_non_sarcastic_sample = []
sampled_indices = np.random.choice(len(tweets_non_sarcastic), size=len(tweets_sarcastic), replace=False)
for idx in sampled_indices:
    tweets_non_sarcastic_sample.append(tweets_non_sarcastic[idx])

In [13]:
sampled_tweets = tweets_sarcastic + tweets_non_sarcastic_sample
len(sampled_tweets)

1734

In [14]:
# test on extra data
fp, fn = naive_bayes.test(sampled_tweets)

100%|██████████| 1734/1734 [00:06<00:00, 286.85it/s]

               precision    recall  f1-score   support

Non-sarcastic       0.50      0.86      0.63       867
    Sarcastic       0.51      0.15      0.23       867

     accuracy                           0.50      1734
    macro avg       0.51      0.50      0.43      1734
 weighted avg       0.51      0.50      0.43      1734






The Naive Bayes baseline achieved a satisfactory performance on the news headlines, however, with the new dataset the metrics are significantly worse, which might suggest that we are actually not learning to detect sarcasm but rather Onion writing style vs. Huffpost writing style.

Although it is hard to define sarcasm exactly, and even though we had trouble detecting it when performing the error analysis after milestone 1, we saw that this might come down to detecting humor instead, as we discussed after our final presentation.