### Naive Bayes Bag of Words

Let's start by imports and data loading.

In [1]:
# imports
from sklearn.model_selection import train_test_split as split

from src.data_util import load_data
from src.naive_bayes import NaiveBayesClassifier

In [2]:
# load the data
headlines = load_data("../data/dataset.conllu")

# split into training and test sets
SEED = 42
train_headlines, other_headlines = split(headlines, test_size=0.3, random_state=SEED)
val_headlines, test_headlines = split(other_headlines, test_size=0.5, random_state=SEED)
print(f"Number of headlines for training, validation, and test is {len(train_headlines)}, {len(val_headlines)}, and {len(test_headlines)} resp.")

Number of headlines for training, validation, and test is 20033, 4293, and 4293 resp.


Now we load our Naive Bayes model from the source file `src/naive_bayes.py` and fit it to our train data.

Our model is implemented with `MultinomialNB` from `sklearn.naive_bayes`.
Its features can be unigrams, bigrams, trigrams, etc, or any range of them, specified by the argument `ngram_range`.
For example, the range `(1, 3)` tells the model to consider all unigrams, bigrams, and trigrams as the features to count.

The model first counts all features with the help of `CountVectorizer` from `sklearn.feature_extraction.text`.
We count the features in each sentence of a headline separately and then add the counts up, so that no ngram we count spans two different sentences.
Then, we pass the feature counts to the Naive Bayes model.
Finally, we can get predictions from the model.

Let's start by training the simplest model, i.e. only unigrams as features.

In [3]:
# fit the Naive Bayes Bag of Word model to training data
naive_bayes = NaiveBayesClassifier(ngram_range=(1, 1))
naive_bayes.fit(train_headlines)

100%|██████████| 20033/20033 [00:11<00:00, 1796.08it/s]


...and here are the results on the training set and the test set respectively:

In [4]:
# test on training data
fp, fn = naive_bayes.test(train_headlines)

100%|██████████| 20033/20033 [00:03<00:00, 5208.47it/s]

               precision    recall  f1-score   support

Non-sarcastic       0.92      0.94      0.93     10530
    Sarcastic       0.93      0.91      0.92      9503

     accuracy                           0.92     20033
    macro avg       0.92      0.92      0.92     20033
 weighted avg       0.92      0.92      0.92     20033






In [5]:
# test on test data and get false positive and false negatives
fp, fn = naive_bayes.test(test_headlines)

100%|██████████| 4293/4293 [00:00<00:00, 5176.62it/s]

               precision    recall  f1-score   support

Non-sarcastic       0.84      0.86      0.85      2237
    Sarcastic       0.84      0.83      0.83      2056

     accuracy                           0.84      4293
    macro avg       0.84      0.84      0.84      4293
 weighted avg       0.84      0.84      0.84      4293






We have around 92% for all metrics on training and around 85% for test.
Pretty impressive for such a simple model!
Of course, we expect the results will be significantly worse when tested on other datasets.
The test set we have is still very similar to the training set, since the whole dataset consists of news headlines of only two sources, each with their own unique style.
For the next milestone, we shall test the model on another dataset as well.

Now let's see a few false positives and a few false negatives:

In [6]:
N = 10
print("\n--- Some false negatives ---")
for f in fn[:N]:
    print(f[0].metadata["text"])
print("\n--- Some false positives ---")
for f in fp[:N]:
    print(f[0].metadata["text"])


--- Some false negatives ---
fist-pumping jared kushner leaves jerusalem embassy refreshed and ready to solve next global crisis
australian parliament gathers to discuss dwindling hemsworth reserves
desperate hillary to obama: 'next vote wins'
supreme court upholds bill of rights in 5-4 decision
fat kid just wants to watch you guys play
mayor daley's son appointed head of illinois nepotist party
fbi panicking after learning encrypted national security communications may have been intercepted by trump administration
rookie justice gorsuch assigned to supreme court overnight shift
supporters praise trump for upholding traditional american value of supporting murderous dictators for political gain
giuliani puts odds of trump-mueller interview at 50-65

--- Some false positives ---
20 struggles every tall girl knows to be true
surge soda is back!
cupid cop gave out roses, cards on valentine's day instead of tickets
early apple computer sells for almost $1 million at auction
airasia search

Now let's investigate what might have caused these by examining model weights.

In [7]:
naive_bayes.show_word_weights(fn[9])

          word     sarcastic non-sarcastic          diff
      giuliani         -9.40         -9.56          0.15
          puts         -7.24         -8.02          0.78
          odds        -10.61         -9.81         -0.80
            of         -3.83         -4.27          0.44
        *trump         -6.15         -4.83         -1.31
             -         -6.47         -6.17         -0.30
       mueller         -8.15         -9.11          0.96
     interview         -9.00         -8.04         -0.96
            at         -5.52         -5.64          0.12
            50         -8.93         -8.26         -0.68
             -         -6.47         -6.17         -0.30
            65        -10.61         -9.96         -0.65



We can see that the word `trump` has a much larger weight for non-sarcastic labels.
This complies with our analysis in milestone 1 regarding the most common lemmas: the word `trump` is way more frequent in non-sarcastic headlines.
In fact, it is the most common lemma in non-sarcastic headlines, and it appears about three times as many as the second most frequent word.
The lemma `trump` still appears a lot in sarcastic headlines as it is the second most common on that list.
However, it appears about four times as many among non-sarcastic.
This makes the Bag of Words model have a hard time detecting sarcastic headlines containing this name.
The existence of the word `trump` should not have much of an effect on determining whether the headline is sarcastic or not.
Therefore, this is a particular problem with this dataset.

Now let's take a look at the words that were the most decisive with their weights:

In [8]:
naive_bayes.show_decisive_words(n=10)

([(-3.713402245211233, 'queer'),
  (-3.665774196221978, 'huffpost'),
  (-3.448709690984151, 'colbert'),
  (-3.3861893340028164, 'kardashian'),
  (-3.3194979595041456, '2017'),
  (-3.3194979595041456, 'jenner'),
  (-3.3194979595041456, 'lgbtq'),
  (-3.1302559598656163, 'hawaii'),
  (-3.0876963454468207, 'roundup'),
  (-3.0876963454468207, 'trans')],
 [(3.26667369535053, 'per'),
  (3.343634736486658, 'asshole'),
  (3.3800023806575323, 'coworker'),
  (3.3800023806575332, 'unable'),
  (3.5741583950984896, 'onion'),
  (3.808997986175891, 'fucking'),
  (3.843287059654523, 'area'),
  (3.8979454721923874, 'clearly'),
  (4.15867173465564, 'shit'),
  (4.454517117746582, 'fuck')])

First, we can see that like the word `trump`, the word `donald` is very decisive in making a headline non-sarcastic.

Then, let's move on to the word `three`.
This word is a bit decisive for sarcastic headlines, it has a weight difference of 0.61.
Our hypothesis is that the number `three` is a good round number when people make-up fake sentences.
And at least in the case of our source for sarcastic headlines (Onion), most are fake.
Therefore, this word appears a lot.
However, unlike the previous word, this does not seem to be an artifact.
This seems to be a genuine indication of something sarcastic (something fake).
Let's take a look at some cases where this word was used in a sarcastic headline:

* area dad figures he's got at least *three* more months of screwing around before son gains ability to form long-term memories
* salad rendered unhealthy in *three* steps
* '97 neons to come in *three* hideous new colors
* presence of *three* round objects triggers juggling reflex in local man
* man always *three* ingredients away from making pancakes

It makes sense that in reality, we would see fewer occurences of this number.
When trying to compose a fake sentence, often the number `three` is the best:
*three* more months, *three* steps, *three* colors, *three* objects, *three* ingredients.
`two` might be too few, and `four` and above can feel like too many.

Moving on, we have other words that are more frequently used in sarcastic headlines, like the words `fuck` and `shit`.

And then, there are words like `nation` that are used frequently by the source Onion in sarcastic headlines.
In fact, `nation` is the most common word in our sarcastic headlines.
It is frequently used in this style of their sarcastic headlines:

* *nation*'s sane people to *nation*'s insane people: 'please stop shooting us'
* *nation*'s poor bastards never even saw it coming
* backup health care plan involves *nation* sharing one big jar of ointment
* report: north dakota leads *nation* in parking availability
* *nation*'s shark experts: 'you could've had this job'

Similar joke patterns can be seen with some other words, e.g. with `dad`, `study`, and `local`:
* *study*: retired *dads* busier than ever
* tech is the future, reports *local* *dad*
* area *dad* off to bad start with waitress
* groundbreaking *study* finds gratification can be deliberately postponed
* congress votes to intervene in *local* wedding
* *local* grandmother feared dead after appearing in woman's profile picture

Now let's run the model again, this time with bigrams as features that we count.

In [9]:
naive_bayes = NaiveBayesClassifier(ngram_range=(2, 2))
naive_bayes.fit(train_headlines)
fp, fn = naive_bayes.test(train_headlines)
fp, fn = naive_bayes.test(test_headlines)

100%|██████████| 20033/20033 [00:45<00:00, 435.85it/s]
100%|██████████| 20033/20033 [00:16<00:00, 1238.87it/s]


               precision    recall  f1-score   support

Non-sarcastic       0.99      1.00      1.00     10530
    Sarcastic       1.00      0.99      0.99      9503

     accuracy                           1.00     20033
    macro avg       1.00      1.00      1.00     20033
 weighted avg       1.00      1.00      1.00     20033



100%|██████████| 4293/4293 [00:03<00:00, 1253.55it/s]

               precision    recall  f1-score   support

Non-sarcastic       0.76      0.87      0.81      2237
    Sarcastic       0.83      0.70      0.76      2056

     accuracy                           0.79      4293
    macro avg       0.80      0.79      0.79      4293
 weighted avg       0.80      0.79      0.79      4293






And finally, let's run the model with unigrams, bigrams, and trigrams, all as the features that we count.

In [10]:
naive_bayes = NaiveBayesClassifier(ngram_range=(1, 3))
naive_bayes.fit(train_headlines)
fp, fn = naive_bayes.test(train_headlines)
fp, fn = naive_bayes.test(test_headlines)

100%|██████████| 20033/20033 [01:55<00:00, 173.36it/s]
100%|██████████| 20033/20033 [00:38<00:00, 513.93it/s]


               precision    recall  f1-score   support

Non-sarcastic       1.00      1.00      1.00     10530
    Sarcastic       1.00      1.00      1.00      9503

     accuracy                           1.00     20033
    macro avg       1.00      1.00      1.00     20033
 weighted avg       1.00      1.00      1.00     20033



100%|██████████| 4293/4293 [00:08<00:00, 499.40it/s]

               precision    recall  f1-score   support

Non-sarcastic       0.84      0.89      0.86      2237
    Sarcastic       0.87      0.81      0.84      2056

     accuracy                           0.85      4293
    macro avg       0.85      0.85      0.85      4293
 weighted avg       0.85      0.85      0.85      4293






We can see that both of these cases overfit the training dataset almost completely and the model's decision making becomes perhaps completely reliant on the artifacts in our small dataset.
On the test set, bigrams alone perform worse than what we saw at the beginning with unigrams.
However, all three ngrams together, make for an ever so slightly improved performance.