<div class="alert alert-block alert-info">
Given headlines and their labels as real (0) or clickbait (1), it is asked from us to classify these headlines as real or clickbait. The classifier that is mandated is Naive Bayes Classifier. You will find the implementation of the Naive Bayes Classifier, as well as another classifier which does a good job given it's simplicity. Implementation details and comments are included both in the code and as markdown cells.
</div>

In [1]:
import pandas as pd
import numpy as np
import re
import nltk
import inflect
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from nltk import word_tokenize
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

nltk.download("stopwords")
nltk.download("wordnet")
nltk.download('punkt')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\palaz\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\palaz\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\palaz\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [2]:
df = pd.read_csv("data.csv")

In [3]:
df

Unnamed: 0,Headline,Clickbait
0,"Cuba Eliminates Mexico, 7-4, at Classic",0
1,This Is What Space Travel Really Looks Like,1
2,Which Mario Are You Based On Your Zodiac Sign,1
3,European oil companies stop trade with Iran,0
4,This Dachshund Is Determined To Sleep With His...,1
...,...,...
28795,44 Perfect Songs To Listen To While You Write,1
28796,Egypt announces Internet crime initiative,0
28797,Australian rules football: West Gippsland Latr...,0
28798,Yankees Grab a Share of First Place as Everyon...,0


In [4]:
class basic_classifier:
    
    #This classifier has two class variables:
    #words_clickbait which is the is a dictionary of string keys and integer values. Keys there are words, preprocessed or not
    #(check below for more information on that) and values are the frequencies. This is the dictionary for clickbait headlines.
    #Words in this dictionary are from clickbait headlines.
    
    #words_real is the same thing with real headlines.
    def __init__(self):
        pass
    
    def fit(self, X, y):
        self.words_clickbait = {}
        self.words_real = {}

        for i, headline in enumerate(X):
            index = X.index[i]
            if y[index] == 0:
                for word in headline:
                    if word in self.words_real.keys():
                        self.words_real[word] += headline.count(word)
                    else:
                        self.words_real[word] = headline.count(word)
            elif y[index] == 1:
                for word in headline:
                    if word in self.words_clickbait.keys():
                        self.words_clickbait[word] += headline.count(word)
                    else:
                        self.words_clickbait[word] = headline.count(word)
    
    #Get the frequency of word from respective dictionary.
    #I used this for to be sure if the calculations were correct or not.
    def get_word_frequency(self, word, clickbait):
        if clickbait:
            for key in self.words_clickbait:
                if key == word:
                    return self.words_clickbait.get(key)
        else:
            for key in self.words_real:
                if key == word:
                    return self.words_real.get(key)
    
    #To get useful words, I used these two functions below.
    #It basically gets the most frequent words from both of the dictionaries.
    def most_freq_words_priv(self, value_number, clickbait = False):
        return_dict = {}
        
        if clickbait:
            most_freqs = sorted(self.words_clickbait, key=self.words_clickbait.get, reverse=True)[:value_number]
            for word in most_freqs:
                return_dict[word] = self.get_word_frequency(word, True)
            return return_dict
        else: 
            most_freqs = sorted(self.words_real, key=self.words_real.get, reverse=True)[:value_number]
            for word in most_freqs:
                return_dict[word] = self.get_word_frequency(word, False)
            return return_dict
    
    
    def get_n_most_frequent_words(self, first_n):
        freq_dict = {
            "clickbait": {},
            "real": {}
        }
        
        freq_dict["clickbait"] = self.most_freq_words_priv(first_n, True)
        freq_dict["real"] = self.most_freq_words_priv(first_n, False)
        
        return freq_dict
    
    #Same design with Assignment1 actually, just predict one then iterate over X_test to predict all instances.
    
    #Prediction is based on:
    #for each word in the headline, if this word is more frequent on real headlines, then headline's num_real gets +1 point
    #and if this word is more frequent on clickbait headlines, then headline's num_clickbait gets +1 point.
    #Whichever of these num_real and num_clickbait is bigger, then that is the prediction of the headline.
    def predict_one_headline(self, headline):  
        num_clickbait = 0
        num_real = 0
        
        for word in headline:
            word = word.lower()
            
            clickbait_count = 0 if self.words_clickbait.get(word) == None else self.words_clickbait.get(word)
            real_count = 0 if self.words_real.get(word) == None else  self.words_real.get(word)
            
            if real_count > clickbait_count:
                num_real += 1
            else:
                num_clickbait += 1
        
        if num_clickbait < num_real:
            return 0
        else:
            return 1
    
    def predict(self, X_test):
        y_pred = []
        for headline in X_test:
            y_pred.append(self.predict_one_headline(headline))
        return y_pred

In [5]:
def accuracy(y, y_pred):
    
    true_count = 0
    for index, el in enumerate(y):
        if y_pred[index] == el:
            true_count += 1
    return 100 * true_count / len(y)

I added 100 factor later, so in the code below, you will see accuracy scores as e.g. 0.9. Sorry for the inconvinience but it take too long to run the whole notebook.

In [6]:
def tokens_to_sentence(words):
    return ' '.join(words)

### Basic prediction with no preprocessing

In [7]:
df["Headline"] = df["Headline"].apply(word_tokenize)

In [8]:
X = df["Headline"]
y = df["Clickbait"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.50, random_state=42, stratify=y)

In [9]:
classifier_basic = basic_classifier()
classifier_basic.fit(X_train, y_train)

In [10]:
y_pred = classifier_basic.predict(X_test)

In [11]:
accuracy(y_test, y_pred)

67.70833333333333

Wasn't expecting any better, since all I did here was tokenizing sentences.

In [12]:
classifier_basic.get_n_most_frequent_words(20)

{'clickbait': {'You': 2913,
  'The': 2671,
  'To': 1621,
  'A': 1440,
  'Your': 1374,
  'Of': 1224,
  "'s": 1177,
  '``': 1146,
  "''": 1145,
  'Are': 917,
  'In': 908,
  'That': 851,
  'This': 841,
  'And': 781,
  'Is': 767,
  'On': 634,
  'For': 617,
  'What': 588,
  'Will': 573,
  'Things': 479},
 'real': {'in': 2217,
  ',': 1952,
  'to': 1743,
  'of': 1351,
  'for': 756,
  'the': 674,
  'a': 629,
  'on': 561,
  'and': 550,
  'at': 461,
  "'s": 349,
  ':': 345,
  'US': 299,
  '.': 283,
  'New': 274,
  'by': 254,
  'after': 232,
  "'": 220,
  'as': 203,
  'Is': 178}}

Without removing stopwords:

For clickbait headlines, "You", "Your" and "What" are the key words that make huge difference because in these headlines, these words are both often and semantically, these are the words that writers of these headlines use.

For real headlines, "US", "New" and "in" are the words that make huge difference.

The thing is, not only their frequencies, but frequency difference between real and clickbait headlines are important too. Because in calculation, this differs a lot.

### Preprocessed data with basic classification

In [13]:
def to_lowercase(words):
    """Convert all characters to lowercase from list of tokenized words"""
    new_words = []
    for word in words:
        new_word = word.lower()
        new_words.append(new_word)
    return new_words

def remove_punctuation(words):
    """Remove punctuation from list of tokenized words"""
    new_words = []
    for word in words:
        new_word = re.sub(r'[^\w\s]', '', word)
        if new_word != '':
            new_words.append(new_word)
    return new_words

def replace_numbers(words):
    """Replace all interger occurrences in list of tokenized words with textual representation"""
    p = inflect.engine()
    new_words = []
    for word in words:
        if word.isdigit():
            new_word = p.number_to_words(word)
            new_words.append(new_word)
        else:
            new_words.append(word)
    return new_words

def remove_stopwords(words):
    """Remove stop words from list of tokenized words"""
    new_words = []
    for word in words:
        if word not in stopwords.words('english'):
            new_words.append(word)
    return new_words

def lemmatize_verbs(words):
    """Lemmatize verbs in list of tokenized words"""
    lemmatizer = WordNetLemmatizer()
    lemmas = []
    for word in words:
        lemma = lemmatizer.lemmatize(word, pos='v')
        lemmas.append(lemma)
    return lemmas

In [14]:
X = X.apply(to_lowercase)
X = X.apply(remove_punctuation)
X = X.apply(replace_numbers)
X = X.apply(remove_stopwords)
X = X.apply(lemmatize_verbs)

The code is not here because I deleted it but, replacing numbers has a huge affect on classification. I guess it corrects lots of crooked part of the input data. Replacing numbers literally increased accuracy from 0.49 to 0.88.

In [15]:
#To convert X to pd.Series, but may be unnecessary I don't know
df["Headline"] = X

In [16]:
df

Unnamed: 0,Headline,Clickbait
0,"[cuba, eliminate, mexico, seventy-four, classic]",0
1,"[space, travel, really, look, like]",1
2,"[mario, base, zodiac, sign]",1
3,"[european, oil, company, stop, trade, iran]",0
4,"[dachshund, determine, sleep, teddy, bear]",1
...,...,...
28795,"[forty-four, perfect, songs, listen, write]",1
28796,"[egypt, announce, internet, crime, initiative]",0
28797,"[australian, rule, football, west, gippsland, ...",0
28798,"[yankees, grab, share, first, place, everyone,...",0


In [17]:
X = df["Headline"]
y = df["Clickbait"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42, stratify=y)

In [18]:
classifier_basic_ = basic_classifier()
classifier_basic_.fit(X_train, y_train)

In [19]:
y_pred_ = classifier_basic_.predict(X_test)

In [20]:
accuracy(y_test, y_pred_)

91.61111111111111

Well, to be honest, just with preprocessing this is a great score for this classifier. I mean, this includes kind of a counting too it is not that sophisticated to get 0.91 accuracy. This is suprising.

Here, one of the most important preprocessing part was replace_numbers function. I guess it just corrects too many things, because without it, I get like 0.47 accuracy socre from basic_classifier, and other preprocessing steps were included.

In [21]:
classifier_basic_.get_n_most_frequent_words(20)

{'clickbait': {'things': 719,
  'make': 706,
  'people': 657,
  'know': 591,
  'time': 572,
  'twenty-one': 455,
  'seventeen': 453,
  'actually': 403,
  'base': 397,
  'nt': 389,
  'get': 389,
  'nineteen': 380,
  'two thousand and fifteen': 359,
  'need': 347,
  'like': 346,
  'best': 328,
  'look': 305,
  'new': 295,
  'life': 272,
  'fifteen': 260},
 'real': {'us': 785,
  'new': 548,
  'kill': 539,
  'win': 311,
  'die': 288,
  'say': 283,
  'two': 239,
  'dead': 238,
  'find': 227,
  'uk': 223,
  'president': 216,
  'bomb': 203,
  'obama': 194,
  'crash': 190,
  'first': 186,
  'australian': 185,
  'state': 176,
  'world': 175,
  'court': 171,
  'plan': 171}}

When stopwords removed:

For clickbait headlines "things", "know" and "base" are these huge impact words I guess. If you read clickbait headlines, these three words are almost are in every clickbait headline. It is suprising to see some numbers, but specific numbers, like 21, 17, 19, 15. I guess these are the most "clickbait" numbers. Just think about it, "17 Things You Don't Know About Koalas!". 2015 might be there because of the data, if the data is collected during 2015, then you may see things like "What will your Zodiac sign be in 2015?", I guess...

For real headlines, country names make super sense. Since in general, these are news headlines, you expect to see lots of country names in a news headline. For specific three words I will choose "us", "kill", "die" because the world is not a good place...

### Naive Bayes Part

In [22]:
class naive_bayes:
    
    def __init__(self, alpha=0):
        self.alpha = alpha
    
    def get_params(self, deep=True):
        return {"alpha": self.alpha}

    def set_params(self, **parameters):
        for parameter, value in parameters.items():
            setattr(self, parameter, value)
        return self
    
    
    #Priors, clickbait and real headline's word counts, and each word's count in clickbait and real headlines are calculated here.
    #Inline lists are way and I mean WAY more faster than for loops and if statements so, I had to use inline lists.
    def fit(self, X, y):
        
        #This is to avoid problems with index.
        #Since y is a pandas.Series, when shuffled during train_test_split, indexes does not reset.
        #But in calculations below, I used enumerate(X) so indexing of y there should be same with X for correct calculations.
        y = y.reset_index()
        y = y["Clickbait"]
        
        
        #Prior calculation
        X = X + self.alpha #Laplace Smoothing with alpha
        self.total_freq = np.sum(X)
        self.prior_clickbait = np.sum([np.sum(X[i]) for i, x in enumerate(X) if y[i] == 1]) / self.total_freq
        self.prior_real = np.sum([np.sum(X[i]) for i, x in enumerate(X) if y[i] == 0]) / self.total_freq
        
        #Word frequency calculation
        d = [[document for index, document in enumerate(X) if y[index] == class_] for class_ in np.unique(y)]
        self.word_freqs = [np.sum(d[class_], axis=0) for class_ in np.unique(y)]
        
        #Word counts in classes calculation
        self.number_clickbait = np.sum(self.word_freqs[1])
        self.number_real = np.sum(self.word_freqs[0])
        
        #THIS IS TOO SLOW
        #DEPRECATED
        #----------------
        #Word frequency calculation
        #self.word_freqs = np.ndarray(shape=(len(np.unique(y)), X.shape[1]), dtype=np.int64)
        #for class_ in np.unique(y):
        #    for i in range(X.shape[1]):
        #        sum_ = 0
        #        for j in np.arange(X.shape[0]):
        #            if y[j] == class_:
        #                sum_ += X[j][i]
        #        self.word_freqs[class_][i] = sum_
        #----------------
        
        self.y = y
        
    def predict(self, X_test):
        y_pred = []
        
        for document in X_test:
            pred_real = np.log(self.prior_real)
            pred_clickbait = np.log(self.prior_clickbait)
            for index, word_freq in enumerate(document):
                if word_freq != 0:
                    for i in range(word_freq):
                        pred_real += np.log(self.word_freqs[0][index] / self.number_real)
                        pred_clickbait += np.log(self.word_freqs[1][index] / self.number_clickbait)
                        
            if pred_real > pred_clickbait:
                y_pred.append(0)
            else:
                y_pred.append(1)
        
        return y_pred
    
    #This function is here for Part 3 of coding, to get most 10 influential words.
    #However, it does not really work that way. It gives the most influential words, and there is an algorithm for that
    #but I assumed that there should be more than 10 words which are higher than mean+standart_deviation but there is literally not.
    #So this functions but poorly.
    
    #Returns indexes, words are given in turn_index_into_words function, which is out of the classifier.
    def get_mosts(self, presence_coefficient, absence_coefficient):
        
        #This is done purely for my bad coding and probably too inefficient
        word_freqs_ = [[el] for el in self.word_freqs]
        
        
        mosts = {"presence_real": [],
                "absence_real": [],
                "presence_clickbait": [],
                "absence_clickbait": []}
        
        #Differences is a list with following structure:
        #[[real frequency - clickbait frequency of a word for word in self.word_freqs],
        #[clickbait frequency - real frequency of a word for word in self.word_freqs]]
        
        #How it works:
        #You get words with max freqs in their respective classes, and if the these words have difference (from differences list)
        #that is higher than mean+presence_coefficient*standart_deviation, then this word is influential with it's presence.
        #Vice versa for lower than mean-absence_coefficient*standart_deviation.
        
        #I do think this is a proper idea, but the functionality is not here, I think this is due to my bad maths.
        #I mean, I used Gaussian Distribution as a model here, what can I do more?
        differences = [word_freqs_[i][j] - word_freqs_[class_][j] for j in range(len(word_freqs_[0])) for i, class_ in enumerate(reversed(np.unique(self.y)))]
        max_indexes = [np.where(word_freqs_[class_] == np.amax(word_freqs_[class_])) for class_ in np.unique(self.y)]
        max_indexes = [max_indexes[i][1] for i in range(len(max_indexes))]
        differences_stds = [np.std(x) for x in differences]
        differences_means = [np.mean(x) for x in differences]
        selected_indexes_presence = [[index for index in max_indexes[class_] if differences[class_][index] >= differences_means[class_]+presence_coefficient*differences_stds[class_]] for class_ in np.unique(self.y)]
        selected_indexes_absence = [[index for index in max_indexes[class_] if differences[class_][index] <= differences_means[class_]-absence_coefficient*differences_stds[class_]] for class_ in np.unique(self.y)]
        
        mosts["presence_real"] = selected_indexes_presence[0]
        mosts["presence_clickbait"] = selected_indexes_presence[1]
        mosts["absence_real"] = selected_indexes_absence[0]
        mosts["absence_clickbait"] = selected_indexes_absence[1]
        
        return mosts
        
        

In [23]:
#As name suggests, turns the indexes we got from classifier.get_mosts to words of that vectorizer. Kinda slow tho.
def turn_index_into_words(vectorizer, index_dict):
    for key in index_dict:
        index_dict[key] = [word for word in vectorizer.get_feature_names() for index in index_dict.get(key) if vectorizer.get_feature_names().index(word) == index]

In [24]:
df = pd.read_csv("data.csv")

### With stopwords

In [25]:
df["Headline"] = df["Headline"].apply(word_tokenize)

X = df["Headline"]
y = df["Clickbait"]

X = X.apply(to_lowercase)
X = X.apply(remove_punctuation)
X = X.apply(replace_numbers)
X = X.apply(lemmatize_verbs)

X = X.apply(tokens_to_sentence)

vectorizer = CountVectorizer()

X = vectorizer.fit_transform(X).toarray()

In [26]:
X

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

In [27]:
y = df["Clickbait"]

In [28]:
X.shape

(28800, 19079)

In [29]:
y.shape

(28800,)

In [30]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42, stratify=y)

In [31]:
X_train

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

In [32]:
parameters = {"alpha": [0.1, 0.01, 0.001, 0.0001, 0.00001]}

scoring = ["accuracy"]
grid_search = GridSearchCV(estimator=naive_bayes(), param_grid=parameters, cv=5, refit= "accuracy", scoring=scoring)

grid_search = grid_search.fit(X_train, y_train)

In [33]:
print("Best Score: ",grid_search.best_score_)
print("Best Estimator: ", grid_search.best_estimator_)
print("Best Parameters: ", grid_search.best_params_)

Best Score:  0.9713425925925927
Best Estimator:  <__main__.naive_bayes object at 0x000001C125F8FF10>
Best Parameters:  {'alpha': 1e-05}


In [34]:
y_test = y_test.reset_index()
y_test = y_test["Clickbait"]

In [35]:
y_test

0       1
1       1
2       1
3       0
4       0
       ..
7195    0
7196    1
7197    0
7198    0
7199    0
Name: Clickbait, Length: 7200, dtype: int64

In [36]:
y_pred = grid_search.best_estimator_.predict(X_test)
print("Classification Report:")
print(classification_report(y_test, y_pred))

Classification Report:
              precision    recall  f1-score   support

           0       0.98      0.97      0.97      3599
           1       0.97      0.98      0.97      3601

    accuracy                           0.97      7200
   macro avg       0.97      0.97      0.97      7200
weighted avg       0.97      0.97      0.97      7200



In [37]:
conf_matrix = confusion_matrix(y_test, y_pred)
print(conf_matrix)

[[3479  120]
 [  74 3527]]


In [38]:
words = grid_search.best_estimator_.get_mosts(1, -1)
turn_index_into_words(vectorizer, words)
words

{'presence_real': ['in'],
 'absence_real': [],
 'presence_clickbait': ['you'],
 'absence_clickbait': []}

In [39]:
words = grid_search.best_estimator_.get_mosts(5, -5)
turn_index_into_words(vectorizer, words)
words

{'presence_real': ['in'],
 'absence_real': [],
 'presence_clickbait': ['you'],
 'absence_clickbait': []}

In [40]:
words = grid_search.best_estimator_.get_mosts(10, -10)
turn_index_into_words(vectorizer, words)
words

{'presence_real': ['in'],
 'absence_real': [],
 'presence_clickbait': ['you'],
 'absence_clickbait': []}

### With-out stopwords

In [41]:
X = df["Headline"]
y = df["Clickbait"]

X = X.apply(to_lowercase)
X = X.apply(remove_punctuation)
X = X.apply(replace_numbers)
X = X.apply(remove_stopwords)
X = X.apply(lemmatize_verbs)

X = X.apply(tokens_to_sentence)

vectorizer = CountVectorizer()

X = vectorizer.fit_transform(X).toarray()

In [42]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42, stratify=y)

In [43]:
y_test = y_test.reset_index()
y_test = y_test["Clickbait"]

In [44]:
classifier_stopwords = naive_bayes(0.00001)
classifier_stopwords.fit(X_train, y_train)
y_pred_stopwords = classifier_stopwords.predict(X_test)

In [45]:
accuracy(y_test, y_pred_stopwords)

95.88888888888889

Removing stopwords actually decreased accuracy. I guess this is because stopwords like "you", "the", "your" etc. do actually impact the classification. But this is a rare situation, obviously. In, general, removing stopwords should not lower the accuracy.

In [46]:
print("Classification Report:")
print(classification_report(y_test, y_pred_stopwords))

Classification Report:
              precision    recall  f1-score   support

           0       0.96      0.96      0.96      3599
           1       0.96      0.96      0.96      3601

    accuracy                           0.96      7200
   macro avg       0.96      0.96      0.96      7200
weighted avg       0.96      0.96      0.96      7200



In [47]:
conf_matrix = confusion_matrix(y_test, y_pred_stopwords)
print(conf_matrix)

[[3456  143]
 [ 153 3448]]


In [48]:
words_ = classifier_stopwords.get_mosts(1,-1)
turn_index_into_words(vectorizer, words_)
words_

{'presence_real': ['us'],
 'absence_real': [],
 'presence_clickbait': ['twenty'],
 'absence_clickbait': []}

In [49]:
words_ = classifier_stopwords.get_mosts(5,-5)
turn_index_into_words(vectorizer, words_)
words_

{'presence_real': ['us'],
 'absence_real': [],
 'presence_clickbait': ['twenty'],
 'absence_clickbait': []}

In [50]:
words_ = classifier_stopwords.get_mosts(10,-10)
turn_index_into_words(vectorizer, words_)
words_

{'presence_real': ['us'],
 'absence_real': [],
 'presence_clickbait': ['twenty'],
 'absence_clickbait': []}

So, changing coefficients does not do anything. Such shame. I couldn't get Part 3.1 done, but I think my idea is a good one.