<h1>Sentiment Analysis</h1><br>
2023 NLP Coursework Part A. First we must import the necessary packages

In [2]:
import nltk
import string
import numpy as np

nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')

from nltk.corpus import stopwords
from nltk.stem.lancaster import LancasterStemmer
from nltk.stem import WordNetLemmatizer

np.random.seed(42)


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\adame\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\adame\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\adame\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


<h1>Extracting the data

In [5]:
import os
def read_corpus(directory):
    files = [f for f in os.listdir(directory) if f.endswith('.txt')]
    corpus = []
    for file in files:
        with open(os.path.join(directory, file), 'r', encoding='utf-8') as f:
            document = f.read()
            corpus.append(document)
    return corpus

In [6]:
positive_corpus = read_corpus("data/pos/")
negative_corpus = read_corpus("data/neg/")
corpus = positive_corpus + negative_corpus
positive_labels = len(positive_corpus)
negative_labels = len(negative_corpus)
corpus_length = len(corpus)

#sanity check, should be the same every time
print(positive_corpus[0])
print(negative_corpus[1])

Homelessness (or Houselessness as George Carlin stated) has been an issue for years but never a plan to help those on the street that were once considered human who did everything from going to school, work, or vote for the matter. Most people think of the homeless as just a lost cause while worrying about things such as racism, the war on Iraq, pressuring kids to succeed, technology, the elections, inflation, or worrying if they'll be next to end up on the streets.<br /><br />But what if you were given a bet to live on the streets for a month without the luxuries you once had from a home, the entertainment sets, a bathroom, pictures on the wall, a computer, and everything you once treasure to see what it's like to be homeless? That is Goddard Bolt's lesson.<br /><br />Mel Brooks (who directs) who stars as Bolt plays a rich man who has everything in the world until deciding to make a bet with a sissy rival (Jeffery Tambor) to see if he can live in the streets for thirty days without th

<h1>Feature Generation Using Ngrams

In [100]:
from nltk import word_tokenize
from nltk import ngrams

def text_to_ngrams(sentence, n, remove_stopwords=True):
    stoplist = set(stopwords.words('english')) #stop-words to remove
    if not remove_stopwords:
        stoplist = set()
    tokenised_words = [word for word in word_tokenize(sentence.lower()) if word not in stoplist and word not in string.punctuation
                       and word != "br"] 
                    #a list of tokenised words with stop-words, punctuation, and <br>s removed
    zipped_grams = ngrams(tokenised_words, n) #apply nltk's ngrams algorithm
    return list(zipped_grams)

In [8]:
sentence = "I am Ozymandias, king of kings, look upon my works ye mighty and despair"
grams = text_to_ngrams(sentence, 3)
print(grams)
for gram in grams:
    print(gram)

[('ozymandias', 'king', 'kings'), ('king', 'kings', 'look'), ('kings', 'look', 'upon'), ('look', 'upon', 'works'), ('upon', 'works', 'ye'), ('works', 'ye', 'mighty'), ('ye', 'mighty', 'despair')]
('ozymandias', 'king', 'kings')
('king', 'kings', 'look')
('kings', 'look', 'upon')
('look', 'upon', 'works')
('upon', 'works', 'ye')
('works', 'ye', 'mighty')
('ye', 'mighty', 'despair')


In [9]:
#converts the entire corpus to ngrams
def corpus_to_ngrams(corpus, n, remove_stopwords=True):
    new_corpus = []
    for text in corpus:
        new_corpus.append(text_to_ngrams(text, n, remove_stopwords))
    return new_corpus

In [10]:
corpus_unigrams = corpus_to_ngrams(corpus, 1)
print(corpus_unigrams[0])

[('homelessness',), ('houselessness',), ('george',), ('carlin',), ('stated',), ('issue',), ('years',), ('never',), ('plan',), ('help',), ('street',), ('considered',), ('human',), ('everything',), ('going',), ('school',), ('work',), ('vote',), ('matter',), ('people',), ('think',), ('homeless',), ('lost',), ('cause',), ('worrying',), ('things',), ('racism',), ('war',), ('iraq',), ('pressuring',), ('kids',), ('succeed',), ('technology',), ('elections',), ('inflation',), ('worrying',), ("'ll",), ('next',), ('end',), ('streets.',), ('given',), ('bet',), ('live',), ('streets',), ('month',), ('without',), ('luxuries',), ('home',), ('entertainment',), ('sets',), ('bathroom',), ('pictures',), ('wall',), ('computer',), ('everything',), ('treasure',), ('see',), ("'s",), ('like',), ('homeless',), ('goddard',), ('bolt',), ("'s",), ('lesson.',), ('mel',), ('brooks',), ('directs',), ('stars',), ('bolt',), ('plays',), ('rich',), ('man',), ('everything',), ('world',), ('deciding',), ('make',), ('bet',)

In [11]:
corpus_bigrams = corpus_to_ngrams(corpus, 2)
print(corpus_bigrams[0])

[('homelessness', 'houselessness'), ('houselessness', 'george'), ('george', 'carlin'), ('carlin', 'stated'), ('stated', 'issue'), ('issue', 'years'), ('years', 'never'), ('never', 'plan'), ('plan', 'help'), ('help', 'street'), ('street', 'considered'), ('considered', 'human'), ('human', 'everything'), ('everything', 'going'), ('going', 'school'), ('school', 'work'), ('work', 'vote'), ('vote', 'matter'), ('matter', 'people'), ('people', 'think'), ('think', 'homeless'), ('homeless', 'lost'), ('lost', 'cause'), ('cause', 'worrying'), ('worrying', 'things'), ('things', 'racism'), ('racism', 'war'), ('war', 'iraq'), ('iraq', 'pressuring'), ('pressuring', 'kids'), ('kids', 'succeed'), ('succeed', 'technology'), ('technology', 'elections'), ('elections', 'inflation'), ('inflation', 'worrying'), ('worrying', "'ll"), ("'ll", 'next'), ('next', 'end'), ('end', 'streets.'), ('streets.', 'given'), ('given', 'bet'), ('bet', 'live'), ('live', 'streets'), ('streets', 'month'), ('month', 'without'), ('

In [12]:
corpus_trigrams = corpus_to_ngrams(corpus, 3)
print(corpus_trigrams[0])

[('homelessness', 'houselessness', 'george'), ('houselessness', 'george', 'carlin'), ('george', 'carlin', 'stated'), ('carlin', 'stated', 'issue'), ('stated', 'issue', 'years'), ('issue', 'years', 'never'), ('years', 'never', 'plan'), ('never', 'plan', 'help'), ('plan', 'help', 'street'), ('help', 'street', 'considered'), ('street', 'considered', 'human'), ('considered', 'human', 'everything'), ('human', 'everything', 'going'), ('everything', 'going', 'school'), ('going', 'school', 'work'), ('school', 'work', 'vote'), ('work', 'vote', 'matter'), ('vote', 'matter', 'people'), ('matter', 'people', 'think'), ('people', 'think', 'homeless'), ('think', 'homeless', 'lost'), ('homeless', 'lost', 'cause'), ('lost', 'cause', 'worrying'), ('cause', 'worrying', 'things'), ('worrying', 'things', 'racism'), ('things', 'racism', 'war'), ('racism', 'war', 'iraq'), ('war', 'iraq', 'pressuring'), ('iraq', 'pressuring', 'kids'), ('pressuring', 'kids', 'succeed'), ('kids', 'succeed', 'technology'), ('suc

In [13]:
corps_unigrams_with_stopwords = corpus_to_ngrams(corpus, 1, remove_stopwords=False)
print(corps_unigrams_with_stopwords[0])

[('homelessness',), ('or',), ('houselessness',), ('as',), ('george',), ('carlin',), ('stated',), ('has',), ('been',), ('an',), ('issue',), ('for',), ('years',), ('but',), ('never',), ('a',), ('plan',), ('to',), ('help',), ('those',), ('on',), ('the',), ('street',), ('that',), ('were',), ('once',), ('considered',), ('human',), ('who',), ('did',), ('everything',), ('from',), ('going',), ('to',), ('school',), ('work',), ('or',), ('vote',), ('for',), ('the',), ('matter',), ('most',), ('people',), ('think',), ('of',), ('the',), ('homeless',), ('as',), ('just',), ('a',), ('lost',), ('cause',), ('while',), ('worrying',), ('about',), ('things',), ('such',), ('as',), ('racism',), ('the',), ('war',), ('on',), ('iraq',), ('pressuring',), ('kids',), ('to',), ('succeed',), ('technology',), ('the',), ('elections',), ('inflation',), ('or',), ('worrying',), ('if',), ('they',), ("'ll",), ('be',), ('next',), ('to',), ('end',), ('up',), ('on',), ('the',), ('streets.',), ('but',), ('what',), ('if',), ('yo

<h1> Feature Selection using Lemmatization and Stemming

In [14]:
def apply_stemming(text):
    st = LancasterStemmer()
    word_list = [" ".join(st.stem(gram) for gram in ngram) for ngram in text]
                # stems the list of ngram tuples using nltk's LancasterStemmer
    return word_list

In [15]:
stemmed_text = apply_stemming(grams)
for feature in stemmed_text:
    print(feature)

ozymandia king king
king king look
king look upon
look upon work
upon work ye
work ye mighty
ye mighty despair


In [16]:
def apply_lemmatization(text):
    lm = WordNetLemmatizer()
    word_list = [" ".join(lm.lemmatize(gram) for gram in ngram) for ngram in text]
                # lemmatizes the list of ngram tuples
    return word_list

In [17]:
lemmatized_text = apply_lemmatization(grams)
for feature in lemmatized_text:
    print(feature)

ozymandias king king
king king look
king look upon
look upon work
upon work ye
work ye mighty
ye mighty despair


In [101]:
#apply a given stemming or lemmatization function to the corpus
def apply_to_corpus(func, corpus):
    new_corpus = []
    for text in corpus:
        new_corpus.append(func(text))
    return new_corpus

In [19]:
lemmatized_unigrams = apply_to_corpus(apply_lemmatization, corpus_unigrams)
stemmed_unigrams = apply_to_corpus(apply_stemming, corpus_unigrams)
print(lemmatized_unigrams[0])
print(stemmed_unigrams[0])

['homelessness', 'houselessness', 'george', 'carlin', 'stated', 'issue', 'year', 'never', 'plan', 'help', 'street', 'considered', 'human', 'everything', 'going', 'school', 'work', 'vote', 'matter', 'people', 'think', 'homeless', 'lost', 'cause', 'worrying', 'thing', 'racism', 'war', 'iraq', 'pressuring', 'kid', 'succeed', 'technology', 'election', 'inflation', 'worrying', "'ll", 'next', 'end', 'streets.', 'given', 'bet', 'live', 'street', 'month', 'without', 'luxury', 'home', 'entertainment', 'set', 'bathroom', 'picture', 'wall', 'computer', 'everything', 'treasure', 'see', "'s", 'like', 'homeless', 'goddard', 'bolt', "'s", 'lesson.', 'mel', 'brook', 'directs', 'star', 'bolt', 'play', 'rich', 'man', 'everything', 'world', 'deciding', 'make', 'bet', 'sissy', 'rival', 'jeffery', 'tambor', 'see', 'live', 'street', 'thirty', 'day', 'without', 'luxury', 'bolt', 'succeeds', 'want', 'future', 'project', 'making', 'building', 'bet', "'s", 'bolt', 'thrown', 'street', 'bracelet', 'leg', 'monitor

<h1> TF-IDF

First we set about generating a shared vocabulary, containing the number of documents each unique word occurs in, in order to calculate TF and IDF values.

In [20]:
def generate_shared_vocabulary(corpus):
    words = {}
    for text in corpus:
        for word in set(text): 
            # set(text) removes duplicates, meaning the dictionary contains document frequency values 
            # (number of documents our word occurs in)
            if word in words:
                words[word] += 1
            else:
                words[word] = 1
    return words

In [21]:
shared_vocabulary = generate_shared_vocabulary(stemmed_unigrams)

Now we generate our TF-IDF matrix, where each row represents a document in the corpus. We utilise the scipy sparse matrix data structure in order to save memory.

In [25]:
from scipy import sparse

def generate_tf_idf_matrix(corpus, shared_vocabulary, one_hot=False):
    N = len(shared_vocabulary)
    shared_vocabulary_list = list(shared_vocabulary)
    matrix = sparse.lil_matrix(np.zeros([corpus_length, N]))
    #sparse list of lists to store our tf_idf values

    for i, text in enumerate(corpus):
        for word in text: # calculate tf_idf for each feature in each document 
                          # and insert in correct index
            index = shared_vocabulary_list.index(word)
            tf = text.count(word) / len(text)
            idf = np.log10(N / (shared_vocabulary[word] +1))
            matrix[i, index] = tf * idf if not one_hot else 1
            #if using one_hot vectors for SVM and LogReg then simply insert 1
    return matrix

In [26]:
stem_unigram_tf_idf = generate_tf_idf_matrix(stemmed_unigrams, shared_vocabulary)
stem_unigram_tf_idf[0].toarray()

array([[0.01005802, 0.00693199, 0.01359507, ..., 0.        , 0.        ,
        0.        ]])

<h1>Splitting the data

In [102]:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.model_selection import train_test_split

r_seed = 563
np.random.seed(r_seed)

def get_test_train_dev_split(X):
    y = np.concatenate([np.ones(positive_labels), np.zeros(negative_labels)])

    X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.80, 
                                                        shuffle=True, random_state=r_seed)
    
    X_train, X_dev, y_train, y_dev = train_test_split(X_train, y_train, 
                                                        test_size=0.15, 
                                                       shuffle=True, random_state=r_seed)
   
    #we first split the data into train and test, then we split train into train 
    #and development
    #68% train, 12% validation, 20% test
    return X_train, y_train, X_test, y_test, X_dev, y_dev

X_train, y_train, X_test, y_test, X_dev, y_dev = get_test_train_dev_split(stem_unigram_tf_idf)
print(X_train.shape)
print(X_test.shape)
print(X_dev.shape)
print(X_train.toarray()[0])


(2720, 29379)
(800, 29379)
(480, 29379)
[0.        0.0105175 0.        ... 0.        0.        0.       ]


<h1> Multinomial Naive Bayes

In [28]:
'''
calculate_likelihood: calculates the likelihood that a feature x belongs to class C, p(x|C)
labels: 1 for the class whose likelihood is being calculated, 0 for any others
data: the training data, i.e. our TF-IDF matrix of all documents
alpha: the alphas value for laplace smoothing
'''
def calculate_likelihoods(data, labels, alpha=1.0):
    N = data.shape[1]
    likelihoods = np.zeros([N])
    for i in range(N):
        feature = data[:, i].toarray().flatten()
        likelihoods[i] = (np.sum(feature * labels) + alpha)  / (np.sum(labels) + alpha) 
        # likelihood calculation (p(X|C)) using laplace smoothing
    return likelihoods
        

In [105]:
'''
#calculates the likelihoods for both classes given the training data
#as well as priors for both classes
#inverted_y_train: an inverted label array, denoting 1 for the negative class and 0 for the positive
#used in calculating likelihood and priors
'''
def train_multinomial_bayes(X_train, y_train):
    inverted_y_train = np.array([not y for y in y_train]).astype(int)
    
    pos_likelihoods = calculate_likelihoods(X_train, y_train)
    neg_likelihoods = calculate_likelihoods(X_train, inverted_y_train)
    pos_log_likelihoods = np.log(pos_likelihoods)
    neg_log_likelihoods = np.log(neg_likelihoods)

    pos_prior = np.sum(y_train) / len(y_train)
    neg_prior = np.sum(inverted_y_train) / len(inverted_y_train)
    pos_log_prior = np.log(pos_prior)
    neg_log_prior = np.log(neg_prior)
    return pos_log_likelihoods, neg_log_likelihoods, pos_log_prior, neg_log_prior
    

In [30]:
#assigns a class label to a given document using likelihoods and priors
def get_multinomial_class_label(data, document):
    pos_log_likelihoods, neg_log_likelihoods, pos_log_prior, neg_log_prior = data
    #unpacks our sparse vector:
    features = np.nonzero(document)[0] 
    pos_total = 0
    neg_total = 0
    for index in features: 
        #sum log likelihoods for each feature, for both classes
        pos_total += pos_log_likelihoods[index]
        neg_total += neg_log_likelihoods[index]
    #add priors
    pos_total += pos_log_prior 
    neg_total += neg_log_prior
    class_label = 1 if pos_total > neg_total else 0 
    return class_label

In [31]:
#runs the entire pipeline for MNB and returns a predictions array
def test_train_multinomial_bayes(train_data, train_labels, test_data):
    data = train_multinomial_bayes(train_data, train_labels)
    predictions = []
    for i, v in enumerate(test_data):
        doc =  test_data[i].toarray().flatten()
        label = get_multinomial_class_label(data, doc)
        predictions.append(label)
    return predictions
    

In [32]:
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score
from sklearn.metrics import accuracy_score

def evaluate_model( test_labels, predictions):
    print("accuracy:", accuracy_score(test_labels, predictions))
    print("precision:", precision_score(test_labels, predictions))
    print("recall:", recall_score(test_labels, predictions))
    print("f1 score:", f1_score(test_labels, predictions))
    print()

Evaluation for Multinomial Naive Bayes on the development set for stemmed unigrams: 

In [104]:
predictions = test_train_multinomial_bayes(X_train, y_train, X_dev)
evaluate_model(y_dev, predictions)

0.5025735294117647 0.4974264705882353
accuracy: 0.875
precision: 0.9267015706806283
recall: 0.7937219730941704
f1 score: 0.8550724637681159



<h1>Gaussian Naive Bayes

In [34]:
'''
calculate_guassian_distributions: calculates the mean and standard distribution for 
a given feature using TF-IDF scores across all documents
labels: 1 for the class whose likelihood is being calculated, 0 for any others
data: the training data, i.e. our TF-IDF matrix of all documents
alpha: the alphas value for laplace smoothing
'''
def calculate_guassian_distributions(data, labels, alpha=1e-10):
    pos_distribution = []
    neg_distribution = []
    inverted_labels = np.array([not y for y in labels]).astype(int)

    for i in range(data.shape[1]): #calculate means and standard deviations for each feature
                                   #in order to compute distributions
        feature = data[:, i].toarray().flatten() 
        #collects the instances of the feature being present in a positive and negative class resp.
        pos_feature = feature * labels
        neg_feature = feature * inverted_labels
        pos_distribution.append((np.mean(pos_feature) + alpha, np.std(pos_feature) + alpha))
        neg_distribution.append((np.mean(neg_feature) + alpha, np.std(neg_feature) + alpha))
    return pos_distribution, neg_distribution



In [35]:
'''
#calculates the likelihoods for both classes given the training data
#as well as priors for both classes
#inverted_y_train: an inverted label array, denoting 1 for the negative class and 0 for the positive
#used in calculating likelihood and priors
'''
def train_gaussian_bayes(X_train, y_train):
    inverted_y_train = np.array([not y for y in y_train]).astype(int)

    pos_distribution, neg_distribution = calculate_guassian_distributions(X_train, y_train)
    pos_log_distribution = np.log(pos_distribution)
    neg_log_distribution = np.log(neg_distribution)

    pos_prior = np.sum(y_train) / len(y_train)
    neg_prior = np.sum(inverted_y_train) / len(inverted_y_train)
    pos_log_prior = np.log(pos_prior)
    neg_log_prior = np.log(neg_prior)

    return pos_log_distribution, neg_log_distribution, pos_log_prior, neg_log_prior

In [36]:
#fits value (the TF-IDF score for a given feature in the input document) to the guassian distribution of said feature 
def gaussian(mean, sd, value):
    exponent = (- (value - mean)**2 ) / (2 * sd**2)
    value = (1 / np.sqrt(2 * np.pi * sd**2)) * np.exp(exponent)
    if np.isnan(value): #0 if NaN
        return 0
    return value

In [37]:
#produces a class label for a given document using our GNB
def get_gaussian_class_label(data, document):
    pos_distribution, neg_distribution, pos_log_prior, neg_log_prior = data
    features = np.nonzero(document)[0]
    pos_total = 0
    neg_total = 0
    for index in features: #calculates the likelihood using the mean, sd, and value for each feature
                           #gaussian(mean, sd, x)
                           #pos_distribution[i] = (mean, sd), where i is feature index
        pos_total += gaussian(pos_distribution[index][0], pos_distribution[index][1], 
                            document[index] ) 
        neg_total += gaussian(neg_distribution[index][0], neg_distribution[index][1],
                            document[index] ) 
    pos_total += pos_log_prior
    neg_total += neg_log_prior
    label = 1 if pos_total > neg_total else 0
    return label

In [38]:
#full GNB pipeline
def test_train_gaussian_bayes(train_data, train_labels, test_data):
    data = train_gaussian_bayes(train_data, train_labels)
    predictions = []
    for i, v in enumerate(test_data):
        doc =  test_data[i].toarray().flatten()
        label = get_gaussian_class_label(data, doc)
        predictions.append(label)
    return predictions
    

Evaluation on development set for Guassian Bayes using stemmed unigrams

In [39]:
predictions = test_train_gaussian_bayes(X_train, y_train, X_dev)
evaluate_model(y_dev, predictions)

accuracy: 0.8645833333333334
precision: 0.8291666666666667
recall: 0.8923766816143498
f1 score: 0.8596112311015119



<h1>Sklearn MNB and GNB Models

In [40]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.naive_bayes import GaussianNB

def test_train_sklearn_models(X_train, y_train, X_test, y_test):
    clf = MultinomialNB()
    clf.fit(X_train, y_train)
    predictions = clf.predict(X_test)
    print("Sklearn Multinomial Bayes")
    evaluate_model(y_test, predictions)

    clf2 = GaussianNB()
    clf2.fit(X_train.toarray(), y_train)
    predictions = clf2.predict(X_test.toarray())
    print("Sklearn Gaussian Bayes")
    evaluate_model(y_test, predictions)
test_train_sklearn_models(X_train, y_train, X_dev, y_dev)

Sklearn Multinomial Bayes
accuracy: 0.85625
precision: 0.8155737704918032
recall: 0.8923766816143498
f1 score: 0.8522483940042827

Sklearn Gaussian Bayes
accuracy: 0.65
precision: 0.611336032388664
recall: 0.6771300448430493
f1 score: 0.6425531914893617



For stemmed unigrams our own implementation of multinomial bayes achieves a similar accuracy and f1score, higher precision and a lower recall in comparison to the prebuilt sklearn model. Own implementation of Gaussian Bayes far outperforms on accuracy, precision, and f1 score.

<h1>Evaluation

In [41]:
#full training and evaluation for a given corpus using all models
def evaluate_on_corpus(corpus):
    shared_vocabulary = generate_shared_vocabulary(corpus)
    tf_idf_matrix = generate_tf_idf_matrix(corpus, shared_vocabulary)
    X_train, y_train, X_test, y_test, X_dev, y_dev = get_test_train_dev_split(tf_idf_matrix)

    multinomial_predictions = test_train_multinomial_bayes(X_train, y_train, X_dev)
    print("Multinomial Bayes")
    evaluate_model(y_dev, multinomial_predictions)
    gaussian_predictions = test_train_gaussian_bayes(X_train, y_train, X_dev)
    print("Gaussian Bayes")
    evaluate_model(y_dev, gaussian_predictions)
    test_train_sklearn_models(X_train, y_train, X_dev, y_dev)
    return tf_idf_matrix

Note that we ran stemmed and lemmatized bigrams and produced the following results. These are omitted as code cells due to the time taken to run (> 90 minutes).

lemmatized_bigrams = apply_to_corpus(apply_lemmatization, corpus_bigrams) <br>
evaluate_on_corpus(lemmatized_bigrams) <br>
<br>
Lemmatized Bigrams:<br>
<br>
Multinomial Bayes<br>
accuracy: 0.555<br>
precision: 0.9666666666666667<br>
recall: 0.07552083333333333<br>
f1 score: 0.1400966183574879<br>
<br>
Gaussian Bayes<br>
accuracy: 0.74375<br>
precision: 0.6660482374768089<br>
recall: 0.9348958333333334<br>
f1 score: 0.7778981581798483<br>
<br>
<br>
Stemmed Bigrams:<br>
Multinomial Bayes<br>
accuracy: 0.6375<br>
precision: 0.8955223880597015<br>
recall: 0.2643171806167401<br>
f1 score: 0.40816326530612246<br>
<br>
Gaussian Bayes<br>
accuracy: 0.7729166666666667<br>
precision: 0.7341269841269841<br>
recall: 0.8149779735682819<br>
f1 score: 0.7724425887265136<br>
<br>
Sklearn Multinomial Bayes<br>
accuracy: 0.7958333333333333<br>
precision: 0.7698744769874477<br>
recall: 0.8105726872246696<br>
f1 score: 0.7896995708154506<br>
<br>
Sklearn Gaussian Bayes<br>
accuracy: 0.7229166666666667<br>
precision: 0.706140350877193<br>
recall: 0.7092511013215859<br>
f1 score: 0.7076923076923076<br>


Having tried stemmed unigrams let's asses the use of lemmatized unigrams:

In [43]:
lemmatized_unigrams = apply_to_corpus(apply_lemmatization, corpus_unigrams)
lem_unigrams_tf_idf = evaluate_on_corpus(lemmatized_unigrams)

Multinomial Bayes
accuracy: 0.8666666666666667
precision: 0.9392265193370166
recall: 0.7623318385650224
f1 score: 0.8415841584158416

Gaussian Bayes
accuracy: 0.8666666666666667
precision: 0.8326359832635983
recall: 0.8923766816143498
f1 score: 0.8614718614718615

Sklearn Multinomial Bayes
accuracy: 0.8645833333333334
precision: 0.8464912280701754
recall: 0.8654708520179372
f1 score: 0.8558758314855874

Sklearn Gaussian Bayes
accuracy: 0.6354166666666666
precision: 0.5930232558139535
recall: 0.6860986547085202
f1 score: 0.6361746361746361



Clearly stemming achieves higher performance across the board. Let's try stemming without stopword removal and assess the performance

In [44]:
stemmed_unigrams_stopwords = apply_to_corpus(apply_stemming, corps_unigrams_with_stopwords)
stem_uni_stopword_tf_idf = evaluate_on_corpus(stemmed_unigrams_stopwords)

Multinomial Bayes
accuracy: 0.8020833333333334
precision: 0.9266666666666666
recall: 0.6233183856502242
f1 score: 0.7453083109919572

Gaussian Bayes
accuracy: 0.8604166666666667
precision: 0.8305084745762712
recall: 0.8789237668161435
f1 score: 0.8540305010893247

Sklearn Multinomial Bayes
accuracy: 0.8354166666666667
precision: 0.7790697674418605
recall: 0.9013452914798207
f1 score: 0.8357588357588358

Sklearn Gaussian Bayes
accuracy: 0.65
precision: 0.6122448979591837
recall: 0.672645739910314
f1 score: 0.6410256410256411



Thus our best feature set is unigrams with stemming. Let's now evaluate on the test set

In [45]:
X_train, y_train, X_test, y_test, X_dev, y_dev = get_test_train_dev_split(stem_unigram_tf_idf)
multinomial_predictions = test_train_multinomial_bayes(X_train, y_train, X_test)
print("Multinomial Bayes")
evaluate_model(y_test, multinomial_predictions)
gaussian_predictions = test_train_gaussian_bayes(X_train, y_train, X_test)
print("Gaussian Bayes")
evaluate_model(y_test, gaussian_predictions)
test_train_sklearn_models(X_train, y_train, X_test, y_test)

Multinomial Bayes
accuracy: 0.80875
precision: 0.8768328445747801
recall: 0.7292682926829268
f1 score: 0.796271637816245

Gaussian Bayes
accuracy: 0.8275
precision: 0.8105022831050228
recall: 0.8658536585365854
f1 score: 0.8372641509433961

Sklearn Multinomial Bayes
accuracy: 0.82
precision: 0.8093023255813954
recall: 0.848780487804878
f1 score: 0.8285714285714286

Sklearn Gaussian Bayes
accuracy: 0.6525
precision: 0.6578947368421053
recall: 0.6707317073170732
f1 score: 0.6642512077294686



<h1> Logistic Regression

Let's first compare one hot matrices to our usual TF-IDF vectors

In [46]:
one_hot_matrix = generate_tf_idf_matrix(stemmed_unigrams, shared_vocabulary, one_hot=True)
print(one_hot_matrix[0].toarray())

[[1. 1. 1. ... 0. 0. 0.]]


In [60]:
from sklearn.linear_model import LogisticRegression
X_train, y_train, X_test, y_test, X_dev, y_dev = get_test_train_dev_split(stem_unigram_tf_idf)
clf = LogisticRegression(random_state=r_seed, solver="sag")
clf.fit(X_train, y_train)
clf.score(X_dev, y_dev)

0.8104166666666667

In [61]:
X_train, y_train, X_test, y_test, X_dev, y_dev = get_test_train_dev_split(one_hot_matrix)
clf = LogisticRegression(random_state=r_seed, solver="sag")
clf.fit(X_train, y_train)
clf.score(X_dev, y_dev)



0.8333333333333334

Clearly one hot matrices lead to higher performance. Let's try evaluating for stemmed unigrams:

In [62]:
predictions = clf.predict(X_dev)
evaluate_model(y_dev, predictions)

accuracy: 0.8333333333333334
precision: 0.8122270742358079
recall: 0.8340807174887892
f1 score: 0.8230088495575221



Let's try for lemmatized unigrams:

In [63]:
lem_uni_one_hot = generate_tf_idf_matrix(lemmatized_unigrams, 
                                         generate_shared_vocabulary(lemmatized_unigrams), one_hot=True)
X_train, y_train, X_test, y_test, X_dev, y_dev = get_test_train_dev_split(lem_uni_one_hot)
clf = LogisticRegression(random_state=r_seed, solver="sag").fit(X_train, y_train)
predictions = clf.predict(X_dev)
evaluate_model(y_dev, predictions)

accuracy: 0.8583333333333333
precision: 0.8414096916299559
recall: 0.8565022421524664
f1 score: 0.8488888888888888





Stemmed unigrams without stopword removal:

In [64]:
stem_uni_stop_one_hot = generate_tf_idf_matrix(stemmed_unigrams_stopwords, 
                                               generate_shared_vocabulary(stemmed_unigrams_stopwords), one_hot=True)
X_train, y_train, X_test, y_test, X_dev, y_dev = get_test_train_dev_split(stem_uni_stop_one_hot)
clf = LogisticRegression(random_state=r_seed, solver="sag")
clf.fit(X_train, y_train)
predictions = clf.predict(X_dev)
evaluate_model(y_dev, predictions)


accuracy: 0.8375
precision: 0.8111587982832618
recall: 0.8475336322869955
f1 score: 0.8289473684210525





The best performing feature set was thus lemmatization with stop-word removal.

<h1>SVMs

Stemmed unigrams with stopword removal

In [106]:
from sklearn import svm
def svm_classifier(feature_matrix):
    X_train, y_train, X_test, y_test, X_dev, y_dev = get_test_train_dev_split(feature_matrix)
    clf = svm.SVC()
    clf.fit(X_train, y_train)
    predictions = clf.predict(X_dev)
    evaluate_model(y_dev, predictions)
svm_classifier(one_hot_matrix)

accuracy: 0.8520833333333333
precision: 0.8247863247863247
recall: 0.8654708520179372
f1 score: 0.8446389496717724



SVM with lemmatized unigrams

In [53]:
svm_classifier(lem_uni_one_hot)

accuracy: 0.85
precision: 0.8212765957446808
recall: 0.8654708520179372
f1 score: 0.8427947598253275



SVM with stemmed unigrams and no stopword removal

In [107]:
svm_classifier(stem_uni_stop_one_hot)

accuracy: 0.8520833333333333
precision: 0.8247863247863247
recall: 0.8654708520179372
f1 score: 0.8446389496717724



Our highest performing set was lemmatization with stopword removal. Let's now optimise hyperparameters

<h1>Hyperparameter optimisation

Baseline performance with no hyperparameter tuning: <br>
accuracy: 0.8583333333333333 <br> 
precision: 0.8414096916299559 <br>
recall: 0.8565022421524664 <br>
f1 score: 0.8488888888888888 <br>

Changing regularization parameter C:

In [82]:
#C=0.1
X_train, y_train, X_test, y_test, X_dev, y_dev = get_test_train_dev_split(lem_uni_one_hot)
clf = LogisticRegression(C=0.1, random_state=r_seed, solver="sag")
clf.fit(X_train, y_train)
predictions = clf.predict(X_dev)
evaluate_model(y_dev, predictions)


accuracy: 0.8541666666666666
precision: 0.84
recall: 0.8475336322869955
f1 score: 0.84375



In [83]:
#C=1.5
X_train, y_train, X_test, y_test, X_dev, y_dev = get_test_train_dev_split(lem_uni_one_hot)
clf = LogisticRegression(C=1.5, random_state=r_seed, solver="sag")
clf.fit(X_train, y_train)
predictions = clf.predict(X_dev)
evaluate_model(y_dev, predictions)


accuracy: 0.8583333333333333
precision: 0.8414096916299559
recall: 0.8565022421524664
f1 score: 0.8488888888888888





C=1.5 clearly improves performance. Now testing different solvers

In [84]:
#solver="liblinear"
clf = LogisticRegression(C=1.5, solver="liblinear",random_state=r_seed)
clf.fit(X_train, y_train)
predictions = clf.predict(X_dev)
evaluate_model(y_dev, predictions)

accuracy: 0.85625
precision: 0.8347826086956521
recall: 0.8609865470852018
f1 score: 0.847682119205298



In [85]:
#solver="lbfgs"
clf = LogisticRegression(C=1.5, solver="lbfgs", random_state=r_seed)
clf.fit(X_train, y_train)
predictions = clf.predict(X_dev)
evaluate_model(y_dev, predictions)

accuracy: 0.8583333333333333
precision: 0.8384279475982532
recall: 0.8609865470852018
f1 score: 0.8495575221238938



In [90]:
#solver="sag"
clf = LogisticRegression(C=1.5, solver="sag",random_state=r_seed)
clf.fit(X_train, y_train)
predictions = clf.predict(X_dev)
evaluate_model(y_dev, predictions)

accuracy: 0.8583333333333333
precision: 0.8414096916299559
recall: 0.8565022421524664
f1 score: 0.8488888888888888





"lbfgs" achieves the best performance. Now experimenting with different penalties

In [91]:
#penalty=None
clf = LogisticRegression(C=1.5, solver="lbfgs", random_state=r_seed,
                         penalty=None)
clf.fit(X_train, y_train)
predictions = clf.predict(X_dev)
evaluate_model(y_dev, predictions)



accuracy: 0.8479166666666667
precision: 0.831858407079646
recall: 0.8430493273542601
f1 score: 0.8374164810690423



In [92]:
#penalty="l2"
clf = LogisticRegression(C=1.5, solver="lbfgs", random_state=r_seed, 
                         penalty="l2")
clf.fit(X_train, y_train)
predictions = clf.predict(X_dev)
evaluate_model(y_dev, predictions)

accuracy: 0.8583333333333333
precision: 0.8384279475982532
recall: 0.8609865470852018
f1 score: 0.8495575221238938



A penalty of "l2" appears optimal, now attempting to parallelize with n_jobs:

In [93]:
#n_jobs=-1
clf = LogisticRegression(C=1.5, solver="lbfgs", random_state=r_seed, 
                         penalty="l2", n_jobs=-1)
clf.fit(X_train, y_train)
predictions = clf.predict(X_dev)
evaluate_model(y_dev, predictions)

accuracy: 0.8583333333333333
precision: 0.8384279475982532
recall: 0.8609865470852018
f1 score: 0.8495575221238938



In [97]:
#n_jobs=None
clf = LogisticRegression(C=1.5, solver="lbfgs", random_state=r_seed, 
                         penalty="l2", n_jobs=None)
clf.fit(X_train, y_train)
predictions = clf.predict(X_dev)
evaluate_model(y_dev, predictions)

accuracy: 0.8583333333333333
precision: 0.8384279475982532
recall: 0.8609865470852018
f1 score: 0.8495575221238938



We find that parallelizing does not increase performance. Hyperparameter tuning is complete, giving us our fine tuned model:
+ C = 1.5
+ solver = "lbfgs"
+ penalty = "l2"
+ n_jobs = None (i.e. not parallelized)

In [99]:
predictions = clf.predict(X_test)
evaluate_model(y_test, predictions)

accuracy: 0.83625
precision: 0.8329355608591885
recall: 0.8512195121951219
f1 score: 0.8419782870928829



<h1> SVM Hyperparameters </h1><br>
Baseline performance of untuned model: <br>
accuracy: 0.8520833333333333 <br>
precision: 0.8247863247863247 <br>
recall: 0.8654708520179372 <br>
f1 score: 0.8446389496717724 <br>

Testing different values for regulariation parameter C:

In [None]:
X_train, y_train, X_test, y_test, X_dev, y_dev = get_test_train_dev_split(one_hot_matrix)
clf = svm.SVC(C=1.0)
clf.fit(X_train, y_train)
predictions = clf.predict(X_dev)
evaluate_model(y_dev, predictions)

In [None]:
clf = svm.SVC(C=0.9)
clf.fit(X_train, y_train)
predictions = clf.predict(X_dev)
evaluate_model(y_dev, predictions)

accuracy: 0.8520833333333333
precision: 0.8220338983050848
recall: 0.8699551569506726
f1 score: 0.8453159041394336



C = 0.9 slightly outperforms the default C = 1, now testing different kernels

In [None]:
clf = svm.SVC(C=0.9, kernel="linear")
clf.fit(X_train, y_train)
predictions = clf.predict(X_dev)
evaluate_model(y_dev, predictions)

accuracy: 0.8166666666666667
precision: 0.8
recall: 0.8071748878923767
f1 score: 0.8035714285714287



In [None]:
clf = svm.SVC(C=0.9, kernel="rbf")
clf.fit(X_train, y_train)
predictions = clf.predict(X_dev)
evaluate_model(y_dev, predictions)

"rbf" kernel achieves the best perfomance, now testing different gamma configurations

In [None]:
clf = svm.SVC(C=0.9, kernel="rbf", gamma="auto")
clf.fit(X_train, y_train)
predictions = clf.predict(X_dev)
evaluate_model(y_dev, predictions)

In [None]:
clf = svm.SVC(C=0.9, kernel="rbf", gamma="scale")
clf.fit(X_train, y_train)
predictions = clf.predict(X_dev)
evaluate_model(y_dev, predictions)

gamma="scale" appears to performe the best, giving us our final tuned model:
+ C = 0.9
+ kernel = "rbf" 
+ gamma = "scale"

In [None]:
predictions = clf.predict(X_test)
evaluate_model(y_test, predictions)

accuracy: 0.82625
precision: 0.8086560364464692
recall: 0.8658536585365854
f1 score: 0.8362779740871613



<h1>BERT Results</h1>
Our BERT experiments are contained in another notebook however we have included the results here for comparitive purposes.

In [None]:
"""
BERT-UNCASED: DEV SET
accuracy: 0.8458333333333333
precision: 0.8311111111111111
recall: 0.8385650224215246
f1 score: 0.8348214285714286

TEST SET
accuracy: 0.89125
precision: 0.8891566265060241
recall: 0.9
f1 score: 0.8945454545454546
"""

In [None]:
"""
BERT CASED DEV SET
accuracy: 0.84375
precision: 0.8032786885245902
recall: 0.8789237668161435
f1 score: 0.8394004282655245


TEST SET
accuracy: 0.8825
precision: 0.8657407407407407
recall: 0.9121951219512195
f1 score: 0.8883610451306413
"""