# <u>Naive Bayes Classifier Report </u>
**Explain and motivate the chosen representation & data preprocessing:** <br>
I have chosen my current model due to the improvements it has over standard Naive Bayes implementation, which gives it a much better accuracy and successful prediction rate. The methodology of my program consists of the standard Naive Bayes classifier (further information given later on in this report) along with various improvements such as n-grams, smoothing, pre-processing, the inclusion of word frequency, and working in the log-space (further information given later on in this report).
For data preprocessing, I split up the training dataset into a mock test dataset which contained the last 10% of the data, and the training dataset, which contained the first 90%. This was done so I could test my program before submitting it to Kaggle and obtaining my training accuracy score. The main act of preprocessing I practiced was removing a group of words that had an extremely high frequency from the text. I determined these words by measuring every word’s frequency within the total text of each row and then removed all the words with a frequency greater than half the frequency of the most common word. In the training dataset this ended up being the top 19 most common words.  

**Explain the idea behind the model improvements and their implementation:** <br>
The various improvements I made to the base version of my Naive Bayes classifier were adding n-grams up to three words (combining up to three words into one if every instance of those two words next two each other has the same output class), Laplace smoothing, including the frequency of each word in my calculations, and my overall calculation of the probability of each word occurring.
1. While the n-gram feature I’ve created does produce the same Kaggle and Training accuracy as the classifier without engrams, it might prove more effective on a larger dataset. However, it does take a fairly long time to train my classifier in this model, which definitely makes it less accessible compared to the model without n-grams.
2. I’ve implemented Laplace smoothing into my model, as it makes probabilities of words that don’t exist in my training dataset equal to 0.5 or 1/number of options (1, 0), which is equal to 0.5.
3. Rather than just combining all the probabilities of each word occurring, I found the sum of the probabilities of each word occurring multiplied by the frequency of the word. This balances the weights of all the words as if a word is more frequent, it will more likely occur multiple times within the text of the test data, which isn’t accounted for when you just use the probability of it being in the text. 
4. Firstly, I conducted all my multiplications in the log space as multiplying tiny numbers can give computational issues, which I want to avoid. I also left out the denominator of the final probability calculations for each class as if it is the same for every class, then it wouldn’t affect the sizes of the probabilities in relation to each other class, so it can be ignored. I’ve also changed the overall final equation to be “probability of it being a certain class - the probability that it isn’t that class” rather than “probability of it being a certain class + the probability that it is that class”. You could also say that I’m finding the probabilities that each class isn’t the correct one and then picking the class with the lowest fail chance. This technique was mentioned in the pdf “Tackling the Poor Assumptions of Naive Bayes Text Classifiers". Written by Rennie at al. (2003), which was provided in Canvas. 

**Explain the evaluation procedure:** <br>
The evaluation of my data can be split into two values, the Kaggle accuracy and the training accuracy. The Kaggle accuracy is calculated by Kaggle when I submit my list of predictions to the website. The training accuracy is determined by testing the classifier on the smaller percentage of the training data I’ve classed as the test dataset. In this model, I’ve removed the bottom 10% from the training dataset and put it in a new dataset which I then tested my classifier on. I then separated the output classes from the abstract text and put them in a separate list, which I then compared to the predicted class outputs to find the training accuracy. This was done so I could test my program before submitting it to Kaggle and obtaining my training accuracy score.

**Include and explain the training/validation results for the standard and improved Naive Bayes model:** <br>
The base Naive Bayes implementation: Training accuracy: 0.675, Kaggle accuracy: 0.72 <br>
These are the baseline results for the standard Naive Bayes implementation. Since I don’t have access to Mitchell's "Machine Learning" textbook, I just used the classic equation to calculate my predictions P(B|A)=P(A|B)P(B)/P(A). P(A|B) includes the probability of every word in the abstract text of occurring and the probability of every word not in the abstract text from not occurring. All these probabilities, as well as the probability of the class from occurring, are multiplied together and then divided by the probability of each word occurring regarding of class multiplied together. This gives you the probability of a certain class being the correct class. You then find the maximum probability to find the designated class prediction. 

The improved Naive Bayes implementation: Training accuracy: 0.965, Kaggle accuracy: 0.97 <br>
This probability is a huge improvement of the standard implementation, which is due to all the pre-processing and improvements made as specified previously in this report.

Other models include: <br> 
The improved Naive Bayes implementation - pre-processing and smoothing: Training accuracy: 0.9225, Kaggle accuracy: 0.94333 <br>
This probability, while not being the best, is a significant improvement from the standard implementation, which is due to most of the improvements specified previously in this report. This is also a major checkpoint/milestone within this project, as it was the first model to give me a significant increase in accuracy from my standard implementation.

The improved Naive Bayes implementation + n-grams: Training accuracy: 0.965, Kaggle accuracy: 0.97 <br>
The model, while having the same values, is the other improved model, includes n-grams up to three words. While it does take a much longer time to find all the predictions due to the n-gram calculations, I feel that it might produce better accuracy over a much larger dataset. <br>
*For a full list of the predictions, just run the corresponding code for the desired model and uncomment the specified print value to obtain all the predictions.*

In [40]:
# Base Naive Bayes implementation - Kaggle accuracy: 0.71, Training accuracy: 0.675

import pandas as pd
import math as m
df = pd.read_csv(r'~/Onedrive/Desktop/trg.csv')
training_data = df.values.tolist()
df = pd.read_csv(r'~/Onedrive/Desktop/tst.csv')
# Uncomment this line when checking kaggle accuracy
#test_data = df.values.tolist()

# Comment out these lines when checking kaggle accuracy
test_data = []
test_values = []
temp_test_data = []
for i in range(len(training_data)):
    if i >= (len(training_data) * 0.9):
        temp = [training_data[i][0], training_data[i][2]]
        test_values.append(training_data[i][1])
        test_data.append(temp)
    else:
        temp_test_data.append(training_data[i])
training_data = temp_test_data
# End of lines to be commented out

totalRows = 0
totalWords = set()
wordCounts = {}
totalWordCounts = {}
classCounts = {}
wordProbs = {}
predictions = []

# Training the classifier
def trainClassifier(data, totalRows):
    for row in training_data:
        totalRows += 1
        classKey = "{0}".format(row[1])
        if classKey in classCounts:
            classCounts[classKey] = classCounts.get(classKey) + 1
        else:
            classCounts[classKey] = 1
        
        wordList = list(set(row[2].split(" ")))
        for word in wordList:
            totalWords.add(word)
            key = "{0}|{1}".format(word, row[1])
            if key in wordCounts:
                wordCounts[key] = wordCounts.get(key) + 1
            else:
                wordCounts[key] = 1

            key2 = "{0}".format(word)
            if key2 in totalWordCounts:
                totalWordCounts[key2] = totalWordCounts.get(key2) + 1
            else:
                totalWordCounts[key2] = 1

    for word in totalWordCounts:
        for classifier in classCounts:
            key_positive = "P({0}=1|class={1})".format(word, classifier)
            key_negative = "P({0}=0|class={1})".format(word, classifier)
            key = "{0}|{1}".format(word, classifier)
            
            if key in wordCounts:
                numerator = wordCounts.get(key)
                denominator = classCounts.get(classifier) 
                probability_positive = numerator / denominator
                numerator = classCounts.get(classifier) - wordCounts.get(key) 
                probability_negative = numerator / denominator
            else:
                probability_positive = 0
                probability_negative = 1
            
            wordProbs[key_positive] = probability_positive
            wordProbs[key_negative] = probability_negative
        
    return totalRows
        
# Predicting the output classes for test_data
def predict(data):
    counter = 0
    for row in data:
        counter += 1
        classifierScores = {}
        for classifier in classCounts:
            numerator = classCounts.get(classifier) / totalRows
            demoninator = 1
            wordList = row[1].split(" ")
            wordSet = set(wordList)
            wordList = list(wordSet)
            
            for word in totalWords:
                if word in wordList:
                    key = "P({0}=1|class={1})".format(word, classifier) 
                    demoninator = demoninator * (totalWordCounts.get(word) / totalRows)
                else:
                    key = "P({0}=0|class={1})".format(word, classifier)
                    demoninator = demoninator * ((totalRows - totalWordCounts.get(word)) / totalRows)
                    
                if key not in wordProbs:
                    wordProbs[key] = 0 
                numerator = numerator * wordProbs.get(key)
            
            finalProb = 0
            if demoninator != 0:
                finalProb = numerator / demoninator
            classifierScores[finalProb] = classifier
             
        prediction = classifierScores.get(max(list(classifierScores.keys())))
        temp_list = [counter, prediction]
        predictions.append(temp_list)
            
totalRows = trainClassifier(training_data, totalRows)
predict(test_data)

# Uncomment to see total list of predictions
#print(predictions)

# Comment out these lines when checking kaggle accuracy
correct = 0
total = 0
for i in range(len(predictions)):
    total += 1
    if predictions[i][1] == test_values[i]:
        correct += 1
print("Accuracy: {0}".format(correct/total))

# Uncomment these two lines to generate a csv of the predictions to check kaggle accuracy
#df = pd.DataFrame(predictions, columns=["id", "class"])
#df.to_csv('output.csv', index=False)
    
print("Done")

Accuracy: 0.675
Done


In [39]:
# Improved Naive Bayes implementation with calculations done in the log-sapce, and taking into account frequency of each word 
# in the abstract - Kaggle accuracy: 0.94333, Training accuracy: 0.9225

import pandas as pd
import math as m
df = pd.read_csv(r'~/Onedrive/Desktop/trg.csv')
training_data = df.values.tolist()
df = pd.read_csv(r'~/Onedrive/Desktop/tst.csv')
# Uncomment this line when checking kaggle accuracy
#test_data = df.values.tolist()

# Comment out these lines when checking kaggle accuracy
test_data = []
test_values = []
temp_test_data = []
for i in range(len(training_data)):
    if i >= (len(training_data) * 0.9):
        temp = [training_data[i][0], training_data[i][2]]
        test_values.append(training_data[i][1])
        test_data.append(temp)
    else:
        temp_test_data.append(training_data[i])
training_data = temp_test_data
# End of lines to be commented out

totalRows = 0
totalWords = set()
wordCounts = {}
totalWordCounts = {}
classCounts = {}
wordProbs = {}
predictions = []

# Training the classifier
def trainClassifier(data, totalRows):
    for row in training_data:
        totalRows += 1
        classKey = "{0}".format(row[1])
        if classKey in classCounts:
            classCounts[classKey] = classCounts.get(classKey) + 1
        else:
            classCounts[classKey] = 1
        
        wordList = list(set(row[2].split(" ")))
        for word in wordList:
            totalWords.add(word)
            key = "{0}|{1}".format(word, row[1])
            if key in wordCounts:
                wordCounts[key] = wordCounts.get(key) + 1
            else:
                wordCounts[key] = 1

            key2 = "{0}".format(word)
            if key2 in totalWordCounts:
                totalWordCounts[key2] = totalWordCounts.get(key2) + 1
            else:
                totalWordCounts[key2] = 1

    for word in totalWordCounts:
        for classifier in classCounts:
            key_negative = "P({0}=0|class={1})".format(word, classifier)
            key = "{0}|{1}".format(word, classifier)
            
            if key in wordCounts:
                numerator = totalWordCounts.get(word) - wordCounts.get(key)
                denominator = totalRows - classCounts.get(classifier) 
                probability_negative = numerator / denominator
            else:
                probability_negative = 1
            
            wordProbs[key_negative] = probability_negative
        
    return totalRows
        
# Predicting the output classes for test_data
def predict(data):
    counter = 0
    for row in data:
        counter += 1
        text = row[1]
        wordList = text.split(" ")
        word_freq = {}
        for word in wordList:
            if word in word_freq:
                word_freq[word] = word_freq.get(word) + 1
            else:
                word_freq[word] = 1
        
        probs_of_class = {}
        for classifier in classCounts:
            prob = classCounts.get(classifier) / totalRows
            if prob != 0:
                total_class_prob = m.log(prob)
            
            sum_of_probs = 0
            unique_wordList = list(set(wordList))
            for word in totalWords:
                key = "P({0}=0|class={1})".format(word, classifier)
                
                prob = wordProbs.get(key)
                if word in word_freq and prob != 0:
                    sum_of_probs += m.log(prob) * word_freq.get(word)
            
            class_prob = total_class_prob - sum_of_probs
            probs_of_class[class_prob] = classifier
        
        prediction = probs_of_class.get(max(list(probs_of_class)))
        temp_list = [counter, prediction]
        predictions.append(temp_list)
        
totalRows = trainClassifier(training_data, totalRows)
predict(test_data)

# Uncomment to see total list of predictions
#print(predictions)

# Comment out these lines when checking kaggle accuracy
correct = 0
total = 0
for i in range(len(predictions)):
    total += 1
    #print("{0} = {1}".format(predictions[i][1], test_values[i]))
    if predictions[i][1] == test_values[i]:
        correct += 1
print("Accuracy: {0}".format(correct/total))

# Uncomment these two lines to generate a csv of the predictions to check kaggle accuracy
#df = pd.DataFrame(predictions, columns=["id", "class"])
#df.to_csv('output.csv', index=False)
    
print("Done")

Accuracy: 0.9225
Done


In [53]:
# Added lapace smoothing and removes all uncorrelated words (words with a very high frequency) 
# - Kaggle accuracy: 0.97, Training accuracy: 0.965

import pandas as pd
import math as m
df = pd.read_csv(r'~/Onedrive/Desktop/trg.csv')
training_data = df.values.tolist()
df = pd.read_csv(r'~/Onedrive/Desktop/tst.csv')
# Uncomment this line when checking kaggle accuracy
#test_data = df.values.tolist()

# Comment out these lines when checking kaggle accuracy
test_data = []
test_values = []
temp_test_data = []
for i in range(len(training_data)):
    if i >= (len(training_data) * 0.9):
        temp = [training_data[i][0], training_data[i][2]]
        test_values.append(training_data[i][1])
        test_data.append(temp)
    else:
        temp_test_data.append(training_data[i])
training_data = temp_test_data
# End of lines to be commented out

totalRows = 0
totalWords = set()
wordCounts = {}
totalWordCounts = {}
classCounts = {}
wordProbs = {}
predictions = []

# Training the classifier
def trainClassifier(data, totalRows):
    for row in training_data:
        totalRows += 1
        classKey = "{0}".format(row[1])
        if classKey in classCounts:
            classCounts[classKey] = classCounts.get(classKey) + 1
        else:
            classCounts[classKey] = 1
        
        wordList = list(set(row[2].split(" ")))
        for word in wordList:
            totalWords.add(word)
            key = "{0}|{1}".format(word, row[1])
            if key in wordCounts:
                wordCounts[key] = wordCounts.get(key) + 1
            else:
                wordCounts[key] = 1

            key2 = "{0}".format(word)
            if key2 in totalWordCounts:
                totalWordCounts[key2] = totalWordCounts.get(key2) + 1
            else:
                totalWordCounts[key2] = 1
    
    # Removing words with high frequency
    temp_list = list(totalWordCounts.items())
    temp_list.sort(key = lambda x: x[1], reverse=True)
    highest = temp_list[0][1]
    for word in temp_list:
        if word[1] < highest/2:
            break
        totalWords.remove(word[0])
        totalWordCounts.pop(word[0])
                
    for word in totalWordCounts:
        for classifier in classCounts:
            key_negative = "P({0}=0|class={1})".format(word, classifier)
            key = "{0}|{1}".format(word, classifier)
            
            # Probabilty calculations
            if key in wordCounts:
                numerator = totalWordCounts.get(word) - wordCounts.get(key) + 1
                denominator = totalRows - classCounts.get(classifier) + 2 
                probability_negative = numerator / denominator
            else:
                probability_negative = 0.5
            
            wordProbs[key_negative] = probability_negative
        
    return totalRows
        
# Predicting outputting classes
def predict(data):
    counter = 0
    for row in data:
        counter += 1
        text = row[1]
        wordList = text.split(" ")
        word_freq = {}
        for word in wordList:
            if word in word_freq:
                word_freq[word] = word_freq.get(word) + 1
            else:
                word_freq[word] = 1
        
        probs_of_class = {}
        for classifier in classCounts:
            prob = classCounts.get(classifier) / totalRows
            if prob != 0:
                total_class_prob = m.log(prob)
            
            unique_wordList = list(set(wordList))
            sum_of_probs2 = 0
            for word in unique_wordList:
                key = "P({0}=0|class={1})".format(word, classifier)
                prob = wordProbs.get(key)
                if word in word_freq and prob != 0 and key in wordProbs:
                    sum_of_probs2 += m.log(prob) * word_freq.get(word)
            
            class_prob = total_class_prob - sum_of_probs2
            probs_of_class[class_prob] = classifier
        
        prediction = probs_of_class.get(max(list(probs_of_class)))
        temp_list = [counter, prediction]
        predictions.append(temp_list)
    
totalRows = trainClassifier(training_data, totalRows)
predict(test_data)

# Uncomment to see total list of predictions
#print(predictions)

# Comment out these lines when checking kaggle accuracy
correct = 0
total = 0
for i in range(len(predictions)):
    total += 1
    #print("{0} = {1}".format(predictions[i][1], test_values[i]))
    if predictions[i][1] == test_values[i]:
        correct += 1
print("Accuracy: {0}".format(correct/total))

# Uncomment these two lines to generate a csv of the predictions to check kaggle accuracy
#df = pd.DataFrame(predictions, columns=["id", "class"])
#df.to_csv('output.csv', index=False)

print("Done")

Accuracy: 0.965
Done


In [52]:
# Added n-gram in word groups of 2 - Kaggle accuracy: 0.97, Training accuracy:  0.965

import pandas as pd
import math as m
df = pd.read_csv(r'~/Onedrive/Desktop/trg.csv')
training_data = df.values.tolist()
df = pd.read_csv(r'~/Onedrive/Desktop/tst.csv')
# Uncomment this line when checking kaggle accuracy
#test_data = df.values.tolist()

# Comment out these lines when checking kaggle accuracy
test_data = []
test_values = []
temp_test_data = []
for i in range(len(training_data)):
    if i >= (len(training_data) * 0.9):
        temp = [training_data[i][0], training_data[i][2]]
        test_values.append(training_data[i][1])
        test_data.append(temp)
    else:
        temp_test_data.append(training_data[i])
training_data = temp_test_data
# End of lines to be commented out

totalRows = 0
totalWords = set()
wordCounts = {}
totalWordCounts = {}
classCounts = {}
wordProbs = {}
predictions = []

# Training classifier
def trainClassifier(data, totalRows):
    wordPairs = {}
    for row in training_data:
        totalRows += 1
        classKey = "{0}".format(row[1])
        if classKey in classCounts:
            classCounts[classKey] = classCounts.get(classKey) + 1
        else:
            classCounts[classKey] = 1
        
        wordList = list(set(row[2].split(" ")))
        for word in wordList:
            totalWords.add(word)
            key = "{0}|{1}".format(word, row[1])
            if key in wordCounts:
                wordCounts[key] = wordCounts.get(key) + 1
            else:
                wordCounts[key] = 1

            key2 = "{0}".format(word)
            if key2 in totalWordCounts:
                totalWordCounts[key2] = totalWordCounts.get(key2) + 1
            else:
                totalWordCounts[key2] = 1

    # Removing words with high frequency
    temp_list = list(totalWordCounts.items())
    temp_list.sort(key = lambda x: x[1], reverse=True)
    highest = temp_list[0][1]
    for word in temp_list:
        if word[1] < highest/2:
            break
        totalWords.remove(word[0])
        totalWordCounts.pop(word[0])
    
    # Finding all pairs of words
    for row in training_data:
        wordList = row[2].split(" ")
        classifier = row[1]
        for i in range(len(wordList)):
            if i < len(wordList)-1:
                key = "{0}|{1}".format(wordList[i], wordList[i+1])
                if key in wordPairs:
                    if (wordPairs.get(key)[1] != classifier):
                        wordPairs.pop(key)
                    else:
                        classifier_and_count = [wordPairs.get(key)[0]+1, classifier]
                        wordPairs[key] = classifier_and_count
                else:
                    classifier_and_count = [1, classifier]
                    wordPairs[key] = classifier_and_count
            
            if i > 0:
                key = "{0}|{1}".format(wordList[i], wordList[i-1])
                if key in wordPairs:
                    if (wordPairs.get(key)[1] != classifier):
                        wordPairs.pop(key)
                    else:
                        classifier_and_count = [wordPairs.get(key)[0]+1, classifier]
                        wordPairs[key] = classifier_and_count
                else:
                    classifier_and_count = [1, classifier]
                    wordPairs[key] = classifier_and_count
                    
                    
            if i < len(wordList)-2:
                key = "{0}|{1}|{2}".format(wordList[i], wordList[i+1], wordList[i+2])
                if key in wordPairs:
                    if (wordPairs.get(key)[1] != classifier):
                        wordPairs.pop(key)
                    else:
                        classifier_and_count = [wordPairs.get(key)[0]+1, classifier]
                        wordPairs[key] = classifier_and_count
                else:
                    classifier_and_count = [1, classifier]
                    wordPairs[key] = classifier_and_count
            
            if i > 1:
                key = "{0}|{1}|{2}".format(wordList[i], wordList[i-1], wordList[i-2])
                if key in wordPairs:
                    if (wordPairs.get(key)[1] != classifier):
                        wordPairs.pop(key)
                    else:
                        classifier_and_count = [wordPairs.get(key)[0]+1, classifier]
                        wordPairs[key] = classifier_and_count
                else:
                    classifier_and_count = [1, classifier]
                    wordPairs[key] = classifier_and_count
    
    # Finding the counts of all pairs of words which occur more than once and have the same output class
    for key, value in wordPairs.items():
        if value[0] > 1:
            words = key.split("|")
            if len(words) == 2:
                new_word = "{0} {1}".format(words[0], words[1])
            elif len(words) == 3:
                new_word = "{0} {1} {2}".format(words[0], words[1], words[2])
            
            for row in training_data:
                 if new_word in row[2]:
                    if new_word in totalWordCounts:
                        totalWordCounts[new_word] = totalWordCounts.get(new_word) + 1
                    else:
                        totalWordCounts[new_word] = 1
                    
                    key = "{0}|{1}".format(new_word, row[1])
                    if new_word in wordCounts:
                        wordCounts[key] = wordCounts.get(key) + 1
                    else:
                        wordCounts[key] = 1
            totalWords.add(new_word)
                
    for word in totalWordCounts:
        for classifier in classCounts:
            key_negative = "P({0}=0|class={1})".format(word, classifier)
            key = "{0}|{1}".format(word, classifier)
            
            # Calculating probabilties 
            if key in wordCounts:
                numerator = totalWordCounts.get(word) - wordCounts.get(key) + 1
                denominator = totalRows - classCounts.get(classifier) + 2 
                probability_negative = numerator / denominator
            else:
                probability_negative = 1
            
            wordProbs[key_negative] = probability_negative
        
    return totalRows
        
# Predicting outputting classes
def predict(data):
    counter = 0
    for row in data:
        counter += 1
        text = row[1]
        wordList = text.split(" ")
        word_freq = {}
        for word in wordList:
            if word in word_freq:
                word_freq[word] = word_freq.get(word) + 1
            else:
                word_freq[word] = 1
        
        probs_of_class = {}
        for classifier in classCounts:
            prob = classCounts.get(classifier) / totalRows
            if prob != 0:
                total_class_prob = m.log(prob)
            
            unique_wordList = list(set(wordList))
            sum_of_probs2 = 0
            for word in unique_wordList:
                key = "P({0}=0|class={1})".format(word, classifier)
                prob = wordProbs.get(key)
                if word in word_freq and prob != 0 and key in wordProbs:
                    sum_of_probs2 += m.log(prob) * word_freq.get(word)
            
            class_prob = total_class_prob - sum_of_probs2
            probs_of_class[class_prob] = classifier
        
        prediction = probs_of_class.get(max(list(probs_of_class)))
        temp_list = [counter, prediction]
        predictions.append(temp_list)
    
totalRows = trainClassifier(training_data, totalRows)
predict(test_data)

# Uncomment to see total list of predictions
#print(predictions)

# Comment out these lines when checking kaggle accuracy
correct = 0
total = 0
for i in range(len(predictions)):
    total += 1
    #print("{0} = {1}".format(predictions[i][1], test_values[i]))
    if predictions[i][1] == test_values[i]:
        correct += 1
print("Accuracy: {0}".format(correct/total))

# Uncomment these two lines to generate a csv of the predictions to check kaggle accuracy
#df = pd.DataFrame(predictions, columns=["id", "class"])
#df.to_csv('output.csv', index=False)

print("Done")

Accuracy: 0.965
Done
