# Improving Standard Naive Bayes on the Medline Dataset. 

## Code Explanation

### Chosen Representation Of Data

I chose to identify the k most frequently occuring words from all of the abstracts,where each word from the k  most frequently occuring words, will be used as an attribute for the X data. 

(note that my final improved classifier uses k = 200)

Each value in each attribute is the frequency of that word for that instance's abstract. 

### Data Preprocessing

Train / Test Split before Preprocessing: 
- Before I preprocess the data into the chosen data representation, I first split the entire dataset into a training and testing set. This is because, if I were instead  to preprocess all of the data, and then split it into train and test sets, I would be violating a fundamental rule of Machine learning - that the test data should not "see" any part of the training data.


- This is because, the result of my preprocessing on the test set would otherwise be influenced by the result of my preprocessing in the training set. I avoid this, by first splitting the data into train and test sets. 

Data Preprocessing: 
- My Naive Bayes Classifier takes a feature numpy array and a class numpy array as input during its training phase. These arrays do not store the ID of each instance. This works, because the order of IDS are maintaned, and the training data is not actually used when making predictions. 


- Because the Naive Bayes Classifier takes numpy arrays as Input,  as part of my preprocessing step, I convert the data into Numpy Arrays. 


- However, the feature data contains one long piece of text as the instance's abstract. I dont want just one attribute to be used to train the model, as almost all instances contain a different abstract. Instead, I want to train the model on a number of different words, by taking the frequency of the words into account. I am using a Multinomial Naive Bayes Model, where the number of times that a woprd appears in an instance's abstract is the value for the instance at that word. 


- Because of this, In my preprocessing phase, I first grab the most common k words from all of the X data's instance's abstracts, where k = 10 for the standard Naive Bayes Model. Using these most common k words, for each instance in the X data, I remove its abstract value and replace it with k new attributes, where each attribute represents a word in the most common k words. Because I chose to use a Multinomial Naive Bayes Model, the value for each attribute is equal to the number of times that the word at the attribute represents, appears in the instance's abstract. I am taking the frequency of each word into account. 


- My final step of Preprocessing is to turn the new X data into a numpy array, where row i, is equal to a numpy array, consisting of the attribute values of instance i. (The ID of the instance is not used in the model). 

### Method Extensions

Functions used for Preprocessing and Model Evaulation, as well as finding an improved model to the standard Naive Bayes model: 
- def load_data(filename):
This is used to load all of the data with the given filename, and return a feature array, a class array and all of the abstracts as a list. 
The feature array is returned in the form, [[ID, abstract], ...] and the class array is returned in the form, [[ID, class], ...]. 


- def get_train_test_split(feature_data, class_data):
This splits the entire dataset into training and testing sets. 80% of the data is used for training. I first grab a permutaion, and then split the feature_data and class_data accordingly to return training X and Y sets, as well as testing X and Y sets. 


- def get_most_common_K_words(all_abstracts, k):
I use the Counter class to quickly find the most common k words from all of the abstracts inputted, and I return an ordered list containg these most common k words, in order of most common to least common.


- def number_of_times_word_in_abstract(abstract, word):
Returns the count of how many times the word appears in an instance's abstract. This is important because I am using multinomial Naive bayes and need the frequencies of each word for an instance. 


- def preprocess(X_array, words):
Here I convert the X_array from [ID, [long piece of text]] to [ID, [word1_frequency, word2_frequency, ... wordk_frequency]]. 



- def values_only_no_IDS(X_data, Y_data):
Here I only grab values for each instance, and not their IDS, I then return a numpy array for the values from the feature array as well as the values from the class array. 


- def get_percentage_correct(predictions, actual_values):
I return the percentage of classes correctly classified from the predictions array using an array of the actual values. 
This is my main scoring metric for the classifiers, simply how accurate it is. 


- def get_k_splits(feature_data, class_data, k):
Here, I split the data into k different splits. This is used in the next function for performing cross validation. 


- def run_cross_validation(nb, feature_data, class_data, words, k):
I run k fold cross vaidation on the classifier, generating a new accuracy for each fold. I use k = 10, for all cross validation performed in this assingment. 


- def find_best_number_words(X_train, X_test, Y_train, Y_test, all_abstracts):
I use this to train a few different classifiers, where each classifier uses a different number of most common words as the attributes for each instance. 

- def normalise_word_frequencies(feature_array, words):
I use this in an attempt to improve the models accuracy, by normalising the word frequencies for each instance. 


- def run_cross_validation_normalisation(nb, feature_data, class_data, words, k):
This is the same as my previous cross validation function, execpt that it performs cross validation on data that has its word frequencies normalsied as per the above function.


- def load_test_data(filename):
I use this to load the test data set, as the test data only contains the IDS and their abstracts, so I need to load this data differently as the training data





### Standard Naive Bayes Implementaion

I first expirmented by implmenting the Naive bayers classifier via many different functions. I would first grab the data and generate a dictionary containing the prior values. Then I would generate a dictionary containing the conditional probability values and so forth. 

However, the run time for this was much too high, especially when trying to debug, and improve the code. 

My final Naive Bayes implementaion ended up being a class, NaiveBayesClassifier. I found that the run time of my code drastically improved, however improvements are still possible. 

How the Implementation Works: 
- The class has a few methods, train(self, X_data, Y_data), get_posterior(self, w, c, X_data, Y_data),  make_predictions_on_X_data(self, new_X_data), make_prediction_on_instance(self, instance, new_X_data) and two get methods, get_conditional_probs(self) and get_priors(self). 


- Creating: When I instantiate a Naive Bayes Classifier, I instantiate it as per normal, nb = NaiveBayesClassifier(). 


- Training: When I train the classifier, I pass in the processed feature numpy array as well as the processed class array into the train() method. In this training phase, I set up a list of the prior proabilities for each unique class in the class numpy array. I also set up a dictionary that contains the conditional probabilities of each word and unqiue class combination. 


- Getting priors: For each unioque class, I divide the number of times an instance has this class by the total number of instances. 


- Getting conditional probabilities: for each attribute/word and unique class coombination, I find the combination's posterior probability by calling the get_posterior()'s method. This performs the P(w|c) calculation from the lecture slides. 



- Making Predictions: When I want to make predictions on some X_data I call the make_predictions_on_X_data() method which loops through each instance in the X_data and returns the ouput of the make_prediction_on_instance() method, see below. 


- The make_prediction_on_instance() method: For a given instance this returns the most likley class, by finding the probability that it is each individual unique class and returning the most likley class. For each attribute in the instance, it  the proabbility that the true class  is the one being tested. this is done by, prior probability + the sum of all of the conditional probabilities for each attribute in the instance. 


- The get methods in this class are self explanatory. 




### Standard Naive Bayes Performance 

When ever I try to find the performance of a model, I run 10 fold cross validation on that model, and find the percentage of classes correctly classified for each fold, I then return the average percentage. (for the training and testing data.)

When running 10 fold cross validation on a model I pass in the non processed data, because when I feed data to the Naive Bayes Classifier, it assumes that the data is already in the correct order, this is why I first process each fold's data accordingly keeping track of the relevant order for each fold, and then make a prediction on that fold.  


I run 10 fold cross validation, in order to get more of an undertstanding as to how well I can expect the model to perform with future unforseen data, the test data csv. 


Performance of Standard Naive Bayes Model: 
- After running repeated 10 fold cross validation on the standard Naive Bayes Model, that has been trained on using 10 words I get an average of 59% accuracy on the training set and 60%  accuracy on the test data.


- Note that due to Random shufflings, the reader of this document can expect to find slightly different values. 



### Extended Naive Bayes Method

There are a few methods that I use as an attempt to improve the accuracy of the model. 


Optimal Number of Words to Train on: 
- I first find the optimal number of words to use on the classifier, which turns out to be 200. I would have guessed that there was a proportional relationship between number of words for the classifier and the classifiers accuracy. This was true untill a threshold of 200 words, where the performance of the models begins to decrease from there on out. 


- I expect this to be due to the intuition that after the threshold, the explanatory power of the additional attributes fades, as they could be essentially adding noise to the model, as these additional attributes take away from the predictive ability of the other attributes that are able to predict reasonably well.

- I then found the 10 fold cross validation scores on the training and testing data for a few different models, where each model used a different number of words for the attributes of each instance. 

- The model that had the greatest CV score, was the model trained on 200 words, So, I decieded to train the future models using 200 words, as this seemed to give the best accuracy for the testing data. 


Normalising Word Frequencies: 

- I noticed that some instances had a relatively large ferquency for some attribute values as compared to the other instances. I hypothesises that this was the result of the fact that once a word appears in an instance's abstract, it is much more likley to reappear in that instance's abstract again. Leading to instances with long abstracts recieving a larger weighting,  for conditional probabilities on the model.  


- Normalising word frequencies was added as an additional preprocessing step. Where s = sum of all word frequencies for the abstract,  I made the new word frequency for each word at a given abstract = its old word frequency / srt(s^2). 


- I then trained a new Naive Bayes Classifier on 200 attributes, where the values at each attribute had now been processed as per above. 


- Using the 10 fold cross validation of this new model on the training data and testing data, and found that Normalising the word frequencies actually decreased the model's performance. I will not be using this in the final model. 


Finding  probabilities via Loagrithims: 

- I decided to try and improve the models performance by finding the probability that the real class is some given class via logarithims. 


- This is because the model was storing very small probabilities in its conditional probabilities dictionary. I thought that it could be possible that there is a loss of information and therefore inaccurate probability when trying to make the predictions on an instance. this is because, I would be multiplying many very small probabilities together. So I decedied to instead try, summing logarithims of probabilities instead. 


- I made a new naivebayesLogarithim Class that behaves exactly the same way as the previous class, except for a few minor changes in the make_prediction_on_instance(self, instance, new_X_data) method of the class. Instead of multiplying together the probabilities, I summed the logarithims of the probabilities. 


- Using the CV scores for the training and testing data, it turned out that this did not increase the performance of the model. 


- So I will not be using this change in the final Naive Bayes Classifier. 


#### Final Naive Bayes Model  Performance

The performance can be found under the heading, Final Naive Bayes Classifier, near the bottom of the notebook. 

The final Naive Bayes Classifier was trained on 200 words, using the logarithim of the probabilities. 

## Code Section

In [1]:
import numpy as np
import math
import random
from random import randrange
from collections import Counter

### Loading Data

In [2]:
def load_data(filename):
    #load the data into a feature array, class array and list with all of the abstracts. 
    feature_array = []
    class_array = []
    all_abstracts = []
    data = np.loadtxt(filename, dtype = str, delimiter = ",")
    for row in data[1:]: 
        feature_array.append((int(row[0]), row[2].split()))
        class_array.append((int(row[0]), row[1]))
        for x in row[2].split():
            all_abstracts.append(x)        
    return np.array(feature_array), np.array(class_array), np.array(all_abstracts)

In [3]:
filename = "data/trg.csv" 
feature_array, class_array, all_abstracts = load_data(filename)


  return np.array(feature_array), np.array(class_array), np.array(all_abstracts)


### Splitting data into Train/Test Sets

I split the data into train and test sets before preprocessing, because in preprocessing, I replace the single attribute with N attributes, 
(a1, a2, ..., an) where each a_i is a word that occurs the ith most often in the total text of the data and N = number of most common words I am using as attributes. 

If I instead were to preprocess all of the data, and then split it into train and test sets, I would be violating a rule of Machine elarning, that the test set should not "see" any part of the train set. 
This is because the frequencey of the word for both the train and test sets (all of the data) would be taken into account for the the frequencey of the words and thus attributes for each instance, in the test set. 

In [4]:
def get_train_test_split(feature_data, class_data):
    #Splits the feature data and the class data into relevenat train and test splits. 
    N = len(feature_data)
    train_set_size = math.floor(0.8 * N)
    
    random_number_generator = np.random.default_rng()
    permutation = random_number_generator.permutation(N)
    train = permutation[:train_set_size]
    test = permutation [train_set_size:]
    
    X_train = feature_data[train]
    X_test = feature_data[test]
    Y_train = class_data[train]
    Y_test = class_data[test]
    
    return X_train, X_test, Y_train, Y_test

In [5]:
X_train, X_test, Y_train, Y_test =  get_train_test_split(feature_array, class_array)

### PreProcessing Data

I want the data to have attributes other than just a long peice of text. 
I will choose to make the attributes to be the N most frequently occuring words of all of the words in all of the abstracts for an instance. 

I am implementing multinomial Naive Bayes, such that the value for each instance at each attribute = the number of times that the attribute word appears in the instance's abstract.  

Such that the data will look like this: 

ID, Word1, Word2, ..., WordN

1, 1, 0, 3

This first row, means that the instance with ID of 1, is of class B and in its attributes, it contains Word1 once, does not contain Word2 etc...

Note that I only preprocess the X_train and X_test sets, as these are the only sets that actually contain the attribute values I want. 


In [6]:
def get_most_common_K_words(all_abstracts, k):
    #returns the k most common words from all of the the abstracts in the X training data. 
    counter = Counter(all_abstracts)
    return [x[0] for x in counter.most_common(k)]

In [7]:
most_common_words = get_most_common_K_words(all_abstracts, 10)

In [8]:
def number_of_times_word_in_abstract(abstract, word):
    #Returns the number of times that a word is in a given abstract. 
    count = 0
    for w in abstract: 
        if w == word: 
            count +=1 
    return count
    
def preprocess(X_array, words):
    #Preprocess the X data by changing its attributes to [....] = new, |new| = number of words. 
    N = len(words)
    processed_array = []
    for instance in X_array:
        ID = int(instance[0])
        attributes = [number_of_times_word_in_abstract(instance[1], words[i]) for i in range(N)]       
        processed_array.append([instance[0], attributes])
    return np.array(processed_array)

def values_only_no_IDS(X_data, Y_data):
    #Return only the values of the arrays. (NaiveBayesClassifier does not use the IDS)
    new_X = [x[1] for x in X_data]
    new_Y = [y[1] for y in Y_data]    
    return np.array(new_X), np.array(new_Y)

In [9]:
preprocessed_X_train = preprocess(X_train, most_common_words)
preprocessed_X_test = preprocess(X_test, most_common_words)
final_X_train, final_Y_train = values_only_no_IDS(preprocessed_X_train, Y_train)
final_X_test, final_Y_test = values_only_no_IDS(preprocessed_X_test, Y_test)


  return np.array(processed_array)


### Training Standard Naive Bayes Classifier

In [10]:
class NaiveBayesClassifier:
    
    def train(self, X_data, Y_data): #Takes X_data as an array of arrays containing only the values from the feature_data. 
        X_data = np.array(X_data)
        Y_data = np.array(Y_data)
        
        self._num_instances, self._num_features = X_data.shape        
        self._unique_classes = np.unique(Y_data)
        self._num_unique_classes = len(self._unique_classes)
        self._prior_probs = np.zeros(self._num_unique_classes) #Initialise the prior probabilities. 

        #Get Prior probabilities:
        for i, c in enumerate(self._unique_classes):
            instances_with_class_c = X_data[Y_data == c] #All the instances from X data that have the class c. 
            self._prior_probs[i] = instances_with_class_c.shape[0] / self._num_instances

        #Get Posterior probabilties: 
        self._conditional_probs = {}
        for x in range(self._num_features):
            self._conditional_probs[x] = {c:self.get_posterior(x, c, X_data, Y_data) for c in self._unique_classes}


    def get_posterior(self, w, c, X_data, Y_data):
        #Returns the posterior probability, P(w|c):
        posterior = 0
        numerators = 0
        denominators = 0
        
        instances_with_class_c = X_data[Y_data == c]#All the instances from X data that have the class c. 
        
        for instance in instances_with_class_c:
            numerators = numerators + instance[w]
            denominators = denominators + sum(instance)
        return (numerators + 1) / (denominators + self._num_features) #Adding 1 to numerator and |V| to denominator to avoid multiplying by zeroes. 
            
    def make_predictions_on_X_data(self, new_X_data):
        #Returns list of predictions for each instance in X data. 
        predictions = [self.make_prediction_on_instance(instance, new_X_data) for instance in new_X_data]
        return np.array(predictions)

    def make_prediction_on_instance(self, instance, new_X_data):       
        #Returns the prediction of a single instance in X data. 
        
        max_prob = 0 #Initialising greatest probability. 
        max_class = "A" #Initialising most likely class. 

        for i in range(len(self._unique_classes)): #For each class: 
            prob = self._prior_probs[i] #Initialise the probability for the class to its prior value. 
            for w in self._conditional_probs.keys():
                if instance[w] != 0: #Make sure instance[word] has a frequency. 
                    prob = prob * (self._conditional_probs[w][self._unique_classes[i]] ** instance[w]) #Grab the probability. 
                    
            if prob > max_prob:
                max_prob = prob
                max_class = self._unique_classes[i]
                
                
        #Evaluate the class with the highest probability. 
        
        return max_class
       
    def get_conditional_probs(self):
        return self._conditional_probs
    
    def get_priors(self):
        return  self._prior_probs

In [11]:
standard_nb = NaiveBayesClassifier()
standard_nb.train(np.array(final_X_train), np.array(final_Y_train))

### Evaluating Model

In [12]:
def get_percentage_correct(predictions, actual_values):
    #Returns the percentage of correctly classified instances. 
    N = len(predictions)
    correct = 0
    for i in range(N):
        if predictions[i] == actual_values[i]:
            correct += 1
        
    return (correct / N) * 100

def get_k_splits(feature_data, class_data, k):# Split data into k splits. 
    splits_X = []   
    number_of_folds = k
    splits_Y = []
    
    for i in range(number_of_folds): # Creating the k folds. 
        feature_data_copy = list(feature_data)
        class_data_copy = list(class_data)
        size_of_fold = len(feature_data_copy) / number_of_folds

        fold = []
        fold_Y = []
        while (len(fold)) < (size_of_fold): 
            index = random.randrange(len(feature_data_copy) ) #Random index in feature_data_copy. 
            fold.append(feature_data_copy[index])
            fold_Y.append(class_data_copy[index])
            feature_data_copy.pop(index) #Cant choose the same item more than once
            class_data_copy.pop(index)
            
        splits_X.append(fold)
        splits_Y.append(fold_Y)
    return splits_X, splits_Y

def run_cross_validation(nb, feature_data, class_data, words, k):
    #Runs cross validation on k splits. 
    list_of_percentage_corrects_for_each_fold = [] 
    
    splits_X, splits_Y = get_k_splits(feature_data,class_data, k) #Get k splits.
    
    for i in range(len(splits_X)):
        split_X = splits_X[i]
        split_Y = splits_Y[i]
        processed_split_X = preprocess(split_X, words)      
        
        final_X_split, final_Y_split = values_only_no_IDS(processed_split_X, split_Y)  

        predictions_on_split = nb.make_predictions_on_X_data(final_X_split)

        percentage_correct = get_percentage_correct(predictions_on_split, final_Y_split)
        
        list_of_percentage_corrects_for_each_fold.append(percentage_correct)
  
    N = len(list_of_percentage_corrects_for_each_fold)
    return sum(list_of_percentage_corrects_for_each_fold) / N

In [13]:
num_splits = 10
train_CV_score = run_cross_validation(standard_nb, X_train, Y_train, most_common_words, num_splits)

print("On Training Data: ")
print("Average percentage of classes corrrectly classified on", num_splits, "splits = ", train_CV_score)

test_CV_score = run_cross_validation(standard_nb, X_test, Y_test, most_common_words, num_splits)
print("On Testing/Validation Data")
print("Average percentage of classes corrrectly classified on", num_splits, "splits = ", test_CV_score)


  return np.array(processed_array)


On Training Data: 
Average percentage of classes corrrectly classified on 10 splits =  60.34375
On Testing/Validation Data
Average percentage of classes corrrectly classified on 10 splits =  65.25


### Extended Naive Bayes Classifier

#### Fining The Best Number of Words to Train On

I want to improve the accuracy of the standard Naivebayes Classifier. Many of the methods that I hope will increase the accuracy of the Naive bayes Classifier,  lie in the preprocessing part of the data. 

I first want to see how increasing the words/number of attributes that an instance has, influences the accuracy of the model. 

I will grab the most accuracte model for a few different numbers of words. 

**Note that the below code take a reasonable while to run, to save the marker some time, the model performs the best on 200 words. 


In [14]:
def find_best_number_words(X_train, X_test, Y_train, Y_test, all_abstracts):
    best_CV_score = 0
    best_model = None
    
    numbers = [50, 100, 150, 200, 300]
    for number in numbers:
        most_common_words = get_most_common_K_words(all_abstracts, number)
        
        preprocessed_X_train = preprocess(X_train, most_common_words)
        preprocessed_X_test = preprocess(X_test, most_common_words)
        final_X_train, final_Y_train = values_only_no_IDS(preprocessed_X_train, Y_train)
        final_X_test, final_Y_test = values_only_no_IDS(preprocessed_X_test, Y_test)

        nb = NaiveBayesClassifier()
        nb.train(np.array(final_X_train), np.array(final_Y_train))

        num_splits = 10
        CV_score = run_cross_validation(nb, X_train, Y_train, most_common_words, num_splits)


        if CV_score > best_CV_score:
            best_CV_score = CV_score
            best_model = nb
            
        test_CV_score = run_cross_validation(nb, X_test, Y_test, most_common_words, num_splits)

        print("--------------------")
        print("Number of words =", number)
        print("Training Data CV Score: ")
        print("Average percentage of classes corrrectly classified on", num_splits, "splits = ", CV_score)
        
        print("testing Data CV Score: ")
        print("Average percentage of classes corrrectly classified on", num_splits, "splits = ", test_CV_score)
        
        print("--------------------")
    
    return best_model, number

In [15]:
###Note that this Code fragment takes a reasonably long time to load. 

best_nb, number_words = find_best_number_words(X_train, X_test, Y_train, Y_test, all_abstracts)

  return np.array(processed_array)


--------------------
Number of words = 50
Training Data CV Score: 
Average percentage of classes corrrectly classified on 10 splits =  70.71875
testing Data CV Score: 
Average percentage of classes corrrectly classified on 10 splits =  72.125
--------------------
--------------------
Number of words = 100
Training Data CV Score: 
Average percentage of classes corrrectly classified on 10 splits =  77.03125
testing Data CV Score: 
Average percentage of classes corrrectly classified on 10 splits =  77.0
--------------------
--------------------
Number of words = 150
Training Data CV Score: 
Average percentage of classes corrrectly classified on 10 splits =  79.9375
testing Data CV Score: 
Average percentage of classes corrrectly classified on 10 splits =  75.75
--------------------
--------------------
Number of words = 200
Training Data CV Score: 
Average percentage of classes corrrectly classified on 10 splits =  83.09375
testing Data CV Score: 
Average percentage of classes corrrectly 

You should be able to see that the best model was trained on instances having  200 words/attributes. 

I will now continue to try and improve this model. 


#### Normalising Word Frequencies

I want to see what the effect of normalising the word frequency values has. This is because, when a word is first used in an abstract / piece of text it is much more likley to be reuseed again in the same abstract/text. This leads to instances with larger abstracts dominating the probbailities of the Naivebayes Classifier. 

I will normalise via: 
-  s = sum of all word frequencies for the abstract. 
- newWordFrequency = oldWordfrequency / sqrt(s^2)


*Note that I need to change my Cross Validation code, as the preprocessing for the data is now different. 

In [16]:
def normalise_word_frequencies(feature_array, words):
    new_X_set = [[x[0], x[1]] for x in feature_array]
    for i in range(len(feature_array)):
        feature = feature_array[i]
        word_frequencies = feature[1]
        sum_of_frequencies = sum(word_frequencies)
 
        for n in range(len(word_frequencies)):
            value = word_frequencies[n]       
            if value != 0:
                new_value = value / math.sqrt(sum_of_frequencies ** 2)                
                new_X_set[i][1][n] = new_value
            else:
                new_X_set[i][1][n] = new_value = value
    return new_X_set  


def run_cross_validation_normalisation(nb, feature_data, class_data, words, k):
    #Runs cross validation on k splits. 
    list_of_percentage_corrects_for_each_fold = [] 
    
    splits_X, splits_Y = get_k_splits(feature_data, class_data, k) #Get k splits.
    
    for i in range(len(splits_X)):
        split_X = splits_X[i]
        split_Y = splits_Y[i]        
        processed_split_X = preprocess(split_X, words) 
        normalsied_frequenciess_X_split = normalise_word_frequencies(processed_split_X, words) #Normalise the frequencies of each fold.         
        final_X_split, final_Y_split = values_only_no_IDS(normalsied_frequenciess_X_split, split_Y)  

        predictions_on_split = nb.make_predictions_on_X_data(final_X_split)
        percentage_correct = get_percentage_correct(predictions_on_split, final_Y_split)
        
        list_of_percentage_corrects_for_each_fold.append(percentage_correct)
  
    N = len(list_of_percentage_corrects_for_each_fold)
    return sum(list_of_percentage_corrects_for_each_fold) / N


In [17]:
#Training the model with normalised word frequencies and 200 words. 

#setting up X data and Y data:
most_common_words = get_most_common_K_words(all_abstracts, 200)

new_preprocessed_X_train = preprocess(X_train, most_common_words)
new_preprocessed_X_test = preprocess(X_test, most_common_words)

normalised_X_train = normalise_word_frequencies(new_preprocessed_X_train, most_common_words)
normalsied_X_test = normalise_word_frequencies(new_preprocessed_X_test, most_common_words)

new_final_X_train, new_final_Y_train = values_only_no_IDS(normalised_X_train, Y_train)
new_final_X_test, new_final_Y_test = values_only_no_IDS(normalsied_X_test, Y_test)

normalised_200_NaiveBayes = NaiveBayesClassifier()
normalised_200_NaiveBayes.train(np.array(new_final_X_train), np.array(new_final_Y_train))

num_splits = 10
CV_score = run_cross_validation_normalisation(normalised_200_NaiveBayes, X_train, Y_train, most_common_words, num_splits)
print("Training Data CV score:")
print("Average percentage of classes corrrectly classified on", num_splits, "splits = ", CV_score) 

Test_CV_score = run_cross_validation_normalisation(normalised_200_NaiveBayes, X_test, Y_test, most_common_words, num_splits)
print("Testing Data CV score:")
print("Average percentage of classes corrrectly classified on", num_splits, "splits = ", Test_CV_score) 


  return np.array(processed_array)


Training Data CV score:
Average percentage of classes corrrectly classified on 10 splits =  53.875
Testing Data CV score:
Average percentage of classes corrrectly classified on 10 splits =  54.375


Normalising the frequcnies negatively impacted the models performance. I will not be using this for the final Naive Bayes Classifier. 

#### Find Probabilities with Logarithims

The probabbilities that my model generates are extermely small, so when I multiply these small probabilities together, this leads to even smaller probabilities. 

It is possible that these very small probabilities are not beinig precisely stored in the model. 

I will therefore change the NaiveBayes model so that it finds the probabilities by adding together loagrithims instead. 

(I will again be using 200 words)


In [18]:
# The change is made under method, make_prediction_on_instance()

class NaiveBayesClassifierLogarithim:
    
    def train(self, X_data, Y_data): #Takes X_data as an array of arrays containing only the values from the feature_data. 
        X_data = np.array(X_data)
        Y_data = np.array(Y_data)
        
        self._num_instances, self._num_features = X_data.shape        
        self._unique_classes = np.unique(Y_data)
        self._num_unique_classes = len(self._unique_classes)
        self._prior_probs = np.zeros(self._num_unique_classes) #Initialise the prior probabilities. 

        #Get Prior probabilities:
        for i, c in enumerate(self._unique_classes):
            instances_with_class_c = X_data[Y_data == c] #All the instances from X data that have the class c. 
            self._prior_probs[i] = instances_with_class_c.shape[0] / self._num_instances

        #Get Posterior probabilties: 
        self._conditional_probs = {}
        for x in range(self._num_features):
            self._conditional_probs[x] = {c:self.get_posterior(x, c, X_data, Y_data) for c in self._unique_classes}


    def get_posterior(self, w, c, X_data, Y_data):
        #Returns the posterior probability, P(w|c):
        posterior = 0
        numerators = 0
        denominators = 0
        
        instances_with_class_c = X_data[Y_data == c]#All the instances from X data that have the class c. 
        
        for instance in instances_with_class_c:
            numerators = numerators + instance[w]
            denominators = denominators + sum(instance)
        return (numerators + 1) / (denominators + self._num_features) #Adding 1 to numerator and |V| to denominator to avoid multiplying by zeroes. 
            
    def make_predictions_on_X_data(self, new_X_data):
        #Returns list of predictions for each instance in X data. 
        predictions = [self.make_prediction_on_instance(instance, new_X_data) for instance in new_X_data]
        return np.array(predictions)

    def make_prediction_on_instance(self, instance, new_X_data):       
        #Returns the prediction of a single instance in X data. 
        
        max_prob = 0 #Initialising greatest probability. 
        max_class = "A" #Initialising most likely class. 

        for i in range(len(self._unique_classes)): #For each class: 
            prob = 0 #Initialise the probability. 
            for w in self._conditional_probs.keys():
                if instance[w] != 0: #Make sure instance[word] has a frequency. 
                    
                    prob = prob + math.log((self._conditional_probs[w][self._unique_classes[i]] ** instance[w]), 2) #Grab the probability. 
                    
                    #summing logs^                
                    
                    #####
                    #probability is found by summing logarithims^ then multiplying by prior. 
                    #####
                    
            prob = prob + math.log(self._prior_probs[i], 2) #multiplying by prior.     
            if prob > max_prob:
                max_prob = prob
                max_class = self._unique_classes[i]
                
                
        #Evaluate the class with the highest probability. 
        
        return max_class
       
    def get_conditional_probs(self):
        return self._conditional_probs
    
    def get_priors(self):
        return  self._prior_probs

In [19]:
#Training the model with with sum of logarithims and features having 200 words. 


#setting up X data and Y data:
most_common_words = get_most_common_K_words(all_abstracts, 200)

new_preprocessed_X_train = preprocess(X_train, most_common_words)
new_preprocessed_X_test = preprocess(X_test, most_common_words)

new_final_X_train, new_final_Y_train = values_only_no_IDS(new_preprocessed_X_train, Y_train)
new_final_X_test, new_final_Y_test = values_only_no_IDS(new_preprocessed_X_test, Y_test)

log_200_Naive_Bayes = NaiveBayesClassifierLogarithim()
log_200_Naive_Bayes.train(np.array(new_final_X_train), np.array(new_final_Y_train))

num_splits = 10
CV_score = run_cross_validation(log_200_Naive_Bayes, X_train, Y_train, most_common_words, num_splits)
print("Training Data CV score: ")
print("Average percentage of classes corrrectly classified on", num_splits, "splits = ", CV_score) 
print()
print("Testing Data CV score: ")
test_CV_score = run_cross_validation(log_200_Naive_Bayes, X_test, Y_test, most_common_words, num_splits)
print("Average percentage of classes corrrectly classified on", num_splits, "splits = ", test_CV_score) 


  return np.array(processed_array)


Training Data CV score: 
Average percentage of classes corrrectly classified on 10 splits =  3.28125

Testing Data CV score: 
Average percentage of classes corrrectly classified on 10 splits =  3.0


This is a not an improvement, so I will not be using this for the final model.

### Final Naive Bayes Classifier

When trying to improve the model above, we found that the model best performed when it was fed with 200 words. using the original Naivebayes Class.  

In [22]:
final_nb = best_nb
final_nb.train(np.array(new_final_X_train), np.array(new_final_Y_train))
train_CV_score = run_cross_validation(final_nb, X_train, Y_train, most_common_words, num_splits)
test_CV_score = run_cross_validation(final_nb, X_test, Y_test, most_common_words, num_splits)


print("Training Data CV Score:")
print("Average percentage of classes corrrectly classified on", num_splits, "splits = ", train_CV_score) 
print()
print("Testing Data CV Score:")
print("Average percentage of classes corrrectly classified on", num_splits, "splits = ", test_CV_score) 


  return np.array(processed_array)


Training Data CV Score:
Average percentage of classes corrrectly classified on 10 splits =  82.875

Testing Data CV Score:
Average percentage of classes corrrectly classified on 10 splits =  76.25


### Making Predictions on Test CSV File. 

In [23]:
def values_only_no_IDS(X_data, Y_data):
    #Return only the values of the arrays. (NaiveBayesClassifier does not use the IDS)
    new_X = [x[1] for x in X_data]
    new_Y = [y[1] for y in Y_data]    
    return np.array(new_X), np.array(new_Y)

def load_test_data(filename):
    #load the data into a feature array, class array and list with all of the abstracts. 
    feature_array = []
    data = np.loadtxt(filename, dtype = str, delimiter = ",")
    for row in data[1:]: 
        feature_array.append((int(row[0]), row[1].split()))
    return np.array(feature_array)

test_filename = "data/tst.csv" 

data = load_test_data(test_filename)

processed_data = preprocess(data, most_common_words)

proceased_data_as_only_values = [x[1] for x in processed_data]


cleaned_processed_data = np.array(proceased_data_as_only_values)
predictions = []

for i in range(len(processed_data)):
    instance = processed_data[i]
    ID = instance[0]
    value = instance[1]
    prediction = final_nb.make_prediction_on_instance(value, cleaned_processed_data)
    predictions.append((ID, prediction))

#Write predictions to a csv file
import csv
with open("data/test_predictions.csv", "w", newline='') as f:
    writer = csv.writer(f)
    writer.writerow(("id", "class"))
    for item in predictions:
        writer.writerow(item)        


  return np.array(feature_array)
  return np.array(processed_array)


In [24]:
print(predictions)

[(1, 'B'), (2, 'E'), (3, 'A'), (4, 'E'), (5, 'E'), (6, 'E'), (7, 'E'), (8, 'B'), (9, 'B'), (10, 'B'), (11, 'A'), (12, 'E'), (13, 'B'), (14, 'B'), (15, 'E'), (16, 'E'), (17, 'E'), (18, 'B'), (19, 'E'), (20, 'B'), (21, 'B'), (22, 'E'), (23, 'B'), (24, 'B'), (25, 'E'), (26, 'E'), (27, 'E'), (28, 'V'), (29, 'B'), (30, 'E'), (31, 'E'), (32, 'B'), (33, 'E'), (34, 'E'), (35, 'B'), (36, 'V'), (37, 'E'), (38, 'E'), (39, 'B'), (40, 'E'), (41, 'E'), (42, 'B'), (43, 'E'), (44, 'E'), (45, 'E'), (46, 'E'), (47, 'E'), (48, 'B'), (49, 'B'), (50, 'E'), (51, 'E'), (52, 'E'), (53, 'E'), (54, 'B'), (55, 'B'), (56, 'B'), (57, 'E'), (58, 'A'), (59, 'B'), (60, 'B'), (61, 'E'), (62, 'E'), (63, 'E'), (64, 'B'), (65, 'E'), (66, 'B'), (67, 'E'), (68, 'B'), (69, 'E'), (70, 'E'), (71, 'E'), (72, 'E'), (73, 'E'), (74, 'E'), (75, 'E'), (76, 'V'), (77, 'E'), (78, 'B'), (79, 'E'), (80, 'B'), (81, 'E'), (82, 'B'), (83, 'E'), (84, 'E'), (85, 'B'), (86, 'E'), (87, 'E'), (88, 'V'), (89, 'B'), (90, 'E'), (91, 'E'), (92, 'E