Standard Naiive Bayes algorithm: Bag of Words implementation, without any stop words and with frequency occurrences.

Extended Naiive Bayes algorithm: Bag of Words implementation, with stop words and with TF-IDF instead of plain frequency occurrences.



Results:
Overall, the algorithm had a 97.6% accuracy on Kaggle. This is good performance. As mentioned on piazza, the Standard Naiive Bayes algorithm is expected a result of around 92%. Also in cross-validation, the Extended Naiive Bayes algorithm performed better than the standard Naiive Bayes Algorithm. (Although in the jupyter notebook, it is not the case because I have touched something).



Explaining the results:

1. Presumption of independence between all the features.
The Standard Naiive Bayes algorithm assumes conditional independence between all features. However, this is realistically impossible. Therefore, if there is any correlation between the features, the performance of the algorithm will decrease.

2. Extensions. 
Unlike the standard Naiive Bayes algorithm, the use of extensions help the extended Naiive Bayes to perform better. There are factors such as using the TF-IDF and using logarithms which are methods to reduce the likelihood of underfitting.


I .Pre-processing: The data was transformed into an accessible state by using the Bag of Words implementation. Through this method, the text in the abstract were recorded as individual words, ignoring order. The strength of this method is that it is easy to perform statistical analysis with the data because the pure frequency of the words is made available. However, this means that the order of the words is lost. Considering the presumption of conditional independent of Naiive Bayes, this choice can be justified.

There was the option to only take the top 1000 frequent words instead of all the text in the dataset. Although this would reduce the likelihood of overfitting to the training dataset, this option was not selected. This is because there are many words which have very little appearances, there is thus also a risk of underfitting. Also, most of the words which have the highest frequencies are common words which provide no information. These words get removed as explained later on. Overall, it seemed reasonable to not reduce the dimensionality of the dataset and take the risk of overfitting.

II. Cross-Validation:
Because the results of the test datset were not disclosed, we are unable to measure the performance of our classifiers. Thus, using cross validation, we can simulate and estimation of accuracy of the classifiers. Due to computing restrictions, only a split of 5 subsets were manageable or else the process would take too long. It was important to use stratified cross validation due to the imbalance of the target class within the dataset. By doing this, it ensured a similar proportion of target classes within each subset, so the classifier does not underfit for some classes . 

III. Extensions used:

1. Removal of common words, implementing stop words
Because the given task is classification by based on textual features, it is logical to remove any common words which do not give any significance to classify. This is because the Naiive Bayes algorithm is revolved around conditional probability of features in classes. For example, the frequency of the article “the” is a common word and its conditional probability will have no significance in classification. Therefore, if “the” is removed from the dataset the algorithm will not be biased to it and  it will cause the accuracy of the algorithm to increase.

2. Log probabilities instead of products.
This was a logical decision because some probability values would be very small. This would like to some conditional probability values being very close to 0 and unable to be distinguished, causing underflow. Using logarithms can solve this issue because logarithms maintain the proportionality and due to log rules, the logs can be added instead of multiplied.

3. TF-IDF

The TF-IDF extension was implemented because for the given task, it seemed logical to consider giving weightings to specific words. There is an imbalance of words and there are words which appear in many different classes, and words which are very class specific. Therefore, TF-IDF reduces the likelihood of underfitting because it enables rare words with importance to have are greater weighting during classification. This enables the algorithm to be more accurate. 

Extensions not used:

1. N-grams
The Naiive Bayes algorithm works with the assumption of conditionally independent features. Thus, the algorithm performs the best when features are conditionally independent. The N-gram extension groups words, removing reducing the likelihood of the features being conditionally independent by merging these words into one. However, this option was not chosen due to the size of the dataset. There are too many words such that computing restrictions are a big issue. There is also an issue of underfitting because there are too many possibilities of N-gram words meaning that classification off N-gram words only may not be effective. 

In [3]:
import numpy as np
import pandas as pd
import math
import random

full_test = pd.read_csv("tst.csv")
full_train = pd.read_csv("trg.csv")


stop_words = ['the', 'of', 'and', 'a', 'in', 'to', 'that', 'is', 'with', 'for', 'from', 'are', 'was', 'by', 'were', 'as', 'this', 'which', 'an', 'we', 'have', 'two', 'these', 'human', 'has', 'be', 'been', 'other', 'on', 'at', 'also', 'but','it']

In [4]:
#Cross valdiation
cv_datasets = []
subsets = {}
number_of_subsets = 5
number_of_instances = len(full_train["id"])
subset_size = int(number_of_instances / number_of_subsets)
classes_in_train = full_train["class"]
unique_classes = list(set(classes_in_train))
indexes_by_class = {target:[] for target in unique_classes}


all_indexes = list(range(number_of_instances))
for p in range(len(all_indexes)):
    classs = full_train["class"][p]
    indexes_by_class[classs].append(p)
remaing_index_by_class = {target:indexes_by_class[target] for target in unique_classes}

for value in range(number_of_subsets):
    subset_indexes = []
    for classc in unique_classes:
        if value == number_of_subsets -1:
            size = len(remaing_index_by_class[classc])
        else:
            size = int(len(indexes_by_class[classc])/number_of_subsets)
        current_indexes = random.sample(remaing_index_by_class[classc], size)
        subset_indexes += current_indexes
        remaing_index_by_class[classc] = list(set(remaing_index_by_class[classc]) - set(current_indexes))
    subsets[str(value+1)] = subset_indexes
subset_names = list(subsets.keys())
print(subset_names)
print("Subsets_created")
for x in range(len(subset_names)):
    cv_train_subsets = subset_names.copy()
    cv_test_subset = cv_train_subsets.pop(x)
    cv_test_index = subsets[cv_test_subset]
    cv_train_indexes =  list(set(all_indexes) - set(cv_test_index))
    cv_datasets.append((cv_train_indexes,cv_test_index))
print("Subset_divisions_complete")
cv_results = {}

['1', '2', '3', '4', '5']
Subsets_created
Subset_divisions_complete


In [5]:
#cross valdiation for standard NB
standard_cv_predictions = []
for x in range(len(cv_datasets)):
    train = cv_datasets[x][0]
    test = cv_datasets[x][1]
    classes = [full_train["class"].iloc[i] for i in train]
    conditional_probabilities = {}
    
    unique_classes = list(set(classes))
    class_index_dict = {key: [] for key in unique_classes}
    unique_words_per_class = {target:[] for target in unique_classes}
    
    for i in range(len(classes)):
        value = classes[i]
        class_index_dict[value].append(i)
        class_priors = {target:len(class_index_dict[target])/len(classes) for target in unique_classes}
    print(class_priors)
    
    full_words_in_dataset = []
    full_occurences = {}
    word_occurence_per_class = {}
    all_words_per_class = {target:0 for target in unique_classes}
    for class1 in unique_classes:
        words = []
        the_dict = {}
        for thing in class_index_dict[class1]:
            string = full_train.iloc[thing]["abstract"].split()
            all_words_per_class[class1] += len(string)
            text = []
            for word in string:
                if word in full_occurences and word not in text:
                    full_occurences[word] += 1
                elif word not in full_occurences and word not in text:
                    full_occurences[word] = 1
                text.append(word)
                if word in the_dict:
                    the_dict[word] += 1
                else:
                    the_dict[word] = 1
            words += text
            full_words_in_dataset += words
        word_occurence_per_class[class1] = the_dict
    unique_words_in_dataset = set(full_words_in_dataset)
    print("DONE")
    
    
    
    for class2 in unique_classes:
        for word in unique_words_in_dataset:
            if word not in word_occurence_per_class[class2]:
                word_occurence_per_class[class2][word] = 0
            q_dict = word_occurence_per_class[class2]
            numerator = q_dict[word]  + 1
            denominator =  all_words_per_class[class2]+ len(unique_words_in_dataset)
            probability = numerator/denominator
            conditional_probabilities[(word,class2)] = probability
    print("DONE")


    final_result_list = []
    list_of_classes = list(unique_classes)
    for i in range(len(test)):
        test_index = test[i]
        test_row = np.array(full_train.iloc[test_index])
        test_row_split = test_row[2].split()
        test_row_words = []
        for word in test_row_split:
            test_row_words.append(word)
        final_predictions = [math.log(class_priors[target]) for target in unique_classes]
        for wordz in test_row_words:
            if wordz in unique_words_in_dataset:
                for classb in unique_classes:
                    index = list_of_classes.index(classb)
                    log = math.log(conditional_probabilities[(wordz,classb)])
                    final_predictions[index] += log
        predict = list_of_classes[final_predictions.index(max(final_predictions))]
        final_result_list.append(predict)
    standard_cv_predictions.append(final_result_list)
    print("DONE")
    print()

{'B': 0.40037476577139286, 'E': 0.5359150530918176, 'A': 0.03216739537788882, 'V': 0.03154278575890069}
DONE
DONE
DONE

{'B': 0.40037476577139286, 'E': 0.5359150530918176, 'A': 0.03216739537788882, 'V': 0.03154278575890069}
DONE
DONE
DONE

{'A': 0.03216739537788882, 'E': 0.5359150530918176, 'V': 0.03154278575890069, 'B': 0.40037476577139286}
DONE
DONE
DONE

{'B': 0.40037476577139286, 'E': 0.5359150530918176, 'A': 0.03216739537788882, 'V': 0.03154278575890069}
DONE
DONE
DONE

{'B': 0.40100250626566414, 'E': 0.5363408521303258, 'A': 0.03132832080200501, 'V': 0.03132832080200501}
DONE
DONE
DONE



In [6]:
#cross valdiation for extended NB
external_cv_predictions = []
for x in range(len(cv_datasets)):
    temp_train = cv_datasets[x][0]
    temp_test = cv_datasets[x][1]
    
    classes = [full_train["class"].iloc[i] for i in temp_train]
    total_instances = len(classes)
    unique_classes = list(set(classes))
    class_index_dict = {key: [] for key in unique_classes}
    for i in range(len(classes)): ####
        value = classes[i]
        class_index_dict[value].append(i)
    class_priors = {target:len(class_index_dict[target])/len(classes) for target in unique_classes}
    unique_words_per_class = {target:[] for target in unique_classes}
    conditional_probabilities = {}
    print(class_priors)

    full_words_in_dataset = []
    full_occurences = {}
    word_occurence_per_class = {}
    for class1 in unique_classes:
        words = []
        the_dict = {}
        for thing in class_index_dict[class1]:
            string = full_train.iloc[thing]["abstract"].split()
            text = []
            for word in string:
                if word not in stop_words:
                    if word in full_occurences and word not in text:
                        full_occurences[word] += 1
                    elif word not in full_occurences and word not in text:
                        full_occurences[word] = 1
                    text.append(word)
                    if word in the_dict:
                        the_dict[word] += 1
                    else:
                        the_dict[word] = 1
            words += text
            full_words_in_dataset += words
        word_occurence_per_class[class1] = the_dict
    unique_words_in_dataset = set(full_words_in_dataset)
    print("DONE")

    for x in unique_classes:
        dictx = word_occurence_per_class[x]
        for word in unique_words_in_dataset:
            if word not in dictx:
                dictx[word] = 0
    sum_values = {}
    for claass in unique_classes:
        sumx = 0
        for wordm in unique_words_in_dataset:
            go_dict = word_occurence_per_class[claass]
            idf =  math.log(total_instances/full_occurences[wordm])
            frequency2 = go_dict[wordm]
            sumx += idf * frequency2
        sum_values[claass] = sumx
    print(sum_values)
    print("DONE")
    for class2 in unique_classes:
        for word in unique_words_in_dataset:
            q_dict = word_occurence_per_class[class2]
            numerator = q_dict[word] * math.log(total_instances/full_occurences[word]) + 1
            denominator = sum_values[class2] + len(unique_words_in_dataset)
            probability = numerator/denominator
            conditional_probabilities[(word,class2)] = probability
    print("DONE")


    final_result_list = []
    list_of_classes = list(unique_classes)
    for i in range(len(temp_test)):
        test_index = temp_test[i]
        test_row = np.array(full_train.iloc[test_index])
        test_row_split = test_row[2].split()
        test_row_words = []
        for word in test_row_split:
            if word not in stop_words:
                test_row_words.append(word)
        final_predictions = [math.log(class_priors[target]) for target in unique_classes]
        for wordz in test_row_words:
            if wordz in unique_words_in_dataset:
                for classb in unique_classes:
                    index = list_of_classes.index(classb)
                    log = math.log(conditional_probabilities[(wordz,classb)])
                    final_predictions[index] += log
        predict = list_of_classes[final_predictions.index(max(final_predictions))]
        final_result_list.append(predict)
    external_cv_predictions.append(final_result_list)
    print("DONE")
    print()

{'B': 0.40037476577139286, 'E': 0.5359150530918176, 'A': 0.03216739537788882, 'V': 0.03154278575890069}
DONE
{'B': 570374.8154865643, 'E': 750951.6783774662, 'A': 47436.857268890155, 'V': 45079.39591486489}
DONE
DONE
DONE

{'B': 0.40037476577139286, 'E': 0.5359150530918176, 'A': 0.03216739537788882, 'V': 0.03154278575890069}
DONE
{'B': 562558.0808327243, 'E': 760915.2677611103, 'A': 44235.61760573611, 'V': 46133.78084821844}
DONE
DONE
DONE

{'A': 0.03216739537788882, 'E': 0.5359150530918176, 'V': 0.03154278575890069, 'B': 0.40037476577139286}
DONE
{'A': 45270.557891655495, 'E': 769066.9976285157, 'V': 43847.90640955105, 'B': 555657.2851180708}
DONE
DONE
DONE

{'B': 0.40037476577139286, 'E': 0.5359150530918176, 'A': 0.03216739537788882, 'V': 0.03154278575890069}
DONE
{'B': 568681.4607612528, 'E': 758024.1379817338, 'A': 44261.923041670765, 'V': 42875.22526313458}
DONE
DONE
DONE

{'B': 0.40100250626566414, 'E': 0.5363408521303258, 'A': 0.03132832080200501, 'V': 0.03132832080200501}
DONE


In [7]:
average1 = 0
average2 = 0
for x in range(len(cv_datasets)):
    test = cv_datasets[x][1]
    test_classes = [full_train["class"].iloc[i] for i in test]
    standard_result = standard_cv_predictions[x]
    extended_result = external_cv_predictions[x]
    standard_count = 0
    extended_count = 0
    total = len(test_classes)
    for i in range(total):
        if test_classes[i] == standard_result[i]:
            standard_count += 1
        if test_classes[i] == extended_result[i]:
            extended_count += 1
    accuracy_s = standard_count/total
    accuracy_e = extended_count/total
    print("Standard Accuracy:", str(accuracy_s),"Extended Accuracy:",str(accuracy_e ))
    average1 += accuracy_s
    average2 += accuracy_e
print("Overall Cross_valdiation_score Standard Accuracy:", str(average1/number_of_subsets),"Extended Accuracy:",str(average2/number_of_subsets) )
        


Standard Accuracy: 0.45112781954887216 Extended Accuracy: 0.42355889724310775
Standard Accuracy: 0.4448621553884712 Extended Accuracy: 0.38095238095238093
Standard Accuracy: 0.4799498746867168 Extended Accuracy: 0.4649122807017544
Standard Accuracy: 0.5087719298245614 Extended Accuracy: 0.474937343358396
Standard Accuracy: 0.4504950495049505 Extended Accuracy: 0.4368811881188119
Overall Cross_valdiation_score Standard Accuracy: 0.4670413657907145 Extended Accuracy: 0.4362484180748901


In [28]:
#I am not sure what I have touched, but the cross valdiation results do not come out as usual. 
#The extended Naiive Bayes algorithm always performed slightly better at around 85%
#I am aware that the better-perfoming algorithm is to be used to predict values for the test set.
#However, because the results do not show as it was, I will proceed with application of the extended Naiive Bayes algorithm

In [21]:
#training/test for test set using extended NB algorithm
classes = full_train["class"]
total_instances = len(full_train["id"])
unique_classes = set(classes)
print(unique_classes)
class_index_dict = {key: [] for key in unique_classes}
for i in range(len(classes)): ####
    value = classes.iloc[i]
    class_index_dict[value].append(i)
class_priors = {target:len(class_index_dict[target])/len(classes) for target in unique_classes}
unique_words_per_class = {target:[] for target in unique_classes}
all_words_per_class = {target:[] for target in unique_classes}
conditional_probabilities = {}
print(class_priors)
full_words_in_dataset = []
full_occurences = {}
word_occurence_per_class = {}
for class1 in unique_classes:
    words = []
    the_dict = {}
    for thing in class_index_dict[class1]:
        string = full_train.iloc[thing]["abstract"].split()
        text = []
        for word in string:
            if word not in stop_words:
                if word in full_occurences and word not in text:
                    full_occurences[word] += 1
                elif word not in full_occurences and word not in text:
                    full_occurences[word] = 1
                text.append(word)
                if word in the_dict:
                    the_dict[word] += 1
                else:
                    the_dict[word] = 1
        words += text
        full_words_in_dataset += words
    word_occurence_per_class[class1] = the_dict
unique_words_in_dataset = set(full_words_in_dataset)
print("done")
for x in unique_classes:
    dictx = word_occurence_per_class[x]
    for word in unique_words_in_dataset:
        if word not in dictx:
            dictx[word] = 0
sum_values = {}
for claass in unique_classes:
    sumx = 0
    for wordm in unique_words_in_dataset:
        go_dict = word_occurence_per_class[claass]
        idf =  math.log(total_instances/full_occurences[wordm])
        frequency2 = go_dict[wordm]
        sumx += idf * frequency2
    sum_values[claass] = sumx
print(sum_values)
print("DONE")
for class2 in unique_classes:
    for word in unique_words_in_dataset:
        all_words_class = all_words_per_class[class2]
        q_dict = word_occurence_per_class[class2]
        numerator = q_dict[word] * math.log(total_instances/full_occurences[word]) + 1
        denominator = sum_values[class2] + len(unique_words_in_dataset)
        probability = numerator/denominator
        conditional_probabilities[(word,class2)] = probability
print("DONE")
#predict
list33 = []
list_of_classes = list(unique_classes)
for i in range(len(full_test["id"])):
    last = np.array(full_test.iloc[i])
    dummy = last[1].split()
    last_words = []
    for word in dummy:
        if word not in stop_words:
            last_words.append(word)
    final_predictions = [math.log(class_priors[target]) for target in unique_classes]
    for wordz in last_words:
        if wordz in unique_words_in_dataset:
            for classb in unique_classes:
                index = list_of_classes.index(classb)
                log = math.log(conditional_probabilities[(wordz,classb)])
                final_predictions[index] += log
    predict = list_of_classes[final_predictions.index(max(final_predictions))]
    list33.append(predict)
print("DONE")
#write out results
results_df = pd.DataFrame(full_test["id"])
results_df["class"] = list33
csv_text = results_df.to_csv(index=False)
f = open("results.csv", "w")
f.write(csv_text)
f.close()
print("DONE")

{'E', 'V', 'A', 'B'}
{'E': 0.536, 'V': 0.0315, 'A': 0.032, 'B': 0.4005}
done
{'E': 974412.9387839177, 'V': 56589.84207723572, 'A': 62689.6861342431, 'B': 681716.2993899308}
DONE
DONE
DONE
DONE
