# The University of Melbourne, School of Computing and Information Systems
# COMP30027 Machine Learning, 2019 Semester 1
-----
## Project 1: Gaining Information about Naive Bayes
-----
###### Student Name(s): Akira and Callum
###### Python version: 3.7.1 from Anaconda 
###### Submission deadline: 1pm, Fri 5 Apr 2019

This iPython notebook is a template which you may use for your Project 1 submission. (You are not required to use it; in particular, there is no need to use iPython if you do not like it.)

Marking will be applied on the five functions that are defined in this notebook, and to your responses to the questions at the end of this notebook.

You may change the prototypes of these functions, and you may write other functions, according to your requirements. We would appreciate it if the required functions were prominent/easy to find. 

In [1]:
%%javascript
IPython.OutputArea.prototype._should_scroll = function(lines) {
    return false;
}

<IPython.core.display.Javascript object>

In [2]:
import pandas as pd
from IPython.display import display
from collections import *
import numpy as np
from math import log

########## POSSIBLE CSVs ##########
d1 =  'anneal.csv'
h1 = 'family,product-type,steel,carbon,hardness,temper_rolling,condition,formability,strength,non-ageing,surface-finish,surface-quality,enamelability,bc,bf,bt,bw-me,bl,m,chrom,phos,cbond,marvi,exptl,ferro,corr,bbvc,lustre,jurofm,s,p,shape,oil,bore,packing,class'.split(',')

d2 =  'breast-cancer.csv'
h2 = 'age,menopause,tumor-size,inv-nodes,node-caps,deg-malig,breast,breast-quad,irradiat,class'.split(',')

d3 =  'car.csv'
h3 = 'buying,maint,doors,persons,lug_boot,safety,class'.split(',')

d4 =  'cmc.csv'
h4 = 'w-education,h-education,n-child,w-relation,w-work,h-occupation,standard-of-living,media-exposure,class'.split(',')

d5 =  'hepatitis.csv'
h5 = 'sex,steroid,antivirals,fatigue,malaise,anorexia,liver-big,liver-firm,spleen-palpable,spiders,ascites,varices,histology,class'.split(',')

d6 =  'hypothyroid.csv'
h6 = 'sex,on-thyroxine,query-on-thyroxine,on_antithyroid,surgery,query-hypothyroid,query-hyperthyroid,pregnant,sick,tumor,lithium,goitre,TSH,T3,TT4,T4U,FTI,TBG,class'.split(',')

d7 =  'mushroom.csv'
h7 = 'cap-shape,cap-surface,cap-color,bruises,odor,gill-attachment,gill-spacing,gill-size,gill-color,stalk-shape,stalk-root,stalk-surface-above-ring,stalk-surface-below-ring,stalk-color-above-ring,stalk-color-below-ring,veil-type,veil-color,ring-number,ring-type,spore-print-color,population,habitat,class'.split(',')

d8 =  'nursery.csv'
h8 = 'parents,has_nurs,form,children,housing,finance,social,health,class'.split(',')

d9 = 'primary-tumor.csv'
h9 = 'age,sex,histologic-type,degree-of-diffe,bone,bone-marrow,lung,pleura,peritoneum,liver,brain,skin,neck,supraclavicular,axillar,mediastinum,abdominal,class'.split(',')

datasets = [d1,d2,d3,d4,d5,d6,d7,d8,d9]
dataset_headers = [h1,h2,h3,h4,h5,h6,h7,h8,h9]

dictionary = {datasets[i] : dataset_headers[i] for i in range(len(datasets))}

def set_column(filename):
    return dictionary[filename]

In [3]:
def preprocess(filename, testing_on_train = True, k = 10):
    # Add column headers, drop columns with only one unique value (all same value)
    df = pd.read_csv(filename, header = None, names = set_column(filename))
    non_unique = df.apply(pd.Series.nunique)
    df.drop(non_unique[non_unique == 1].index, axis=1, inplace=True) 
    
    """If we are not using the cross validation method"""
    # Return the whole dataset as a dataframe
    if testing_on_train:
        return df
    else:
        temp = df.copy()
        partitions = list()
        
        # k-fold Cross Validation
        divisor = k
        
        for i in range(k):
            partitions.append(temp.sample(frac=1/divisor))
            divisor -= 1
            temp.drop(partitions[-1].index, axis=0, inplace=True)
        
        del temp
        
        # Dictionary of train/test pairs
        cross_validation_pairs = defaultdict(list)
        models = list()
        
        for i in range(k):
            test = partitions[i]
            train = df.iloc[df.index.drop(test.index.values)]
            cross_validation_pairs["train"].append(train)
            cross_validation_pairs["test"].append(test)
            
        return cross_validation_pairs

In [4]:
def train(train_set):
    N = len(train_set)
    priors = {}
    posteriors = defaultdict(lambda: defaultdict(lambda: defaultdict(float)))
    """Accessable using posteriors[class j][attribute x][value i]"""
    
    for label in train_set['class'].unique():
        priors[label] = len(train_set.loc[train_set['class'] == label]) / N
        for attribute in train_set.columns[:-1]:
            temp = train_set.loc[train_set['class'] == label, [attribute,'class']]
            n = len(temp)
            count = Counter(temp[attribute])
            for i in count:
                posteriors[label][attribute][i] = count[i] / n
                
    trained_model = {"priors": priors, "posteriors": posteriors}
    
    return trained_model

def cross_validation_train(cross_validation_pairs):
    trained_models = list()
    train_set = cross_validation_pairs["train"]
    test_set = cross_validation_pairs["test"]
    N = len(train_set)
    
    for i in range(N):
        trained_models.append(train(train_set[i]))
        
    return trained_models

In [5]:
def predict(trained_model, test_set):
    priors = trained_model["priors"]
    posteriors = trained_model["posteriors"]
    
    """Drop the class labels of the test set"""
    test_labels = test_set['class']
    test = test_set.drop('class', axis=1)
    cols = test_set.columns
    
    """Probabilistic Smoothing with epsilon -> 0"""
    n = len(test_labels)
    epsilon = 1e-100
    
    """Model Prediction"""
    prediction = {}
    
    """The Predicted Labels to be Returned"""
    """(Key, Value) = (Test Instance Row, Predicted Label)"""
    predicted_labels = {}
    
    for i in range(n):
        instance = test.iloc[i]
        for label in priors.keys():       
            prob = log(priors[label])
            for attribute in cols:
                try:
                    """If the valyue is non missing"""
                    if instance[attribute] != '?':
                        prob += log(posteriors[label][attribute][instance[attribute]])
                    else:
                        """Otherwise we have chosen to simply ignore it"""
                        pass
                except:
                    """If the value does not exist in our model, we use epsilon"""
                    prob += log(epsilon)
                    
            prediction[label] = prob
        
        """Choose the predicted class with the highest probability"""
        predicted_labels[i] = max(prediction, key=prediction.get)
        
    return predicted_labels

def cross_validation_predict(trained_models, cross_validation_pairs):
    test_set = cross_validation_pairs["test"]
    N = len(test_set)
    predictions = list()
    
    for i in range(N):
        predictions.append(predict(trained_models[i], test_set[i]))
    
    return predictions

In [6]:
def evaluate(predicted_labels, test_set):
    test_labels = test_set['class']
    n = len(test_labels)
    
    return [1 if predicted_labels[i] == test_labels.iloc[i] else 0 for i in range(n)]

def cross_validation_evaluate(predictions, cross_pairs):
    test = cross_pairs["test"]
    N = len(test)
    results = list()
    for i in range(N):
        results.append(evaluate(predictions[i], test[i]))
        
    return results


In [7]:
def info_gain(dataset):
    attributes = train_set.columns[:-1]
    label = train_set.columns[-1]
    entropies = defaultdict(float)
    
    for attribute in attributes:
        entropy[attribute] = 
    
    return entropy

In [8]:
# Script to run all functions and print the results
print("*"*40)
k = int(input("Enter k value for k-Fold Cross Validation: "))
print("*"*40)
for data in datasets:
    print(f"Processing {data} ...",end=" ")
    
    """TESTING ON THE TRAIN DATA"""
    
    df = preprocess(data)
    model = train(df)
    prediction = predict(model, df)
    print("...")
    results = evaluate(prediction, df)
    print(f"Accuracy for Testing on the Training Data: {100*sum(results)/len(results):.2f}%")
    print("...",end=" ")
    
    
    """k-FOLD CROSS VALIDATION"""
    
    cross_validation_pairs = preprocess(data, testing_on_train = False, k = k)
    trained_models = cross_validation_train(cross_validation_pairs)
    print("...",end=" ")
    predictions = cross_validation_predict(trained_models, cross_validation_pairs)
    print("...")
    cross_validation_results = cross_validation_evaluate(predictions, cross_validation_pairs)
    print("Accuracy using k-Fold Cross Validation: " + " ".join([f"{100*sum(i) / len(i):.2f}%" for i in cross_validation_results]))
    print(f"Average k-Fold Cross Validation Accuracy: {sum([100*sum(i) / len(i) for i in cross_validation_results]) / len(cross_validation_results):.2f}%")
    print("*"*40)

****************************************
Enter k value for k-Fold Cross Validation: 4
****************************************
Processing anneal.csv ... ...
Accuracy for Testing on the Training Data: 99.11%
... ... ...
Accuracy using k-Fold Cross Validation: 98.66% 98.67% 99.11% 99.56%
Average k-Fold Cross Validation Accuracy: 99.00%
****************************************
Processing breast-cancer.csv ... ...
Accuracy for Testing on the Training Data: 75.52%
... ... ...
Accuracy using k-Fold Cross Validation: 63.89% 80.28% 70.83% 71.83%
Average k-Fold Cross Validation Accuracy: 71.71%
****************************************
Processing car.csv ... ...
Accuracy for Testing on the Training Data: 87.38%
... ... ...
Accuracy using k-Fold Cross Validation: 87.04% 83.10% 85.65% 87.27%
Average k-Fold Cross Validation Accuracy: 85.76%
****************************************
Processing cmc.csv ... ...
Accuracy for Testing on the Training Data: 50.58%
... ... ...
Accuracy using k-Fold Cross Va

Questions (you may respond in a cell or cells below):

1. The Naive Bayes classifiers can be seen to vary, in terms of their effectiveness on the given datasets (e.g. in terms of Accuracy). Consider the Information Gain of each attribute, relative to the class distribution — does this help to explain the classifiers’ behaviour? Identify any results that are particularly surprising, and explain why they occur.
2. The Information Gain can be seen as a kind of correlation coefficient between a pair of attributes: when the gain is low, the attribute values are uncorrelated; when the gain is high, the attribute values are correlated. In supervised ML, we typically calculate the Infomation Gain between a single attribute and the class, but it can be calculated for any pair of attributes. Using the pair-wise IG as a proxy for attribute interdependence, in which cases are our NB assumptions violated? Describe any evidence (or indeed, lack of evidence) that this is has some effect on the effectiveness of the NB classifier.
3. Since we have gone to all of the effort of calculating Infomation Gain, we might as well use that as a criterion for building a “Decision Stump” (1-R classifier). How does the effectiveness of this classifier compare to Naive Bayes? Identify one or more cases where the effectiveness is notably different, and explain why.
4. Evaluating the model on the same data that we use to train the model is considered to be a major mistake in Machine Learning. Implement a hold–out or cross–validation evaluation strategy. How does your estimate of effectiveness change, compared to testing on the training data? Explain why. (The result might surprise you!)
5. Implement one of the advanced smoothing regimes (add-k, Good-Turing). Does changing the smoothing regime (or indeed, not smoothing at all) affect the effectiveness of the Naive Bayes classifier? Explain why, or why not.
6. Naive Bayes is said to elegantly handle missing attribute values. For the datasets with missing values, is there any evidence that the performance is different on the instances with missing values, compared to the instances where all of the values are present? Does it matter which, or how many values are missing? Would a imputation strategy have any effect on this?

Don't forget that groups of 1 student should respond to question (1), and one other question of your choosing. Groups of 2 students should respond to question (1) and question (2), and two other questions of your choosing. Your responses should be about 150-250 words each.

4. Should improve since a using the training dataset to test a supervised algo is like a cheat. the supervised algo has already seen the answers and will recognise it. Using a test / cross-validation will ensure that we are left with unseen data albeit a decrease in training instances

6. performance wise there should be no difference. there is no need to impute as a missing value could be significant in implying something. 