# The University of Melbourne, School of Computing and Information Systems
# COMP30027 Machine Learning, 2019 Semester 1
-----
## Project 1: Gaining Information about Naive Bayes
-----
###### Student Name(s): Akira and Callum
###### Python version: 3.7.1 from Anaconda 
###### Submission deadline: 1pm, Fri 5 Apr 2019

A good link: https://www.hackerearth.com/blog/machine-learning/introduction-naive-bayes-algorithm-codes-python-r/
https://www.machinelearningplus.com/predictive-modeling/how-naive-bayes-algorithm-works-with-example-and-full-code/

This iPython notebook is a template which you may use for your Project 1 submission. (You are not required to use it; in particular, there is no need to use iPython if you do not like it.)

Marking will be applied on the five functions that are defined in this notebook, and to your responses to the questions at the end of this notebook.

You may change the prototypes of these functions, and you may write other functions, according to your requirements. We would appreciate it if the required functions were prominent/easy to find. 

In [1]:
%%javascript
IPython.OutputArea.prototype._should_scroll = function(lines) {
    return false;
}

<IPython.core.display.Javascript object>

In [2]:
import pandas as pd
from IPython.display import display, Latex
from collections import *
import numpy as np

# POSSIBLE CSVs
# hypothyroid.csv seems like an easy choice for now
d1 =  'anneal.csv'
h1 = 'family,product-type,steel,carbon,hardness,temper_rolling,condition,formability,strength,non-ageing,surface-finish,surface-quality,enamelability,bc,bf,bt,bw-me,bl,m,chrom,phos,cbond,marvi,exptl,ferro,corr,bbvc,lustre,jurofm,s,p,shape,oil,bore,packing,class'.split(',')

d2 =  'breast-cancer.csv'
h2 = 'age,menopause,tumor-size,inv-nodes,node-caps,deg-malig,breast,breast-quad,irradiat,class'.split(',')

d3 =  'car.csv'
h3 = 'buying,maint,doors,persons,lug_boot,safety,class'.split(',')

d4 =  'cmc.csv'
h4 = 'w-education,h-education,n-child,w-relation,w-work,h-occupation,standard-of-living,media-exposure,class'.split(',')

d5 =  'hepatitis.csv'
h5 = 'sex,steroid,antivirals,fatigue,malaise,anorexia,liver-big,liver-firm,spleen-palpable,spiders,ascites,varices,histology,class'.split(',')

d6 =  'hypothyroid.csv'
h6 = 'sex,on-thyroxine,query-on-thyroxine,on_antithyroid,surgery,query-hypothyroid,query-hyperthyroid,pregnant,sick,tumor,lithium,goitre,TSH,T3,TT4,T4U,FTI,TBG,class'.split(',')

d7 =  'mushroom.csv'
h7 = 'cap-shape,cap-surface,cap-color,bruises,odor,gill-attachment,gill-spacing,gill-size,gill-color,stalk-shape,stalk-root,stalk-surface-above-ring,stalk-surface-below-ring,stalk-color-above-ring,stalk-color-below-ring,veil-type,veil-color,ring-number,ring-type,spore-print-color,population,habitat,class'.split(',')

d8 =  'nursery.csv'
h8 = 'parents,has_nurs,form,children,housing,finance,social,health,class'.split(',')

d9 = 'primary-tumor.csv'
h9 = 'age,sex,histologic-type,degree-of-diffe,bone,bone-marrow,lung,pleura,peritoneum,liver,brain,skin,neck,supraclavicular,axillar,mediastinum,abdominal,class'.split(',')

datasets = [d1,d2,d3,d4,d5,d6,d7,d8,d9]
dataset_headers = [h1,h2,h3,h4,h5,h6,h7,h8,h9]

In [3]:
# Dictionary that holds key 'filename' and values 'attribute names'
dictionary = {}

for i in range(9):
    dictionary[datasets[i]] = dataset_headers[i]

In [4]:
# Function that finds the corresponding attribute names
def set_column(filename):
    return dictionary[filename]

In [5]:
# Function I ripped off my previous ETL scripts, slightly modified
def describe(filename):
    df = pd.read_csv(filename, header = None, names = set_column(filename))
    
    print("**************************************************************************")
    print(f'NAME OF FILE: {filename}')
    print("**************************************************************************")
    
    # General description for dataset
    print(f"Number of rows: {len(df)}")
    print(f"Number of attributes/columns: {len(df.columns)}")
    print(f"Column names: [{', '.join(list(df.columns))}]")
    print(f"Column datatypes: [{', '.join(set(list(df.dtypes.astype(str))))}]")
    print(f"Shape of dataset: {(df.shape,)}\n")
    
    print("**************************************************************************")

    # Double check for missing values
    if (df.replace('-', np.NaN).isnull().sum().sum() == 0):
        print("No missing values and no imputations required.\n")
    else:
        print(f"Number of missing values = {df.replace('-', np.NaN).isnull().sum().sum()}")
        print("Missing values and we require imputations.\n")
        print("Using Average:")
        display(df.mean())
        print("\nUsing Median:")
        display(df.median())
    
    print("**************************************************************************")
        
    print("Number of unique values per attribute:")
    for i in df.columns:
        print(f"{i}: {len(df[i].unique())}")
    
    print("**************************************************************************")
    
    verify_prob = 0
    
    # MODIFIED SECTION
    print('Prior Probabilities:')
          
    priors = {}
    for i in df['class'].unique():    
        
        # Number of class labels / number of instances
        prob = len(df.loc[df['class'] == i]) / len(df)
        
        verify_prob += prob
        
        print(f"Pr({i}) = {prob}")
        
        priors[i] = prob
        
    # Just in case we break any axioms
    if verify_prob != 1:
        print('You just broke an axiom. Sum of all probs != 1')
        
        return "????"
          
    print("**************************************************************************")
    display(df.head(10))

In [6]:
# Calculate the probs - works but need to find a way to store it i guess
df = pd.read_csv('hypothyroid.csv', header = None, names = set_column('hypothyroid.csv'))

priors = {}
for label in df['class'].unique():    
    # Number of class labels / number of instances
    N = len(df)
    prob = len(df.loc[df['class'] == label]) / N
    
    # Dictionary of Prior Probabilities
    print(f"Pr({label}) = {prob}")
    priors[label] = prob
    
    for attribute in df.columns[:-1]:
        # for each attribute given the class
        df1 = df.loc[df['class'] == label, [attribute,'class']]
        
        # Number of instances corresponding to attribute given class
        n = len(df1)
        
        # Can be replaced with Counter()
        attribute_prob = df1.groupby(attribute).count().rename({'class': 'count'}, axis=1)
        
        attribute_prob['Pr('+attribute+'|'+label+')'] = attribute_prob.apply(lambda x: x / n)

        display(attribute_prob)

Pr(hypothyroid) = 0.04773948782801138


Unnamed: 0_level_0,count,Pr(sex|hypothyroid)
sex,Unnamed: 1_level_1,Unnamed: 2_level_1
?,2,0.013245
F,111,0.735099
M,38,0.251656


Unnamed: 0_level_0,count,Pr(on-thyroxine|hypothyroid)
on-thyroxine,Unnamed: 1_level_1,Unnamed: 2_level_1
f,137,0.907285
t,14,0.092715


Unnamed: 0_level_0,count,Pr(query-on-thyroxine|hypothyroid)
query-on-thyroxine,Unnamed: 1_level_1,Unnamed: 2_level_1
f,151,1.0


Unnamed: 0_level_0,count,Pr(on_antithyroid|hypothyroid)
on_antithyroid,Unnamed: 1_level_1,Unnamed: 2_level_1
f,150,0.993377
t,1,0.006623


Unnamed: 0_level_0,count,Pr(surgery|hypothyroid)
surgery,Unnamed: 1_level_1,Unnamed: 2_level_1
f,141,0.933775
t,10,0.066225


Unnamed: 0_level_0,count,Pr(query-hypothyroid|hypothyroid)
query-hypothyroid,Unnamed: 1_level_1,Unnamed: 2_level_1
f,131,0.86755
t,20,0.13245


Unnamed: 0_level_0,count,Pr(query-hyperthyroid|hypothyroid)
query-hyperthyroid,Unnamed: 1_level_1,Unnamed: 2_level_1
f,144,0.953642
t,7,0.046358


Unnamed: 0_level_0,count,Pr(pregnant|hypothyroid)
pregnant,Unnamed: 1_level_1,Unnamed: 2_level_1
f,150,0.993377
t,1,0.006623


Unnamed: 0_level_0,count,Pr(sick|hypothyroid)
sick,Unnamed: 1_level_1,Unnamed: 2_level_1
f,149,0.986755
t,2,0.013245


Unnamed: 0_level_0,count,Pr(tumor|hypothyroid)
tumor,Unnamed: 1_level_1,Unnamed: 2_level_1
f,151,1.0


Unnamed: 0_level_0,count,Pr(lithium|hypothyroid)
lithium,Unnamed: 1_level_1,Unnamed: 2_level_1
f,151,1.0


Unnamed: 0_level_0,count,Pr(goitre|hypothyroid)
goitre,Unnamed: 1_level_1,Unnamed: 2_level_1
f,145,0.960265
t,6,0.039735


Unnamed: 0_level_0,count,Pr(TSH|hypothyroid)
TSH,Unnamed: 1_level_1,Unnamed: 2_level_1
n,1,0.006623
y,150,0.993377


Unnamed: 0_level_0,count,Pr(T3|hypothyroid)
T3,Unnamed: 1_level_1,Unnamed: 2_level_1
n,14,0.092715
y,137,0.907285


Unnamed: 0_level_0,count,Pr(TT4|hypothyroid)
TT4,Unnamed: 1_level_1,Unnamed: 2_level_1
y,151,1.0


Unnamed: 0_level_0,count,Pr(T4U|hypothyroid)
T4U,Unnamed: 1_level_1,Unnamed: 2_level_1
y,151,1.0


Unnamed: 0_level_0,count,Pr(FTI|hypothyroid)
FTI,Unnamed: 1_level_1,Unnamed: 2_level_1
y,151,1.0


Unnamed: 0_level_0,count,Pr(TBG|hypothyroid)
TBG,Unnamed: 1_level_1,Unnamed: 2_level_1
n,148,0.980132
y,3,0.019868


Pr(negative) = 0.9522605121719886


Unnamed: 0_level_0,count,Pr(sex|negative)
sex,Unnamed: 1_level_1,Unnamed: 2_level_1
?,71,0.023572
F,2071,0.687583
M,870,0.288845


Unnamed: 0_level_0,count,Pr(on-thyroxine|negative)
on-thyroxine,Unnamed: 1_level_1,Unnamed: 2_level_1
f,2565,0.851594
t,447,0.148406


Unnamed: 0_level_0,count,Pr(query-on-thyroxine|negative)
query-on-thyroxine,Unnamed: 1_level_1,Unnamed: 2_level_1
f,2957,0.98174
t,55,0.01826


Unnamed: 0_level_0,count,Pr(on_antithyroid|negative)
on_antithyroid,Unnamed: 1_level_1,Unnamed: 2_level_1
f,2971,0.986388
t,41,0.013612


Unnamed: 0_level_0,count,Pr(surgery|negative)
surgery,Unnamed: 1_level_1,Unnamed: 2_level_1
f,2918,0.968792
t,94,0.031208


Unnamed: 0_level_0,count,Pr(query-hypothyroid|negative)
query-hypothyroid,Unnamed: 1_level_1,Unnamed: 2_level_1
f,2791,0.926627
t,221,0.073373


Unnamed: 0_level_0,count,Pr(query-hyperthyroid|negative)
query-hyperthyroid,Unnamed: 1_level_1,Unnamed: 2_level_1
f,2776,0.921647
t,236,0.078353


Unnamed: 0_level_0,count,Pr(pregnant|negative)
pregnant,Unnamed: 1_level_1,Unnamed: 2_level_1
f,2950,0.979416
t,62,0.020584


Unnamed: 0_level_0,count,Pr(sick|negative)
sick,Unnamed: 1_level_1,Unnamed: 2_level_1
f,2915,0.967795
t,97,0.032205


Unnamed: 0_level_0,count,Pr(tumor|negative)
tumor,Unnamed: 1_level_1,Unnamed: 2_level_1
f,2972,0.98672
t,40,0.01328


Unnamed: 0_level_0,count,Pr(lithium|negative)
lithium,Unnamed: 1_level_1,Unnamed: 2_level_1
f,3010,0.999336
t,2,0.000664


Unnamed: 0_level_0,count,Pr(goitre|negative)
goitre,Unnamed: 1_level_1,Unnamed: 2_level_1
f,2919,0.969124
t,93,0.030876


Unnamed: 0_level_0,count,Pr(TSH|negative)
TSH,Unnamed: 1_level_1,Unnamed: 2_level_1
n,467,0.155046
y,2545,0.844954


Unnamed: 0_level_0,count,Pr(T3|negative)
T3,Unnamed: 1_level_1,Unnamed: 2_level_1
n,681,0.226096
y,2331,0.773904


Unnamed: 0_level_0,count,Pr(TT4|negative)
TT4,Unnamed: 1_level_1,Unnamed: 2_level_1
n,249,0.082669
y,2763,0.917331


Unnamed: 0_level_0,count,Pr(T4U|negative)
T4U,Unnamed: 1_level_1,Unnamed: 2_level_1
n,248,0.082337
y,2764,0.917663


Unnamed: 0_level_0,count,Pr(FTI|negative)
FTI,Unnamed: 1_level_1,Unnamed: 2_level_1
n,247,0.082005
y,2765,0.917995


Unnamed: 0_level_0,count,Pr(TBG|negative)
TBG,Unnamed: 1_level_1,Unnamed: 2_level_1
n,2755,0.914675
y,257,0.085325


In [37]:
# Calculate the probs - works but need to find a way to store it i guess
df = pd.read_csv('hypothyroid.csv', header = None, names = set_column('hypothyroid.csv'))

priors = {}
test = defaultdict(lambda: defaultdict(lambda: defaultdict(float)))

for label in df['class'].unique():    
    # Number of class labels / number of instances
    N = len(df)
    prob = len(df.loc[df['class'] == label]) / N
    
    # Dictionary of Prior Probabilities
    # print(f"Pr({label}) = {prob}")
    priors[label] = prob
    
    for attribute in df.columns[:-1]:
        # for each attribute given the class
        df1 = df.loc[df['class'] == label, [attribute,'class']]
        
        # Number of instances corresponding to attribute given class
        n = len(df1)
        
        # REPLACED WITH COUNTER
        count = Counter(df1[attribute])
        ### display(count)
        for i in count:
            test[label][attribute][i] = count[i] / n
        
print(test)
        
# Let say I wanted Pr(Sex = M | hypothyroid):
display(test['hypothyroid']['sex']['M'])

defaultdict(<function <lambda> at 0x000001BB321756A8>, {'hypothyroid': defaultdict(<function <lambda>.<locals>.<lambda> at 0x000001BB32175B70>, {'sex': defaultdict(<class 'float'>, {'M': 0.25165562913907286, 'F': 0.7350993377483444, '?': 0.013245033112582781}), 'on-thyroxine': defaultdict(<class 'float'>, {'f': 0.9072847682119205, 't': 0.09271523178807947}), 'query-on-thyroxine': defaultdict(<class 'float'>, {'f': 1.0}), 'on_antithyroid': defaultdict(<class 'float'>, {'f': 0.9933774834437086, 't': 0.006622516556291391}), 'surgery': defaultdict(<class 'float'>, {'f': 0.9337748344370861, 't': 0.06622516556291391}), 'query-hypothyroid': defaultdict(<class 'float'>, {'f': 0.8675496688741722, 't': 0.13245033112582782}), 'query-hyperthyroid': defaultdict(<class 'float'>, {'f': 0.9536423841059603, 't': 0.046357615894039736}), 'pregnant': defaultdict(<class 'float'>, {'f': 0.9933774834437086, 't': 0.006622516556291391}), 'sick': defaultdict(<class 'float'>, {'f': 0.9867549668874173, 't': 0.013

0.25165562913907286

In [None]:
# This function should open a data file in csv, and transform it into a usable format 
def preprocess(filename):
    # Read in csv and add the column header
    df = pd.read_csv(filename, header = None, names = set_column(filename))
    
    # Output uses the script above 
    # Comment it out or remove it for the actual submission
    
    # TODO: impute missing values (?), return a matrix?


In [None]:
# This function should build a supervised NB model
def train():
    return

In [None]:
# This function should predict the class for an instance or a set of instances, based on a trained model 
def predict():
    return

In [None]:
# This function should evaluate a set of predictions, in a supervised context 
def evaluate():
    return

In [None]:
# This function should calculate the Information Gain of an attribute or a set of attribute, with respect to the class
def info_gain():
    return

Questions (you may respond in a cell or cells below):

1. The Naive Bayes classifiers can be seen to vary, in terms of their effectiveness on the given datasets (e.g. in terms of Accuracy). Consider the Information Gain of each attribute, relative to the class distribution — does this help to explain the classifiers’ behaviour? Identify any results that are particularly surprising, and explain why they occur.
2. The Information Gain can be seen as a kind of correlation coefficient between a pair of attributes: when the gain is low, the attribute values are uncorrelated; when the gain is high, the attribute values are correlated. In supervised ML, we typically calculate the Infomation Gain between a single attribute and the class, but it can be calculated for any pair of attributes. Using the pair-wise IG as a proxy for attribute interdependence, in which cases are our NB assumptions violated? Describe any evidence (or indeed, lack of evidence) that this is has some effect on the effectiveness of the NB classifier.
3. Since we have gone to all of the effort of calculating Infomation Gain, we might as well use that as a criterion for building a “Decision Stump” (1-R classifier). How does the effectiveness of this classifier compare to Naive Bayes? Identify one or more cases where the effectiveness is notably different, and explain why.
4. Evaluating the model on the same data that we use to train the model is considered to be a major mistake in Machine Learning. Implement a hold–out or cross–validation evaluation strategy. How does your estimate of effectiveness change, compared to testing on the training data? Explain why. (The result might surprise you!)
5. Implement one of the advanced smoothing regimes (add-k, Good-Turing). Does changing the smoothing regime (or indeed, not smoothing at all) affect the effectiveness of the Naive Bayes classifier? Explain why, or why not.
6. Naive Bayes is said to elegantly handle missing attribute values. For the datasets with missing values, is there any evidence that the performance is different on the instances with missing values, compared to the instances where all of the values are present? Does it matter which, or how many values are missing? Would a imputation strategy have any effect on this?

Don't forget that groups of 1 student should respond to question (1), and one other question of your choosing. Groups of 2 students should respond to question (1) and question (2), and two other questions of your choosing. Your responses should be about 150-250 words each.