
### The University of Melbourne, School of Computing and Information Systems
# COMP30027 Machine Learning, 2020 Semester 1

## Assignment 1: Naive Bayes Classifiers

###### Submission deadline: 7 pm, Monday 20 Apr 2020

**Student Name(s):**    Aneesh Chattaraj

**Student ID(s):**     826860


This iPython notebook is a template which you will use for your Assignment 1 submission.

Marking will be applied on the four functions that are defined in this notebook, and to your responses to the questions at the end of this notebook (Submitted in a separate PDF file).

**NOTE: YOU SHOULD ADD YOUR RESULTS, DIAGRAMS AND IMAGES FROM YOUR OBSERVATIONS IN THIS FILE TO YOUR REPORT (the PDF file).**

You may change the prototypes of these functions, and you may write other functions, according to your requirements. We would appreciate it if the required functions were prominent/easy to find.

**Adding proper comments to your code is MANDATORY. **

In [14]:
# This function should prepare the data by reading it from a file and converting it into a useful format for training and testing
import pandas as pd
from collections import defaultdict
import numpy as np

def preprocess(filename, classcolumn,id_code):
    temp = pd.read_csv(filename, skiprows=0, header=None, delimiter=';', skip_blank_lines=True)
    df=temp[0].str.split(',',expand=True)
    class1= df[classcolumn-1]
    df.drop(labels=[classcolumn-1], axis=1,inplace = True)
    df.insert(len(df.columns), 'class', class1)
    if id_code != 'none':
        del df[id_code]
    data=df.values.tolist()
    
    
    return data


In [2]:
#to calculate prior
def prior_func(data):
    prob={}
    for row in data: 
        if row[-1] in prob:
            prob[row[-1]]+=1
        else:
            prob[row[-1]]=1
    for row in prob.keys():
        prob[row]=prob[row]/len(data)
    return prob


In [15]:
#to calculate the prior probabilities

def posterior_func(data):
    prob = defaultdict(lambda : defaultdict(lambda : defaultdict(int)))
    
    for row in data:
        attr= row[:-1]
        classes= row[-1]
        for attr, a in enumerate(attr):
            prob[classes][attr][a] += 1
    for classes in prob:
        for attr in prob[classes]:
            l = sum(prob[classes][attr].values())
            for a in prob[classes][attr]:
                 prob[classes][attr][a] =  prob[classes][attr][a]/l
    
    
    prob=dict(prob)
    for classes in prob:
        prob[classes]=dict(prob[classes])
        for attr in prob[classes]:
             prob[classes][attr]=dict( prob[classes][attr])
   
    return prob

In [16]:
# To train NB and get prior and posterior

def train(data):
    return (prior_func(data),posterior_func(data))


In [5]:
# This function should predict classes for new items in a test dataset (for the purposes of this assignment, you
# can re-use the training data as a test set)
def predict(preprocessed_data, trained_set):
   
    prior_prob, post_prob= trained_set[0], trained_set[1]
    length= len(preprocessed_data)
    predicted_class=[]
    for row in preprocessed_data:
        prediction ={}
        for a, b in prior_prob.items():
            prediction[a]=b
            for i in range(len(row)-1):
                if row[i] in post_prob[a][i]:
                    prediction[a] =prediction[a]* post_prob[a][i][row[i]]
                else:
                    prediction[a]= prediction[a]*0.01
                    prediction[a]= prediction[a]/length
        max_key = max(prediction, key=prediction.get)
        predicted_class.append(max_key)
    
    return predicted_class


In [6]:
# This function should evaluate the prediction performance by comparing your model’s class outputs to ground
# truth labels

def evaluate(filename, classcolumn,id_code):
    preprocessed_data=preprocess(filename, classcolumn,id_code)
    trained_set=train(preprocessed_data)
    total=0
    predict1=predict(preprocessed_data, trained_set)
    count=len(preprocessed_data)
    
    for a in range(count):
        if  preprocessed_data[a][-1]== predict1[a]:
            total=total+1
    
    
    accuracy= total/count
    
    
    return accuracy*100


In [7]:
list_datasets=[["breast-cancer-wisconsin.data",11,0],["mushroom.data",1,"none"],["lymphography.data",1,"none"],
               ["wine.data",1,"none"],["car.data",7,"none"],["nursery.data",9,"none"],["somerville.data",1,"none"],
               ["adult.data",15,"none"],["bank.data",15,"none"],["university.data",14,0],["wdbc.data",2,0]]
list_datasets_og=[]
for a in range(len(list_datasets)):
    b=evaluate(list_datasets[a][0],list_datasets[a][1],list_datasets[a][2])
    list_datasets_og.append(b)
    print(str(list_datasets[a][0])+'      '+str(b))
    

breast-cancer-wisconsin.data      97.56795422031473
mushroom.data      99.1506646971935
lymphography.data      89.1891891891892
wine.data      100.0
car.data      87.38425925925925
nursery.data      90.30864197530865
somerville.data      67.13286713286713
adult.data      93.78704585240011
bank.data      92.33814779589038
university.data      93.10344827586206
wdbc.data      100.0


## Questions 


If you are in a group of 1, you will respond to question (1), and **one** other of your choosing (two responses in total).

If you are in a group of 2, you will respond to question (1) and question (2), and **two** others of your choosing (four responses in total). 

A response to a question should take about 100–250 words, and make reference to the data wherever possible.

#### NOTE: you may develope codes or functions in respond to the question, but your formal answer should be added to a separate file.

### Q1
Try discretising the numeric attributes in these datasets and treating them as discrete variables in the na¨ıve Bayes classifier. You can use a discretisation method of your choice and group the numeric values into any number of levels (but around 3 to 5 levels would probably be a good starting point). Does discretizing the variables improve classification performance, compared to the Gaussian na¨ıve Bayes approach? Why or why not?

In [17]:
#binning with 3 to 5 levels and discretising the numeric data
from pandas import DataFrame
import math
import pandas
df=preprocess("wdbc.data",2,0)
df= DataFrame(df)

classes= df[30]
for i in range(30):
    k=0
    k=df[i].astype(float).max()
    f=df[i].astype(float).min()
    
    
    for j in range(569):
        df[i][j]= math.floor(math.ceil(float(df[i][j]) * 5) / k)

df.to_csv(r'address of folder to save the file as wdbc_new.data', index = False)

evaluate('wdbc_new.data',30,"none")




100.0

In [19]:
df=preprocess("wine.data",1,"none")
df= DataFrame(df)

classes= df[0]
for i in range(13):
    k=0
    k=df[i].astype(float).max()
    for j in range(178):
        df[i][j]= math.floor(math.ceil(float(df[i][j]) * 3) / k)

df.to_csv(r'address of folder to save the file as wine_new.data', index = False)
evaluate('wine_new.data',14,"none")
    
    

96.64804469273743

In [20]:
df=preprocess("wine.data",1,"none")
df= DataFrame(df)
classes= df[13]
del df[13]
train=df
from sklearn.metrics import accuracy_score
from sklearn.naive_bayes import GaussianNB
model = GaussianNB()
GNB_model=model.fit(train,classes)
pred = GNB_model.predict(train)

accuracy = accuracy_score(classes, pred)
print('Accuracy for wine.data',100*accuracy,'%')


df=preprocess("wdbc.data",2,0)
df= DataFrame(df)
classes= df[30]
del df[30]
train=df
from sklearn.metrics import accuracy_score
from sklearn.naive_bayes import GaussianNB
model = GaussianNB()
GNB_model=model.fit(train,classes)
pred = GNB_model.predict(train)

accuracy = accuracy_score(classes, pred)
print('Accuracy for wdbc',100*accuracy,'%')

Accuracy for wine.data 98.87640449438202 %
Accuracy for wdbc 94.20035149384886 %


Answer Q1: Naive Bayes treats all features as being independent of one another. It works best with categorical datasets and smaller datasets. Discretisation is helful when dealing with numerical datasets to convert them into values depending on the number of bins selected and maximum and minimum value of the entire column. From our results we can see our Naive Bayes has 100% accuracy on both of our numerical datasets, a strange observation. When we discretisize the datasets into bins of 5 we can still see that wdbc.data having 28 columns x 569 rows still got 100% accuracy this could be due to grouping ofnumerical data over a small bin size producing similar results. While wine.data had a reduced accuracy could be because of the number of classes to identify. They slightly perform better than gaussian Naive bayes depending upon the levels or bin size taken. 

### Q2
Implement a baseline model (e.g., random or 0R) and compare the performance of the na¨ıve Bayes classifier to this baseline on multiple datasets. Discuss why the baseline performance varies across datasets, and to what extent the na¨ıve Bayes classifier improves on the baseline performance.

In [21]:
def zero_rule_algorithm_classification(train, test):
    output_values = [row[-1] for row in train]
    prediction = max(set(output_values), key=output_values.count)
    predicted = [prediction for i in range(len(test))]
    return predicted
list_datasets=[["breast-cancer-wisconsin.data",11,0],["mushroom.data",1,"none"],["lymphography.data",1,"none"],
               ["wine.data",1,"none"],["car.data",7,"none"],["nursery.data",9,"none"],["somerville.data",1,"none"],
               ["adult.data",15,"none"],["bank.data",15,"none"],["university.data",14,0],["wdbc.data",2,0]]
l=[]
sum=0
for a in range(len(list_datasets)):
    df=preprocess(list_datasets[a][0],list_datasets[a][1],list_datasets[a][2])
    total=0
    k=zero_rule_algorithm_classification(df,df)
    output_values = [row[-1] for row in df]
    for b in range(len(df)):
        if  k[b]== output_values[b]:
            total=total+1
        accuracy= total/len(df)
    print(str(list_datasets[a][0])+'      '+str(accuracy)+'   vs  '+str(list_datasets_og[a]))
    l.append(((accuracy*100)-list_datasets_og[a])/list_datasets_og[a])
    print(l[a]*100)
    sum+=l[a]*100
print('average performance comparison'+str(sum/11))
    



breast-cancer-wisconsin.data      0.6552217453505007   vs  97.56795422031473
-32.84457478005865
mushroom.data      0.517971442639094   vs  99.1506646971935
-47.75915580384854
lymphography.data      0.5472972972972973   vs  89.1891891891892
-38.63636363636364
wine.data      0.398876404494382   vs  100.0
-60.1123595505618
car.data      0.7002314814814815   vs  87.38425925925925
-19.867549668874158
nursery.data      0.3333333333333333   vs  90.30864197530865
-63.08954203691046
somerville.data      0.5384615384615384   vs  67.13286713286713
-19.791666666666664
adult.data      0.7591904425539756   vs  93.78704585240011
-19.051673325037655
bank.data      0.8830151954170445   vs  92.33814779589038
-4.371571609936043
university.data      0.3620689655172414   vs  93.10344827586206
-61.111111111111114
wdbc.data      0.6274165202108963   vs  100.0
-37.258347978910365
average performance comparison-36.71762874257083


ANSWER Q2: Comparing Naive Bayes with 0R classifier it performs much better when dealing with most of the datasets provided. From the results we found that Naive Bayes performs better by an average of 36.71% compared to a base classifier 0R. It performs worse comparatively as the number of instances increases or the number of instances and attributes are just low enough for naive bayes to get good results irrespective of the type of data. Considering numerical datasets wine and wdbc even though wdbc has more attributes and instances it performs better on that compared to wine, maybe because of the number of classes to predict since wdbc has only 2 vs wine having 3. Again this could be seen in bank.data having one of the largest datasets with only two classes to predict. Considering the ordinal datasets it performs worse on nursery having 5 classes to predict compared to car.data and somerville.data again could be inference on the number of classes and then coming in second the size of datasets.

### Q3
Since it’s difficult to model the probabilities of ordinal data, ordinal attributes are often treated as either nominal variables or numeric variables. Compare these strategies on the ordinal datasets provided. Deterimine which approach gives higher classification accuracy and discuss why.

### Q4
Evaluating the model on the same data that we use to train the model is considered to be a major mistake in Machine Learning. Implement a hold–out or cross–validation evaluation strategy (you should implement this yourself and do not simply call existing implementations from `scikit-learn`). How does your estimate of effectiveness change, compared to testing on the training data? Explain why. (The result might surprise you!)

### Q5
Implement one of the advanced smoothing regimes (add-k, Good-Turing). Does changing the smoothing regime (or indeed, not smoothing at all) affect the effectiveness of the na¨ıve Bayes classifier? Explain why, or why not.

### Q6
The Gaussian na¨ıve Bayes classifier assumes that numeric attributes come from a Gaussian distribution. Is this assumption always true for the numeric attributes in these datasets? Identify some cases where the Gaussian assumption is violated and describe any evidence (or lack thereof) that this has some effect on the NB classifier’s predictions.