# The University of Melbourne, School of Computing and Information Systems
# COMP30027 Machine Learning, 2018 Semester 1
-----
## Project 1: What is labelled data worth to Naive Bayes?
-----
###### Student Name(s): Emmanuel Macario
###### Python version: 3.6

This iPython notebook is a template which you may use for your Project 1 submission. (You are not required to use it; in particular, there is no need to use iPython if you do not like it.)

Marking will be applied on the seven functions that are defined in this notebook, and to your responses to the questions at the end of this notebook.

You may change the prototypes of these functions, and you may write other functions, according to your requirements. We would appreciate it if the required functions were prominent/easy to find. 

In [100]:
# Import useful libraries
from collections import defaultdict
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

In [101]:
# Filename constants used for easy file access
CSV1 = 'breast-cancer-dos.csv'
CSV2 = 'car-dos.csv'
CSV3 = 'hypothyroid-dos.csv'
CSV4 = 'mushroom-dos.csv'


def preprocess(filename):
    """
    Opens a data file in csv, and transforms it into
    a usable format.
    """
    
    # Read csv data into a dataframe. Assign
    # an integer value to each attribute in
    # the data.
    df = pd.read_csv(filename, header=None)
    
    return df



# Test preprocess function
df = preprocess(CSV2)
df.head(5)


Unnamed: 0,0,1,2,3,4,5,6
0,vhigh,vhigh,2,2,small,low,unacc
1,vhigh,vhigh,2,2,small,med,unacc
2,vhigh,vhigh,2,2,small,high,unacc
3,vhigh,vhigh,2,2,med,low,unacc
4,vhigh,vhigh,2,2,med,med,unacc


In [102]:
# This function should build a supervised NB model
def train_supervised(df):
    """
    Builds a supervised Naive Bays model,
    given a preprocessed set of training
    data, by calculating counts from the
    training data.
    
    Inputs: a training set in a dataframe.
    
    Outputs: 2-tuple containing a class frequency 
    dictionary and also the supervised model.
    
    """
    
    # Dictionary contains the frequencies
    # of each class in the training set
    class_freqs = defaultdict(int) 
    
    # The supervised Naive Bays model
    supervised_model = {}
    
    # For every instance in the training
    # data, update the model
    for instance in df.values:
        
        # Get the instance's class label
        class_label = instance[-1]
        
        
        # Update the class frequency dictionary
        class_freqs[class_label] += 1
        
        
        # If the class label is not already in
        # the model, create a nested set of
        # dictionaries for the class.
        if class_label not in supervised_model:
            
            # New dictionary for every new class
            supervised_model[class_label] = {}
            
            # For each attribute, initialise
            # a new default dictionary.
            for attr in df.columns[:-1]:
                supervised_model[class_label][attr] = defaultdict(int)
            
            
        # For every attribute in the instance,
        # get its corresponding value and update
        # the model.
        for attr in df.columns[:-1]:
            attr_value = instance[attr]
            supervised_model[class_label][attr][attr_value] += 1
    
    
    return class_freqs, supervised_model



# Test function
class_freqs, supervised_model = train_supervised(df)


supervised_model

{'acc': {0: defaultdict(int,
              {'high': 108, 'low': 89, 'med': 115, 'vhigh': 72}),
  1: defaultdict(int, {'high': 105, 'low': 92, 'med': 115, 'vhigh': 72}),
  2: defaultdict(int, {'2': 81, '3': 99, '4': 102, '5more': 102}),
  3: defaultdict(int, {'4': 198, 'more': 186}),
  4: defaultdict(int, {'big': 144, 'med': 135, 'small': 105}),
  5: defaultdict(int, {'high': 204, 'med': 180})},
 'good': {0: defaultdict(int, {'low': 46, 'med': 23}),
  1: defaultdict(int, {'low': 46, 'med': 23}),
  2: defaultdict(int, {'2': 15, '3': 18, '4': 18, '5more': 18}),
  3: defaultdict(int, {'4': 36, 'more': 33}),
  4: defaultdict(int, {'big': 24, 'med': 24, 'small': 21}),
  5: defaultdict(int, {'high': 30, 'med': 39})},
 'unacc': {0: defaultdict(int,
              {'high': 324, 'low': 258, 'med': 268, 'vhigh': 360}),
  1: defaultdict(int, {'high': 314, 'low': 268, 'med': 268, 'vhigh': 360}),
  2: defaultdict(int, {'2': 326, '3': 300, '4': 292, '5more': 292}),
  3: defaultdict(int, {'2': 576, '4'

In [108]:

def predict_supervised(df, class_freqs, supervised_model):
    """
    Predicts the class for a set of
    instances, based on a trained model.
    """
    
    total_instances = sum(class_freqs.values())
    print("Total instances:", total_instances)
    
    
    # For each instance, predict the class
    # based on our model.
    
    correct = 0
    
    for instance in df.values:
        
        max_prob = 0
        predicted_class = ""
        
        
        
        actual_class = instance[-1]
        
        
        # Test each class label to see what the predicted class
        for class_label in class_freqs:
            
            # The prior probability of a class
            prob = class_freqs[class_label]/total_instances
            
            
            # Multiply the prior probability of the predicted class
            # by the posterior probability of the instance, given
            # the class.
            for attr in supervised_model[class_label]:
                prob *= (supervised_model[class_label][attr][instance[attr]]/class_freqs[class_label])
                
            
            
            if prob > max_prob:
                max_prob = prob
                predicted_class = class_label
            
        
        
        if actual_class == predicted_class:
            correct += 1
        
        print("Actual: {:8s}  Predicted: {:8s}  Max Probability: {:.4f}"
             .format(actual_class, predicted_class, max_prob))
        
        
    
    print("Accuracy: {:d}/{:d}".format(correct, total_instances))
        
    
    return


# Test the function
predict_supervised(df, class_freqs, supervised_model)

Total instances: 1728
Actual: unacc     Predicted: unacc     Max Probability: 0.0014
Actual: unacc     Predicted: unacc     Max Probability: 0.0009
Actual: unacc     Predicted: unacc     Max Probability: 0.0007
Actual: unacc     Predicted: unacc     Max Probability: 0.0012
Actual: unacc     Predicted: unacc     Max Probability: 0.0008
Actual: unacc     Predicted: unacc     Max Probability: 0.0006
Actual: unacc     Predicted: unacc     Max Probability: 0.0012
Actual: unacc     Predicted: unacc     Max Probability: 0.0007
Actual: unacc     Predicted: unacc     Max Probability: 0.0006
Actual: unacc     Predicted: unacc     Max Probability: 0.0008
Actual: unacc     Predicted: unacc     Max Probability: 0.0005
Actual: unacc     Predicted: unacc     Max Probability: 0.0004
Actual: unacc     Predicted: unacc     Max Probability: 0.0007
Actual: unacc     Predicted: unacc     Max Probability: 0.0004
Actual: unacc     Predicted: unacc     Max Probability: 0.0003
Actual: unacc     Predicted: unac

In [None]:
# This function should evaluate a set of predictions, in a supervised context 
def evaluate_supervised():
    return

In [None]:
# This function should build an unsupervised NB model 
def train_unsupervised():
    return

In [None]:
# This function should predict the class distribution for a set of instances, based on a trained model
def predict_unsupervised():
    return

In [None]:
# This function should evaluate a set of predictions, in an unsupervised manner
def evaluate_unsupervised():
    return

Questions (you may respond in a cell or cells below):

1. Since we’re starting off with random guesses, it might be surprising that the unsupervised NB works at all. Explain what characteristics of the data cause it to work pretty well (say, within 10% Accuracy of the supervised NB) most of the time; also, explain why it utterly fails sometimes.
2. When evaluating supervised NB across the four different datasets, you will observe some variation in effectiveness (e.g. Accuracy). Explain what causes this variation. Describe and explain any particularly suprising results.
3. Evaluating the model on the same data that we use to train the model is considered to be a major mistake in Machine Learning. Implement a hold–out (hint: check out numpy.shuffle()) or cross–validation evaluation strategy. How does your estimate of Accuracy change, compared to testing on the training data? Explain why. (The result might surprise you!)
4. Implement one of the advanced smoothing regimes (add-k, Good-Turing). Do you notice any variation in the predictions made by either the supervised or unsupervised NB classifiers? Explain why, or why not.
5. The lecture suggests that deterministically labelling the instances in the initialisation phase of the unsupervised NB classifier “doesn’t work very well”. Confirm this for yourself, and then demonstrate why.
6. Rather than evaluating the unsupervised NB classifier by assigning a class deterministically, instead calculate how far away the probabilistic estimate of the true class is from 1 (where we would be certain of the correct class), and take the average over the instances. Does this performance estimate change, as we alter the number of iterations in the method? Explain why.
7. Explore what causes the unsupervised NB classifier to converge: what proportion of instances change their prediction from the random assignment, to the first iteration? From the first to the second? What is the latest iteration where you observe a prediction change? Make some conjecture(s) as to what is occurring here.

Don't forget that groups of 1 student should respond to question (1), and one other question. Groups of 2 students should respond to question (1), and three other questions. Your responses should be about 100-200 words each.