# The University of Melbourne, School of Computing and Information Systems
# COMP30027 Machine Learning, 2018 Semester 1
-----
## Project 1: What is labelled data worth to Naive Bayes?
-----
###### Student Name(s): Emmanuel Macario
###### Python version: 3.6.0

This iPython notebook is a template which you may use for your Project 1 submission. (You are not required to use it; in particular, there is no need to use iPython if you do not like it.)

Marking will be applied on the seven functions that are defined in this notebook, and to your responses to the questions at the end of this notebook.

You may change the prototypes of these functions, and you may write other functions, according to your requirements. We would appreciate it if the required functions were prominent/easy to find. 

In [16]:
# Import useful libraries
from collections import defaultdict
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

In [48]:
# Filename constants used for easy file access
CSV1 = 'breast-cancer-dos.csv'
CSV2 = 'car-dos.csv'
CSV3 = 'hypothyroid-dos.csv'
CSV4 = 'mushroom-dos.csv'


def preprocess(filename):
    """
    Opens a data file in csv, and transforms it into
    a usable format.
    """
    
    # Read csv data into a dataframe. Assign
    # an integer value to each attribute in
    # the data.
    df = pd.read_csv(filename, header=None)
    
    
    # Partition the data into instances and
    # instance class labels.
    instance_list = df.iloc[:,:-1]
    class_list = df.iloc[:,-1]
    data_set = instance_list, class_list
    
    # Note: the instance list is a dataframe, whereas the class
    # list is a series object
    
    # Give a description of the class distributions
    print("CLASS DISTRIBUTION")
    print(pd.Series(class_list).value_counts())
    
    print()
    print("ATTRIBUTE DISTRIBUTIONS")
    print(instance_list.describe())
    
    
    return data_set


# Test preprocess function
instance_list, class_list = preprocess(CSV4)
instance_list.head(5)

CLASS DISTRIBUTION
e    4208
p    3916
Name: 22, dtype: int64

ATTRIBUTE DISTRIBUTIONS
          0     1     2     3     4     5     6     7     8     9   ...   \
count   8124  8124  8124  8124  8124  8124  8124  8124  8124  8124  ...    
unique     6     4    10     2     9     2     2     2    12     2  ...    
top        x     y     n     f     n     f     c     b     b     t  ...    
freq    3656  3244  2284  4748  3528  7914  6812  5612  1728  4608  ...    

          12    13    14    15    16    17    18    19    20    21  
count   8124  8124  8124  8124  8124  8124  8124  8124  8124  8124  
unique     4     9     9     1     4     3     5     9     6     7  
top        s     w     w     p     w     o     p     w     v     d  
freq    4936  4464  4384  8124  7924  7488  3968  2388  4040  3148  

[4 rows x 22 columns]


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,12,13,14,15,16,17,18,19,20,21
0,x,s,n,t,p,f,c,n,k,e,...,s,w,w,p,w,o,p,k,s,u
1,x,s,y,t,a,f,c,b,k,e,...,s,w,w,p,w,o,p,n,n,g
2,b,s,w,t,l,f,c,b,n,e,...,s,w,w,p,w,o,p,n,n,m
3,x,y,w,t,p,f,c,n,n,e,...,s,w,w,p,w,o,p,k,s,u
4,x,s,g,f,n,f,w,b,k,t,...,s,w,w,p,w,o,e,n,a,g


In [49]:

def train_supervised(instance_list, class_list):
    """
    Builds a supervised Naive Bays model,
    given a preprocessed set of training
    data, by calculating counts from the
    training data.
    
    Inputs: a training set of data, consisting
    of a list of instances, and a list of class 
    labels for those instances.
    
    Outputs: 2-tuple containing a class frequency 
    dictionary and also the supervised model.
    """
    
    # The supervised Naive Bays model
    supervised_model = {}
    
    # For every instance in the training
    # data, update the model
    for i in range(len(instance_list)):
        
        # Get the instance
        instance = instance_list.iloc[i,:]
        
        # Get the associated class label
        class_label = class_list[i]
        
        
        # If the class label is not already in
        # the model, create a nested set of
        # dictionaries for the class.
        if class_label not in supervised_model:
            
            # New dictionary for every new class
            supervised_model[class_label] = {}
            
            # For each attribute, initialise
            # a new default dictionary.
            for attr in instance_list.columns:
                supervised_model[class_label][attr] = defaultdict(int)
                
        
        # For every attribute in the instance,
        # get its corresponding value and update
        # the model.
        for attr in instance_list.columns:
            
            attr_value = instance[attr]
            
            # If a value is missing in a training instance, it is
            # possible to simply have it not contribute to the
            # attribute-value counts.
            if attr_value == '?':
                # print("Encountered missing value in:")
                # print(instance)
                # print()
                continue
                
            supervised_model[class_label][attr][attr_value] += 1
    
    
    return supervised_model



# Test function
supervised_model = train_supervised(instance_list, class_list)

supervised_model

{'e': {0: defaultdict(int,
              {'b': 404, 'f': 1596, 'k': 228, 's': 32, 'x': 1948}),
  1: defaultdict(int, {'f': 1560, 's': 1144, 'y': 1504}),
  2: defaultdict(int,
              {'b': 48,
               'c': 32,
               'e': 624,
               'g': 1032,
               'n': 1264,
               'p': 56,
               'r': 16,
               'u': 16,
               'w': 720,
               'y': 400}),
  3: defaultdict(int, {'f': 1456, 't': 2752}),
  4: defaultdict(int, {'a': 400, 'l': 400, 'n': 3408}),
  5: defaultdict(int, {'a': 192, 'f': 4016}),
  6: defaultdict(int, {'c': 3008, 'w': 1200}),
  7: defaultdict(int, {'b': 3920, 'n': 288}),
  8: defaultdict(int,
              {'e': 96,
               'g': 248,
               'h': 204,
               'k': 344,
               'n': 936,
               'o': 64,
               'p': 852,
               'u': 444,
               'w': 956,
               'y': 64}),
  9: defaultdict(int, {'e': 1616, 't': 2592}),
  10: defaultdic

In [50]:

def get_class_freqs(supervised_model):
    """
    Returns a dictionary containing frequency counts
    of all the class labels in the training data.
    """
    
    class_freqs = defaultdict(int)
    
    for class_label in supervised_model:
        class_freqs[class_label] = sum(supervised_model[class_label][0].values())
        
    return class_freqs



def predict_supervised(supervised_model, instance_list):
    """
    Predicts the class labels for a set of test
    instances, based on a supervised Naive Bayes
    model.
    """
    
    # The list of predictions for each instance
    prediction_list = []
    
    # Get the class frequencies to avoid redundant calculations
    class_freqs = get_class_freqs(supervised_model)
    
    # Calculate the total number of instances
    total_instances = sum(class_freqs.values())
    
    
    # For each instance in the test set, predict 
    # its class based on the supervised model.
    for instance in instance_list.values:
        
        max_prob = 0.0
        predicted_class = ""
        
        # We obtain the probability of an instance
        # being a certain class, for each possible class.
        for class_label in class_freqs:
            
            # Firstly, get the prior probability of a class
            prob = class_freqs[class_label]/total_instances
            
            
            # Now, multiply the prior probability of the class
            # by the posterior probability of each of the instance's 
            # attribute values, given the class.
            for attr in supervised_model[class_label]:
                
                prob *= (supervised_model[class_label][attr][instance[attr]]/class_freqs[class_label])
                
            
            # If the probability is the highest seen
            # thus far, set the predicted class to the
            # class label.
            if prob > max_prob:
                max_prob = prob
                predicted_class = class_label
            
        
        # Add the class label with the highest
        # corresponding probability to the list
        # of predictions.
        prediction_list.append(predicted_class)
        
        
    return prediction_list


# Test the function
prediction_list = predict_supervised(supervised_model, instance_list)

print(pd.Series(prediction_list).value_counts())

e    3471
     2480
p    2173
dtype: int64


In [51]:

def evaluate_supervised(prediction_list, class_list):
    """
    Evaluates a set of predictions, in a supervised
    context. Uses accuracy as the primary method of
    evaluation.
    """
    
    # Validation checking
    assert(len(prediction_list) == len(class_list))
    
    # Calculate and return the accuracy of the model
    return (prediction_list == class_list).value_counts().loc[True]/len(prediction_list)


# Test the function
evaluate_supervised(prediction_list, class_list)


0.6919005416051206

In [67]:

def train_unsupervised(instance_list, class_labels):
    """
    Builds a weak unsupervised Naive Bayes model,
    from a given set of unlabelled training data,
    and a unique set of possible class labels.
    """
    
    # The list containing random class
    # distributions for each instance.
    random_class_distributions = []
    
    
    # For each training instance, create
    # a non-uniform class distribution for
    # that instance.
    for instance in instance_list.values:
        
        """
        random_distribution = np.random.rand(1,len(class_labels))[0]
        norm_random_distribution = random_distribution/sum(random_distribution)
                
        random_class_distributions.append(norm_random_distribution)
        """
        
        """
        random_class_distributions.append(
            np.random.dirichlet(np.ones(len(class_labels)),size=1)[0])
        """
        random_distribution = np.abs(np.random.randn(1, len(class_labels))[0])
        norm_random_distribution = random_distribution/sum(random_distribution)
        
        random_class_distributions.append(norm_random_distribution)
    
    
    # Make a new dataframe consisting of random 
    # class distributions for each instance.
    class_distributions = pd.DataFrame(random_class_distributions, 
                                       columns=class_labels)   
        
        
    
    # Dictionary contains the (random) frequencies
    # of each class in the training set
    class_freqs = defaultdict(int) 
    
    
    # Count the frequencies
    for class_label in class_labels:
        class_freqs[class_label] = class_distributions[class_label].sum()
    
    
    # Initialise the unsupervised Naive Bays model, which
    # is just a dictionary of dictionaries of dictionaries
    unsupervised_model = {c : {a : defaultdict(float) for a in instance_list.columns} for c in class_labels}
    
    
    # Concatenate the dataframes
    # class_distributions = pd.concat([instance_list, class_distributions], axis=1)
    

    # For every instance in the training
    # data, update the unsupervised model
    for i in range(len(instance_list)):
        
        for attr in instance_list.columns:
            
            for class_label in class_labels:
                
                unsupervised_model[class_label][attr][instance_list[attr][i]] += class_distributions[class_label][i]
                
    
    return class_distributions, class_freqs, unsupervised_model



# Test the function
class_labels = class_list.unique()
cd, cf, um = train_unsupervised(instance_list, class_labels)
print([cd[class_label].sum() for class_label in cd.columns])
um

[4083.8945281489987, 4040.1054718510027]


{'e': {0: defaultdict(float,
              {'b': 214.90559649428255,
               'c': 1.592119533208989,
               'f': 1576.256779861335,
               'k': 400.7450712680926,
               's': 14.597680420386467,
               'x': 1832.0082242736937}),
  1: defaultdict(float,
              {'f': 1159.933755540562,
               'g': 2.915711942836021,
               's': 1280.8885762910375,
               'y': 1596.367428076564}),
  2: defaultdict(float,
              {'b': 84.73924510318173,
               'c': 20.98299171334446,
               'e': 749.7442576141632,
               'g': 924.2928955179578,
               'n': 1125.4223686244031,
               'p': 74.43258227724898,
               'r': 7.196751827943744,
               'u': 7.722806366249106,
               'w': 510.55918956494764,
               'y': 535.0123832415632}),
  3: defaultdict(float, {'f': 2355.691604957061, 't': 1684.4138668939372}),
  4: defaultdict(float,
              {'a': 188.3110432

In [68]:
def get_class_freqs2(class_distributions):
    
    class_freqs = defaultdict(float)
    
    for class_label in class_distributions.columns:
        class_freqs[class_label] = class_distributions[class_label].sum()
        
    return class_freqs



def predict_unsupervised(class_distributions, unsupervised_model, instance_list, iterations=5):
    """
    Predicts the class distribution for a set of
    instances, based on a trained model.
    """
        
    # For each instance in the test set, iteratively
    # update the class distribution for the instance
    for iteration in range(iterations):
        
        # Calculate the frequency of the classes 
        # in the random class distribution
        class_freqs = get_class_freqs2(class_distributions)
        
        print(class_freqs)
        
        
        # Calculate the total number of instances
        total_instances = int(sum(class_freqs.values()) + 0.5)
    
        
        for i in range(len(instance_list)):

            # Get the instance
            instance = instance_list.iloc[i,:]

            # Find a new class distribution for the instance,
            # then normalise that distribution.
            for class_label in class_freqs:

                # Firstly, get the prior probability of a class
                prob = class_freqs[class_label]/total_instances

                # Now, multiply the prior probability of the class
                # by the posterior probability of each of the instance's 
                # attribute values, given the class.
                for attr in unsupervised_model[class_label]:
                    prob *= (unsupervised_model[class_label][attr][instance[attr]]/class_freqs[class_label])


                # Update the probability distribution for the instance
                class_distributions[class_label][i] = prob


        # Now, normalise the class distribution so that the probabilities add to 1
        class_distributions = class_distributions.div(class_distributions.sum(axis=1), axis=0)


    return class_distributions


# Test the function
class_distributions = predict_unsupervised(cd, um, instance_list)
class_distributions.head(25)

defaultdict(<class 'float'>, {'p': 4083.8945281489987, 'e': 4040.1054718510027})
defaultdict(<class 'float'>, {'p': 4080.8808041007037, 'e': 4043.1191958992918})
defaultdict(<class 'float'>, {'p': 4143.573340764418, 'e': 3980.426659235584})
defaultdict(<class 'float'>, {'p': 2875.0842974914813, 'e': 5248.915702508522})
defaultdict(<class 'float'>, {'p': 8123.978877197694, 'e': 0.021122802318584724})


Unnamed: 0,p,e
0,7.898888e-118,1.0
1,7.623208000000001e-118,1.0
2,8.187062000000001e-118,1.0
3,7.84927e-118,1.0
4,5.888638e-118,1.0
5,7.573978000000001e-118,1.0
6,9.807259e-118,1.0
7,8.637089000000001e-118,1.0
8,8.228755e-118,1.0
9,9.658849e-118,1.0


In [69]:
len(instance_list)

8124

In [70]:

def evaluate_unsupervised(class_distributions, class_list):
    """
    Evaluates a set of predictions, in an
    unsupervised manner.
    """
    
    # Validation checking
    assert(len(class_distributions) == len(class_list))
    
    # The list of class predictions for the instances
    prediction_list = []
        
    # Get the predictions for all instances
    for instance in class_distributions.values:
        prediction_list.append(class_distributions.columns[np.argmax(instance)])
    
    
    # Validation checking again
    assert(len(prediction_list) == len(class_list))
    
    # Construct a confusion matrix to calculate accuracy
    confusion_matrix = {predicted_class : defaultdict(int) for predicted_class in set(prediction_list)}
    
    # Fill in the confusion matrix
    for i in range(len(prediction_list)):
        predicted_class = prediction_list[i]
        actual_class = class_list[i]
        confusion_matrix[predicted_class][actual_class] += 1
    
    
    # Calculate the total number of 'correct' predictions
    correct = 0
    for predicted_class in confusion_matrix:
        correct += max(confusion_matrix[predicted_class].values())
    
    
    print(confusion_matrix)
    
    # Calculate and return the accuracy of the classifier
    return correct/len(prediction_list)


# Test the function
evaluate_unsupervised(class_distributions, class_list)

{'e': defaultdict(<class 'int'>, {'p': 3916, 'e': 4208})}


0.517971442639094

Questions (you may respond in a cell or cells below):

1. Since we’re starting off with random guesses, it might be surprising that the unsupervised NB works at all. Explain what characteristics of the data cause it to work pretty well (say, within 10% Accuracy of the supervised NB) most of the time; also, explain why it utterly fails sometimes.
2. When evaluating supervised NB across the four different datasets, you will observe some variation in effectiveness (e.g. Accuracy). Explain what causes this variation. Describe and explain any particularly suprising results.
3. Evaluating the model on the same data that we use to train the model is considered to be a major mistake in Machine Learning. Implement a hold–out (hint: check out numpy.shuffle()) or cross–validation evaluation strategy. How does your estimate of Accuracy change, compared to testing on the training data? Explain why. (The result might surprise you!)
4. Implement one of the advanced smoothing regimes (add-k, Good-Turing). Do you notice any variation in the predictions made by either the supervised or unsupervised NB classifiers? Explain why, or why not.
5. The lecture suggests that deterministically labelling the instances in the initialisation phase of the unsupervised NB classifier “doesn’t work very well”. Confirm this for yourself, and then demonstrate why.
6. Rather than evaluating the unsupervised NB classifier by assigning a class deterministically, instead calculate how far away the probabilistic estimate of the true class is from 1 (where we would be certain of the correct class), and take the average over the instances. Does this performance estimate change, as we alter the number of iterations in the method? Explain why.
7. Explore what causes the unsupervised NB classifier to converge: what proportion of instances change their prediction from the random assignment, to the first iteration? From the first to the second? What is the latest iteration where you observe a prediction change? Make some conjecture(s) as to what is occurring here.

Don't forget that groups of 1 student should respond to question (1), and one other question. Groups of 2 students should respond to question (1), and three other questions. Your responses should be about 100-200 words each.