# The University of Melbourne, School of Computing and Information Systems
# COMP30027 Machine Learning, 2018 Semester 1
-----
## Project 1: What is labelled data worth to Naive Bayes?
-----
###### Student Name(s): Emmanuel Macario
###### Python version: 3.6.0

This iPython notebook is a template which you may use for your Project 1 submission. (You are not required to use it; in particular, there is no need to use iPython if you do not like it.)

Marking will be applied on the seven functions that are defined in this notebook, and to your responses to the questions at the end of this notebook.

You may change the prototypes of these functions, and you may write other functions, according to your requirements. We would appreciate it if the required functions were prominent/easy to find. 

In [None]:
"""
Results:

Dataset       Supervised   Unsupervised
---------------------------------------
breast            0.7552
car               0.8738
hypothyroid       0.9522
mushroom          0.9972
   

"""


In [78]:
# Import useful libraries
from collections import defaultdict
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

In [79]:
# Leave this to test supervised model on iris data
from sklearn import datasets

iris = datasets.load_iris()
instance_list = pd.DataFrame(iris['data'])
instance_list

class_list = iris.target

In [94]:
# Filename constants used for easy file access
CSV1 = 'breast-cancer-dos.csv'
CSV2 = 'car-dos.csv'
CSV3 = 'hypothyroid-dos.csv'
CSV4 = 'mushroom-dos.csv'


def preprocess(filename):
    """
    Opens a data file in csv, and transforms it into
    a usable format.
    """
    
    # Read csv data into a dataframe. Assign
    # an integer value to each attribute in
    # the data.
    df = pd.read_csv(filename, header=None)
    
    
    # Partition the data into instances and
    # instance class labels.
    instance_list = df.iloc[:,:-1]
    class_list = df.iloc[:,-1]
    data_set = instance_list, class_list
    
    # Note: the instance list is a dataframe, whereas the class
    # list is a series object
    
    # Give a description of the class distributions
    print("CLASS DISTRIBUTION")
    print(pd.Series(class_list).value_counts())
    
    print()
    print("ATTRIBUTE DISTRIBUTIONS")
    print(instance_list.describe())
    
    
    return data_set


# Test preprocess function
instance_list, class_list = preprocess(CSV3)
instance_list.head(5)

CLASS DISTRIBUTION
negative       3012
hypothyroid     151
Name: 18, dtype: int64

ATTRIBUTE DISTRIBUTIONS
          0     1     2     3     4     5     6     7     8     9     10  \
count   3163  3163  3163  3163  3163  3163  3163  3163  3163  3163  3163   
unique     3     2     2     2     2     2     2     2     2     2     2   
top        F     f     f     f     f     f     f     f     f     f     f   
freq    2182  2702  3108  3121  3059  2922  2920  3100  3064  3123  3161   

          11    12    13    14    15    16    17  
count   3163  3163  3163  3163  3163  3163  3163  
unique     2     2     2     2     2     2     2  
top        f     y     y     y     y     y     n  
freq    3064  2695  2468  2914  2915  2916  2903  


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17
0,M,f,f,f,f,f,f,f,f,f,f,f,y,y,y,y,y,n
1,F,t,f,f,f,f,f,f,f,f,f,f,y,y,y,y,y,n
2,M,f,f,f,f,f,f,f,f,f,f,f,y,y,y,y,y,n
3,F,f,f,f,f,f,f,f,f,f,f,f,y,y,y,y,y,n
4,M,f,f,f,f,f,f,f,f,f,f,f,y,y,y,y,y,n


In [95]:

def train_supervised(instance_list, class_list):
    """
    Builds a supervised Naive Bays model,
    given a preprocessed set of training
    data, by calculating counts from the
    training data.
    
    Inputs: a training set of data, consisting
    of a list of instances, and a list of class 
    labels for those instances.
    
    Outputs: 2-tuple containing a class frequency 
    dictionary and also the supervised model.
    """
    
    # The supervised Naive Bays model
    supervised_model = {}
    
    # For every instance in the training
    # data, update the model
    for i in range(len(instance_list)):
        
        # Get the instance
        instance = instance_list.iloc[i,:]
        
        # Get the associated class label
        class_label = class_list[i]
        
        
        # If the class label is not already in
        # the model, create a nested set of
        # dictionaries for the class.
        if class_label not in supervised_model:
            
            # New dictionary for every new class
            supervised_model[class_label] = {}
            
            # For each attribute, initialise
            # a new default dictionary.
            for attr in instance_list.columns:
                supervised_model[class_label][attr] = defaultdict(int)
                
        
        # For every attribute in the instance,
        # get its corresponding value and update
        # the model.
        for attr in instance_list.columns:
            
            attr_value = instance[attr]
            
            # For missing values in a training instance, it is
            # possible to simply have them not contribute to the
            # attribute-value counts.
            if attr_value == '?':
                continue
                
            supervised_model[class_label][attr][attr_value] += 1
    
    
    return supervised_model



# Test function
supervised_model = train_supervised(instance_list, class_list)

supervised_model

{'hypothyroid': {0: defaultdict(int, {'F': 111, 'M': 38}),
  1: defaultdict(int, {'f': 137, 't': 14}),
  2: defaultdict(int, {'f': 151}),
  3: defaultdict(int, {'f': 150, 't': 1}),
  4: defaultdict(int, {'f': 141, 't': 10}),
  5: defaultdict(int, {'f': 131, 't': 20}),
  6: defaultdict(int, {'f': 144, 't': 7}),
  7: defaultdict(int, {'f': 150, 't': 1}),
  8: defaultdict(int, {'f': 149, 't': 2}),
  9: defaultdict(int, {'f': 151}),
  10: defaultdict(int, {'f': 151}),
  11: defaultdict(int, {'f': 145, 't': 6}),
  12: defaultdict(int, {'n': 1, 'y': 150}),
  13: defaultdict(int, {'n': 14, 'y': 137}),
  14: defaultdict(int, {'y': 151}),
  15: defaultdict(int, {'y': 151}),
  16: defaultdict(int, {'y': 151}),
  17: defaultdict(int, {'n': 148, 'y': 3})},
 'negative': {0: defaultdict(int, {'F': 2071, 'M': 870}),
  1: defaultdict(int, {'f': 2565, 't': 447}),
  2: defaultdict(int, {'f': 2957, 't': 55}),
  3: defaultdict(int, {'f': 2971, 't': 41}),
  4: defaultdict(int, {'f': 2918, 't': 94}),
  5: d

In [96]:

def get_class_freqs(supervised_model):
    """
    Returns a dictionary containing frequency counts
    of all the class labels in the training data.
    """
    
    class_freqs = defaultdict(int)
    
    for class_label in supervised_model:
        class_freqs[class_label] = sum(supervised_model[class_label][0].values())
        
    return class_freqs



def predict_supervised(supervised_model, instance_list):
    """
    Predicts the class labels for a set of test
    instances, based on a supervised Naive Bayes
    model. Implements epsilon probabilistic smoothing.
    """
    
    # Epsilon value for epsilon-smoothing
    EPSILON = np.finfo(float).eps
    
    
    # The list of predictions for each instance
    prediction_list = []
    
    
    # Get the class frequencies to avoid redundant calculations
    class_freqs = get_class_freqs(supervised_model)
    
    
    # Calculate the total number of instances,
    total_instances = sum(class_freqs.values())
    
    
    # For each instance in the test set, predict 
    # its class based on the supervised model.
    for instance in instance_list.values:
        
        max_prob = 0.0
        predicted_class = ""
        
        # We obtain the probability of an instance
        # being a certain class, for each possible class.
        for class_label in class_freqs:
            
            # Firstly, get the prior probability of a class
            prob = class_freqs[class_label]/total_instances
            
            
            # Now, multiply the prior probability of the class
            # by the posterior probability of each of the instance's 
            # attribute values, given the class.
            for attr in supervised_model[class_label]:
                
                # Get the attribute value
                attr_value = instance[attr]
                
                # If an attribute value is missing for the training
                # instance, simply have it not count towards the probability
                # estimates
                if attr_value == '?':
                    continue
                
                # When accessing values, if value is 0, replace with epsilon
                post_attr = supervised_model[class_label][attr].get(attr_value, EPSILON)
                
                # Update the probability estimate
                prob *= (post_attr/class_freqs[class_label])
                
            
            # If the probability is the highest seen
            # thus far, set the predicted class to the
            # class label.
            if prob > max_prob:
                max_prob = prob
                predicted_class = class_label
            
        
        # Add the class label with the highest
        # corresponding probability to the list
        # of predictions.
        prediction_list.append(predicted_class)
        
        
    return prediction_list


# Test the function
prediction_list = predict_supervised(supervised_model, instance_list)

print(pd.Series(prediction_list).value_counts())

negative    3163
dtype: int64


In [97]:

def evaluate_supervised(prediction_list, class_list):
    """
    Evaluates a set of predictions, in a supervised
    context. Uses accuracy as the primary method of
    evaluation.
    """
    
    # Validation checking
    assert(len(prediction_list) == len(class_list))
    
    # Calculate and return the accuracy of the model
    return (prediction_list == class_list).value_counts().loc[True]/len(prediction_list)


# Test the function
evaluate_supervised(prediction_list, class_list)


0.9522605121719886

In [99]:

def train_unsupervised(instance_list, class_labels):
    """
    Builds a weak unsupervised Naive Bayes model,
    from a given set of unlabelled training data,
    and a unique set of possible class labels.
    """
    
    # The list containing random class
    # distributions for each instance.
    random_class_distributions = []
    
    
    # For each training instance, create
    # a non-uniform class distribution for
    # that instance.
    for instance in instance_list.values:
        
        """
        random_distribution = np.random.rand(1,len(class_labels))[0]
        norm_random_distribution = random_distribution/sum(random_distribution)
                
        random_class_distributions.append(norm_random_distribution)
        """
        
        """
        random_class_distributions.append(
            np.random.dirichlet(np.ones(len(class_labels)),size=1)[0])
        """
        random_distribution = np.abs(np.random.randn(1, len(class_labels))[0])
        norm_random_distribution = random_distribution/sum(random_distribution)
        
        random_class_distributions.append(norm_random_distribution)
    
    
    # Make a new dataframe consisting of random 
    # class distributions for each instance.
    class_distributions = pd.DataFrame(random_class_distributions, 
                                       columns=class_labels)   
        
        
    
    # Dictionary contains the (random) frequencies
    # of each class in the training set
    class_freqs = defaultdict(int) 
    
    
    # Count the frequencies
    for class_label in class_labels:
        class_freqs[class_label] = class_distributions[class_label].sum()
    
    
    # Initialise the unsupervised Naive Bays model, which
    # is just a dictionary of dictionaries of dictionaries
    unsupervised_model = {c : {a : defaultdict(float) for a in instance_list.columns} for c in class_labels}
    
    
    # Concatenate the dataframes
    # class_distributions = pd.concat([instance_list, class_distributions], axis=1)
    

    # For every instance in the training
    # data, update the unsupervised model
    for i in range(len(instance_list)):
        
        for class_label in class_labels:
            
            for attr in instance_list.columns:
            
                attr_value = instance_list[attr][i]
                
                # Skip missing values
                if attr_value == '?':
                    continue
                
                unsupervised_model[class_label][attr][attr_value] += class_distributions[class_label][i]
                
    
    return class_distributions, class_freqs, unsupervised_model



# Test the function
class_labels = class_list.unique()
cd, cf, um = train_unsupervised(instance_list, class_labels)
print([cd[class_label].sum() for class_label in cd.columns])
cd

[1574.4065652745892, 1588.59343472541]


Unnamed: 0,hypothyroid,negative
0,0.139154,0.860846
1,0.459950,0.540050
2,0.411403,0.588597
3,0.476302,0.523698
4,0.667230,0.332770
5,0.749564,0.250436
6,0.009613,0.990387
7,0.722046,0.277954
8,0.067793,0.932207
9,0.043748,0.956252


In [102]:
def get_class_freqs2(class_distributions):
    
    class_freqs = defaultdict(float)
    
    for class_label in class_distributions.columns:
        class_freqs[class_label] = class_distributions[class_label].sum()
        
    return class_freqs



def predict_unsupervised(class_distributions, unsupervised_model, instance_list, iterations=5):
    """
    Predicts the class distribution for a set of
    instances, based on a trained model.
    """
        
    # For each instance in the test set, iteratively
    # update the class distribution for the instance
    for iteration in range(iterations):
        
        # Calculate the frequency of the classes 
        # in the random class distribution
        class_freqs = get_class_freqs2(class_distributions)
         
        
        # Calculate the total number of instances
        total_instances = int(sum(class_freqs.values()) + 0.5)
    
        
        for i in range(len(instance_list)):

            # Get the instance
            instance = instance_list.iloc[i,:]

            # Find a new class distribution for the instance,
            # then normalise that distribution.
            for class_label in class_freqs:

                # Firstly, get the prior probability of a class
                prob = class_freqs[class_label]/total_instances

                # Now, multiply the prior probability of the class
                # by the posterior probability of each of the instance's 
                # attribute values, given the class.
                for attr in unsupervised_model[class_label]:
                    
                    attr_value = instance[attr]
                    
                    if attr_value == '?':
                        continue
                    
                    prob *= (unsupervised_model[class_label][attr][attr_value]/class_freqs[class_label])


                # Update the probability distribution for the instance
                class_distributions[class_label][i] = prob


        # Now, normalise the class distribution so that the probabilities add to 1
        class_distributions = class_distributions.div(class_distributions.sum(axis=1), axis=0)
        
        print(class_distributions)
        
        
        class_freqs = get_class_freqs2(class_distributions)
        print(class_freqs)
        
        
    return class_distributions


# Test the function
class_distributions = predict_unsupervised(cd, um, instance_list)
class_distributions

      hypothyroid  negative
0        0.488291  0.511709
1        0.510356  0.489644
2        0.488291  0.511709
3        0.487142  0.512858
4        0.488291  0.511709
5        0.500330  0.499670
6        0.487930  0.512070
7        0.487142  0.512858
8        0.493583  0.506417
9        0.480403  0.519597
10       0.487142  0.512858
11       0.487142  0.512858
12       0.487142  0.512858
13       0.494334  0.505666
14       0.487142  0.512858
15       0.488291  0.511709
16       0.504399  0.495601
17       0.487142  0.512858
18       0.487142  0.512858
19       0.487142  0.512858
20       0.494733  0.505267
21       0.487142  0.512858
22       0.488291  0.511709
23       0.487142  0.512858
24       0.470984  0.529016
25       0.487142  0.512858
26       0.489079  0.510921
27       0.487142  0.512858
28       0.487142  0.512858
29       0.572681  0.427319
...           ...       ...
3133     0.487142  0.512858
3134     0.488291  0.511709
3135     0.510569  0.489431
3136     0.401386  0



      hypothyroid  negative
0             0.0       NaN
1             0.0       NaN
2             0.0       NaN
3             0.0       NaN
4             0.0       NaN
5             0.0       NaN
6             0.0       NaN
7             0.0       NaN
8             0.0       NaN
9             0.0       NaN
10            0.0       NaN
11            0.0       NaN
12            0.0       NaN
13            0.0       NaN
14            0.0       NaN
15            0.0       NaN
16            0.0       NaN
17            0.0       NaN
18            0.0       NaN
19            0.0       NaN
20            0.0       NaN
21            0.0       NaN
22            0.0       NaN
23            0.0       NaN
24            0.0       NaN
25            0.0       NaN
26            0.0       NaN
27            0.0       NaN
28            0.0       NaN
29            0.0       NaN
...           ...       ...
3133          0.0       NaN
3134          0.0       NaN
3135          0.0       NaN
3136          0.0   

Unnamed: 0,hypothyroid,negative
0,0.0,
1,0.0,
2,0.0,
3,0.0,
4,0.0,
5,0.0,
6,0.0,
7,0.0,
8,0.0,
9,0.0,


In [103]:

def evaluate_unsupervised(class_distributions, class_list):
    """
    Evaluates a set of predictions, in an
    unsupervised manner.
    """
    
    # Validation checking
    assert(len(class_distributions) == len(class_list))
    
    # The list of class predictions for the instances
    prediction_list = []
        
    # Get the predictions for all instances
    for instance in class_distributions.values:
        prediction_list.append(class_distributions.columns[np.argmax(instance)])
    
    
    # Validation checking again
    assert(len(prediction_list) == len(class_list))
    
    # Construct a confusion matrix to calculate accuracy
    confusion_matrix = {predicted_class : defaultdict(int) for predicted_class in set(prediction_list)}
    
    # Fill in the confusion matrix
    for i in range(len(prediction_list)):
        predicted_class = prediction_list[i]
        actual_class = class_list[i]
        confusion_matrix[predicted_class][actual_class] += 1
    
    
    # Calculate the total number of 'correct' predictions. We
    # take the max value for each column in the confusion matrix
    correct = 0
    for predicted_class in confusion_matrix:
        correct += max(confusion_matrix[predicted_class].values())
    
    
    print(confusion_matrix)
    
    # Calculate and return the accuracy of the classifier
    return correct/len(prediction_list)


# Test the function
evaluate_unsupervised(class_distributions, class_list)

{'negative': defaultdict(<class 'int'>, {'hypothyroid': 151, 'negative': 3012})}


0.9522605121719886

Questions (you may respond in a cell or cells below):

1. Since we’re starting off with random guesses, it might be surprising that the unsupervised NB works at all. Explain what characteristics of the data cause it to work pretty well (say, within 10% Accuracy of the supervised NB) most of the time; also, explain why it utterly fails sometimes.
2. When evaluating supervised NB across the four different datasets, you will observe some variation in effectiveness (e.g. Accuracy). Explain what causes this variation. Describe and explain any particularly suprising results.
3. Evaluating the model on the same data that we use to train the model is considered to be a major mistake in Machine Learning. Implement a hold–out (hint: check out numpy.shuffle()) or cross–validation evaluation strategy. How does your estimate of Accuracy change, compared to testing on the training data? Explain why. (The result might surprise you!)
4. Implement one of the advanced smoothing regimes (add-k, Good-Turing). Do you notice any variation in the predictions made by either the supervised or unsupervised NB classifiers? Explain why, or why not.
5. The lecture suggests that deterministically labelling the instances in the initialisation phase of the unsupervised NB classifier “doesn’t work very well”. Confirm this for yourself, and then demonstrate why.
6. Rather than evaluating the unsupervised NB classifier by assigning a class deterministically, instead calculate how far away the probabilistic estimate of the true class is from 1 (where we would be certain of the correct class), and take the average over the instances. Does this performance estimate change, as we alter the number of iterations in the method? Explain why.
7. Explore what causes the unsupervised NB classifier to converge: what proportion of instances change their prediction from the random assignment, to the first iteration? From the first to the second? What is the latest iteration where you observe a prediction change? Make some conjecture(s) as to what is occurring here.

Don't forget that groups of 1 student should respond to question (1), and one other question. Groups of 2 students should respond to question (1), and three other questions. Your responses should be about 100-200 words each.

In [None]:
"""
QUESTION 1



"""