## This assignment may be worked individually or in pairs. 
## Enter your name/names here:
    

In [26]:
#BULBASAUR

# Assignment 1: Intro to Classification

In this assignment we'll be looking at 3 common classification algorithms -- Decision Trees, k Nearest Neighbor and Naive Bayes classifier. For this task we'll be using the Diabetic Retinopathy data set, which contains features from the Messidor image set to predict whether an image contains signs of diabetic retinopathy or not. This dataset has `1151` instances and `20` attributes (some categorical, some continuous). You can find additional details about the dataset [here](http://archive.ics.uci.edu/ml/datasets/Diabetic+Retinopathy+Debrecen+Data+Set).

## Part 1: Decision Trees

For this task you'll be implementing the decision tree classifier. A few function prototypes are already given to you, please don't change those. You can add additional helper functions for your convenience. *Suggestion:* The dataset is substantially big, for the purpose of easy debugging work with a subset of the data and test your decision tree implementation on that.

Attribute Information:

0) The binary result of quality assessment. 0 = bad quality 1 = sufficient quality.

1) The binary result of pre-screening, where 1 indicates severe retinal abnormality and 0 its lack. 

2-7) The results of MA detection. Each feature value stand for the number of MAs found at the confidence levels alpha = 0.5, . . . , 1, respectively. 

8-15) contain the same information as 2-7) for exudates. However, as exudates are represented by a set of points rather than the number of pixels constructing the lesions, these features are normalized by dividing the 
number of lesions with the diameter of the ROI to compensate different image sizes. 

16) The euclidean distance of the center of the macula and the center of the optic disc to provide important information regarding the patient's condition. This feature is also normalized with the diameter of the ROI.

17) The diameter of the optic disc. 

18) The binary result of the AM/FM-based classification.

19) Class label. 1 = contains signs of Diabetic Retinopathy (Accumulative label for the Messidor classes 1, 2, 3), 0 = no signs of Diabetic Retinopathy.

In [27]:
# Standard Headers
# You are welcome to add additional headers if you wish
# EXCEPT for scikit-learn... You may NOT use scikit-learn for this assignment!
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from math import log, pow, sqrt
from random import shuffle

In [28]:
class DataPoint:
    def __str__(self):
        return "< " + str(self.label) + ": " + str(self.features) + " >"
    def __init__(self, label, features):
        self.label = label # the classification label of this data point
        self.features = features

Q1. Read data from a CSV file. Put it into a list of `DataPoints`.

In [29]:
def get_data(filename):
    data = []
    global pd
    #store the data in temp list
    temp = pd.read_csv(filename,header=None)
    #loop through the temp data, store them in a list of DataPoint objects
    for i in range(len(temp)):
        newDataPoint = DataPoint(int(temp.iloc[i][len(temp.columns)-1]),list(temp.iloc[i][:-1]))
        data.append(newDataPoint)
    return data

In [30]:
class TreeNode:
    is_leaf = True          # boolean variable to check if the node is a leaf
    feature_idx = None      # index that identifies the feature
    thresh_val = None       # threshold value that splits the node
    prediction = None       # prediction class (only valid for leaf nodes)
    left_child = None       # left TreeNode (all values < thresh_val)
    right_child = None      # right TreeNode (all values >= thresh_val)
    
    def printTree(self):    # for debugging purposes
        if self.is_leaf:
            print ('Leaf Node:      predicts ' + str(self.prediction))
        else:
            print ('Internal Node:  splits on feature ' 
                   + str(self.feature_idx) + ' with threshold ' + str(self.thresh_val))
            self.left_child.printTree()
            self.right_child.printTree()

Q2. Implement the function `make_prediction` that takes the decision tree root and a `DataPoint` instance and returns the prediction label.

In [31]:
def make_prediction(tree_root, data_point):
    #if tree_root is a leaf, return class directly
    if tree_root.is_leaf:
        return tree_root.prediction
    #if tree_root is not a leaf, check threshhold to decide go left or right
    else:
        if data_point.features[tree_root.feature_idx] <= tree_root.thresh_val:
            return make_prediction(tree_root.left_child, data_point)
        else:
            return make_prediction(tree_root.right_child, data_point)

Q3. Implement the function `split_dataset` given an input data set, a `feature_idx` and the `threshold` for the feature. `left_split` will have all values < `threshold` and `right_split` will have all values >= `threshold`.

In [32]:
def split_dataset(data, feature_idx, threshold):
    left_split = []
    right_split = []
    #loop through the dataset, seperate them to 2 lists
    for i in range(len(data)):
        if data[i].features[feature_idx] < threshold:
            left_split.append(data[i])
        else:
            right_split.append(data[i])
    return (left_split, right_split)

Q4. Implement the function `calc_entropy` to return the entropy of the input dataset.

In [33]:
def calc_entropy(data):
    global log
    entropy = 0.0
    zero_count = 0;
    total = len(data)
    #count how many records have class value 0
    for i in range(total):
        if data[i].label == 0:
            zero_count += 1;
    #calculate how many record have clss value 1
    one_count = total - zero_count
    #possibility calculation
    p_zero = zero_count/total
    p_one = one_count/total
    #change values from 0 to 1 for log calcualtion
    if p_zero == 0:
        p_zero = 1
    if p_one == 0:
        p_one = 1
    #calculate final entropy result
    entropy = - p_one*log(p_one,2) - p_zero*log(p_zero,2)
    return entropy

Q5. Implement the function `calc_best_threshold` which returns the best information gain and the corresponding threshold value for one feature at `feature_idx`.

In [34]:
def calc_best_threshold(data, feature_idx):
    best_info_gain = 0.0
    best_thresh = None
    gain = 0.0
    #sort the data records to asending order, for threshhold finding convinience
    newData = sorted(data, key=lambda x: x.features[feature_idx], reverse=False)
    p_entropy = calc_entropy(newData)
    for i in range(len(newData)-1):
        #when finding a class split point, calculate the gain value
        if newData[i].label != newData[i+1].label:
            gain = p_entropy - len(newData[:i+1])*calc_entropy(newData[:i+1])/len(newData)-len(newData[i+1:])*calc_entropy(newData[i+1:])/len(newData)  
            #every time better gain value found, update best_info_gain and best_thresh
            if gain >= best_info_gain:
                best_info_gain = gain
                best_thresh = newData[i].features[feature_idx]
    return (best_info_gain, best_thresh)

Q6. Implement the function `identify_best_split` which returns the best feature to split on for an input dataset, and also returns the corresponding threshold value.

In [35]:
def identify_best_split(data):
    if len(data) < 2:
        return (None, None)
    best_feature = None
    best_thresh = None
    best_gain = 0
    for i in range(len(data[0].features)):
        gain,thresh = calc_best_threshold(data,i)
        if gain > best_gain:
            best_gain = gain
            best_feature = i
            best_thresh = thresh
    return (best_feature, best_thresh)

Q7. Implement the function `createLeafNode` which returns a `TreeNode` with `is_leaf=True` and `prediction` set to whichever classification occurs most in the dataset at this node.

In [36]:
def createLeafNode(data):
    zero_count = 0
    predict = 1
    for i in range(len(data)):
        if data[i].label == 0:
            zero_count += 1;
    if zero_count > len(data)/2:
        predict = 0        
    NewTreeNode = TreeNode()
    NewTreeNode.prediction = predict
    return NewTreeNode

Q8. Implement the `createDecisionTree` function. `max_levels` denotes the maximum height of the tree (for example if `max_levels = 1` then the decision tree will only contain the leaf node at the root. [Hint: this is where the recursion happens.]

In [37]:
def createDecisionTree(data, max_levels):
    if max_levels == 1:
        return createLeafNode(data) 
    else:
        #find the best feature to split
        best_feature, best_thresh = identify_best_split(data)
        if best_feature == None:
            return createLeafNode(data)
        NewTreeNode = TreeNode()
        #when split feature found, split the data set to 2 parts according to threshhold value
        leftSet, rightSet = split_dataset(data,best_feature,best_thresh)
        NewTreeNode.is_leaf = False
        #set feature index, threshhold and left, right tree node
        NewTreeNode.feature_idx = best_feature
        NewTreeNode.thresh_val = best_thresh
        NewTreeNode.left_child = createDecisionTree(leftSet, max_levels-1)
        NewTreeNode.right_child = createDecisionTree(rightSet, max_levels-1)
    return NewTreeNode

Q9. Given a test set, the function `calcAccuracy` returns the accuracy of the classifier. You'll use the `makePrediction` function for this.

In [38]:
#use global variables to store values needed for confusion matrix and precision, recall calculation
#actual positive and predicted positive
positive_positive = 0
#actual positive and predicted negative
positive_negative = 0
#actual negative and predicted postive
negative_positive = 0
#actual negative and predicted negative
negative_negative = 0
def calcAccuracy(tree_root, data):
    global positive_positive, positive_negative, negative_positive, negative_negative
    positive_positive = positive_negative = negative_positive = negative_negative = 0
    correct_count = 0
    #loop through data for accuracy calculation and set values to global variables 
    for i in range(len(data)):
        if make_prediction(tree_root, data[i]) == data[i].label == 1:
            correct_count += 1
            positive_positive += 1
        elif  make_prediction(tree_root, data[i]) == data[i].label == 0:
            correct_count += 1
            negative_negative += 1
        elif make_prediction(tree_root, data[i]) != data[i].label and data[i].label ==1:
            positive_negative += 1
        else:
            negative_positive += 1
    return correct_count/len(data)

Q10. Keeping the `max_levels` parameter as 10, use 5-fold cross validation to measure the accuracy of the model. Print the recall and precision of the model. Also display the confusion matrix.

In [39]:
# edit the code here - this is just a sample to get you started
import time

d = get_data("messidor_features.txt")
test_size = len(d)//5
accuracy_record = []
for i in range(5):
    # partition data into train_set and test_set
    train_set = d[:i*test_size]+d[(i+1)*test_size:]
    test_set = d[i*test_size:(i+1)*test_size]

    print ('Training set size:', len(train_set))
    print ('Test set size    :', len(test_set))

    # create the decision tree
    start = time.time()
    tree = createDecisionTree(train_set, 10)
    end = time.time()
    print ('Time taken:', end - start)

    # calculate the accuracy of the tree
    accuracy = calcAccuracy(tree, test_set)
    accuracy_record.append(accuracy)
    #print confusion matrix and calculate Precision, Recall values
    print('lable value 1 as positive and value 0 as negative')
    print('\n\tPredicted   Class')
    print('\t   +          -  ')
    print('--------------------------')
    print('Actual +  ',positive_positive,'      ',positive_negative)
    print('Class  -  ',negative_positive,'      ',negative_negative,'\n')
    print('Precision(+) =',100*positive_positive/(positive_positive+negative_positive),'%')
    print('Precision(-) =',100*negative_negative/(negative_negative+positive_negative),'%')
    print('Recall(+) =',100*positive_positive/(positive_positive+positive_negative),'%')
    print('Recall(+) =',100*negative_negative/(negative_negative+negative_positive),'%')    
    print ('The accuracy on the test set is ', str(accuracy * 100.0),'\n')
print('\nAverage accuracy:',100*sum(accuracy_record)/len(accuracy_record))


Training set size: 921
Test set size    : 230
Time taken: 5.492665767669678
lable value 1 as positive and value 0 as negative

	Predicted   Class
	   +          -  
--------------------------
Actual +   60        68
Class  -   22        80 

Precision(+) = 73.17073170731707 %
Precision(-) = 54.054054054054056 %
Recall(+) = 46.875 %
Recall(+) = 78.43137254901961 %
The accuracy on the test set is  60.86956521739131 

Training set size: 921
Test set size    : 230
Time taken: 4.9088380336761475
lable value 1 as positive and value 0 as negative

	Predicted   Class
	   +          -  
--------------------------
Actual +   100        23
Class  -   58        49 

Precision(+) = 63.29113924050633 %
Precision(-) = 68.05555555555556 %
Recall(+) = 81.30081300813008 %
Recall(+) = 45.794392523364486 %
The accuracy on the test set is  64.78260869565217 

Training set size: 921
Test set size    : 230
Time taken: 6.143146991729736
lable value 1 as positive and value 0 as negative

	Predicted   Class
	  

Q11. Extra Credit: Implement a pruning algorithm on the decision tree (either chi-squared, reduced error pruning, or model selection using a validation set/validation error) and see if that improves the generalization error of the decision tree (using 5-fold CV).

In [40]:
# your code goes here

## Part 2: KNN Classifier/Naive Bayes Classifier

For this task you have an option to either implement the k Nearest Neighbor classifier or the Naive Bayes classifier. You will be using the same dataset as above. The implementation details are up to you but, it is generally a good idea to divide your code into helper functions.

For your implemented model, measure the accuracy of the KNN/NB classifier using 5-fold cross validation. Compare this to the decision tree model you created above. Also print the precision and recall of the KNN/NB classifier and display the confusion matrix.

In [41]:
'''
function to calculate the distance square of two data records
'''
def calculateDistanceSquare(train_data, test_data):
    global pow
    #binary list stores binary attibute index
    binary_list = [0,1,18]
    new_distance_square = 0
    #loop through all attributes and sum up the attribute distance square
    for i in range(len(test_data.features)):
        #if binary attribute, set distance 1 or 0
        if i in binary_list:
            if test_data.features[i] == train_data.features[i]:
                new_distance_square += 0
                new_distance_square
            else:
                new_distance_square += 1
                new_distance_square
        #if not binary, calculate the distance 
        else:
            new_distance_square += pow(test_data.features[i]-train_data.features[i],2)
    return new_distance_square

'''
Function to make class prediction for a test data record, k is the chosen neighbor number
prediction is made by weighted voting result
'''
def make_KNN_prediction(train_data_set, test_data, k):
    global sqrt
    distance_list = []
    #loop through the training data set, calculate each distance and weight value
    for i in range(len(train_data_set)):
        square = calculateDistanceSquare(train_data_set[i], test_data)
        #if all attributes fit test data record, record weight as 1
        if square == 0:
            weight = 1
        #if not, calculate weight value
        else:
            weight = 1/square
        #store the new neighbor info as a list to a list of lists, [index in training set, distance, weight]
        distance_list.append([i,sqrt(square),weight])
    #for further using, sort the neighbors in asending order by distance
    distance_list = sorted(distance_list, key=lambda x: x[1], reverse=False)
    #only store the needed k neighbors to a new list
    result_list = distance_list[:k]
    #one_sum and zero_sum will store voting status
    one_sum = zero_sum = 0.0
    for i in range(len(result_list)):
        if train_data_set[result_list[i][0]].label == 1:
            one_sum += result_list[i][2]
        else:
            zero_sum += result_list[i][2]
    #predict the label
    if one_sum > zero_sum:
        return 1;
    else:
        return 0;
'''
function to calculate accuracy, basically mimicing the one in decision tree
also uses glocal variable to store the values used in cinfusion matrix and precision,recall calculating
'''    
def calculate_KNN_accuracy(train_data_set,test_data_set, k):
    global positive_positive, positive_negative, negative_positive, negative_negative
    positive_positive = positive_negative = negative_positive = negative_negative = 0
    correct_count = 0;
    for i in range(len(test_data_set)):
        if make_KNN_prediction(train_data_set,test_data_set[i],k) == test_data_set[i].label == 1:
            correct_count += 1
            positive_positive += 1
        elif make_KNN_prediction(train_data_set,test_data_set[i],k) == test_data_set[i].label == 0:
            correct_count += 1
            negative_negative += 1
        elif make_KNN_prediction(train_data_set,test_data_set[i],k) != test_data_set[i].label and test_data_set[i].label == 1:
            positive_negative += 1
        else:
            negative_positive += 1
    return correct_count/len(test_data_set)


d = get_data("messidor_features.txt")
test_size = len(d)//5
accuracy_record = []
for i in range(5):
    # partition data into train_set and test_set
    train_set = d[:i*test_size]+d[(i+1)*test_size:]
    test_set = d[i*test_size:(i+1)*test_size]

    print ('Training set size:', len(train_set))
    print ('Test set size    :', len(test_set))

    # calculate the accuracy
    start = time.time()
    accuracy = calculate_KNN_accuracy(train_set, test_set, 11)
    end = time.time()
    print ('Time taken:', end - start)
    print ('The accuracy on the test set is ', str(accuracy * 100.0),'\n')
    accuracy_record.append(accuracy)
    print('lable value 1 as positive and value 0 as negative')
    print('\n\tPredicted   Class')
    print('\t   +          -  ')
    print('--------------------------')
    print('Actual +  ',positive_positive,'      ',positive_negative)
    print('Class  -  ',negative_positive,'      ',negative_negative,'\n')
    print('Precision(+) =',100*positive_positive/(positive_positive+negative_positive),'%')
    print('Precision(-) =',100*negative_negative/(negative_negative+positive_negative),'%')
    print('Recall(+) =',100*positive_positive/(positive_positive+positive_negative),'%')
    print('Recall(+) =',100*negative_negative/(negative_negative+negative_positive),'%')    
    print ('The accuracy on the test set is ', str(accuracy * 100.0),'\n')
print('\nAverage accuracy:',100*sum(accuracy_record)/len(accuracy_record))

Training set size: 921
Test set size    : 230
Time taken: 6.797586917877197
The accuracy on the test set is  66.08695652173913 

lable value 1 as positive and value 0 as negative

	Predicted   Class
	   +          -  
--------------------------
Actual +   79        49
Class  -   29        73 

Precision(+) = 73.14814814814815 %
Precision(-) = 59.83606557377049 %
Recall(+) = 61.71875 %
Recall(+) = 71.56862745098039 %
The accuracy on the test set is  66.08695652173913 

Training set size: 921
Test set size    : 230
Time taken: 6.135332822799683
The accuracy on the test set is  62.17391304347826 

lable value 1 as positive and value 0 as negative

	Predicted   Class
	   +          -  
--------------------------
Actual +   70        53
Class  -   34        73 

Precision(+) = 67.3076923076923 %
Precision(-) = 57.93650793650794 %
Recall(+) = 56.91056910569106 %
Recall(+) = 68.22429906542057 %
The accuracy on the test set is  62.17391304347826 

Training set size: 921
Test set size    : 230
