# Assignment 2 Group no. [9]
### Project members: 
[Yuxia Wang, yuxia@kth.se, Hansika Attanayake, ghat@kth.se, Sevket Melih Zenciroglu, smzen@kth.se,] ...

### Declaration:
By submitting this solution, it is hereby declared that all individuals listed above have contributed to the solution, either with code that appear in the final solution below, or with code that has been evaluated and compared to the final solution, but for some reason has been excluded. It is also declared that all project members fully understand all parts of the final solution and can explain it upon request.

It is furthermore declared that the code below is a contribution by the project members only, and specifically that no part of the solution has been copied from any other source (except for lecture slides at the course ID2214) and no part of the solution has been provided by someone not listed as project member above.

It is furthermore declared that it has been understood that no other library/package than the Python 3 standard library, NumPy, pandas and time may be used in the solution for this assignment.


### Instructions
All assignments starting with number 1 below are mandatory. Satisfactory solutions
will give 1 point (in total). If they in addition are good (all parts work more or less 
as they should), completed on time (submitted before the deadline in Canvas) and according
to the instructions, together with satisfactory solutions of assignments starting with 
number 2 below, then the assignment will receive 2 points (in total).

It is highly recommended that you do not develop the code directly within the notebook
but that you copy the comments and test cases to your regular development environment
and only when everything works as expected, that you paste your functions into this
notebook, do a final testing (all cells should succeed) and submit the whole notebook 
(a single file) in Canvas (do not forget to fill in your group number and names above,
and thereby 


## Load NumPy, pandas and time

In [1]:
import numpy as np
import pandas as pd
import time


## Reused functions from Assignment 1

In [17]:
# Copy and paste functions from Assignment 1 here that you need for this assignment

def create_normalization(dataframe, normalizationtype="minmax"):
    
    df = dataframe.copy()
    normalization = {}
    for col in df.columns:
         if col != "CLASS" and col != "ID":
            if normalizationtype == "minmax":
                min = df[col].min()
                max = df[col].max()
                df[col] = [(x-min)/(max-min) for x in df[col]]
                normalization[col] = ("minmax", min, max)
                
            elif normalizationtype == "zscore":
                mean = df[col].mean()
                std = df[col].std()
                df[col] = df[col].apply(lambda x: (x-mean)/std)
                normalization[col] = ("zscore", mean, std)
            
    return df, normalization

def apply_normalization(dataframe,normalization):
    
    df = dataframe.copy()
    for col in df.columns:
        if col != "CLASS" and col != "ID":
            value = list(normalization[col])
            if value[0] == "minmax":
                df[col] = [(x - value[1])/(value[2]-value[1]) for x in df[col]]
            elif value[0] == "zscore":
                df[col] = [(x - value[1])/value[2] for x in df[col]]            
            else:
                return False
            
    return df

def create_imputation(dataframe):
    
    df = dataframe.copy()
    imputation = {}
    
    for col in df.columns:
        if col != "ID" and col != "CLASS":         
            if df[col].dtypes == "int" or df[col].dtypes == "float":           
                df[col].fillna(df[col].mean(),inplace=True)
                imputation[col] = (df[col].mean())            
            else:
                df[col].fillna(df[col].mode()[0],inplace=True)
                imputation[col] = (df["CLASS"].mode()[0])
            
    return df, imputation

def apply_imputation(dataframe,imputation):
    
    df = dataframe.copy()
    [df[col].fillna(imputation[col],inplace=True) for col in imputation]
    return df


def create_bins(dataframe,nobins=10,bintype="equal-width"):
    
    df = dataframe.copy()
    binning = {}
    for col in df.columns:
        if col != "CLASS" and col != "ID" and df[col].dtype in ["float64", "float32", "int64", "int32"]:
            if bintype == "equal-width":
                df[col], bins = pd.cut(df[col],nobins,retbins=True,duplicates="drop",labels=False)
                binning[col] = bins    
            elif bintype == "equal-size":
                df[col], bins = pd.qcut(df[col],q=nobins,retbins=True,duplicates="drop",labels=False)
                binning[col] = bins
            df[col] = df[col].astype("category")
            df[col] = df[col].cat.set_categories([str(i) for i in df[col].cat.categories], rename = True)
            binning[col][0] = -np.inf
            binning[col][-1] = np.inf
        else:
            df[col] = df[col].astype('category')
    
    return df, binning

def apply_bins(dataframe,binning):
    
    df = dataframe.copy()
    bin_labels = {}
    for col in binning:  
        bins = binning[col]
        df[col] = pd.cut(df[col],bins,labels=False)
        df[col] = df[col].astype("category")
        df[col] = df[col].cat.set_categories([str(i) for i in df[col].cat.categories], rename = True)        
    df = df.astype("category")
    return df

def split(dataframe, testfraction=0.5):
    
    df = dataframe.copy()
    df_random = df.reindex(np.random.permutation(df.index))
    trainingdf = df_random[0: int((1-testfraction)*df.shape[0])]
    testdf = df_random[int((1-testfraction)*df.shape[0])+1 : df.shape[0]]
    return trainingdf, testdf

def accuracy(dataframe, correctlabels):
    
    df = dataframe.copy()
    labels = df.idxmax(axis=1)
    truelabels = (labels == correctlabels).sum(axis=0)
    accuracy = truelabels/len(df)
    return accuracy

def create_one_hot(dataframe):
    
    df = dataframe.copy()
    df_new = df.copy()
    one_hot = {}
    for col in df.columns:
        if col != "CLASS" and col != "ID":  
            if str(df.dtypes[col]) == "category" or str(df.dtypes[col]) == "object":
                df[col] = df[col].astype("category")
                one_hot[col] = list(df[col].cat.categories)
                for i in one_hot[col]:
                    name = col + "_" + str(i)  
                    new_col = df[col]==i
                    new_col = new_col.astype("float")
                    df_new[name] = new_col 
                df_new = df_new.drop(columns = col, axis = 1) 
                
    return df_new, one_hot


def apply_one_hot(dataframe,one_hot):
    
    df = dataframe.copy()
    df_new = df.copy()
    for col in df.columns:
        if col in one_hot.keys():    
            for i in one_hot[col]:
                name = col + "-" + str(i)
                new_col = df[col]==i
                new_col = pd.Series(new_col.astype("float"))
                df_new[name] = new_col
            df_new = df_new.drop(columns = col, axis = 1)
            
    return df_new


def folds(dataframe,nofolds=10):
    
    df = dataframe.copy()
    np.random.permutation(df.index) 
    folds = []
    for i in range(nofolds):
        folds.append(df[int(len(df)*i/nofolds) : int(len(df)*(i+1)/nofolds)])

    return folds


def brier_score(dataframe, corretlabels):
    
    df = dataframe.copy()
    correct_df = pd.get_dummies(corretlabels)
    brier_score = np.mean(np.sum((df - correct_df)**2, axis=1))
    
    return brier_score


# ROC_Henrik's way

def count_tp_fp(predictions_df, correctlabels):
    #print('predictions_df', predictions_df) # last column includes the real values

    Score = predictions_df.iloc[:, 0]
    #print('Score=', Score)

    sorted_unique_score = np.unique(Score)[::-1]
    #print('sorted_unique_score = ', sorted_unique_score)

    pos = np.zeros(len(sorted_unique_score))
    neg = np.zeros(len(sorted_unique_score))

    for s in range(len(sorted_unique_score)):
        for p in range(len(predictions_df)):
            if(sorted_unique_score[s] == predictions_df.iloc[p, 0]):
                if(predictions_df.columns[0] == correctlabels[p]):
                    pos[s] += 1
                else:
                    neg[s] += 1           
    
    #print('pos=', pos)
    #print('neg=', neg)
    
    #draw_ROC(pos, neg)
    
    return pos, neg

def draw_ROC(pos, neg):
    import matplotlib.pyplot as plt
    tpr = [cs/sum(pos) for cs in np.cumsum(pos)]
    print('tpr=', tpr)    
    fpr = [cs/sum(neg) for cs in np.cumsum(neg)]
    print('fpr=', fpr)
    plt.plot([0.0]+fpr+[1.0],[0.0]+tpr+[1.0],"-",label="1")
    plt.plot([0.0,1.0],[0.0,1.0],"--",label="Baseline")
    plt.xlabel("fpr")
    plt.ylabel("tpr")
    plt.legend()
    plt.show()

def calculate_AUC_Henrik_2(pos, neg):
    # AUC = Area under ROC curve
    AUC = 0
    Cov_tp = 0
    n_tp = len(pos)
    Tot_tp = sum(pos)
    Tot_fp = sum(neg)
    
    for i in range(n_tp):
        #print('i={}...pos[i]={}...neg[i]={}'.format(i, pos[i], neg[i]))        
        if(neg[i] == 0):
            Cov_tp += pos[i]
            #print('AUC_if = ', AUC)
        elif(pos[i] == 0):
            AUC += (Cov_tp/Tot_tp)*(neg[i]/Tot_fp)
            #print('AUC_elif = ', AUC)
        else:
            AUC += (Cov_tp/Tot_tp)*(neg[i]/Tot_fp) + (pos[i]/Tot_tp)*(neg[i]/Tot_fp)/2
            Cov_tp += pos[i]
            #print('AUC_else = ', AUC)
            
    return AUC

def auc(df, correctlabels):    
    class_frequency = dict(pd.Series(correctlabels).value_counts(normalize = True))   
    #print('class_frequency', class_frequency)
    AUC = 0
    #print(df.columns)
    #test_labels_unique = pd.Series(test_labels).value_counts(normalize = True)
    #print('correctlabels=', class_frequency.keys())
    for col in df.columns:
        if(col in class_frequency.keys()):
            #print('col=', col)
            predictions_df = pd.DataFrame(df[col], columns=[col])
            #list_reversed_tpr_fpr = get_tpr_fpr(prediction_vector, correctlabels, col)
            pos, neg = count_tp_fp(predictions_df, correctlabels)
            #area_col = calculate_AUC(list_reversed_tpr_fpr)
            area_col = calculate_AUC_Henrik_2(pos, neg)            
            #print('col={}__area_col={}'.format(col, area_col))
            AUC += class_frequency[col] * area_col        
    return AUC

## 1. Define the class kNN

In [18]:
# Define the class kNN with three functions __init__, fit and predict (after the comments):
#
# Input to __init__: 
# self: the object itself
#
# Output from __init__:
# nothing
# 
# This function does not return anything but just initializes the following attributes of the object (self) to None:
# imputation, normalization, one_hot, labels, training_labels, training_data

# Input to fit:
# self: the object itself
# df: a dataframe (where the column names "CLASS" and "ID" have special meaning)
# normalizationtype: "minmax" (default) or "zscore"
#
# Output from fit:
# nothing
#
# The result of applying this function should be:
#
# self.imputation should be an imputation mapping (see Assignment 1) from df
# self.normalization should be a normalization mapping (see Assignment 1), using normalizationtype from the imputed df
# self.one_hot should be a one-hot mapping (see Assignment 1; can be excluded if this function was not completed)
# self.training_labels should be a pandas series corresponding to the "CLASS" column, set to be of type "category" 
# self.labels should be the categories of the previous series
# self.training_data should be the values (an ndarray) of the transformed dataframe, i.e., after employing imputation, 
# normalization, and possibly one-hot encoding, and also after removing the "CLASS" and "ID" columns 
# Note that the function does not return anything but just assigns values to the attributes of the object.
#
# Input to predict:
# self: the object itself
# df: a dataframe
# k: an integer >= 1 (default = 5)
# 
# Output from predict:
# predictions: a dataframe with class labels as column names and the rows corresponding to
#              predictions with estimated class probabilities for each row in df, where the class probabilities
#              are estimated by the relative class frequencies in the set of class labels from the k nearest 
#              (with respect to Euclidean distance) neighbors in training_data
#
# Hint 1: Drop any "CLASS" and "ID" columns first and then apply imputation, normalization and (possibly) one-hot
# Hint 2: Get the numerical values (as an ndarray) from the resulting dataframe and iterate over the rows 
#         calling some sub-function, e.g., get_nearest_neighbor_predictions(x_test,k), which for a test row
#         (numerical input feature values) finds the k nearest neighbors and calculate the class probabilities.
# Hint 3: This sub-function may first find the distances to all training instances, e.g., pairs consisting of
#         training instance index and distance, and then sort them according to distance, and then (using the indexes
#         of the k closest instances) find the corresponding labels and calculate the relative class frequencies

class kNN:
    
    def __init__(self):
        
        self.imputation = None
        self.normalization = None
        self.one_hot = None
        self.labels = None
        self.training_labels = None
        self.training_data = None
        
    
    def fit(self,dataframe,normalizationtype="minmax"):
        
        df = dataframe.copy()
        df, self.imputation = create_imputation(df)
        df, self.normalization = create_normalization(df, normalizationtype)
        df, self.one_hot = create_one_hot(df)
        df["CLASS"] = df["CLASS"].astype("category")
        self.training_labels = df["CLASS"]
        self.labels = list(df["CLASS"].cat.categories)
        self.training_data = df.drop(columns=["ID","CLASS"],errors='ignore') 
        
      #  print(self.training_data)
        
    def euclidean_distance(self, row1, row2):
        return np.sqrt(np.sum(np.power(row1 - row2, 2),axis=0))
    

    def get_NN_predictions(self, x_test, k):
        neighbours = []
        dists = []  
        data = pd.DataFrame(self.training_data)
        
        num_rows = data.shape[0]
        
        for i in range(num_rows):
            dist = 0.0
            values = np.array(data.iloc[i,:].values)
            #print(values.shape)
            dist = self.euclidean_distance(values, x_test)
            dists.append((i, dist))
            
        dists.sort(key=lambda tup: tup[1])
    
        for j in range(k):
            neighbours.append(dists[j][0])
            
        return neighbours  
    
    def get_probability(self, k_NN_indexes):
        k = len(k_NN_indexes)
        training_labels, labels = self.training_labels, self.labels
        labels_prob = np.zeros(len(labels))
        
        for i in range(k):
            for j in range(len(labels)):
                if training_labels[k_NN_indexes[i]]==labels[j]:
                    labels_prob[j] += 1       
            prob = labels_prob/k

        return(prob)
    
    
    def predict(self, dataframe, k):
        
        df = dataframe.copy()
        df.drop(columns=["ID","CLASS"],inplace=True)
        
        df = apply_normalization(df, self.normalization)
        df = apply_imputation(df,self.imputation)
        df = apply_one_hot(df,self.one_hot)
        num_rows = df.shape[0]
        s = (num_rows,len(self.labels))
        predictions = np.zeros(s)
        
        for i in range(num_rows):
            values = np.array(df.iloc[i,:].values)
            neighbours_index = self.get_NN_predictions(values, k)
            prob = self.get_probability(neighbours_index)
            predictions[i] = prob
            
        predictions_df = pd.DataFrame(predictions, columns=self.labels)
        #print(predictions_df)
        
        return predictions_df

In [19]:
# Test your code (leave this part unchanged, except for if auc is undefined)

glass_train_df = pd.read_csv("glass_train.txt")

glass_test_df = pd.read_csv("glass_test.txt")

knn_model = kNN()

t0 = time.perf_counter()
knn_model.fit(glass_train_df)

print("Training time: {0:.2f} s.".format(time.perf_counter()-t0))

test_labels = glass_test_df["CLASS"]

k_values = [1,3,5,7,9]
results = np.empty((len(k_values),3))

for i in range(len(k_values)):
    t0 = time.perf_counter()
    predictions = knn_model.predict(glass_test_df,k=k_values[i])
    print("Testing time (k={0}): {1:.2f} s.".format(k_values[i],time.perf_counter()-t0))
        
    results[i] = [accuracy(predictions,test_labels),brier_score(predictions,test_labels),
                  auc(predictions,test_labels)] # Assuming that you have defined auc - remove otherwise

results = pd.DataFrame(results,index=k_values,columns=["Accuracy","Brier score","AUC"])
results

Training time: 0.01 s.
Testing time (k=1): 1.24 s.
Testing time (k=3): 1.23 s.
Testing time (k=5): 1.20 s.
Testing time (k=7): 1.17 s.
Testing time (k=9): 1.20 s.


Unnamed: 0,Accuracy,Brier score,AUC
1,0.747664,0.504673,0.81035
3,0.663551,0.488058,0.815859
5,0.579439,0.474019,0.833805
7,0.598131,0.470723,0.834465
9,0.616822,0.483674,0.828734


In [None]:
train_labels = glass_train_df["CLASS"]
predictions = knn_model.predict(glass_train_df,k=1)
print("Accuracy on training set (k=1): {0:.2f}".format(accuracy(predictions,train_labels)))
print("AUC on training set (k=1): {0:.2f}".format(auc(predictions,train_labels)))
print("Brier score on training set (k=1): {0:.2f}".format(brier_score(predictions,train_labels)))


### Comment on assumptions, things that do not work properly, etc.


## 2. Define the class NaiveBayes

In [8]:
# Define the class NaiveBayes with three functions __init__, fit and predict (after the comments):
#
# Input to __init__: 
# self: the object itself
#
# Output from __init__:
# nothing
# 
# This function does not return anything but just initializes the following attributes of the object (self) to None:
# binning, class_priors, feature_class_value_counts, feature_class_counts
#
# Input to fit:
# self: the object itself
# df: a dataframe (where the column names "CLASS" and "ID" have special meaning)
# nobins: no. of bins (default = 10)
# bintype: either "equal-width" (default) or "equal-size" 
#
# Output from fit:
# nothing
#
# The result of applying this function should be:
#
# self.binning should be a discretization mapping (see Assignment 1) from df
# self.class_priors should be a mapping (dictionary) from the labels (categories) of the "CLASS" column of df,
# to the relative frequencies of the labels
# self.feature_class_value_counts should be a mapping from a feature (column name) to another mapping, which
# given a feature value and class label provides the number of training instances with this specific combination
# self.feature_class_counts should me a mapping from the feature (column name) and class label to the number of
# training instances with this specific class label and any (non-missing) value for the feature
# Note that the function does not return anything but just assigns values to the attributes of the object.
#
# Hint 1: feature_class_value_counts can be a dictionary, which given a feature f returns a mapping obtained 
#         by pandas groupby and size (see lecture slides), which given a feature value v and class label c 
#         returns the number of instances, e.g., using get((c,v),0)
#
# Input to predict:
# self: the object itself
# df: a dataframe
# 
# Output from predict:
# predictions: a dataframe with class labels as column names and the rows corresponding to
# predictions with estimated class probabilities for each row in df, where the class probabilities
# are estimated by the naive approximation of Bayes rule (see lecture slides)
#
# Hint 1: First apply discretization
# Hint 2: Iterating over either columns or rows, and for each possible class label, calculate the relative
#         frequency of the observed feature value given the class (using feature_class_value_counts and 
#         feature_class_counts) 
# Hint 3: Calculate the non-normalized estimated class probabilities by multiplying the class priors to the
#         product of the relative frequencies
# Hint 4: Normalize the probabilities by dividing by the sum of the non-normalized probabilities; in case
#         this sum is zero, then set the probabilities to the class priors

class NaiveBayes:
    
    def __init__(self):
        
        self.binning = None
        self.class_priors = None
        self.feature_class_value_counts = None
        self.feature_class_counts = None
        self.class_labels = None
        
    def fit(self, dataframe, nobins=10, bintype="equal-width"):
        df = dataframe.copy()
        df, self.binning = create_bins(df,nobins,bintype)
        df["CLASS"] = df["CLASS"].astype("category")
        self.class_labels = list(df["CLASS"].cat.categories)
        self.class_priors = dict(df["CLASS"].value_counts(normalize = True))
        
        feature_class_value_counts = {} # a mapping from a col to a dictionary((c,v),num of this combonation)
        feature_class_counts = {}  # a mapping from a col to a dictionary (c, num of instances)
        
        for col in df.columns:
            if col not in ["CLASS", "ID"]:
                df_temp = df.dropna(axis = 0, how="any", subset = ["CLASS", col])
                feature_class_counts[col] = dict(df_temp["CLASS"].value_counts())
                g = df_temp.groupby(["CLASS", col]).size()
                feature_class_value_counts[col] = dict(g)
            
        self.feature_class_counts = feature_class_counts
        self.feature_class_value_counts = feature_class_value_counts
        
    def predict(self, dataframe):
        df = dataframe.copy()
        df = apply_bins(df, self.binning)
        labels = self.class_labels
        df.drop(columns=["ID","CLASS"],inplace=True)
        
        nrow, ncol, nlabel = df.shape[0], df.shape[1], len(labels)
        matrix = np.zeros([nlabel, nrow, ncol])

        for col_num in range(ncol):
            col = df.columns[col_num]
            
            for label_num in range(nlabel):
                label = labels[label_num]

                for row_num in range(nrow):
                    value = df.iloc[row_num, col_num]
                    if((label, value) in self.feature_class_value_counts[col].keys()):
                        features_value_count = self.feature_class_value_counts[col][(label, value)]
                        feature_count = self.feature_class_counts[col][label]
                        relative_freq = features_value_count / feature_count
                    else:
                        relative_freq = 0
                    
                    matrix[label_num, row_num, col_num] = relative_freq
        
        product = matrix.prod(axis=2)
        
        prior = np.array([self.class_priors[labels[i]] for i in range(nlabel)])
        #print(prior)
        prior = np.tile(prior, nrow).reshape(nrow,nlabel).T  # notice the shape 
        prob =  product * prior 
        
        sum_prob = prob.sum(axis=0)
        sum_zero = sum_prob==0.0
        sum_prob += sum_zero.astype('float')
        
        norm_prob = prob/sum_prob
        
        predictions = pd.DataFrame(norm_prob.T, columns = labels)
        
        return predictions
        
                

In [9]:
# Test your code (leave this part unchanged, except for if auc is undefined)

glass_train_df = pd.read_csv("glass_train.txt")

glass_test_df = pd.read_csv("glass_test.txt")

nb_model = NaiveBayes()

test_labels = glass_test_df["CLASS"]

nobins_values = [3,5,10]
bintype_values = ["equal-width","equal-size"]
parameters = [(nobins,bintype) for nobins in nobins_values for bintype in bintype_values]

results = np.empty((len(parameters),3))

for i in range(len(parameters)):
    t0 = time.perf_counter()
    nb_model.fit(glass_train_df,nobins=parameters[i][0],bintype=parameters[i][1])
    print("Training time {0}: {1:.2f} s.".format(parameters[i],time.perf_counter()-t0))
    t0 = time.perf_counter()
    predictions = nb_model.predict(glass_test_df)
    print("Testing time {0}: {1:.2f} s.".format(parameters[i],time.perf_counter()-t0))
  
    results[i] = [accuracy(predictions,test_labels),brier_score(predictions,test_labels),
                  auc(predictions,test_labels)] # Assuming that you have defined auc - remove otherwise

results = pd.DataFrame(results,index=pd.MultiIndex.from_product([nobins_values,bintype_values]),
                       columns=["Accuracy","Brier score","AUC"])
results


Training time (3, 'equal-width'): 0.08 s.
Testing time (3, 'equal-width'): 0.09 s.
Training time (3, 'equal-size'): 0.08 s.
Testing time (3, 'equal-size'): 0.10 s.
Training time (5, 'equal-width'): 0.06 s.
Testing time (5, 'equal-width'): 0.09 s.
Training time (5, 'equal-size'): 0.08 s.
Testing time (5, 'equal-size'): 0.11 s.
Training time (10, 'equal-width'): 0.07 s.
Testing time (10, 'equal-width'): 0.10 s.
Training time (10, 'equal-size'): 0.07 s.
Testing time (10, 'equal-size'): 0.10 s.


Unnamed: 0,Unnamed: 1,Accuracy,Brier score,AUC
3,equal-width,0.616822,0.621325,0.72356
3,equal-size,0.607477,0.554782,0.780163
5,equal-width,0.626168,0.559282,0.750656
5,equal-size,0.598131,0.581556,0.796675
10,equal-width,0.598131,0.57027,0.747255
10,equal-size,0.579439,0.743837,0.746409


In [10]:
train_labels = glass_train_df["CLASS"]
nb_model.fit(glass_train_df)
predictions = nb_model.predict(glass_train_df)
print("Accuracy on training set: {0:.2f}".format(accuracy(predictions,train_labels)))
print("AUC on training set: {0:.2f}".format(auc(predictions,train_labels)))
print("Brier score on training set: {0:.2f}".format(brier_score(predictions,train_labels)))

Accuracy on training set: 0.85
AUC on training set: 0.97
Brier score on training set: 0.23


### Comment on assumptions, things that do not work properly, etc.