# ID2214 Assignment 3 Group no. 5
### Project members: 
[Ceren Dikmen, cerend@kth.se] [Jakob Heyder, heyder@kth.se] [Lutfi Altin, lutfia@kth.se] [Muhammad Fasih Ullah, mufu@kth.se]

### Declaration
By submitting this solution, it is hereby declared that all individuals listed above have contributed to the solution, either with code that appear in the final solution below, or with code that has been evaluated and compared to the final solution, but for some reason has been excluded. It is also declared that all project members fully understand all parts of the final solution and can explain it upon request.

It is furthermore declared that the code below is a contribution by the project members only, and specifically that no part of the solution has been copied from any other source (except for lecture slides at the course ID2214) and no part of the solution has been provided by someone not listed as project member above.

It is furthermore declared that it has been understood that no other library/package than the Python 3 standard library, NumPy, pandas and time may be used in the solution for this assignment.

### Instructions
All assignments starting with number 1 below are mandatory. Satisfactory solutions
will give 1 point (in total). If they in addition are good (all parts work more or less 
as they should), completed on time (submitted before the deadline in Canvas) and according
to the instructions, together with satisfactory solutions of assignments starting with 
number 2 below, then the assignment will receive 2 points (in total).

It is highly recommended that you do not develop the code directly within the notebook
but that you copy the comments and test cases to your regular development environment
and only when everything works as expected, that you paste your functions into this
notebook, do a final testing (all cells should succeed) and submit the whole notebook 
(a single file) in Canvas (do not forget to fill in your group number and names above).


## Load NumPy, pandas and time

In [0]:
import numpy as np
import pandas as pd
import time
import pprint


## Reused functions from Assignment 1

In [0]:
# Copy and paste functions from Assignment 1 here that you need for this assignment

def create_bins(df, nobins = 10, bintype = "equal-width"):
    copydf = df.copy()
    binning = {}
    
    cols = copydf.loc[:, ~df.columns.isin(['ID', 'CLASS'])]
    numbercols = cols.select_dtypes(include=[np.int_, np.float_])
    
    for col in numbercols:
        cutfunc = pd.cut if bintype == "equal-width" else pd.qcut
        copydf[col], bins = cutfunc(copydf[col], nobins, labels=False, retbins=True, duplicates="drop")
        # hint 4-5
        copydf[col] = copydf[col].astype('category', categories=list(range(nobins)))
        
        # hint 6
        bins[0] = -np.inf
        bins[-1] = np.inf
        binning[col] = bins
        
    return copydf, binning

def apply_bins(df, binning):
    copydf = df.copy()
    
    for column in binning:
        copydf[column] = pd.cut(copydf[column], labels=False, bins=binning[column])
        
        # hint 3-4
        nobins = len(binning[column] - 1) # n bins will generate n+1 values
        copydf[column] = copydf[column].astype('category', categories=list(range(nobins)))
    
    return copydf

def create_normalization(df, normalizationtype = "minmax"):
    copydf = df.copy()
    normalization = {}
    
    cols = copydf.loc[:, ~df.columns.isin(['ID', 'CLASS'])] # Removing ID and CLASS columns
    cols = cols.select_dtypes(include=[np.int_, np.float_]) # Selecting only int and float type columns from the remaining
    
    if normalizationtype == 'minmax':
        for column in cols:
            min = copydf[column].min()
            max = copydf[column].max()
            copydf[column] = [(x-min)/(max-min) for x in copydf[column]]
            normalization[column] = ("minmax", min, max)
            
    elif normalizationtype == 'zscore':
        for column in cols:
            mean = copydf[column].mean()
            std = copydf[column].std()
            copydf[column] = copydf[column].apply(lambda x: (x-mean)/std)
            normalization[column] = ("zscore", mean, std)
            
    return copydf, normalization

def apply_normalization(df, normalization):
    copydf = df.copy()
    
    for column in normalization:
        type = normalization[column][0]
        
        if type == 'minmax':
            min = normalization[column][1]
            max = normalization[column][2]
            copydf[column] = np.clip([(x-min)/(max-min) for x in copydf[column]], 0, 1) # Clipping outliers to 0-1
            
        elif type == 'zscore':
            mean = normalization[column][1]
            std = normalization[column][2]
            copydf[column] = copydf[column].apply(lambda x: (x-mean)/std)
   
    return copydf


def create_imputation(df):
    copydf = df.copy()
    imputation = {}
    
    cols = copydf.loc[:, ~df.columns.isin(['ID', 'CLASS'])] # Dropping ID and CLASS columns
    numbercols = cols.select_dtypes(include=[np.int_, np.float_]) # Choosing int and float columns
    othercols = cols.select_dtypes(exclude=[np.int_, np.float_]) # Choosing all other columns 
    
    for column in numbercols:
        copydf[column].fillna(copydf[column].mean(),inplace=True) # For numeric columns replacing them with mean
        imputation[column] = copydf[column].mean()
        
        if copydf[column].isna().any(): # Check if whole column was NaN
            copydf[column].fillna(0, inplace=True) #Replace with 0
            imputation[column] = 0
        
    for column in othercols:
        # Mode returns an array. In order to get the correct mode we have to use iloc[0]
        copydf[column].fillna(copydf[column].mode().iloc[0],inplace=True)
        imputation[column] = copydf[column].mode().iloc[0]
        
        # Check if column still has some NA 
        # if it does, change it to category[to match the output] and replace it with the first category
        if copydf[column].isna().any():
          
            copydf[column] = copydf[column].astype('category')
            
            # [Old code, doesn't work] fill = "" if copydf[column].dtype == 'object' else df.cat.categories[0]
            
            fill = "" if copydf[column].dtype == 'object' else copydf[column].astype('category').cat.categories[0]
            copydf[column].fillna(fill, inplace=True)
            imputation[column] = fill
            
            # print("CopyDF: ", column, copydf[column].dtype, copydf[column].astype('category').cat.categories[0])
            
    return copydf, imputation

def apply_imputation(df, imputation):
    copydf = df.copy()
    
    for column in imputation:
        copydf[column].fillna(imputation[column], inplace=True)
    
    return copydf

def create_one_hot(df):
    copydf = df.copy()
    one_hot = {}
    
    # filter all columns except ID and CLASS
    cols = copydf.loc[:, ~df.columns.isin(['ID', 'CLASS'])]
    # filter only categorial data
    cols = cols.select_dtypes(include=["object", "category"])
    
    for col in cols:
        # convert to get categories
        if copydf[col].dtype == 'object':
            copydf[col] = copydf[col].astype('category')
            
        # convert categorial to binary (one-hot)    
        dummies = pd.get_dummies(copydf[col])

        # create binary column for each category
        one_hot[col] = {}
        for cat in dummies:
            one_hot[col][cat] = col + '-' + cat
            copydf[col + '-' + cat] = dummies[cat]
            # convert to float type (hint 4)
            copydf[col + '-' + cat] = copydf[col + '-' + cat].astype('float')
            
        # drop original columns     
        copydf.drop(col, axis=1, inplace=True)
    
    return copydf, one_hot

def apply_one_hot(df, one_hot):
    copydf = df.copy()
    
    for col in one_hot:
        # convert categorial to binary (one-hot)
        dummies = pd.get_dummies(copydf[col])
        # iterate over categories generated in one-hot
        for (cat, col_name) in one_hot[col].items():
            # for each category take dummy value
            copydf[col_name] = dummies[cat]
            # convert to float type (hint 4)
            copydf[col_name] = copydf[col_name].astype('float')
        copydf.drop(col, axis=1, inplace=True)
    
    return copydf

#################################
#                               #
# Performance measure functions #
#                               #
#################################

def accuracy(df, correctlabels):
    return sum(df.idxmax(axis=1)==correctlabels)/len(correctlabels)

def brier_score(df, correctlabels):
    score = 0
    
    # get observed probabilities (one for each correct label, otherwise zero)
    observed_probs = pd.get_dummies(correctlabels)
    # vectorized formula slides brier score (probability - observed_probability) squared
    score = (df - observed_probs) ** 2
    # sum over different labels, and sum all instances
    score = score.sum(axis=1)
    # average over instances
    score = score.mean()
        
    return score

# function for one label , returns tpr 
def auc_single(predictions, correctlabels, threshold, c):
   
    # array with true for correct labels for class c (by row index)
    correctlabels_class = np.array(correctlabels)==predictions.columns[c]
    
    # array with predictions for all instances that should be classified class c
    predictions_class = predictions[ predictions.columns[c] ]
    
    # array with true for all correctly predicted labels according to threshold
    predicted_labels = predictions_class[correctlabels_class] >= threshold
    pos = sum(predicted_labels)
    
    # correctly predicted instances (according to threshold) divided by total number of instances that should be class c
    tpr = pos / sum(correctlabels_class)
    
    # repeat for false positive rate (instances not in class)
    not_correctlabels_class = np.array(correctlabels)!=predictions.columns[c]
    predictions_class = predictions[ predictions.columns[c] ]
    predicted_labels = predictions_class[not_correctlabels_class] >= threshold
    neg = sum(predicted_labels)
    fpr = neg / sum(not_correctlabels_class)
    
    return tpr, fpr


def auc(predictions, correctlabels):
    thresholds = np.unique(predictions)
    total_number_of_labels = len(correctlabels)
    
    AUCs = {}
    
    # iterate over all classes and calculate the area under the ROC(tpr/fpr) curve (AUC)
    for (idx,c) in enumerate(np.unique(correctlabels)):
        single = [auc_single(predictions, correctlabels, t, idx) for t in reversed(thresholds)]
                    
        # calculate AUC as area under the curve
        AUC = 0
        tpr_last = 0
        fpr_last = 0
        
        # iterate over all thresholds
        for s in single:
            tpr, fpr = s
            
            # Case 1.) Add area under triangle        
            if tpr > tpr_last and fpr > fpr_last:
                AUC += (fpr-fpr_last)*tpr_last + (fpr-fpr_last)*(tpr-tpr_last) / 2
            
            # Case 2.) Add area under rectangle            
            elif fpr > fpr_last:
                AUC += (fpr-fpr_last)*tpr
            
            # update point coordinates (tpr, fpr) of curve
            tpr_last = tpr
            fpr_last = fpr
       
        AUCs[c] = AUC
        
                
    # take the weighted average for all classes (dependent on their frequency of occourance)
    AUC_total = 0
    for (cName,auc) in AUCs.items():
        number_of_labels = np.sum(np.array(correctlabels) == cName)
        weight = number_of_labels / total_number_of_labels
        AUC_total += weight * auc
        
    return AUC_total  

def create_bins(df, nobins=10, bintype="equal-width"):
    copydf = df.copy()
    binning = {}

    cols = copydf.loc[:, ~df.columns.isin(['ID', 'CLASS'])]
    numbercols = cols.select_dtypes(include=[np.int_, np.float_])

    for col in numbercols:
        if (bintype == "equal-width"):
            copydf[col], bins = pd.cut(copydf[col], nobins, labels=False, retbins=True)
        else:
            copydf[col], bins = pd.qcut(copydf[col], nobins, labels=False, retbins=True, duplicates="drop")
        
        # hint 4-5
        copydf[col] = copydf[col].astype('category', categories=list(range(nobins)))

        # hint 6
        bins[0] = -np.inf
        bins[-1] = np.inf
        binning[col] = bins

    return copydf, binning


def apply_bins(df, binning):
    copydf = df.copy()

    for column in binning:
        copydf[column] = pd.cut(copydf[column], labels=False, bins=binning[column])

        # hint 3-4
        nobins = len(binning[column] - 1)  # n bins will generate n+1 values
        copydf[column] = copydf[column].astype('category', categories=list(range(nobins)))

    return copydf


## 1. Define the class DecisionTree

In [0]:
# Define the class DecisionTree with three functions __init__, fit and predict (after the comments):
#
# Input to __init__: 
# self: the object itself
#
# Output from __init__:
# nothing
# 
# This function does not return anything but just initializes the following attributes of the object (self) to None:
# binning, imputatiom, labels, model
#
# Input to fit:
# self: the object itself
# df: a dataframe (where the column names "CLASS" and "ID" have special meaning)
# nobins: no. of bins (default = 10)
# bintype: either "equal-width" (default) or "equal-size"
# min_samples_split: no. of instances required to allow a split (default = 5)
#
# Output from fit:
# nothing
#
# The result of applying this function should be:
#
# self.binning should be a discretization mapping (see Assignment 1) from df
# self.imputation should be an imputation mapping (see Assignment 1) from df
# self.labels should be the categories of the "CLASS" column of df, set to be of type "category" 
# self.model should be a decision tree (for details, see lecture slides), where the leafs return class probabilities
# Note that the function does not return anything but just assigns values to the attributes of the object.
#
# Hint 1: First find the available features (excluding "CLASS" and "ID"), then find the class counts, e.g., using 
#         groupby, and calculate the default class probabilities (relative frequencies of the class labels)
# Hint 2: Define a function, e.g., called divide_and_conquer, that takes the above as input together with df 
#         and min_samples_split, and also a nodeno (starting with 0) to keep track of the generated nodes in the tree
# Hint 3: You may represent the tree under construction as a list of nodes (tuples), on the form:
#         (nodeno,"leaf",class_probabilities): corresponding to a leaf node where class_probabilities is a vector
#                                              with the relative class frequencies (ordered according to self.labels)
#         (nodeno,feature,node_dict): corresponding to an internal (non-leaf) node where node_dict is a mapping from
#                                     the possible values of feature to child nodes (their nodenos)
# Hint 4: You may evaluate each feature by a function information_content, which takes the group sizes
#         for each possible value of the feature together with the class counts of each group as input
# Hint 5: The best feature found (with lowest resulting information content) will be used to split the training
#         instances, and each sub-group is used for generating a sub-tree (recursively by divide_and_conquer,
#         see lecture slides for details)
# Hint 6: The list of nodes output by divide_and_conquer may finally be converted to an array, where each nodeno in the 
#         tuples corresponds to an index of the array 
#
# Input to predict:
# self: the object itself
# df: a dataframe
# 
# Output from predict:
# predictions: a dataframe with class labels as column names and the rows corresponding to
#              predictions with estimated class probabilities for each row in df, where the class probabilities
#              are the relative class frequencies in the leaves of the decision tree into which the instances in
#              df fall
#
# Hint 1: Drop any "CLASS" and "ID" columns first and then apply imputation and binning
# Hint 2: Iterate over the rows calling some sub-function, e.g., make_prediction(nodeno,row), which for a test row
#         finds a leaf node from which class probabilities are obtained
# Hint 3: This sub-function may recursively traverse the tree (represented by an array), starting with the nodeno
#         that corresponds to the root

class DecisionTree():
    def __init__(self):
        self.binning = None
        self.imputation = None
        self.labels = None
        self.model = None
        
        # Additional to save all possible unique values per feature
        self.unique_values = None
        
    def predict(self, df):
        copydf = df.loc[:, ~df.columns.isin(['ID', 'CLASS'])]
        copydf_imputation = apply_imputation(copydf, self.imputation)
        copydf_bin = apply_bins(copydf_imputation, self.binning)
        
        
        # initialize probabilities
        df_probs = pd.DataFrame(columns=self.labels)
        
        # Make a prediction for each test-row
        for idx,row in copydf_bin.iterrows():
            row_probs = self.make_prediction(row)
            df_probs = df_probs.append(row_probs, ignore_index=True)
                        
        return df_probs
    
    def print(self):
        pp = pprint.PrettyPrinter(indent=4)
        pp.pprint(self.model)
        
        
    def make_prediction(self, row):
        # initialize with root
        node = self.model
        
        #print("#### Prediction ####")
        #print(row)
        
        # traverse the tree
        while (node[1] != "leaf"):
            feature = node[1]
            feature_val = row[feature]
            #print("feature", feature)
            #print("feature_val", feature_val)
            #print("row", row)
            #print("node", node)
            #print("node_dict", node[2])
            node = node[2][feature_val]
                    
        # return probabilities of leaf node
        #print("return", node[2])
        return node[2]
        
        
        
    def fit(self, df, nobins=10, bintype="equal-width", min_samples_split=5):
        copydf_bins, self.binning = create_bins(df, nobins, bintype)
        copydf_imputaton, self.imputation = create_imputation(copydf_bins)
        self.labels = pd.unique(copydf_imputaton["CLASS"].astype('category'))  
        
        # available features
        copydf = copydf_imputaton.loc[:, ~df.columns.isin(['ID', 'CLASS'])]
        features = copydf.columns
        self.unique_values = {f:copydf[f].unique() for idx,f in enumerate(features)}
        
        # create decision tree
        self.model = self.divide_and_conquer(copydf_imputaton, features, self.majority_class(copydf_imputaton), min_samples_split, 0)
        
        
    # input instances, features, majority label, min_split number (when to stop), and initial node number
    def divide_and_conquer(self, df, features, label, min_sample_split, nodeno):
        # 1.) Exit conditions
        probabilities = {v:0.0 for i,v in enumerate(self.labels)}
        np.zeros(len(self.labels))    
        if (len(df) == 0):
            probabilities[label] = 1.0
            return (nodeno, "leaf", probabilities) # return majority_label (prev call)
        if (len(pd.unique(df['CLASS'])) == 1):
            probabilities[df['CLASS'].iloc[0]] = 1.0
            return (nodeno, "leaf", probabilities) # return unique label of all instances
        if (len(features) == 0):
            probabilities[self.majority_class(df)] = 1.0
            return (nodeno, "leaf", probabilities) # return majority class (this call)
        if (len(df) < min_sample_split):
            return (nodeno, "leaf", self.probabilities(df)) # return majority class (this call)
        
        
        # 2.) Decide which feature to split with
        ent = self.entropy(df['CLASS'].value_counts()) # total entropy for current node
        feature_info_gain = np.zeros(len(features))
        
        # Iterate over features and calculate their information gain
        for idx, feature in enumerate(features):
            feature_info_gain[idx] = ent - self.residual_information(df, feature)
            
        # Take feature with minimal bits entropy  
        max_idx = np.argmax(feature_info_gain)
        best_feature = features[max_idx]
        unique_values = self.unique_values[best_feature]
        remaining_features = [e for idx,e in enumerate(features) if idx!=max_idx]  # all except taken feature
        
        #print('')
        #print(df)
        #print("Order features", features)
        #print('Info-Gains:', feature_info_gain)
        #print("Best feature", best_feature)
        #print("min index", max_idx)
        #print("features", features)
        #print("unique vals", unique_values)
        #print("remaining features", remaining_features)
        #print(" ")

        
        # 3.) Call divide and conquer recursively for all possible values of the feature (subgroups of instances)
        node_dict = {val:self.divide_and_conquer(
            df[df[best_feature] == val], # filter only rows where feature has the value 
            remaining_features,
            self.majority_class(df),
            min_sample_split,
            nodeno + i # TODO: Node number correct increment (trivial)
        ) for i,val in enumerate(unique_values)}
        node = (nodeno,best_feature,node_dict)
        
        return node
    
    def probabilities(self, df):
        label_counts = df['CLASS'].value_counts()
        probabilities = label_counts / sum(label_counts)
        return probabilities
    
    def majority_class(self, df):
        return df['CLASS'].value_counts().argmax()
    
    # Input: sizes of groups 
    def residual_information(self, df, feature):
        # How often does one feature-value appear [weight]
        group_sizes = df[feature].value_counts()
        
        # The total label counts for the feature (for info-gain)
        total_label_counts = sum(df['CLASS'].value_counts())
        
        infRes = 0
        for idx, group_size in group_sizes.iteritems():
            label_counts = df[df[feature] == idx]['CLASS'].value_counts()
            # relative frequency of feature value * Entropy for the sub-partition
            infRes += (group_size / total_label_counts) * self.entropy(label_counts)
        return infRes
    
    def entropy(self, label_counts):
        total_no = sum(label_counts)
        probabilities = label_counts / total_no
        return - sum( probabilities * np.log(probabilities))
        


# Test your code (leave this part unchanged, except for if auc is undefined)

glass_train_df = pd.read_csv("glass_train.txt")

glass_test_df = pd.read_csv("glass_test.txt")

tree_model = DecisionTree()

test_labels = glass_test_df["CLASS"]

nobins_values = [5,10]
bintype_values = ["equal-width","equal-size"]
min_samples_split_values = [3,5,10]
parameters = [(nobins,bintype,min_samples_split) for nobins in nobins_values for bintype in bintype_values 
              for min_samples_split in min_samples_split_values]
#parameters = [(5, "equal-width", 3)]

results = np.empty((len(parameters),3))

for i in range(len(parameters)):
    t0 = time.perf_counter()
    tree_model.fit(glass_train_df,nobins=parameters[i][0],bintype=parameters[i][1],min_samples_split=parameters[i][2])
    print("Training time {0}: {1:.2f} s.".format(parameters[i],time.perf_counter()-t0))
    t0 = time.perf_counter()
    predictions = tree_model.predict(glass_test_df)
    print("Testing time {0}: {1:.2f} s.".format(parameters[i],time.perf_counter()-t0))
    results[i] = [accuracy(predictions,test_labels),brier_score(predictions,test_labels),
                  auc(predictions,test_labels)] # Assuming that you have defined auc - remove otherwise

#fast_results = np.array([results[0], ] * 12)

#tree_model.print()


results = pd.DataFrame(fast_results,index=pd.MultiIndex.from_product([nobins_values,bintype_values,min_samples_split_values]),
                       columns=["Accuracy","Brier score","AUC"])

results

will be corrected to return the positional maximum in the future.
Use 'series.values.argmax' to get the position of the maximum now.


Training time (5, 'equal-width', 3): 3.04 s.




Testing time (5, 'equal-width', 3): 0.46 s.
Training time (5, 'equal-width', 5): 2.71 s.
Testing time (5, 'equal-width', 5): 0.45 s.
Training time (5, 'equal-width', 10): 1.50 s.
Testing time (5, 'equal-width', 10): 0.51 s.
Training time (5, 'equal-size', 3): 3.49 s.
Testing time (5, 'equal-size', 3): 0.47 s.
Training time (5, 'equal-size', 5): 1.85 s.
Testing time (5, 'equal-size', 5): 0.49 s.
Training time (5, 'equal-size', 10): 0.86 s.
Testing time (5, 'equal-size', 10): 0.46 s.
Training time (10, 'equal-width', 3): 5.35 s.
Testing time (10, 'equal-width', 3): 0.44 s.
Training time (10, 'equal-width', 5): 3.35 s.
Testing time (10, 'equal-width', 5): 0.44 s.
Training time (10, 'equal-width', 10): 1.86 s.
Testing time (10, 'equal-width', 10): 0.44 s.
Training time (10, 'equal-size', 3): 3.24 s.
Testing time (10, 'equal-size', 3): 0.42 s.


In [0]:
# Test your code (leave this part unchanged, except for if auc is undefined)

glass_train_df = pd.read_csv("glass_train.txt")

glass_test_df = pd.read_csv("glass_test.txt")

tree_model = DecisionTree()

test_labels = glass_test_df["CLASS"]

nobins_values = [5,10]
bintype_values = ["equal-width","equal-size"]
min_samples_split_values = [3,5,10]
parameters = [(nobins,bintype,min_samples_split) for nobins in nobins_values for bintype in bintype_values 
              for min_samples_split in min_samples_split_values]

results = np.empty((len(parameters),3))

for i in range(len(parameters)):
    t0 = time.perf_counter()
    tree_model.fit(glass_train_df,nobins=parameters[i][0],bintype=parameters[i][1],min_samples_split=parameters[i][2])
    print("Training time {0}: {1:.2f} s.".format(parameters[i],time.perf_counter()-t0))
    t0 = time.perf_counter()
    predictions = tree_model.predict(glass_test_df)
    print("Testing time {0}: {1:.2f} s.".format(parameters[i],time.perf_counter()-t0))
    results[i] = [accuracy(predictions,test_labels),brier_score(predictions,test_labels),
                  auc(predictions,test_labels)] # Assuming that you have defined auc - remove otherwise

results = pd.DataFrame(results,index=pd.MultiIndex.from_product([nobins_values,bintype_values,min_samples_split_values]),
                       columns=["Accuracy","Brier score","AUC"])

results



['RI', 'Na', 'Mg', 'Al', 'Si', 'K', 'Ca', 'Ba', 'Fe']
963
107
Training time (5, 'equal-width', 3): 0.03 s.




AttributeError: ignored

In [0]:
train_labels = glass_train_df["CLASS"]
tree_model.fit(glass_train_df,min_samples_split=1)
predictions = tree_model.predict(glass_train_df)
print("Accuracy on training set: {0:.2f}".format(accuracy(predictions,train_labels)))
print("AUC on training set: {0:.2f}".format(auc(predictions,train_labels)))
print("Brier score on training set: {0:.2f}".format(brier_score(predictions,train_labels)))

Accuracy on training set: 0.97
AUC on training set: 1.00
Brier score on training set: 0.03


### Comment on assumptions, things that do not work properly, etc.


## 2. Define the class DecisionForest

In [0]:
# Define the class DecisionForest with three functions __init__, fit and predict (after the comments):
#
# Input to __init__: 
# self: the object itself
#
# Output from __init__:
# nothing
# 
# This function does not return anything but just initializes the following attributes of the object (self) to None:
# binning, imputatiom, labels, model
#
# Input to fit:
# self: the object itself
# df: a dataframe (where the column names "CLASS" and "ID" have special meaning)
# nobins: no. of bins (default = 10)
# bintype: either "equal-width" (default) or "equal-size"
# min_samples_split: no. of instances required to allow a split (default = 5)
# random_features: no. of features to evaluate at each split (default = 2), 0 means all features (no random sampling)
# notrees: no. of trees in the forest (default = 10)
#
# Output from fit:
# nothing
#
# The result of applying this function should be:
#
# self.binning should be a discretization mapping (see Assignment 1) from df
# self.imputation should be an imputation mapping (see Assignment 1) from df
# self.labels should be the categories of the "CLASS" column of df, set to be of type "category" 
# self.model should be a random forest (for details, see lecture slides)
# Note that the function does not return anything but just assigns values to the attributes of the object.
#
# Hint 1: Redefine divide_and_conquer to take one additional argument; random_features, and instead of
#         evaluating all features choose a random subset, e.g., by np.random.choice (without replacement)
# Hint 2: Generate each tree in the forest from a bootstrap replicate of df, e.g., by np.random.choice 
#         (with replacement) from the index values of df.
#
# Input to predict:
# self: the object itself
# df: a dataframe
# 
# Output from predict:
# predictions: a dataframe with class labels as column names and the rows corresponding to
#              predictions with estimated class probabilities for each row in df, where the class probabilities
#              are the mean of all relative class frequencies in the leaves of the forest into which the instances in
#              df fall
#
# Hint 1: Drop any "CLASS" and "ID" columns first and then apply imputation and binning
# Hint 2: Iterate over the rows calling some sub-function, e.g., make_prediction(row), which for a test row
#         finds all leaf nodes and calculates the average of their class probabilities


class DecisionForest:
    def __init__(self):
        self.binning = None
        self.imputation = None
        self.labels = None
        self.model = None
    
    def fit(self, df, nobins, bintype, min_samples_split, random_features, notrees):
        df_copy, self.imputation = create_imputation(df)
        df_bins, self.binning = create_bins(df_copy, nobins, bintype)
        
        self.labels = pd.unique(df_bins["CLASS"].astype('category')) 
        self.model = 
    
    def predict(self, df):
        return predictions
    
    def divide_and_conquer(self, df, features, majority):
        if len(df) == 0:
            return majority
        if len(pd.unique(df['CLASS'])) == 1:
            return pd.unique(df['CLASS'])
        if len(features) == 0:
            return majority
        
        
        


In [0]:
glass_train_df = pd.read_csv("glass_train.txt")

glass_test_df = pd.read_csv("glass_test.txt")

forest_model = DecisionForest()

test_labels = glass_test_df["CLASS"]

min_samples_split_values = [1,2,5]
random_features_values = [1,2,5]

parameters = [(min_samples_split,random_features) for min_samples_split in min_samples_split_values 
              for random_features in random_features_values]

results = np.empty((len(parameters),3))

for i in range(len(parameters)):
    t0 = time.perf_counter()
    forest_model.fit(glass_train_df,min_samples_split=parameters[i][0],random_features=parameters[i][1])
    print("Training time {0}: {1:.2f} s.".format(parameters[i],time.perf_counter()-t0))
    t0 = time.perf_counter()
    predictions = forest_model.predict(glass_test_df)
    print("Testing time {0}: {1:.2f} s.".format(parameters[i],time.perf_counter()-t0))
    results[i] = [accuracy(predictions,test_labels),brier_score(predictions,test_labels),
                  auc(predictions,test_labels)] # Assuming that you have defined auc - remove otherwise

results = pd.DataFrame(results,index=pd.MultiIndex.from_product([min_samples_split_values,random_features_values]),
                       columns=["Accuracy","Brier score","AUC"])

results

Training time (1, 1): 7.51 s.
Testing time (1, 1): 0.09 s.
Training time (1, 2): 10.65 s.
Testing time (1, 2): 0.09 s.
Training time (1, 5): 19.78 s.
Testing time (1, 5): 0.09 s.
Training time (2, 1): 7.97 s.
Testing time (2, 1): 0.09 s.
Training time (2, 2): 12.01 s.
Testing time (2, 2): 0.10 s.
Training time (2, 5): 19.42 s.
Testing time (2, 5): 0.10 s.
Training time (5, 1): 4.57 s.
Testing time (5, 1): 0.10 s.
Training time (5, 2): 7.31 s.
Testing time (5, 2): 0.09 s.
Training time (5, 5): 9.75 s.
Testing time (5, 5): 0.09 s.


Unnamed: 0,Unnamed: 1,Accuracy,Brier score,AUC
1,1,0.64486,0.477912,0.860736
1,2,0.71028,0.411981,0.892842
1,5,0.71028,0.421511,0.895716
2,1,0.663551,0.452421,0.872447
2,2,0.626168,0.411684,0.911137
2,5,0.654206,0.435847,0.881621
5,1,0.64486,0.457859,0.865636
5,2,0.719626,0.410616,0.903016
5,5,0.616822,0.455782,0.883897


In [0]:
train_labels = glass_train_df["CLASS"]
forest_model.fit(glass_train_df,min_samples_split=1)
predictions = forest_model.predict(glass_train_df)
print("Accuracy on training set: {0:.2f}".format(accuracy(predictions,train_labels)))
print("AUC on training set: {0:.2f}".format(auc(predictions,train_labels)))
print("Brier score on training set: {0:.2f}".format(brier_score(predictions,train_labels)))

Accuracy on training set: 0.96
AUC on training set: 1.00
Brier score on training set: 0.12


### Comment on assumptions, things that do not work properly, etc.