# Building a Ada Boost Classification Model (from scratch)
There are many packages that include wonderful pre-built functions to build adaboost models. I figured a good way to really learn and appreciate the algorithm though would be to build one from scratch using (mostly) base python :)

##A couple of notes:
-there is likely some redundant/inefficient code in here! I really tried to only use native python so there is going to be some silly work arounds in here. Also expect a lot of if/for/while statements
-

## 1 Reading in data
this data set is a pretty simple data set of patient data collected to try and predict two outcomes: inflammation and nephiritis. For this project we will just focus on inflammation as the primary outcome. 

In [237]:
import pandas as pd
import numpy as np
df = pd.read_csv("diagnosis.csv")
df.head()

Unnamed: 0,temp,nausea,lumbar_pain,urine_pushing,micturition,burning,inflammation,nephritis
0,35.5,no,yes,no,no,no,no,no
1,35.9,no,no,yes,yes,yes,yes,no
2,35.9,no,yes,no,no,no,no,no
3,36.0,no,no,yes,yes,yes,yes,no
4,36.0,no,yes,no,no,no,no,no


## 2 cleaning data and adding initial sample weights 
'temp' is the only continuous variable in this dataset, the rest can be convirted to binary dummy variables (rather than characters. 

At first, all samples will be equally weighted (1/n)


In [344]:
def clean_data(obs):
    if obs == 'yes':
        new_obs = 1
    elif obs == 'no':
        new_obs = 0
    else:
        new_obs = obs
    return new_obs

df_nonclean = pd.DataFrame.copy(df)
for col in df.columns:
    df[col] = df[col].apply(lambda x: clean_data(x))
df['weight'] = 1/len(df)
print(df.head())

df1 = pd.DataFrame.copy(df) #making a copy so that original can stay as is just in case

   temp  nausea  lumbar_pain  urine_pushing  micturition  burning  \
0  35.5       0            1              0            0        0   
1  35.9       0            0              1            1        1   
2  35.9       0            1              0            0        0   
3  36.0       0            0              1            1        1   
4  36.0       0            1              0            0        0   

   inflammation  nephritis    weight  
0             0          0  0.008333  
1             1          0  0.008333  
2             0          0  0.008333  
3             1          0  0.008333  
4             0          0  0.008333  


## 3 building function to collect gini impurity scores for each feature
Because an Adaboost model is an ensemble of weighted stumps, the first step we should take is to define a function to determint best feature for the stump at whatever round we are in.

Special care will be needed to distinguish continuous and categorical (binary) variables. 

In [259]:
#stump function for binary variables
def gini(data, outcome):
        n = len(data)
        features = list(data.columns)
        features.remove(outcome)
        features.remove('weight')
        pred = pd.DataFrame(columns = features)
        gini_index = pd.DataFrame(columns = ['feature', 
                                             'continuous',
                                             'gini_score',
                                             'best_cutoff',
                                             'prob_yes0',
                                             'prob_yes1'
                                            ])
        gini_index['feature']=features 
        for feature in features:
            if (max(data[feature]) == 1) & (min(data[feature])==0):   #rough way of sorting binary from cont. variables
                prob_yes1 = sum(data[outcome][data[feature]==1])/len(data[data[feature]==1])
                prob_no1 = 1 - prob_yes1
                gini_impur1 = 1 - prob_yes1**2 - prob_no1**2
                
                prob_yes0 = sum(data[outcome][data[feature]==0])/len(data[data[feature]==0])
                prob_no0 = 1 - prob_yes0
                gini_impur0 = 1 - prob_yes0**2 - prob_no0**2
                
                weight_sum1 = sum(data['weight'][data[feature]==1])
                weight_sum0 = sum(data['weight'][data[feature]==0])
                
                gini_index['gini_score'][gini_index['feature']==feature] = (weight_sum1*gini_impur1 + 
                                                                            weight_sum0*gini_impur0)
                gini_index['continuous'][gini_index['feature']==feature] = 0
                gini_index['best_cutoff'][gini_index['feature']==feature] = None
                gini_index['prob_yes0'][gini_index['feature']==feature] = prob_yes0
                gini_index['prob_yes1'][gini_index['feature']==feature] = prob_yes1
                
                
            else: 
                sorted_feat = sorted(list(data[feature]))                                     
                avg = []
                for i in (range(0,len(sorted_feat)-1)):
                    avg.append((sorted_feat[i]+sorted_feat[i+1])/2)
                avg = sorted(list(set((avg))))         #making sure only unique figures are in list 
                lowest_gini = 1   #creating gini variable to keep track of best gini/cutoff
                best_cutoff = 0
                for i in (range(0,len(avg))):
                    if min(data[feature]) == avg[i]:  # to prevent zero in denom
                        prob_yes1 = sum(data[outcome][data[feature]> avg[i]])/len(data[data[feature]> avg[i]])
                        prob_no1 = 1 - prob_yes1 
                    
                        prob_yes0 = sum(data[outcome][data[feature]<= avg[i]])/len(data[data[feature]<= avg[i]])
                        prob_no0 = 1 - prob_yes0
                        
                        weight_sum1 = sum(data['weight'][data[feature] > avg[i]])
                        weight_sum0 = sum(data['weight'][data[feature] <= avg[i]])

                    else:
                        prob_yes1 = sum(data[outcome][data[feature]>= avg[i]])/len(data[data[feature]>= avg[i]])
                        prob_no1 = 1 - prob_yes1 
                    
                        prob_yes0 = sum(data[outcome][data[feature]< avg[i]])/len(data[data[feature]< avg[i]])
                        prob_no0 = 1 - prob_yes0
                        
                        weight_sum1 = sum(data['weight'][data[feature] >= avg[i]])
                        weight_sum0 = sum(data['weight'][data[feature] < avg[i]])
                    
                    gini_impur1 = 1 - prob_yes1**2 - prob_no1**2
                    gini_impur0 = 1 - prob_yes0**2 - prob_no0**2
                    gini_score = (weight_sum1*gini_impur1 + weight_sum0*gini_impur0)
                    
                    if  gini_score < lowest_gini:
                        lowest_gini = gini_score
                        best_cutoff = avg[i]
                        best_prob_yes0 = prob_yes0
                        best_prob_yes1 = prob_yes1 
                        
                gini_index['gini_score'][gini_index['feature']==feature] = lowest_gini
                gini_index['continuous'][gini_index['feature']==feature] = 1
                gini_index['best_cutoff'][gini_index['feature']==feature] = best_cutoff
                gini_index['prob_yes0'][gini_index['feature']==feature] = best_prob_yes0
                gini_index['prob_yes1'][gini_index['feature']==feature] = best_prob_yes1
                
        return gini_index
                                       
#gini_index = gini(data = df1, outcome = 'inflammation')
gini_index = gini(data = df1, outcome = 'nephritis')


print(gini_index)
print(gini_index['best_cutoff'][0])

                
                

         feature continuous gini_score best_cutoff prob_yes0 prob_yes1
0           temp          1   0.138889       37.95         0  0.833333
1         nausea          0   0.269231        None  0.230769         1
2    lumbar_pain          0   0.238095        None         0  0.714286
3  urine_pushing          0   0.458333        None      0.25       0.5
4    micturition          0   0.475271        None  0.344262  0.491525
5        burning          0   0.438095        None  0.285714       0.6
6   inflammation          0   0.468788        None  0.508197  0.322034
37.95


The _gini index_ table looks good! Now we need to use the data from this table to build our stump!

## 4 Building stump function

In [263]:
def stump_cont(feature, greater_than, cutoff):  #need to create a function for continuous and binary features
    pred = []
    if min(feature) == cutoff:
        if greater_than == True:
            for feat in feature:
                if feat <= cutoff:
                    pred.append(0)
                else:
                    pred.append(1)
        else:
            for feat in feature:
                if feat <= cutoff:
                    pred.append(1)
                else:
                    pred.append(0)
    else:
        if greater_than == True:
            for feat in feature:
                if feat < cutoff:
                    pred.append(0)
                else:
                    pred.append(1)
        else:
            for feat in feature:
                if feat < cutoff:
                    pred.append(1)
                else:
                    pred.append(0)
    return pred

def stump_binary(feature, yes_value):    #yes_value is 1 or 0 (whatever value yields yes prediction)
    pred = [] 
    if yes_value == 1:
        for feat in feature:
            if feat == 1:       
                pred.append(1)
            else:
                pred.append(0)
    else:
        for feat in feature:
            if feat == 1:
                pred.append(0)
            else:
                pred.append(1)
    return(pred)
            
            

##  5 making training function
the way we're 'training' the model is to store in a table the features/parameters that 'define' the stumps of the final model as result of the training algorithm

In [328]:
df1 = pd.DataFrame.copy(df) 
def train(data,outcome, stop_error = 0, max_rounds = 50):
    total_error = 1
    prev_total_error = 100
    cycle_num = 0 
    train_feat_list = [] #list for storing feature of each round
    train_cont = [] #keeping track if above feature is continuous or not
    train_greater_than = [] #for continuous features if greater than or less than cutoff
    train_cutoff = []  #keeping track of cutoff for continuous variable trees
    train_yes_value = [] #keeping track of wheter 0 or 1 predicts outcome for binary variables
    train_amnt_of_say = [] #to keep track of amount of say for each 'stump'
    train_total_error = []
    
    while (total_error > stop_error and cycle_num < max_rounds and prev_total_error > total_error):
        gini_index = gini(data = data, outcome = outcome)
        best_feature = list(gini_index['feature'][gini_index['gini_score'] == min(gini_index['gini_score'])])[0]
        train_feat_list.append(best_feature)
        predictions = []
        new_weight = [0]*len(data)
        for i in range(0,len(gini_index)):   #grabbing row number of best feature 
            if gini_index['feature'][i] == best_feature:
                best = i
                
            
        if gini_index['continuous'][best] == 0:
            train_cont.append(0)
            train_cutoff.append(None)
            train_greater_than.append(None)
            if gini_index['prob_yes0'][best] > gini_index['prob_yes1'][best]:
                train_yes_value.append(0)
                predictions = stump_binary(feature = data[best_feature], yes_value = 0)
           
            else:
                train_yes_value.append(1)
                predictions = stump_binary(feature = data[best_feature], yes_value = 1)
        else:
            train_cont.append(1)
            train_cutoff.append(gini_index['best_cutoff'][best])
            train_yes_value.append(None)
            if gini_index['prob_yes0'][best] > gini_index['prob_yes1'][best]:
                train_greater_than.append(False)
                predictions = stump_cont(feature = data[best_feature], 
                                         greater_than = False, 
                                         cutoff = gini_index['best_cutoff'][best])
            
            else:
                train_greater_than.append(True)
                predictions = stump_cont(feature = data[best_feature], 
                                         greater_than = True, 
                                         cutoff = gini_index['best_cutoff'][best])
                        
        data['prediction'] = predictions
        prev_total_error = total_error
        total_error = sum(data['weight'][data['prediction']!= data[outcome]])
        amount_of_say = (1/2)*np.log((1-total_error)/(total_error))
        train_amnt_of_say.append(amount_of_say)
        train_total_error.append(total_error)
    
    #calculate new weights:
        for i in range(0,len(data)):
            if data[outcome][i] != data['prediction'][i]:
                new_weight[i] = data['weight'][i]*np.e**(amount_of_say)
            else:
                new_weight[i] = data['weight'][i]*np.e**(-amount_of_say)
        data['weight'] = new_weight
        cycle_num = cycle_num + 1
        #return data, amount_of_say
    
    train_results = pd.DataFrame({
        'feature':train_feat_list,
        'continuous':train_cont,
        'greater_than':train_greater_than,
        'cutoff':train_cutoff,
        'yes_value':train_yes_value,
        'amount_of_say': train_amnt_of_say,
        'total_error': train_total_error
    })
    return train_results

results = train(data = df1 ,outcome = 'inflammation',stop_error = 0, max_rounds = 50)
#results = train(data = df1 ,outcome = 'nephritis',stop_error = 0, max_rounds = 50)
print(results)


     temp  nausea  lumbar_pain  urine_pushing  micturition  burning  \
0    35.5       0            1              0            0        0   
1    35.9       0            0              1            1        1   
2    35.9       0            1              0            0        0   
3    36.0       0            0              1            1        1   
4    36.0       0            1              0            0        0   
..    ...     ...          ...            ...          ...      ...   
115  41.4       0            1              1            0        1   
116  41.5       0            0              0            0        0   
117  41.5       1            1              0            1        0   
118  41.5       0            1              1            0        1   
119  41.5       0            1              1            0        1   

     inflammation  nephritis    weight  prediction  
0               0          0  0.000388           0  
1               1          0  0.000388   

In [294]:
def predict(train_results, data, outcome):
    pred_after_say = [0] * len(data)
    final_prediction = []
    for i in range(0,len(train_results)):
        feature = train_results['feature'][i]
        if train_results['continuous'][i] == 1:
            prediction = stump_cont(feature=data[feature],
                                    greater_than = train_results['greater_than'][i],
                                    cutoff = train_results['cutoff'][i])
        else:
            prediction = stump_binary(feature = data[feature], 
                                      yes_value=train_results['yes_value'][i])
            
        for k in range(0,len(prediction)):
            if prediction[k] == 1:
                pred_after_say[k] = pred_after_say[k] + train_results['amount_of_say'][i]
            else:
                pred_after_say[k] = pred_after_say[k] - train_results['amount_of_say'][i]
    for pred in pred_after_say:
        if pred >= 0:
            final_prediction.append(1)
        else:
            final_prediction.append(0)
    data['prediction'] = final_prediction
    
    correct = []
    for z in range(0,len(data)):
        if data[outcome][z] == data['prediction'][z]:
            correct.append(1)
        else:
            correct.append(0)
    data['correct'] = correct
    error_rate = sum(correct)/len(data)
    print('complete! percent correct = ',error_rate)
    return data, error_rate

df1, error_rate = predict(data= df1, train_results = results, outcome = 'inflammation')
                

complete! percent correct =  0.825


Great! It looks like we have the model training and predicting programs down! Let's see how well the model works when we seperate the data into a training and testing dataset

## testing the model by splitting the dataset 

In [364]:
df1 = pd.DataFrame.copy(df)
import random

#creating function to randomly assign observations to train/test datasets
def split(data):
    data['assignment'] = list(range(0,len(data)))
    random.shuffle(data['assignment'])
    #80 percent going to train dataset
    data_train = pd.DataFrame.copy(data[data['assignment']<= 0.8*len(data)])
    #20 percent going to test data set
    data_test = pd.DataFrame.copy(data[data['assignment'] > 0.2*len(data)])
    data_train.reset_index(drop=True, inplace = True)
    data_test.reset_index(drop=True, inplace = True)
   
    return data_train, data_test


data_train, data_test = split(df1)

del data_train['assignment']
del data_test['assignment']

for col in data_train.columns:
    data_train[col] = data_train[col].apply(lambda x: clean_data(x))
data_train['weight'] = 1/len(data_train)
for col in data_test.columns:
    data_test[col] = data_test[col].apply(lambda x: clean_data(x))
data_test['weight'] = 1/len(data_test)

#training model with training dataset 
train_results = train(data = data_train, outcome = 'inflammation')

data_final, error_rate = predict(train_results = train_results, data = data_test, outcome = 'inflammation')



complete! percent correct =  0.8526315789473684


Looks good! 