# Chua Ki Min Clement 7083543
## CSCI 316 Indivual Assignment 1 Task 2
### Dataset: Census Income
##### Source: https://archive.ics.uci.edu/ml/datasets/Adult
### Objective: The objective of this task is to implement from scratch Decision Tree classification method to predict whether the incomes exceed 50K/yr based on census data. Thus, this is a binary classification problem. The training and test sets are pre-defined in the data set (i.e., in “adult.data” and “adult.test”).
### Requirement:
##### (1) Implement two DT models by choosing any two (2) split criteria from Information Gain, Gain Ratio, Gini Index and Variance. Note that you can use either binary-split or multiple-split. 
##### (2) Use (approximately) 2/3 records in “adult.data” for training, and 1/3 records in “adult.data” for postpruning. 
##### (3) Report the accuracy of each model.
##### (4) All DT models must be self-implemented. You CANNOT use any machine learning library in this task. 
##### (5) It is recommended that your implementation includes a “tree induction function”, a “classification function” and a “post-pruning function”.
##### (6) You can (but not must) use any suitable pre-processing method. You also can (but not must) use any reasonable early stopping criteria (pre-pruned parameters such as number of splits, minimum data set size, and split threshold) to improve the training speed. If you do so, explain your reasons. 
##### (7) Present clear and accurate explanation of your implementation and results (in the Markdown format).

In [1]:
#importing librarys needed for this task 
import pandas as pd
import numpy as np
from pprint import pprint


###  Import adult.data file which i converted to a csv file as a dataframe and split it into 2/3 records for training and 1/3 for validation for post-pruning. Afterwards i will reset the index so that the index starts from 0 to prevent any errors later on such as testing between the testing dataset.  

In [2]:
#importing our dataset adult file
columns = ["age","workclass","fnlwgt","education","education-num","marital-status","occupation","relationship",
           "race","sex","capital-gain","capital-loss","hours-per-week","native-country","income"]
df=pd.read_csv('adult.data',names=columns)

In [3]:
#duplicate df dataframe, split it into training and validation for post-pruning
dfdupl=df.copy()
dftrain=dfdupl.sample(frac=0.67,random_state=1)
dfpostp=dfdupl.drop(dftrain.index)

In [4]:
#relabel the index starting from 0
dftrain=dftrain.reset_index(drop=True)
dfpostp=dfpostp.reset_index(drop=True)

## Feature engineering
### Even though that there is no nan values in our dataset however there is special characters such as '?' in some columns. So i replaced it with nan values and drop the indexes that contain any nan values in it.
### Some features/columns has many distinct unique values however some columns has too many unique values such as 'fnlwgt', 'age' , 'capital-gain', 'capital-loss', 'native-country' and may cause inefficiencies when building our decision tree model hence i drop it.  'education' and 'education-num' is similar  so i drop education-num .
### Changing all categorical features/columns into numerical data using the map function

In [5]:
# Finding the '?' in dataframe 
dftrain.isin(['?']).sum(axis=0)

age                  0
workclass         1243
fnlwgt               0
education            0
education-num        0
marital-status       0
occupation        1247
relationship         0
race                 0
sex                  0
capital-gain         0
capital-loss         0
hours-per-week       0
native-country     396
income               0
dtype: int64

In [6]:
#replace '?' to nan and then drop the columns
dftrain[dftrain=='?']=np.nan

In [7]:
#checking whether it is replace to null
dftrain.isnull().sum()

age                  0
workclass         1243
fnlwgt               0
education            0
education-num        0
marital-status       0
occupation        1247
relationship         0
race                 0
sex                  0
capital-gain         0
capital-loss         0
hours-per-week       0
native-country     396
income               0
dtype: int64

In [8]:
#dropping the null values
dftrain.dropna(how='any',inplace=True)

In [9]:
#finding value counts for each feature/column
for c in dftrain[columns]:
    print ("%s :" % c)
    print(dftrain[c].value_counts())

age :
31    593
33    562
34    562
25    561
36    561
     ... 
77      6
85      3
83      2
84      2
88      1
Name: age, Length: 71, dtype: int64
workclass :
Private             14925
Self-emp-not-inc     1652
Local-gov            1387
State-gov             862
Self-emp-inc          708
Federal-gov           648
Without-pay             9
Name: workclass, dtype: int64
fnlwgt :
113364    10
203488     9
148995     9
164190     9
111483     9
          ..
233428     1
182237     1
97030      1
73091      1
93169      1
Name: fnlwgt, Length: 15111, dtype: int64
education :
HS-grad         6589
Some-college    4453
Bachelors       3374
Masters         1078
Assoc-voc        852
11th             720
Assoc-acdm       676
10th             556
7th-8th          379
Prof-school      357
9th              306
Doctorate        263
12th             260
5th-6th          197
1st-4th          103
Preschool         28
Name: education, dtype: int64
education-num :
9     6589
10    4453
13    3374
14 

In [10]:
#finding the number of unique values in each feature/column
for i in dftrain:
    print ("%s :" % i)
    print (dftrain[i].nunique())

age :
71
workclass :
7
fnlwgt :
15111
education :
16
education-num :
16
marital-status :
7
occupation :
14
relationship :
6
race :
5
sex :
2
capital-gain :
112
capital-loss :
83
hours-per-week :
89
native-country :
41
income :
2


In [11]:
#dropping this features/columns based on the uniquness of data 
dftrain.drop(['age', 'fnlwgt', 'capital-gain','capital-loss', 'native-country','education-num'], axis=1, inplace=True)

## Finding all unique values in each feature/column
### 'workclass' : 'Self-emp-inc': 0, 'State-gov': 1,'Federal-gov': 2, 'Without-pay': 3, 'Local-gov': 4,'Private': 5, 'Self-emp-not-inc': 6
### 'education' : 'Some-college': 0, 'Preschool': 1, '5th-6th': 2, 'HS-grad': 3, 'Masters': 4, '12th': 5, '7th-8th': 6, 'Prof-school': 7,'1st-4th': 8, 'Assoc-acdm': 9, 'Doctorate': 10, '11th': 11,'Bachelors': 12, '10th': 13,'Assoc-voc': 14,'9th': 15
### 'marital-status: 'Married-spouse-absent': 0,'Married-civ-spouse': 1, 'Married-AF-spouse': 2, 'Widowed': 3, 'Separated': 4, 'Divorced': 5,'Never-married': 6
### 'occupation' : 'Farming-fishing': 0, 'Tech-support': 1, 'Adm-clerical': 2, 'Handlers-cleaners': 3, 'Prof-specialty': 4,'Machine-op-inspct': 5, 'Exec-managerial': 6,'Priv-house-serv': 7,'Craft-repair': 8,'Sales': 9, 'Transport-moving': 10, 'Armed-Forces': 11, 'Other-service': 12,'Protective-serv':13
### 'relationship': 'Not-in-family': 0, 'Wife': 1, 'Other-relative': 2, 'Unmarried': 3,'Husband': 4,'Own-child': 5
### 'race' feature/column: 'Black': 0, 'Asian-Pac-Islander': 1,'Other': 2,'Amer-Indian-Eskimo': 3, 'White': 4
### 'sex' feature/column: 'Male': 0, 'Female': 1
### 'income' : '<=50K': 0, '>50K': 1

In [12]:
dftrain_column = columns = ["workclass","education","marital-status","occupation","relationship","race","sex",
                            "hours-per-week","income"]
for c in dftrain[dftrain_column]:
    print ("%s :" % c)
    print(dftrain[c].unique())

workclass :
['Self-emp-not-inc' 'Private' 'Local-gov' 'Federal-gov' 'State-gov'
 'Self-emp-inc' 'Without-pay']
education :
['7th-8th' '11th' 'Bachelors' 'HS-grad' 'Some-college' 'Masters'
 'Assoc-voc' '5th-6th' '10th' '12th' 'Doctorate' 'Assoc-acdm'
 'Prof-school' '9th' '1st-4th' 'Preschool']
marital-status :
['Widowed' 'Never-married' 'Married-civ-spouse' 'Divorced' 'Separated'
 'Married-spouse-absent' 'Married-AF-spouse']
occupation :
['Other-service' 'Farming-fishing' 'Prof-specialty' 'Machine-op-inspct'
 'Handlers-cleaners' 'Exec-managerial' 'Adm-clerical' 'Craft-repair'
 'Sales' 'Transport-moving' 'Tech-support' 'Priv-house-serv'
 'Protective-serv' 'Armed-Forces']
relationship :
['Not-in-family' 'Other-relative' 'Own-child' 'Husband' 'Unmarried' 'Wife']
race :
['White' 'Black' 'Asian-Pac-Islander' 'Other' 'Amer-Indian-Eskimo']
sex :
['Female' 'Male']
hours-per-week :
[66 25 50 40 64 45 30 60 36 35 48 20 42 80 65 43 58 37 70 44 28 16 15 39
 51 75 10 55 90 24 38 62 13  9 52 14 61 54

In [13]:
# convert categoricals data in the dataset to numericals data 
dftrain['workclass']= dftrain['workclass'].map({'Self-emp-inc': 0, 'State-gov': 1,'Federal-gov': 2, 'Without-pay': 3, 
                                                'Local-gov': 4,'Private': 5, 'Self-emp-not-inc': 6}).astype(int)
dftrain['education']= dftrain['education'].map({'Some-college': 0, 'Preschool': 1, '5th-6th': 2, 'HS-grad': 3, 
                                                'Masters': 4, '12th': 5, '7th-8th': 6, 'Prof-school': 7,'1st-4th': 8, 
                                                'Assoc-acdm': 9, 'Doctorate': 10, '11th': 11,'Bachelors': 12, 
                                                '10th': 13,'Assoc-voc': 14,'9th': 15}).astype(int)
dftrain['marital-status'] = dftrain['marital-status'].map({'Married-spouse-absent': 0,'Married-civ-spouse': 1, 
                                                           'Married-AF-spouse': 2, 'Widowed': 3, 'Separated': 4, 
                                                           'Divorced': 5,'Never-married': 6}).astype(int)
dftrain['occupation'] = dftrain['occupation'].map({ 'Farming-fishing': 0, 'Tech-support': 1, 'Adm-clerical': 2, 
                                                   'Handlers-cleaners': 3, 'Prof-specialty': 4,'Machine-op-inspct': 5, 
                                                   'Exec-managerial': 6,'Priv-house-serv': 7,'Craft-repair': 8,'Sales': 9, 
                                                   'Transport-moving': 10, 'Armed-Forces': 11, 'Other-service': 12,
                                                   'Protective-serv':13}).astype(int)
dftrain['relationship'] = dftrain['relationship'].map({'Not-in-family': 0, 'Wife': 1, 'Other-relative': 2, 'Unmarried': 3,
                                                       'Husband': 4,'Own-child': 5}).astype(int)
dftrain['race'] = dftrain['race'].map({'Black': 0, 'Asian-Pac-Islander': 1,'Other': 2,'Amer-Indian-Eskimo': 3, 
                                       'White': 4}).astype(int)
dftrain['sex'] = dftrain['sex'].map({'Male': 0, 'Female': 1}).astype(int)
dftrain['income']=dftrain['income'].map({'<=50K': 0, '>50K': 1}).astype(int)


## First decision tree model that i am implementing is using information gain alogrithm 
### We first have to calculate the entropy of our dataset of our target feature which is 'income' and for all features.
### Afterwards we will then compute information gain for all features to find out which is the best features which has the highest information gain to be the splitting node and build our decision tree model based on these informations.
### For our decision tree model we define the stopping criteria, if one of this is satisfied, we return to a leaf node. If the dataset is empty, return the mode target feature value in the original dataset. If the feature space is empty, return the mode target feature value of the direct parent node

In [14]:
#Calculate the entropy of a dataset, argument of this function is the col parameter which specifies the target column
def entropy(col):
    elements,counts = np.unique(col,return_counts = True)
    entropy = np.sum([(-counts[i]/np.sum(counts))*np.log2(counts[i]/np.sum(counts)) for i in range(len(elements))])
    return entropy

### Calculate the information gain of a dataset. This function has three argument where the first is the dataset of the feature the informationgain should be calculated.Second is the feature the information gain should be calculated, third is the name of the target feature which is 'income'

In [15]:
#calculate entropy,weighted entropy,values and counts for the split attribute and information gain
def InfoGain(data,split_attribute_name,target_name="income"):
    total_entropy = entropy(data[target_name])
    vals,counts= np.unique(data[split_attribute_name],return_counts=True)
    Weighted_Entropy = np.sum([(counts[i]/np.sum(counts))*entropy(data.where(data[split_attribute_name]==vals[i]).
                                                                  dropna()[target_name]) for i in range(len(vals))])
    information_Gain = total_entropy - Weighted_Entropy
    return information_Gain

##  This builds the tree, and this function has 5 arguments the first and second is the dataset that is use to build the tree the second argument is in the case the dataset delivered by the first parameter is empty. Third is the features of the dataset, fourth is the target attribute and the last argument is the class of the mode target feature value of the parent node for a specific node.


In [16]:
def information_gain_tree_model(data,originaldata,features,target_attribute_name="income",parent_node_class = None):

    
    
    if len(np.unique(data[target_attribute_name])) <= 1:
        return np.unique(data[target_attribute_name])[0]
    
    elif len(data)==0:
        return np.unique(originaldata[target_attribute_name])[np.argmax(np.unique(originaldata[target_attribute_name],
                                                                                  return_counts=True)[1])]
    
    elif len(features) ==0:
        return parent_node_class
    
    
    else:
        parent_node_class = np.unique(data[target_attribute_name])[np.argmax(np.unique(data[target_attribute_name],
                                                                                       return_counts=True)[1])]
        
        #Select the feature which best splits the dataset
        #Return the information gain values for the features in the dataset
        item_values = [InfoGain(data,feature,target_attribute_name) for feature in features] 
        best_feature_index = np.argmax(item_values)
        best_feature = features[best_feature_index]
        
        #The root gets the name of the feature (best_feature) with the maximum informatio gain in the first run
        tree = {best_feature:{}}
        
        
        #Remove the feature with the best inforamtion gain from the feature space
        features = [i for i in features if i != best_feature]
        
        #Grow a branch under the root node for each possible value of the root node feature
        
        for value in np.unique(data[best_feature]):
            value = value
            #Split the dataset along the value of the feature with the largest information gain
            sub_data = data.where(data[best_feature] == value).dropna()
            
            #Call the information gain algorithm for each of those sub_datasets with the new parameters
            subtree = information_gain_tree_model(sub_data,df,features,target_attribute_name,parent_node_class)
            

            tree[best_feature][value] = subtree
            
        return(tree)    

### Prediction function takes two argument which first is the instance of a new unknown data which shape is a dictionary and the second is the tree that has been build. Check the new query which contains a dictionary for features in our tree and check if the name of the root node is equal to one of the query features. If true, run down the root node outgoing branch whose value equals to the value of query feature equals to root node. If at the end of the the branch a leaf node which is not a dict object we return the value which is the prediction. However if there is another node we search in the query for the feature which equals the value of that node. We look up the value of our query feature and run down the branch whose value is equl to the query feature value. However a default will be reutrn if that is an error or no classification is possible for the query. 

In [17]:
# for prediction of new data 
def predict(query,tree,default = 1):

    for key in list(query.keys()):
        if key in list(tree.keys()):

            try:
                result = tree[key][query[key]] 
            except:
                return default

            result = tree[key][query[key]]

            if isinstance(result,dict):
                return predict(query,result)

            else:
                return result


### Check the accuracy of our tree
### Create a new query instances by removing the target feature column from the original dataset and convert it into a dictionary. 

In [18]:
def accuracy_test(data,tree):

    queries = data.iloc[:,:-1].to_dict(orient = "records")
    

    predicted = pd.DataFrame(columns=["predicted"]) 

    for i in range(len(data)):
        predicted.loc[i,"predicted"] = predict(queries[i],tree,1.0) 
    print('The prediction accuracy is: ',(np.sum(predicted["predicted"] == data["income"])/len(data))*100,'%')

    
    

In [19]:
# build tree of our training set using the model of the information gain algorithm
tree = information_gain_tree_model(dftrain,dftrain,dftrain.columns[:-1])

###  Import adult.test file as a dataframe. I will reset the index so that the index starts from 0 to prevent any errors later on such as testing between the testing dataset.  Also clean all the special characters in the dataframe as there is also '?' in the testing dataset and same as our training set i will remove all irrelevant features.

In [20]:
#importing test adult dataset 
columns = ["age","workclass","fnlwgt","education","education-num","marital-status","occupation","relationship","race",
           "sex","capital-gain","capital-loss","hours-per-week","native-country","income"]
df2=pd.read_csv('adult.test',names=columns)

In [21]:
# Finding the '?' in dataframe 
df2.isin(['?']).sum(axis=0)

age                 0
workclass         963
fnlwgt              0
education           0
education-num       0
marital-status      0
occupation        966
relationship        0
race                0
sex                 0
capital-gain        0
capital-loss        0
hours-per-week      0
native-country    274
income              0
dtype: int64

In [22]:
#replace '?' to nan and then drop the columns
df2[df2=='?']=np.nan

In [23]:
#dropping the null values
df2.dropna(how='any',inplace=True)

In [24]:
#relabel the index starting from 0
df2=df2.reset_index(drop=True)

In [25]:
#dropping this features/columns same as our training data 
df2.drop(['age', 'fnlwgt', 'capital-gain','capital-loss', 'native-country','education-num'], axis=1, inplace=True)

In [26]:
# convert all categorical data in the dataset to numerical data
df2['workclass']= df2['workclass'].map({'Self-emp-inc': 0, 'State-gov': 1,'Federal-gov': 2, 'Without-pay': 3, 
                                        'Local-gov': 4,'Private': 5, 'Self-emp-not-inc': 6}).astype(int)
df2['education']= df2['education'].map({'Some-college': 0, 'Preschool': 1, '5th-6th': 2, 'HS-grad': 3, 'Masters': 4, 
                                        '12th': 5, '7th-8th': 6, 'Prof-school': 7,'1st-4th': 8, 'Assoc-acdm': 9, 
                                        'Doctorate': 10, '11th': 11,'Bachelors': 12, '10th': 13,
                                        'Assoc-voc': 14,'9th': 15}).astype(int)
df2['marital-status'] = df2['marital-status'].map({'Married-spouse-absent': 0,'Married-civ-spouse': 1, 
                                                   'Married-AF-spouse': 2, 'Widowed': 3, 'Separated': 4, 
                                                   'Divorced': 5,'Never-married': 6}).astype(int)
df2['occupation'] = df2['occupation'].map({ 'Farming-fishing': 0, 'Tech-support': 1, 'Adm-clerical': 2, 
                                           'Handlers-cleaners': 3, 'Prof-specialty': 4,'Machine-op-inspct': 5, 
                                           'Exec-managerial': 6,'Priv-house-serv': 7,'Craft-repair': 8,'Sales': 9, 
                                           'Transport-moving': 10, 'Armed-Forces': 11, 'Other-service': 12,
                                           'Protective-serv':13}).astype(int)
df2['relationship'] = df2['relationship'].map({'Not-in-family': 0, 'Wife': 1, 'Other-relative': 2, 'Unmarried': 3,
                                               'Husband': 4,'Own-child': 5}).astype(int)
df2['race'] = df2['race'].map({'Black': 0, 'Asian-Pac-Islander': 1,'Other': 2,'Amer-Indian-Eskimo': 3, 
                               'White': 4}).astype(int)
df2['sex'] = df2['sex'].map({'Male': 0, 'Female': 1}).astype(int)
df2['income']=df2['income'].map({'<=50K': 0, '>50K': 1}).astype(int)


In [27]:
# testing the accuracy of our testing set with our tree that is build with the training set
accuracy_test(df2,tree)

The prediction accuracy is:  75.77689243027889 %


## Second decision tree model that i am implementing is using variance alogrithm 
### Calculate the homogeneity of a node, if a node is entirely homogeneous, then the variance is zero.
### For each split, indivdually calculate each child node variance and calculate the variance of each split as the weighted average variance of child nodes
### Select the split with the lowest variance and then perform it recursively until completely homogeneous nodes are achieved

In [28]:
# finding the feature values
def unique(seq, return_counts=False, id=None):
   
    found = set()
    if id is None:
        for x in seq:
            found.add(x)
           
    else:
        for x in seq:
            x = id(x)
            if x not in found:
                found.add(x)
    found = list(found)           
    counts = [seq.count(0),seq.count(1)]
    if return_counts:
        return found,counts
    else:
        return found
     

In [29]:
# total values in each features
def sum(data):
    sum = 0
    for i in data:
        sum = sum + i
    return sum

### Calculate the variance of the dataset feature/column. This function has one argument where the first is the feature the variance

In [30]:
def calculate_variance(target_values):
    values = list(target_values)
    elements,counts = unique(values,True)
    variance_impurity = 0
    sum_counts = sum(counts)
    for i in elements:
        variance_impurity += (-counts[i]/sum_counts*(counts[i]/sum_counts))
    return variance_impurity

### To decide the optimal split for our algorithm from a noot rode

In [31]:
def variance_impurity_gain(data, split_attribute_name, target_attribute_name):
    data_split = data.groupby(split_attribute_name)
    aggregated_data = data_split.agg({target_attribute_name : [calculate_variance, lambda x: len(x)/(len(data.index) * 1.0)] 
                                     })[target_attribute_name]
    aggregated_data.columns = ['Variance', 'Observations']
    weighted_variance_impurity = sum( aggregated_data['Variance'] * aggregated_data['Observations'] )
    total_variance_impurity = calculate_variance(data[target_attribute_name])
    variance_impurity_gain = total_variance_impurity - weighted_variance_impurity
    return variance_impurity_gain

##  This builds the tree, and this function has 4 arguments the first is the dataset that is use to build the tree the second argument is the target attribute of the dataset, third is the feature/columnand the last argument is just an argument is when the dataset is empty will return none

In [32]:
def build_tree_using_variance_impurity(data, target_attribute_name, attribute_names, default_class=None):
    global node_number_variance
    from collections import Counter
    count_set = Counter(x for x in data[target_attribute_name])
    if len(count_set) == 1:
        return list(count_set.keys())[0]

    elif data.empty or (not attribute_names):
        return default_class 
    
    else:
        index_of_max = list(count_set.values()).index(max(count_set.values())) 
        default_class = list(count_set.keys())[index_of_max]
        variance_gain = [variance_impurity_gain(data, attr, target_attribute_name) for attr in attribute_names]
        index_of_max = variance_gain.index(max(variance_gain)) 
        best_attr = attribute_names[index_of_max]
         
        #The root gets the name of the feature (best_feature)
        tree = {best_attr:{}}
        positiveCount = data['income'].value_counts()[1];
        negativeCount = data['income'].value_counts()[0];
        if positiveCount>negativeCount :
            best_class = 1
        elif positiveCount<negativeCount:
            best_class = 0
        else:
            best_class = 'none'
        tree[best_attr]['best_class'] = best_class
        node_number_variance = node_number_variance + 1

        remaining_attribute_names = [i for i in attribute_names if i != best_attr]

        for attr_val, data_subset in data.groupby(best_attr):
            subtree = build_tree_using_variance_impurity(data_subset,
                        target_attribute_name,
                        remaining_attribute_names,
                        default_class)
            tree[best_attr][attr_val] = subtree
        return tree

In [33]:
node_number_variance = 0
labelValues = list(dftrain.columns.values)
labelValues.remove('income')

In [34]:
#building our second tree with the features and attributes
tree2 = build_tree_using_variance_impurity(dftrain,'income',labelValues)

In [35]:
# testing the accuracy of tree2
accuracy_test(df2,tree2)

The prediction accuracy is:  75.28552456839309 %


### For our validation dataset that is use for post pruning, i would also do the same thing as previous dataset to change all special characters '?' to null values and drop it. I would also drop the same inefficent feature/column same as our previous training dataset however i would drop a extra hours-per-week as i feel that it is not needed as there are too many different values. 
### I will build my tree using the validation data and test the accuracy to see if there is any difference in it

In [36]:
#replace '?' to nan and then drop the columns
dfpostp[dfpostp=='?']=np.nan

In [37]:
#dropping the null values
dfpostp.dropna(how='any',inplace=True)

In [38]:
# convert all categorical data in the dataset to numerical data for our validation set
dfpostp.drop(['age', 'fnlwgt', 'capital-gain','capital-loss', 'native-country','education-num','hours-per-week'],
             axis=1, inplace=True)
dfpostp['workclass']= dfpostp['workclass'].map({'Self-emp-inc': 0, 'State-gov': 1,'Federal-gov': 2, 'Without-pay': 3, 
                                                'Local-gov': 4,'Private': 5, 'Self-emp-not-inc': 6}).astype(int)
dfpostp['education']= dfpostp['education'].map({'Some-college': 0, 'Preschool': 1, '5th-6th': 2, 'HS-grad': 3, 
                                                'Masters': 4, '12th': 5, '7th-8th': 6, 'Prof-school': 7,'1st-4th': 8, 
                                                'Assoc-acdm': 9, 'Doctorate': 10, '11th': 11,'Bachelors': 12, '10th': 
                                                13,'Assoc-voc': 14,'9th': 15}).astype(int)
dfpostp['marital-status'] = dfpostp['marital-status'].map({'Married-spouse-absent': 0,'Married-civ-spouse': 1, 
                                                           'Married-AF-spouse': 2, 'Widowed': 3, 'Separated': 4, 
                                                           'Divorced': 5,'Never-married': 6}).astype(int)
dfpostp['occupation'] = dfpostp['occupation'].map({ 'Farming-fishing': 0, 'Tech-support': 1, 'Adm-clerical': 2, 
                                                   'Handlers-cleaners': 3, 'Prof-specialty': 4,'Machine-op-inspct': 5,
                                                   'Exec-managerial': 6,'Priv-house-serv': 7,'Craft-repair': 8,'Sales': 
                                                   9, 'Transport-moving': 10, 'Armed-Forces': 11, 'Other-service': 12,
                                                   'Protective-serv':13}).astype(int)
dfpostp['relationship'] = dfpostp['relationship'].map({'Not-in-family': 0, 'Wife': 1, 'Other-relative': 2, 'Unmarried': 3,
                                                       'Husband': 4,'Own-child': 5}).astype(int)
dfpostp['race'] = dfpostp['race'].map({'Black': 0, 'Asian-Pac-Islander': 1,'Other': 2,'Amer-Indian-Eskimo': 3, 
                                       'White': 4}).astype(int)
dfpostp['sex'] = dfpostp['sex'].map({'Male': 0, 'Female': 1}).astype(int)
dfpostp['income']=dfpostp['income'].map({'<=50K': 0, '>50K': 1}).astype(int)

In [39]:
# i will also drop my testing dataset hours-per-week so that i can test with my validation dataset
df2.drop(['hours-per-week'], axis=1, inplace=True)

In [40]:
#prune for infogain model tree
tree3 = information_gain_tree_model(dfpostp,dfpostp,dfpostp.columns[:-1])

### Looking at the accuracy for our tree3(information gain algorithm) 78.811% and tree 4(variance alogorithm) 78.685% after removing unnecessary column for post pruning comparing to our tree(information gain algorithm) 75.637 % and tree2(variance algorithm) 75.133% we can see the difference after doing post pruning 

In [41]:
# testing the accuracy of our tree
accuracy_test(df2,tree3)

The prediction accuracy is:  78.77158034528553 %


In [42]:
labelValues = list(dfpostp.columns.values)
labelValues.remove('income')
tree4 = build_tree_using_variance_impurity(dfpostp,'income',labelValues)

In [43]:
# testing the accuracy of our tree
accuracy_test(df2,tree4)

The prediction accuracy is:  78.59229747675963 %


## Predicting with one of our tree using our predict function by creating a query using a dictionary containing key and values of our features and some of the values that can be found in our dataset

In [44]:
query = {'workclass':5,'education':11,'marital-status':5,'occupation':5,'relationship':5,'race':0,'sex':0}

In [45]:
query = pd.Series(query)

In [46]:
predict(query,tree4)

0