In [None]:
'''
The code is using the Decision Tree classifier on the Adult data set. The Decision Tree is a tree-like structure. The topmost
is the root node. Below the root nodes are the internal nodes/branches which are decision rules for further splitting the 
internal  nodes. In the end, we have the leaf node, which corresponds to a pure class. The decision tree uses either Entropy
or gini along with information gain as a deciding factor to create a root node and branches, leaf node below it.
In scikit, Decision tree only works on numerical data. Therefore all categorical data are OneHotEncoded to integer format.
The categorical data is transformed using the scikit library OneHotEncoder into O's and 1's.In OneHotEncoder every unique value
of the column is transformed to feature. The row values of the original data set are now transformed into 0 and 1 values.
The row in the new data frame as values for that column as 1 if in the original data frame the row had that is value else will 
have value 0. 
Thus new data frame will have as many columns as many unique column values in the original data set.

Eg:-Let say a data frame has 2 columns. After OneHotEncoder is applied on the categorical column(Color), it will be transformed
as shown below into a new data frame.
    Color   Price                     
    Red	    2000
    Green   5000
    Yellow  9000
    
Color_Red    Color_Green   Color_Yellow   Price
    1           0               0        2000
    0           1               0        5000
    0           0               1        9000
    
    
The code implements both train-test split and cross-validation to see if there is any change in the model and accuracy score.

The code also implements the scikit GridSearchCV library. This library takes multiple classifier objects. For each of the classifiers,
you can pass multiple parameters with a range of values. The library will execute each of the classifiers object with 
permutation and combination of their respective parameter, create and store different models based on this. In the end, 
the model which gives the maximum accuracy score will be displayed along with the optimal parameter and values.


Details of the algorithm are given below:-

1. Data is read into a pandas data frame. The columns having '?' are replaced with the mode of those columns. The data is then 
split into X(input) and Y(output/class).

2. The X data are separated respectively into a continuous and categorical data frame.

3. The categorical data is transformed into integer format using the OneHotEncoder library. The categorical X data 
   is fit and transformed on the object of OneHotEncoder. The output is an array where every unique value of the 
   column is transformed into a feature. The original rows in the array are now replaced with 0's and 1's as values.
   
4. Concatenate transformed categorical and continuous data. Train-test split this data into training and test data.

5. Train-test split the data obtained from point 4

6. Create an object of Decision Tree with all default parameters except criterion ='entropy'.Fit the training and Y data to build a
model.

7. Predict test data over the object of decision tree ie model created
in point 6.

8. Find the accuracy score of the predicted data and actual Y data.

Decision Tree with cross_val_score --

The decision tree is executed with cross-validation using the scikit method cross_val_score.

Steps for the execution of the code:-

1. The point 1-4 from above will remain the same.
2. Execute the cross_val_score with decision tree object,training data,Y data,cv=5,scoring=accuracy parameter and values.
The method will run with 5 folds. It will return the average accuracy score of the folds.


Decision Tree and K-nearest classifier with GridSearch --

1. The points 1-4 from the above will remain the same.

2. Create a Pipeline object listing classifier objects.

3. Create a param_grid dictionary with parameters for each of the classifiers and a range of values.

3. Create object of GridSearchCV with pipeline object,param_grid dictionary and cv=3 (cross validation).

4. Fit the training data on this object.

5. The attributes of gridsearchcv are printed to give the best classifier with optimal parameter and values along with accuracy
score.

In the end, you will an accuracy score to compare train-test and cross-validation approaches.
Also, compare and choose the best classifier for the Adult data set.

The code also has a method to graphically represent the Decision tree and also save the image as a pdf file.

'''

In [2]:
import pandas as pd
import numpy as np
import math
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
from sklearn.tree import export_graphviz
from sklearn.externals.six import StringIO
from IPython.display import Image
from sklearn.model_selection import cross_val_score
import pydotplus
from sklearn.model_selection import GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline

from sklearn import tree
import os 
import graphviz
os.environ["PATH"] += os.pathsep + 'C:/Program Files (x86)/Graphviz2.38/bin/'



In [3]:
def read_data():
    '''
    This function is to read data using Pandas read_csv function and convert into dataframe.
    The '?' values of the columns are replaced with the mode value of those columns.
    
    Return :- 
    data :- Dataframe of adult data set
    
    '''
    data=pd.read_csv('http://mlr.cs.umass.edu/ml/machine-learning-databases/adult/adult.data',names=['Age','Workclass','FNLWGT','Education','Education-Num','Marital Status','Occupation','Relationship','Race','Sex','Capital Gain','Caplital Loss','Hrs-Per-Week','Native-Country','Sal'])
    
      
    for col in data.columns:

        data[col].replace(' ?',data[col].mode()[0],inplace=True)
        
        data['Native-Country'].replace(' Trinadad&Tobago' ,'TrinadadTobago',inplace=True)
        
    
    
        
    return data
    

In [4]:
def preprocess_data(df):
    
    '''
    The function is to split data first into the Input (X) and Output (Y) dataset
    
    Argument :-
    df :- Adult data set
    
    Return :-
    x :- X data
    y:- Y data
    
    
    '''
    
    
    x=df.iloc[:,:-1]
    y=df.iloc[:,-1]
    
    
    return x,y

In [5]:
def labelencode(y_data):
    
    '''
    The function is to convert the Y data into labelencode. We are doing this for graphical display of Decision Tree.
    Data format may not be right,so labelencoding the data.
    
    Argument :- 
    y_data :- Y data
    
    Return :- 
    y_data :- Labelencoded Y data
    le    :- Object of LabelEncoder
    
    '''
    
    le=LabelEncoder()
    
    y_data=le.fit_transform(y_data)
    
    return y_data,le

In [6]:
def cat_cont_data_split(xdata):
    
    '''
    The function is to split the X data into categorical and continuous data.
    
    
    
    Argument :- 
    xdata :- X data
    
    Return :- 
    x_cat_data :- Categorical data frame.
    x_cont_data :- Continuous data frame
    
      
    '''
    
    
    #split xdata data into categorical data

    x_cat_data=xdata[['Workclass','Education','Marital Status','Occupation','Relationship','Race','Sex','Native-Country']]
       
    #split xdata into continuous data

    x_cont_data=xdata[['Age','FNLWGT','Education-Num','Capital Gain','Caplital Loss','Hrs-Per-Week']]
    
    
    return x_cat_data,x_cont_data
    
    

In [7]:
def onehotencode(datatofit):
    
    '''
   The function is to convert the categorical data into integer format using the OneHotEncoder library. The categorical X data 
   is fit and transformed on the object of OneHotEncoder. The output is an array where every unique value of the 
   columns are transformed into a feature. The original rows in the array are now replaced with 0's and 1's as values.
    
    Argument :- 
    datatofit :-X-catogorical data
    
    Return :- 
    x_data_oe :-Onehot encoded catogorical data.
    oe        :- Object of OneHotEncoder
    
    '''
    
    oe=OneHotEncoder()
    x_data_oe=oe.fit_transform(datatofit).toarray()
    
    return x_data_oe,oe

In [8]:
def concat_cat_cont(xcat_ohe,xcont_data):
    
    '''
    The function is to concatenate transformed categorical and continuous data. 
    
    
    Argument :- 
    xcat_ohe   :- onehotencoder data 
    xcont_data :-Continuous data.
    
    Return :- 
    concat_df :-Dataframe with both catogorical and continuous data.
             
    '''
    xcat_ohe=pd.DataFrame(xcat_ohe)
    concat_df=pd.concat([xcont_data,xcat_ohe],axis=1)
    
    return concat_df

In [9]:
def train_test(concat_cat_cont_df,ydata_lb):
    
    '''
    The function is to train-test split the dataframe with transformed categorical data and continuous data.
    
    Argument :- 
    concat_cat_cont_df  :- Concatenated cat and cont data,
    ydata_lb               :- LabelEncoded Y data.
    
    Return :- 
    x_train:-training data
    x_test :- Validation data
    y_train :- Label data of training data
    y_test :- label data of validation data
    
    '''
    
    
    x_train,x_test,y_train,y_test=train_test_split(concat_cat_cont_df,ydata_lb,test_size=0.3,random_state=0)
    
    return x_train,x_test,y_train,y_test
    
    

In [10]:
def decision_tree(xtrain,ydata):
    '''
    Create an object of Decision Tree and fit the training X and Y data.
    
    Argument :- 
    xtrain :- Training data
    ydata  :- Y data of training.
    
    Return :- object of DT
    
    '''
    
    dt=DecisionTreeClassifier(criterion="entropy")
    dt.fit(xtrain,ydata)
    
    return dt

In [11]:
def pred_test(dt_obj,xtest):
    
    '''
    The function is to run the test data over the model to get the prediction of every row.
    
    Argument :- 
    dt_obj :- Decision Tree object 
    xtest  :- test data.
    
    Return :- 
    ypred :- Array of predicted class for every row.
    
    
    '''
    
    ypred=dt_obj.predict(xtest)
    
    return ypred

In [12]:
def accuracy_score(ytest,y_pred):
    
    '''
    The function is to calculate the accuracy score of predicted data and actual Y data of test data set.
    
    Argument :- 
    ytest :- Test data
    y_pred :-Predicted arrray of class 
            
    '''

    print(metrics.accuracy_score(ytest,y_pred))

In [13]:
def cross_val_split(xtrain,ytrain):
    
    '''
    The function is used to use cross validation with cross_val_score method.This method take parameters classifier object,
    training data, Y data,cv=number of fold,scoring=accuracy. It returns an average score of the folds.
    
    Argument :- 
    xtrain :- Training data.
    ytrain :- Y data.
   
    '''
    
    scores=cross_val_score(DecisionTreeClassifier(criterion="entropy"),xtrain,ytrain,cv=4,scoring='accuracy')
    dt_score=scores.mean()
    print(dt_score)
    
    

In [14]:
def grid_search_dt():
    
    '''
    The function is to run multiple classifiers using GridSearchCV and the find the best classifier.The library takes multiple
    classifier objects. Create a param_grid dictionary with parameters for each of the classifiers and a range of values. 
    Create a Pipeline object listing classifier objects.Create a object of GridSearchCV with pipeline object,param_grid 
    dictionary and cv=3 (cross-validation).Fit the training data on this object. The attributes of GridSearchCV are printed to
    give the best classifier with optimal parameters and values along with an accuracy score.
    
    '''
    k_range = list(range(3,9,2))
    m_depth=list(range(10,12))
    m_split=list(range(30,35))
    
    
    pipe = Pipeline([('classifier', KNeighborsClassifier())])


    param_grid = [{
        'classifier':[KNeighborsClassifier()],
        'classifier__n_neighbors':k_range,
        },
        {
        'classifier':[DecisionTreeClassifier()],
        'classifier__max_depth': m_depth,
        'classifier__min_samples_split':m_split,
        'classifier__criterion':['entropy','gini']
       }] 


    clf=GridSearchCV(pipe,param_grid,cv=3)
    clf.fit(xtrain,ytrain)
    print(clf.best_estimator_)
    print(clf.best_score_)


In [15]:
def graph_display(dt_obj,xtrain,oe_obj):
    
    '''
    Use Graphviz to graphically display the Decsion Tree.It also allows to store the tree in pdf format.
    
    Argument :-
    dt_obj  :- Decision tree object
    xtrain  :- Training data
    oe_obj 
    
    '''
    label=list(dt_obj.classes_.astype('str'))
    
    dot_data = tree.export_graphviz(dt_obj, out_file=None, feature_names=list(xtrain.columns), class_names=label,filled=True, rounded=True,special_characters=True)  
      
    graph = pydotplus.graph_from_dot_data(dot_data)  
    graph.write_pdf("Adult_data_new.pdf")
   

In [16]:
#Read data using pandas and convert in data frame
df=read_data()

In [17]:
#Split data into X and Y data.
xdata,ydata=preprocess_data(df)

In [18]:
#LabelEncode Y data for graphically displaying the Decision Tree.
ydata_lb,le_obj=labelencode(ydata)

In [19]:
#Split the X data into categorical and continuous data.
xcat_data,xcont_data=cat_cont_data_split(xdata)

In [20]:
#Convert the categorical data into interger format using OneHotEncoder.
xcat_ohe,oe_obj=onehotencode(xcat_data)

In [21]:
#Concatenate transformed categorical data with continuous data.
concat_cat_cont_df=concat_cat_cont(xcat_ohe,xcont_data)

In [22]:
#Train-test split the concatenated data.
xtrain,xtest,ytrain,ytest=train_test(concat_cat_cont_df,ydata_lb)

In [23]:
#Create object of Decision Tree
dt_obj=decision_tree(xtrain,ytrain)

In [24]:
#Run the test data over the model to predict the data.
y_pred=pred_test(dt_obj,xtest)

In [36]:
#Find the accuracy score of predicted and actual Y data.
accuracy_score(ytest,y_pred)

0.8172791483263384


In [37]:
#Run Decision Tree with cross-validation.
cross_val_split(xtrain,ytrain)

0.8157246665787077


In [25]:
#Run K-nearest and Decision Tree classifier with their respective multilple parameters and range of values using GridSearchCV.
grid_search_dt()

Pipeline(memory=None,
     steps=[('classifier', DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=11,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=33,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best'))])
0.8549491049491049


In [39]:
#Graphically display Decision tree and save the image in pdf format.
graph_display(dt_obj,xtrain,oe_obj)