In [None]:
'''
The code is using the RandomForest classifier on the Adult data set. The RandomForest is a tree-like structure.RandomForest is
based on the concept of ‘Ensemble’ or group. In RandomForest multiple decision trees are generated (based on n_estimator values)
This group of weak models is combined to form a powerful model and it unifies their result to obtain a good prediction.
The approach of a voting mechanism from the aggregated prediction of all the trees gives a better chance of good and fairer
predication.Multiple decision trees are made using either gini index or entropy
In scikit, Decision Tree only works on numerical data. Therefore all categorical data are OneHotEncoded to integer format.
The categorical data is transformed using the scikit library OneHotEncoder into O's and 1's.In OneHotEncoder every unique value
of the column is transformed to feature. The row values of the original data set are now transformed into 0 and 1 values.
The row in the new data frame as values for that column as 1 if in the original data frame the row had that is value else will 
have value 0. 
Thus new data frame will have as many columns as many unique column values in the original data set.

Eg:-Let say a data frame has 2 columns. After OneHotEncoder is applied on the categorical column(Color), it will be transformed
as shown below into a new data frame.
    Color   Price                     
    Red	    2000
    Green   5000
    Yellow  9000
    
Color_Red    Color_Green   Color_Yellow   Price
    1           0               0        2000
    0           1               0        5000
    0           0               1        9000
    
    
The code implements both train-test split.

The code also implements the scikit GridSearchCV library. This library takes multiple classifier objects. For each of the classifiers,
you can pass multiple parameters with a range of values. The library will execute each of the classifiers object with 
permutation and combination of their respective parameter, create and store different models based on this. In the end, 
the model which gives the maximum accuracy score will be displayed along with the optimal parameter and values.


Details of the algorithm are given below:-

1. Data is read into a pandas data frame. The columns having '?' are replaced with the mode of those columns. The data is then 
split into X(input) and Y(output/class).

2. The X data are separated respectively into a continuous and categorical data frame.

3. The categorical data is transformed into integer format using the OneHotEncoder library. The categorical X data 
   is fit and transformed on the object of OneHotEncoder. The output is an array where every unique value of the 
   column is transformed into a feature. The original rows in the array are now replaced with 0's and 1's as values.
   
4. Concatenate transformed categorical and continuous data. Train-test split this data into training and test data.

5. Train-test split the data obtained from point 4

6. Create an object of RandomForest with all default parameters except criterion ='entropy'.Fit the training and Y data to build a
model.

7. Predict test data over the object of RandomForest ie model created
in point 6.

8. Find the accuracy score of the predicted data and actual Y data.



RandomForest and K-nearest classifier with GridSearch --

1. The points 1-4 from the above will remain the same.

2. Create a Pipeline object listing classifier objects.

3. Create a param_grid dictionary with parameters for each of the classifiers and a range of values.

3. Create object of GridSearchCV with pipeline object,param_grid dictionary and cv=3 (cross validation).

4. Fit the training data on this object.

5. The attributes of gridsearchcv are printed to give the best classifier with optimal parameter and values along with accuracy
score.

In the end, you will an accuracy score to compare train-test and cross-validation approaches.
Also, compare and choose the best classifier for the Adult data set.

'''

In [42]:
import pandas as pd
import numpy as np
import math
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline


In [12]:
def read_data():
    '''
    This function is to read data using Pandas read_csv function and convert into dataframe.
    The '?' values of the columns are replaced with the mode value of those columns.
    
    Return :- 
    data :- Dataframe of adult data set
    
    '''
    data=pd.read_csv('http://mlr.cs.umass.edu/ml/machine-learning-databases/adult/adult.data',names=['Age','Workclass','FNLWGT','Education','Education-Num','Marital Status','Occupation','Relationship','Race','Sex','Capital Gain','Caplital Loss','Hrs-Per-Week','Native-Country','Sal'])
    
      
    for col in data.columns:

        data[col].replace(' ?',data[col].mode()[0],inplace=True)
        
        data['Native-Country'].replace(' Trinadad&Tobago' ,'TrinadadTobago',inplace=True)
        
    
    
        
    return data
    

In [14]:
def preprocess_data(df):
    
    '''
    The function is to split data first into the Input (X) and Output (Y) dataset
    
    Argument :-
    df :- Adult data set
    
    Return :-
    x :- X data
    y:- Y data
    
    
    '''
    
    
    x=df.iloc[:,:-1]
    y=df.iloc[:,-1]
    
    
    return x,y

In [15]:
def labelencode(y_data):
    
    '''
    The function is to convert the Y data into labelencode. We are doing this for graphical display of Decision Tree.
    Data format may not be right,so labelencoding the data.
    
    Argument :- 
    y_data :- Y data
    
    Return :- 
    y_data :- Labelencoded Y data
    le    :- Object of LabelEncoder
    
    '''
    
    le=LabelEncoder()
    
    y_data=le.fit_transform(y_data)
    
    return y_data,le

In [16]:
def cat_cont_data_split(xdata):
    
    '''
    The function is to split the X data into categorical and continuous data.
            
    Argument :- 
    xdata :- X data
    
    Return :- 
    x_cat_data :- Categorical data frame.
    x_cont_data :- Continuous data frame
    
      
    '''
        
    #split xdata data into categorical data

    x_cat_data=xdata[['Workclass','Education','Marital Status','Occupation','Relationship','Race','Sex','Native-Country']]
        
    #split xdata into continuous data

    x_cont_data=xdata[['Age','FNLWGT','Education-Num','Capital Gain','Caplital Loss','Hrs-Per-Week']]
        
    
    return x_cat_data,x_cont_data
    

In [17]:
def onehotencode(datatofit):
    
    '''
   The function is to convert the categorical data into integer format using the OneHotEncoder library. The categorical X data 
   is fit and transformed on the object of OneHotEncoder. The output is an array where every unique value of the 
   columns are transformed into a feature. The original rows in the array are now replaced with 0's and 1's as values.
    
    Argument :- 
    datatofit :-X-catogorical data
    
    Return :- 
    x_data_oe :-Onehot encoded catogorical data.
    oe        :- Object of OneHotEncoder
    
    '''
    
    oe=OneHotEncoder()
    x_data_oe=oe.fit_transform(datatofit).toarray()
    
    return x_data_oe,oe

In [18]:
def concat_cat_cont(xcat_ohe,xcont_data):
    
    '''
    The function is to concatenate transformed categorical and continuous data. 
    
    
    Argument :- 
    xcat_ohe   :- onehotencoder data 
    xcont_data :-Continuous data.
    
    Return :- 
    concat_df :-Dataframe with both catogorical and continuous data.
             
    '''
    xcat_ohe=pd.DataFrame(xcat_ohe)
    concat_df=pd.concat([xcont_data,xcat_ohe],axis=1)
    
    return concat_df

In [19]:
def train_test(concat_cat_cont_df,ydata_lb):
    
    '''
    The function is to train-test split the dataframe with transformed categorical data and continuous data.
    
    Argument :- 
    concat_cat_cont_df  :- Concatenated cat and cont data,
    ydata_lb               :- LabelEncoded Y data.
    
    Return :- 
    x_train:-training data
    x_test :- Validation data
    y_train :- Label data of training data
    y_test :- label data of validation data
    
    '''
    
    x_train,x_test,y_train,y_test=train_test_split(concat_cat_cont_df,ydata_lb,test_size=0.3,random_state=0)
    
    return x_train,x_test,y_train,y_test

In [20]:
def random_forest(xtrain,ydata):
    '''
    Create an object of DT and fit the training X and Y data.
    
    Argument :- xtrain and ydata.
    
    Return :- object of DT
    
    '''
    clf=RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=30, min_samples_split=40,
            min_weight_fraction_leaf=0.0, n_estimators=100, 
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)
    
    
    clf.fit(xtrain,ydata)
    
    
    
    return clf

In [34]:
def pred_test(rf_obj,x_test):
    
    '''
    The function is to run the test data over the model to get the prediction of every row.
    
    Argument :- 
    rf_obj :- Decision Tree object 
    x_test  :- test data.
    
    Return :- 
    ypred :- Array of predicted class for every row.
    
    
    '''
    
    y_pred=rf_obj.predict(x_test)
    
    return y_pred

In [37]:
def acccuracy_score(ypred,y_test):
    
    '''
    The function is to calculate the accuracy score of predicted data and actual Y data of test data set.
    
    Argument :- 
    y_test :- Test data
    ypred :-Predicted arrray of class 
            
    '''
    

    print("Accuracy:",metrics.accuracy_score(y_test, ypred))

In [23]:
def grid_search_dt():
    #knn_grid = {'n_neighbors': list(range(3,10,2))}
    
    parameters={'n_estimators':[10,100],'max_depth': range(1,10,1),'min_samples_leaf': range(1,15,5),'min_samples_split':range(2,50,2),'min_impurity_decrease':np.arange(0.0,0.1,0.01)}
    #parameters={'min_samples_split':range(4,50,2)}
    clf_tree=RandomForestClassifier(n_estimators=100,criterion="entropy")
    clf=GridSearchCV(clf_tree,parameters,cv=3)
    clf.fit(xtrain,ytrain)
    print(clf.best_estimator_)
    print(clf.best_score_)
    best_obj=clf.best_estimator_
    
    return best_obj

In [45]:
def grid_search_dt():
    
    '''
    The function is to run multiple classifiers using GridSearchCV and the find the best classifier.The library takes multiple
    classifier objects. Create a param_grid dictionary with parameters for each of the classifiers and a range of values. 
    Create a Pipeline object listing classifier objects.Create a object of GridSearchCV with pipeline object,param_grid 
    dictionary and cv=3 (cross-validation).Fit the training data on this object. The attributes of GridSearchCV are printed to
    give the best classifier with optimal parameters and values along with an accuracy score.
    
    '''
    k_range = list(range(3,9,2))
    m_depth=list(range(10,12))
    m_split=list(range(30,35))
    est_range = list(range(50,100,10))
    
    pipe = Pipeline([('classifier', KNeighborsClassifier())])


    param_grid = [{
        'classifier':[KNeighborsClassifier()],
        'classifier__n_neighbors':k_range,
        },
        {
        'classifier':[RandomForestClassifier()],
        'classifier__max_depth': m_depth,
        'classifier__n_estimators':est_range,
        'classifier__min_samples_split':m_split,
        'classifier__criterion':['entropy','gini']
       }] 


    clf=GridSearchCV(pipe,param_grid,cv=3)
    clf.fit(xtrain,ytrain)
    print(clf.best_estimator_)
    print(clf.best_score_)


In [24]:
#Read data using pandas and convert in data frame
df=read_data()

In [25]:
#Split data into X and Y data.
xdata,ydata=preprocess_data(df)

In [26]:
#LabelEncode Y data for graphically displaying the Decision Tree.
ydata_lb,le_obj=labelencode(ydata)

In [28]:
#Split the X data into categorical and continuous data.
xcat_data,xcont_data=cat_cont_data_split(xdata)

In [29]:
#Convert the categorical data into interger format using OneHotEncoder.
xcat_ohe,oe_obj=onehotencode(xcat_data)

In [30]:
#Concatenate transformed categorical data with continuous data.
concat_cat_cont_df=concat_cat_cont(xcat_ohe,xcont_data)

In [31]:
#Train-test split the concatenated data.
xtrain,xtest,ytrain,ytest=train_test(concat_cat_cont_df,ydata_lb)

In [32]:
#Create object of RandomForest classifier.
rf_obj=random_forest(xtrain,ytrain)

In [35]:
#Run the test data over the model to predict the data.
y_pred=pred_test(rf_obj,xtest)

In [39]:
#Find the accuracy score of predicted and actual Y data.
acccuracy_score(ytest,y_pred)

Accuracy: 0.8554611526256526


In [46]:
best_est_obj=grid_search_dt()

Pipeline(memory=None,
     steps=[('classifier', RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=11, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=32,
            min_weight_fraction_leaf=0.0, n_estimators=90, n_jobs=None,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False))])
0.8578448578448579
