# HW5: Ensemble methods and cross validation with college data

Using the college statistic data to compare different ensemble methods on their performance with cross validation.


In [0]:
# import few things we will use in this exercise
import pandas as pd
from sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.preprocessing import LabelEncoder
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier, BaggingClassifier,AdaBoostClassifier,GradientBoostingClassifier

In [0]:
#1 Code to read csv file into colaboratory:
!pip install -U -q PyDrive
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

# 1. Authenticate and create the PyDrive client.
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

In [0]:
#2. Get the file
#make sure you upload all your data files to your Google drive and change share->Advanced->change->anyone with the link can view
downloaded = drive.CreateFile({'id':'185gv_hzqVvSUI_TBDINYNU6Leawu5tMo'}) # replace the id with id of file you want to access
downloaded.GetContentFile('college.csv')

In [0]:
# read in the data and save it in a dataframe
col = pd.read_csv('college.csv')

In [0]:
# Take a look and see what's in it
col.head(5)

Unnamed: 0.1,Unnamed: 0,Private,Apps,Accept,Enroll,Top10perc,Top25perc,F.Undergrad,P.Undergrad,Outstate,Room.Board,Books,Personal,PhD,Terminal,S.F.Ratio,perc.alumni,Expend,Grad.Rate
0,Abilene Christian University,Yes,1660,1232,721,23,52,2885,537,7440,3300,450,2200,70,78,18.1,12,7041,60
1,Adelphi University,Yes,2186,1924,512,16,29,2683,1227,12280,6450,750,1500,29,30,12.2,16,10527,56
2,Adrian College,Yes,1428,1097,336,22,50,1036,99,11250,3750,400,1165,53,66,12.9,30,8735,54
3,Agnes Scott College,Yes,417,349,137,60,89,510,63,12960,5450,450,875,92,97,7.7,37,19016,59
4,Alaska Pacific University,Yes,193,146,55,16,44,249,869,7560,4120,800,1500,76,72,11.9,2,10922,15


---
## EDA: 
##### You may perform more EDA steps than the ones suggested below.

In [0]:
# See what kind of data format and how many are there in each column
col.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 777 entries, 0 to 776
Data columns (total 19 columns):
Unnamed: 0     777 non-null object
Private        777 non-null object
Apps           777 non-null int64
Accept         777 non-null int64
Enroll         777 non-null int64
Top10perc      777 non-null int64
Top25perc      777 non-null int64
F.Undergrad    777 non-null int64
P.Undergrad    777 non-null int64
Outstate       777 non-null int64
Room.Board     777 non-null int64
Books          777 non-null int64
Personal       777 non-null int64
PhD            777 non-null int64
Terminal       777 non-null int64
S.F.Ratio      777 non-null float64
perc.alumni    777 non-null int64
Expend         777 non-null int64
Grad.Rate      777 non-null int64
dtypes: float64(1), int64(16), object(2)
memory usage: 115.4+ KB


In [0]:
col.shape ## There are 777 rows and 19 columns in the dataset

(777, 19)

In [0]:
col.dtypes

Unnamed: 0      object
Private         object
Apps             int64
Accept           int64
Enroll           int64
Top10perc        int64
Top25perc        int64
F.Undergrad      int64
P.Undergrad      int64
Outstate         int64
Room.Board       int64
Books            int64
Personal         int64
PhD              int64
Terminal         int64
S.F.Ratio      float64
perc.alumni      int64
Expend           int64
Grad.Rate        int64
dtype: object

---

#### Rename the first column name to University so it will be easier to relate.

In [0]:
#After renaming, check to see if the column names got changed.

col.rename(columns = {'Unnamed: 0' : 'University'},inplace = True)
col.head(2)

Unnamed: 0,University,Private,Apps,Accept,Enroll,Top10perc,Top25perc,F.Undergrad,P.Undergrad,Outstate,Room.Board,Books,Personal,PhD,Terminal,S.F.Ratio,perc.alumni,Expend,Grad.Rate
0,Abilene Christian University,Yes,1660,1232,721,23,52,2885,537,7440,3300,450,2200,70,78,18.1,12,7041,60
1,Adelphi University,Yes,2186,1924,512,16,29,2683,1227,12280,6450,750,1500,29,30,12.2,16,10527,56


#### Change yes/no to 1/0 in column 'Priviate' to get it ready for model training.

In [0]:
#First of all, find out how many yes and no are there in the 'Private' column.
col.loc[:,'Private'].value_counts()
col.Private.value_counts()

Yes    565
No     212
Name: Private, dtype: int64

In [0]:
#change yes to 1 and no to 0 in the "Private" column and view the changes after your action.
Private_dummies = pd.get_dummies(col.Private,prefix = 'Private') # Convert string categories  into dummies

Private_dummies.drop(Private_dummies.columns[0],axis = 1,inplace = True) # Drop first column after creating of dummies

col.drop(['Private'],axis = 1,inplace = True) # Drop the actual Private Column

col.insert(loc = 1,column = 'Private', value = Private_dummies ) # Insert the values for Yes as 1's and No as 0's

col.loc[:,'Private'].value_counts() ## Count the number of 1's and 0's


1    565
0    212
Name: Private, dtype: int64

In [0]:
#check to see if there are equal number of 1/0s as that in yes/no you checked earilier.
col.loc[:,'Private'].value_counts() ## Count the number of 1's and 0's


1    565
0    212
Name: Private, dtype: int64

There are 565 - Yes and 212 - No.

Now, we have 565 - 1's and 212 - 0's. Hence there is no change in the value counts

___
### Compare classification performance of different models with cross validation

In [0]:
coll = col

In [0]:
# Assign the Private column as target y and 
#assign all columns, except University and Private, in the dataframe as X

y = coll.loc[:,'Private']

X = coll.iloc[:,2:19]

#### Create a function that takes a model and the 'number of splits' parameter in StratifiedKFold cross validation

In [0]:
from sklearn.tree import DecisionTreeClassifier

from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier, BaggingClassifier, AdaBoostClassifier,GradientBoostingClassifier
  
dt = DecisionTreeClassifier()

bc = BaggingClassifier()

et = ExtraTreesClassifier()

rf = RandomForestClassifier()

In [0]:
def cross_score(model,model_name,n_splitss, X, y):
    """
    Function that shows performance avearge and standard deviation 
    with different models and differnt number of splits in cross validation.

    Args:
        model: machine learning model used, e.g., DecisionTreeClassifier()
        model_name: The name of the machine learning model used, e.g., 'Decision Tree' Classifier.
        n_splits: number of folds used in StratifiedKFold
        X: feature data used for the model
        y: target data used for the model
    Returns:
        none
    Printout: 
        the average and the standard deviation of cross validation score 
        with 3 decimal places like the following.
        
        Decision Tree Classifier with 2 splits has an average score of 0.916 ± 0.038

    """
    ### your code here ###
    from sklearn.model_selection import cross_val_score, StratifiedKFold
  
    cv = StratifiedKFold(n_splits= n_splitss, shuffle=True)
  
    s = cross_val_score(model, X, y, cv=cv, n_jobs=-1)
  
    #print("{} Score:\t{:0.3} ± {:0.3}".format('model_name', s.mean().round(3), s.std().round(3)))
    #print("{} Score:\t{:0.3} ± {:0.3}".format(model_name, s.mean().round(3), s.std().round(3)))
  
    print("%s Classifier with %s splits has an average score of %s ± %s" %(model_name,n_splitss,s.mean().round(3),s.std().round(3)))

  


In [0]:
from sklearn.tree import DecisionTreeClassifier

from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier, BaggingClassifier, AdaBoostClassifier,GradientBoostingClassifier
  
dt = DecisionTreeClassifier()

bc = BaggingClassifier()

et = ExtraTreesClassifier()

rf = RandomForestClassifier()

def cross_score(model,model_name,n_splitss, X, y):
  from sklearn.model_selection import cross_val_score, StratifiedKFold
  
  cv = StratifiedKFold(n_splits= n_splitss, shuffle=True)
  
  s = cross_val_score(model, X, y, cv=cv, n_jobs=-1)
  
  #print("{} Score:\t{:0.3} ± {:0.3}".format('model_name', s.mean().round(3), s.std().round(3)))
  #print("{} Score:\t{:0.3} ± {:0.3}".format(model_name, s.mean().round(3), s.std().round(3)))
  
  print("%s Classifier with %s splits has an average score of %s ± %s" %(model_name,n_splitss,s.mean().round(3),s.std().round(3)))

  
  #print('{}'.format(model_name))

In [0]:
#cross_score(DecisionTreeClassifier(), 'Decision Tree', 3,X,y)

cross_score(RandomForestClassifier(), 'Random Forest ', 2,X,y)

#cross_score(ExtraTreesClassifier(), 'Extra Trees Classifier', 5,X,y)

#cross_score(BaggingClassifier(), 'Bagging Classifier', 5,X,y)

#cross_score(RandomForestClassifier(), 'Random Forest', 5,X,y)

### Testing the function on Decision Tree algorithm

In [0]:
for i in range(2,21):
  
  cross_score(dt, 'Decision Tree ', i,X,y)

Decision Tree  Classifier with 2 splits has an average score of 0.898 ± 0.009
Decision Tree  Classifier with 3 splits has an average score of 0.906 ± 0.01
Decision Tree  Classifier with 4 splits has an average score of 0.907 ± 0.011
Decision Tree  Classifier with 5 splits has an average score of 0.919 ± 0.013
Decision Tree  Classifier with 6 splits has an average score of 0.896 ± 0.015
Decision Tree  Classifier with 7 splits has an average score of 0.902 ± 0.035
Decision Tree  Classifier with 8 splits has an average score of 0.925 ± 0.018
Decision Tree  Classifier with 9 splits has an average score of 0.898 ± 0.023
Decision Tree  Classifier with 10 splits has an average score of 0.905 ± 0.028
Decision Tree  Classifier with 11 splits has an average score of 0.907 ± 0.018
Decision Tree  Classifier with 12 splits has an average score of 0.915 ± 0.031
Decision Tree  Classifier with 13 splits has an average score of 0.906 ± 0.042
Decision Tree  Classifier with 14 splits has an average score

#### By examining the result above, which split(s) gives you best performance?



Answer: In General, mean should be of maximum and std should be low. The Time, when I ran the above cell it's with 5 splits, has given the maximum mean and low standard deviation, when compared to rest of the other splits. We can make conclusion that split 5 could give best perfomace over the other splits. If you keep re run the value may get change.



___
### Comparing the mean and sd of cross validation scores subject to decision tree and ensemble methods

Use the function you created above to print out  
the mean and sd of cross validation scores for  
decision trees, bagging, random forest, extra trees ,
AdaBoost, and Gradient Boosting classifiers   
with 3 folds cross validation.

In [0]:
cross_score(DecisionTreeClassifier(), 'Decision Tree', 3,X,y)

cross_score(RandomForestClassifier(), 'Random Forest ', 3,X,y)

cross_score(ExtraTreesClassifier(), 'Extra Trees Classifier', 3,X,y)

cross_score(BaggingClassifier(), 'Bagging Classifier', 3,X,y)

cross_score(AdaBoostClassifier(), 'Ada Boosst', 3,X,y)

cross_score(GradientBoostingClassifier(), 'Gradient Boosting', 3,X,y)


Decision Tree Classifier with 3 splits has an average score of 0.916 ± 0.019
Random Forest  Classifier with 3 splits has an average score of 0.93 ± 0.019
Extra Trees Classifier Classifier with 3 splits has an average score of 0.937 ± 0.006
Bagging Classifier Classifier with 3 splits has an average score of 0.934 ± 0.014
Ada Boosst Classifier with 3 splits has an average score of 0.918 ± 0.011
Gradient Boosting Classifier with 3 splits has an average score of 0.936 ± 0.005


#### Comment on the mean and the sd deviation of scores and see if they are consistent with the pros and cons of ensemble methods.   
You may want to run few times for generalized comparision purposes.

--- As I ran the cell several times the ensemble approach values of mean and standard deviation are consistent for most of the time which has very minute variation.

It is not much improved with using of ensemble methids, but it any how got improved . The variation and bias got minimized with ensemble approach.

---

### Print out the "feature importances" of this best model.

The model has an attribute called `.feature_importances_` which can tell us which features were most important vs. others. It ranges from 0 to 1, with 1 being the most important.

An easy way to think about the feature importance is how much that particular variable was used to make decisions. Really though, it also takes into account how much that feature contributed to splitting up the class or reducing the variance.

A feature with higher feature importance reduced the criterion (impurity) more than the other features.

See 
<a href="https://towardsdatascience.com/running-random-forests-inspect-the-feature-importances-with-this-code-2b00dd72b92e">here </a> for more detail and coding sample.

#### Show, for RandomForestClassifier, the feature importances for each variable predicting private vs. not, sorted by most important feature to least.

In [0]:
# show the ranked list of feature importances here

rf.fit(X, y)

#print(dt.feature_importances_)

import pandas as pd
feature_importances = pd.DataFrame(rf.feature_importances_,
                                   index = X.columns,
                                    columns=['Random_Forest_Important_Features']).sort_values('Random_Forest_Important_Features', ascending=False)


In [0]:
features = feature_importances
feature_importances

Unnamed: 0,Random_Forest_Important_Features
F.Undergrad,0.218535
Outstate,0.203657
Enroll,0.102236
P.Undergrad,0.085526
Expend,0.069796
Accept,0.056063
Room.Board,0.045354
Apps,0.039114
S.F.Ratio,0.0276
perc.alumni,0.025828


#### Show a ranked list of features in col dataframe that is correlated to the target column 'Private'

In [0]:
# ranked list of features in col dataframe that is correlated to the target column 'Private'.

#yy = pd.concat(X,y,axis = 1,inplace = True)

result = pd.concat([X, y], axis=1)

col.corr().abs().Private.sort_values(ascending= False)

#result.corr()

data = col.corr().abs().Private.sort_values(ascending= False)

data1 = feature_importances

dataa = pd.DataFrame(data,data1)



In [0]:
dataa = pd.concat([data,data1],axis = 1 )

#df.sort_values('2')

#dataa.sort_values('Private')

dataa.sort_values(by=['Private'], ascending=False)

Unnamed: 0,Private,Random_Forest_Important_Features
Private,1.0,
F.Undergrad,0.615561,0.218535
Enroll,0.567908,0.102236
Outstate,0.55265,0.203657
Accept,0.475252,0.056063
S.F.Ratio,0.472205,0.0276
P.Undergrad,0.452088,0.085526
Apps,0.432095,0.039114
perc.alumni,0.414775,0.025828
Room.Board,0.340532,0.045354


#### Compare and comment on the feature importances and corr() lists you have above.

There  are 18 features in the dataset.

The Random Forest model has made among all the 18 features in the dataset, F.Undergrad, Outstate,Enroll,	 were the top 3 features of importance that has high predctibality power and are less impure nodes when compared to other features. 

But If we look into correlation table  Enroll was secong  has high correlation with respect to target feature "Private" among  the features. But  Random forest considered it to be the third important feature. 

One major takeaway on both Correlation table and Feature importance by random forest.
That is, it is not necessarily, that if the features are highly correlated with target variable need not have to be the important important feature.


##### Note: The values may change as you keep on running the cells. The values are the one I got, when I executed it on my machine.
### Top correlated Features:


In [0]:
data = col.corr().abs().Private.sort_values(ascending= False)
data

AttributeError: ignored