In [None]:
# ------------------------------------------------------------------------------
# Authors: Andreas Nilsson, Anouka Ranby, Erik Rosvall (All part of ADS program)
# Date: 23 Jan 2022
# Description: Decision Trees - Applied Machine Learning, DIT867
# ------------------------------------------------------------------------------

## Task 3: Feature importances in random forest classifiers
Decision trees and random forests are trained by computing importance scores for individual features in different ways: information gain, Gini impurity, variance reduction, etc.

As a way to make our classifiers more interpretable, we can print the importance scores. In scikit-learn, decision trees and ensemble classifiers such as random forests all define an attribute called feature_importances_ (note the final underscore in this name). This is a NumPy array that stores the importance scores for each feature column in the training data matrix. For random forests and other tree ensembles, these importance scores are computed by averaging the scores when training all the different trees in the ensemble.

To make these importance scores easier to understand, we can use the attribute feature_names_ (note the underscore again) in the DictVectorizer.

Sort the features by importance scores in reverse order (so that the most important feature comes first), inspect the first few of these features, and try to reason about why you got this result.

Hint. If you used a Pipeline, you can access the parts of the sequence via the list pipeline.steps. For instance, pipeline.steps[0][1] will be the first step, pipeline.steps[1][1] will be the second step, etc.

Hint. This way of computing feature importance scores just tells us whether a feature is good for discriminating between the classes: it does not tell us what the relationship between the feature and an output class is: whether the feature makes it more or less likely that the person is a high earner.

For your report, please also mention an alternative way to compute some sort of importance score of individual features. (You don't need to implement it.) Here, you can either use your common sense, or optionally read the discussion by Parr et al. (2018) that gives some criticism of decision tree-based feature importance scores and discusses some alternatives.

In [None]:
# Import required libs
import pandas as pd
import numpy as np
import time
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_validate
from sklearn.feature_extraction import DictVectorizer
from sklearn.pipeline import make_pipeline
from sklearn.feature_selection import f_classif
from sklearn.feature_selection import mutual_info_classif
from sklearn.feature_selection import chi2
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
from sklearn.feature_selection import SelectKBest
from sklearn.preprocessing import StandardScaler

In [None]:
# Import csv file for Training data
train = pd.read_csv('adult_train.csv')

# Import csv file for Test data
test = pd.read_csv('adult_test.csv')

# Split into input part X and output part Y.
# convert X data to dict - creates a list of dicts 
# where each row's (person's) information is gathered in one dict, 
Xtrain_dicts = train.drop('target', axis=1).to_dict('records')
Xtest_dicts = test.drop('target', axis=1).to_dict('records')
Ytrain = train['target']
Ytest = test['target']

In [None]:
# create d DictVectorizer 
dv = DictVectorizer()

# one-hot encode the list with all dicts
Xtrain_encoded = dv.fit_transform(Xtrain_dicts)
Xtest_encoded = dv.transform(Xtest_dicts)

In [None]:
## Find good hyperparameters for Random Forest classifier
rfc = RandomForestClassifier()

# specify grid
hparams_grid = {'max_depth':  [1,3,6,9,12,15,18,21,51],
                'n_estimators': [1,3,6,9,12,15,21,26,51,76,101,126,151,176,201,226,251] }


In [None]:
# Random search since faster
random_search = RandomizedSearchCV(rfc, hparams_grid, n_iter=10)

# train and time it
start = time.time()
random_search.fit(Xtrain_encoded, Ytrain)
end = time.time()
time_grid_train = end - start
print('Train time: {}'.format(time_grid_train))


Train time: 100.30883741378784


In [None]:
# store best params
r_best_m_d = random_search.best_params_['max_depth']
r_best_n_e = random_search.best_params_['n_estimators']

# print result
print("Best Score: {}".format(random_search.best_score_))
print("Best params: {}".format(random_search.best_params_))

Best Score: 0.8614908012363103
Best params: {'n_estimators': 51, 'max_depth': 18}


In [None]:
# MAKE PIPELINE
pipeline = make_pipeline(DictVectorizer(), # one-hot encode dicts
                        StandardScaler(),
                        SelectKBest(f_classif, k=100), # default f_classif, top 100 features
                        # use best params found from above grid search
                        RandomForestClassifier(max_depth=r_best_m_d, n_estimators=r_best_n_e, random_state=0, n_jobs=-1) )

In [None]:
# MAKE PIPELINE
pipeline = make_pipeline(DictVectorizer(), # one-hot encode dicts
                        SelectKBest(f_classif, k=100), # default f_classif, top 100 features
                        # use best params found from above grid search
                        RandomForestClassifier(max_depth=r_best_m_d, n_estimators=r_best_n_e, random_state=0, n_jobs=-1) )

In [None]:
# train pipeline
pipeline.fit(Xtrain_dicts, Ytrain)

ValueError: Cannot center sparse matrices: pass `with_mean=False` instead. See docstring for motivation and alternatives.

In [None]:
# get accuracy on unseen test data
accuracy_score(Ytest, pipeline.predict(Xtest_dicts))

In [None]:
# get accuracy on unseen test data
accuracy_score(Ytest, pipeline.predict(Xtest_dicts))

0.8638290031324857

Here we show the model's (rfc) features importance scores computed from the impurity-based feature importances (computed as the mean decreas in impurity). This is the most common method to compute feature importance and is therefor built into scikit-learns RFC model. Below are the top 10 scores giving best indication of earnings (y target):

In [None]:
# Get the names of each feature
feature_names = pipeline.named_steps["dictvectorizer"].get_feature_names_out()

# Get the impurity-based feature importances
rfc_feature_score = pipeline.named_steps["randomforestclassifier"].feature_importances_

# print top 10 result
for s, f in sorted(zip(rfc_feature_score, feature_names), reverse=True)[:10]:
    print(f, s)

capital-gain 0.1516155230915853
marital-status=Married-civ-spouse 0.10718817182326537
age 0.10698287365700311
education-num 0.09045906901351733
hours-per-week 0.07044134769451368
occupation=Tech-support 0.06782105127450488
capital-loss 0.0425116455988528
marital-status=Never-married 0.03717284106381609
native-country=United-States 0.02689571115649695
occupation=Craft-repair 0.02074903708273899


##### Comments:

The top features listed when using mutual_info_classif and f_classif were the below computed as the mean decrease in impurity. The importance score indicate that these features are most predictive of earnings:
1. Capital Gain
2. Maritial status (=Married), 
3. Age

Capital gain can help describe the economy of a person and may be large or small as result of income. Naturally married 'Maritial status' can be a strong indicator of that you have a close partner/family that you prioritise and value highly. Generally we can argue that 'Age' results in higher salary growing with your role and career. However, as stated these do not indicate wheather the income was above or below the 50k. 

Other than the above tested scikitlearn's built in feature importance score one can use permutation based importance score described by Parr et al (2018). 

The perumation based importance score is the difference between the baseline accuracy and accuracy by permutin the column values. It is hevaier to compute but more reliable as less biased. 