## Task 3: Feature importances in random forest classifiers
Decision trees and random forests are trained by computing importance scores for individual features in different ways: information gain, Gini impurity, variance reduction, etc.

As a way to make our classifiers more interpretable, we can print the importance scores. In scikit-learn, decision trees and ensemble classifiers such as random forests all define an attribute called feature_importances_ (note the final underscore in this name). This is a NumPy array that stores the importance scores for each feature column in the training data matrix. For random forests and other tree ensembles, these importance scores are computed by averaging the scores when training all the different trees in the ensemble.

To make these importance scores easier to understand, we can use the attribute feature_names_ (note the underscore again) in the DictVectorizer.

Sort the features by importance scores in reverse order (so that the most important feature comes first), inspect the first few of these features, and try to reason about why you got this result.

Hint. If you used a Pipeline, you can access the parts of the sequence via the list pipeline.steps. For instance, pipeline.steps[0][1] will be the first step, pipeline.steps[1][1] will be the second step, etc.

Hint. This way of computing feature importance scores just tells us whether a feature is good for discriminating between the classes: it does not tell us what the relationship between the feature and an output class is: whether the feature makes it more or less likely that the person is a high earner.

For your report, please also mention an alternative way to compute some sort of importance score of individual features. (You don't need to implement it.) Here, you can either use your common sense, or optionally read the discussion by Parr et al. (2018) that gives some criticism of decision tree-based feature importance scores and discusses some alternatives.

In [None]:
# Import required libs
import pandas as pd
import numpy as np
import time
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_validate
from sklearn.feature_extraction import DictVectorizer
from sklearn.pipeline import make_pipeline
from sklearn.feature_selection import f_classif
from sklearn.feature_selection import mutual_info_classif
from sklearn.feature_selection import chi2
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
from sklearn.feature_selection import SelectKBest
from sklearn.pipeline import Pipeline

In [None]:
# Import csv file for Training data
train = pd.read_csv('adult_train.csv')

# Import csv file for Test data
test = pd.read_csv('adult_test.csv')

# Split into input part X and output part Y.
# convert X data to dict - creates a list of dicts 
# where each row's (person's) information is gathered in one dict, 
Xtrain_dicts = train.drop('target', axis=1).to_dict('records')
Xtest_dicts = test.drop('target', axis=1).to_dict('records')
Ytrain = train['target']
Ytest = test['target']

In [None]:
# MAKE PIPELINE
pipeline = Pipeline(steps=[('dv', DictVectorizer()), ('rfc', RandomForestClassifier())])

# Specify params to search
hparams_grid = {'rfc__max_depth':  [1,3,6,9,12,15,18,21,51],
                'rfc__n_estimators': [1,3,6,9,12,15,21,26,51,76,101,126,151,176,201,226,251] }

# Random search since faster
random_search = RandomizedSearchCV(pipeline, hparams_grid, n_iter=5)

In [None]:
# check available params
random_search.get_params().keys()

dict_keys(['cv', 'error_score', 'estimator__memory', 'estimator__steps', 'estimator__verbose', 'estimator__dictvectorizer', 'estimator__randomforestclassifier', 'estimator__dictvectorizer__dtype', 'estimator__dictvectorizer__separator', 'estimator__dictvectorizer__sort', 'estimator__dictvectorizer__sparse', 'estimator__randomforestclassifier__bootstrap', 'estimator__randomforestclassifier__ccp_alpha', 'estimator__randomforestclassifier__class_weight', 'estimator__randomforestclassifier__criterion', 'estimator__randomforestclassifier__max_depth', 'estimator__randomforestclassifier__max_features', 'estimator__randomforestclassifier__max_leaf_nodes', 'estimator__randomforestclassifier__max_samples', 'estimator__randomforestclassifier__min_impurity_decrease', 'estimator__randomforestclassifier__min_samples_leaf', 'estimator__randomforestclassifier__min_samples_split', 'estimator__randomforestclassifier__min_weight_fraction_leaf', 'estimator__randomforestclassifier__n_estimators', 'estimato

In [None]:
# time it
start = time.time()
# train
random_search.fit(Xtrain_dicts, Ytrain)
end = time.time()
time_grid_train = end - start
print('Train time: {}'.format(time_grid_train))

Train time: 62.00542140007019


In [None]:
# store best params
r_best_m_d = random_search.best_params_['rfc__max_depth']
r_best_n_e = random_search.best_params_['rfc__n_estimators']

# print result
print("Best Score: {}".format(random_search.best_score_))
print("Best params: {}".format(random_search.best_params_))

Best Score: 0.8625349612625062
Best params: {'rfc__n_estimators': 51, 'rfc__max_depth': 21}


In [None]:
# get accuracy on unseen test data
accuracy_score(Ytest, random_search.predict(Xtest_dicts))

0.8615564154535962

In [None]:
random_search.named_steps["dv"].get_feature_names_out()

AttributeError: 'RandomizedSearchCV' object has no attribute 'named_steps'

In [None]:
# Get the names of each feature
feature_names = pipeline.named_steps["dv"].get_feature_names()

# Get the impurity-based feature importances
rfc_feature_score = pipeline.named_steps["randomforestclassifier"].feature_importances_

# print top 10 result
for s, f in sorted(zip(rfc_feature_score, feature_names), reverse=True)[:10]:
    print(f, s)



AttributeError: 'DictVectorizer' object has no attribute 'feature_names_'

In [None]:
# Get the Scores of features
scores = pipeline.named_steps["selectkbest"].scores_

for s, f in sorted(zip(scores, feature_names), reverse=True)[:10]:
    print(f, s)

marital-status=Married-civ-spouse 8025.8420615949835
relationship=Husband 6240.018276214241
education-num 4120.095779707474
marital-status=Never-married 3674.2001465697413
age 1886.7073137161203
hours-per-week 1813.3862822161334
relationship=Own-child 1794.1574893573925
capital-gain 1709.150063743795
sex=Female 1593.1079074467164
sex=Male 1593.1079074467073


In [None]:
# feature selection with f_classif
feature_scores1 = f_classif(Xtrain_encoded, Ytrain)[0]

# print result
for score, fname in sorted(zip(feature_scores1, dv.get_feature_names_out()), reverse=True)[:10]:
    print(fname, score)

marital-status=Married-civ-spouse 8025.8420615949835
relationship=Husband 6240.018276214241
education-num 4120.095779707474
marital-status=Never-married 3674.2001465697413
age 1886.7073137161203
hours-per-week 1813.3862822161334
relationship=Own-child 1794.1574893573925
capital-gain 1709.150063743795
sex=Female 1593.1079074467164
sex=Male 1593.1079074467073


In [None]:
# feature selection with mutual info classif
feature_scores2 = mutual_info_classif(Xtrain_encoded, Ytrain)

# print result
for score, fname in sorted(zip(feature_scores2, dv.get_feature_names_out()), reverse=True)[:10]:
    print(fname, score)

marital-status=Married-civ-spouse 0.10543223425355985
capital-gain 0.08338237212343601
relationship=Husband 0.08087684110742101
age 0.0687725396789363
education-num 0.064872227626807
marital-status=Never-married 0.06195072410418583
hours-per-week 0.0422833222022355
relationship=Own-child 0.03821610420273137
capital-loss 0.03698048451035268
sex=Male 0.025765242400373284


##### Comments:
**Reason about why you got these first few features as top features:**

The top features listed when using mutual_info_classif and f_classif were the below computed as the mean decrease in impurity. The importance score indicate that these features are most predictive of earnings:
1. Maritial status (=Married), 
2. Capital Gain
3. Relationship status (=Husband), 
4. Age

Maritial status and relationship may be a strong indicator of your earnings in such a way that prioritising time with the family. Generally we can argue that age results in higher salary growing with your role and career. However, as stated these do not indicate wheather the income was above or below the 50k. 

**For your report, please also mention an alternative way to compute some sort of importance score of individual features. (You don't need to implement it.) Here, you can either use your common sense, or optionally read the discussion by Parr et al. (2018) that gives some criticism of decision tree-based feature importance scores and discusses some alternatives.**

Other than the above tested scikitlearn's built in feature importance scores one can use permutation based importance score described by Parr et al (2018). 

The importance score is the difference between the baseline accuracy and accuracy by permutin the column values. It is hevaier to compute but more reliable as less biased. 

---
The most common mechanism to compute feature importances, and the one used in scikit-learn's RandomForestClassifier and RandomForestRegressor, is the mean decrease in impurity (or gini importance) mechanism