# Overview
The purpose of this notebook is to perform model training and evaluation for a variety of configurations of the same model type

## Experiment
The current experiment is intended to demonstrate the efficacy of different methods for determining feature importance.  First we will train and evaluate a baseline Random Forest classifer, with all available features.  Then we will use multiple methods to determine the most important features for this model.  Then we will train and a series of experimental Random Forest classifier, each ignoring some of the available features.  The features that will be ignored will be the features previously determined to be the most important features.

### Hypothesis
Given a baseline classifier and full feature set, the performance (Precision-Recall metrics) of a similar model will be worse if the most important features are excluded.  Furthermore, one of the feature importance algorithms will do a better job than other algorithms at selecting the features for which model performance will be most degraded upon their removal.

# Bootstrap by using helper notebook to load modeling data set

In [1]:
% run 02_modeling_data_preparation.ipynb

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 205 entries, 0 to 204
Data columns (total 26 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   symboling          205 non-null    object 
 1   normalized-losses  164 non-null    float64
 2   make               205 non-null    float64
 3   fuel-type          205 non-null    float64
 4   aspiration         205 non-null    float64
 5   num-of-doors       203 non-null    float64
 6   body-style         205 non-null    float64
 7   drive-wheels       205 non-null    float64
 8   engine-location    205 non-null    float64
 9   wheel-base         205 non-null    float64
 10  length             205 non-null    float64
 11  width              205 non-null    float64
 12  height             205 non-null    float64
 13  curb-weight        205 non-null    float64
 14  engine-type        205 non-null    float64
 15  num-of-cylinders   205 non-null    float64
 16  engine-size        205 non

## Show all symbols in scope after execution of bootstrap notebook

In [2]:
for symbol in dir():
    print(symbol)

DEFAULT_FIGSIZE
In
Out
X_test
X_train
_
__
___
__builtin__
__builtins__
__doc__
__loader__
__name__
__package__
__spec__
_dh
_i
_i1
_i2
_ih
_ii
_iii
_oh
all_cols
col
data_values
exit
feature_cols
fetch_openml
full_data
full_df
get_ipython
ignore_feature_cols
min_examples_count
modeling_df
np
pd
quit
response_col
response_value_counts
train_test_split
valid_response_values
y_test
y_train


# Train and Evaluate the Baseline Classifier

In [3]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.inspection import permutation_importance
from sklearn.metrics import precision_recall_fscore_support

In [4]:
clf = RandomForestClassifier(random_state=100, class_weight='balanced')
clf.fit(X_train, y_train)

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight='balanced',
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=100,
                       verbose=0, warm_start=False)

In [5]:
def evaluate_RF(clf, X_test, y_test):
    y_pred = clf.predict(X_test)
    
    class_labels = y_test.value_counts().index.tolist()
    (precision, recall, f1_score, support) = precision_recall_fscore_support(y_test, y_pred, labels=class_labels)
    pr_df = pd.DataFrame({
        'class': class_labels,
        'precision': precision, 
        'recall': recall, 
        'f1_score': f1_score, 
        'support': support}).set_index('class')
    print(pr_df)

In [6]:
evaluate_RF(clf, X_test, y_test)

       precision    recall  f1_score  support
class                                        
19.0    0.909091  1.000000  0.952381       10
12.0    1.000000  1.000000  1.000000        6
5.0     1.000000  1.000000  1.000000        4
20.0    1.000000  1.000000  1.000000        4
13.0    1.000000  0.750000  0.857143        4
21.0    0.800000  1.000000  0.888889        4
11.0    0.750000  0.750000  0.750000        4
8.0     1.000000  0.750000  0.857143        4
18.0    1.000000  1.000000  1.000000        4
4.0     0.666667  0.666667  0.666667        3


In [7]:
def get_default_feature_importance(clf, X_train):
    feature_importance_df = pd.DataFrame({
        'feature_name': X_train.columns, 
        'importance': clf.feature_importances_})
    print(feature_importance_df.sort_values('importance', ascending=False))

In [8]:
get_default_feature_importance(clf, X_train)

         feature_name  importance
11             stroke    0.111848
7               width    0.086463
18           peak-rpm    0.080046
2              height    0.079216
9          wheel-base    0.074221
13        curb-weight    0.072357
16               bore    0.070489
3              length    0.069574
17        engine-size    0.059706
15        engine-type    0.059583
6         fuel-system    0.054054
0   compression-ratio    0.042041
21         horsepower    0.041399
20        highway-mpg    0.025147
12           city-mpg    0.024443
8        drive-wheels    0.019452
19       num-of-doors    0.009727
10         body-style    0.009712
14         aspiration    0.004058
1    num-of-cylinders    0.003849
4           fuel-type    0.002615
5     engine-location    0.000000


In [10]:
def get_permutation_feature_importance(clf, X_test, y_test):
    result = permutation_importance(clf, X_test, y_test, random_state=100)
    feature_importance_df = pd.DataFrame({
        'feature_name': X_test.columns, 
        'importance': result['importances_mean']})
    print(feature_importance_df.sort_values('importance', ascending=False))

In [11]:
get_permutation_feature_importance(clf, X_test, y_test)

         feature_name  importance
6         fuel-system    0.021277
7               width    0.021277
9          wheel-base    0.017021
11             stroke    0.017021
21         horsepower    0.008511
5     engine-location    0.000000
4           fuel-type    0.000000
20        highway-mpg    0.000000
19       num-of-doors    0.000000
17        engine-size    0.000000
16               bore    0.000000
15        engine-type    0.000000
14         aspiration    0.000000
13        curb-weight    0.000000
12           city-mpg    0.000000
1    num-of-cylinders    0.000000
10         body-style    0.000000
8        drive-wheels    0.000000
3              length    0.000000
0   compression-ratio    0.000000
2              height   -0.004255
18           peak-rpm   -0.008511
