<a href="https://colab.research.google.com/github/cbsobral/ml-fies/blob/main/Module01_Models.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Module 02c - Random Forest fine-tuning

In this module, we perform the following steps:

1. Load the data from Mod_00 and execute pipelines;
2. Perform cross-validation for hyperparameter tuning and model selection;
3. Evaluating the performance in over and undersampled data;
4. Save the best model(s)
5. Evaluate feature importance.



Note: Due to the size of our data set and the technical conditions available, the time to perform cross-validation for the random forest model was considerable long. For example, a randomized CV with 3 parameters sampled and 4 folds took up to 7h. As we know, good quality results increase with larger number of parameters sampled. Due to time limitation, we were only able to use cross-validation with a reduced set of hyperparameters candidates. Some configurations were tested individually when cross-validation was not an option. Nevertheless, the final model improved in relation to the intial one.

### 1 - Load Data

Here, we import the training and testing sets created in Module00_Data. 


In [None]:
import pandas as pd

url_train = "https://drive.google.com/file/d/1IP7jyXkLgD_Ouy5cL6fJk4VUA5qRB2PK/view?usp=sharing"
path_train = "https://drive.google.com/uc?export=download&id="+url_train.split("/")[-2]
train = pd.read_csv(path_train)
train.shape

(351001, 31)

In [None]:
url_test = "https://drive.google.com/file/d/1v4FqKwt7NzG5RM6d9f1y7CLIdKq69jSS/view?usp=sharing"
path_test = "https://drive.google.com/uc?export=download&id="+url_test.split("/")[-2]
test = pd.read_csv(path_test)
test.shape

(87751, 31)

In [None]:
train_set = train.drop("default", axis=1) # drop targets for training set
train_target = train["default"].copy()

In [None]:
test_set = test.drop("default", axis=1) # drop targets for test set
test_target = test["default"].copy()

### 2 - Pipeline

The pipeline contains functions that will be used to transform the dataset. For the numeric attributes, the stardardization is performed by the StandardScaler. For ordinal attributes, variables are encoded by the OrdinalEncoder, and for categorical, theOneHotEncoder. 

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer

num_pipeline = Pipeline([("num_imputer", SimpleImputer(strategy="median")),
                         ("std_scaler", StandardScaler()), 
                         ])

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OrdinalEncoder

ord_pipeline = Pipeline([("ord_imputer", SimpleImputer(strategy="most_frequent")),
                         ("ord_encoder", OrdinalEncoder()),
                         ])

In [None]:
from sklearn.preprocessing import OneHotEncoder

cat_pipeline = Pipeline(steps=[('one_hot', OneHotEncoder())])

In [None]:
ord_attribs = ["igc","date_contract"] # 2 attributes

num_attribs = ["family_income",   #17
               "personal_income",
               "high_school_endyear",
               "n_sem_course",
               "n_completed_sem",
               "sem_funded",
               "fam_size",
               "income_pc",
               "tuition_current",
               "inc_prop",
               "perc_requested",
               "loan_value_sem",
               "student_resource",
               "loan_value",
               "loan_limit",
               "total_debt",
               "age"]
  

cat_attribs = ["semester_enroll",  #9
               "gender",
               "occupation", 
               "marital_status",
               "ethnicity", 
               "public_hs", 
               "state_course", 
               "degree", 
               "contract_phase"]

In [None]:
from sklearn.compose import ColumnTransformer

full_pipeline = ColumnTransformer([
        ("num", num_pipeline, num_attribs),
        ("cat", cat_pipeline, cat_attribs),
        ("ord", ord_pipeline, ord_attribs)
        ])

In [None]:
train_prepared = full_pipeline.fit_transform(train_set)
train_prepared[:1]

<1x94 sparse matrix of type '<class 'numpy.float64'>'
	with 28 stored elements in Compressed Sparse Row format>

In [None]:
test_prepared = full_pipeline.fit_transform(test_set)
test_prepared[:1]

<1x94 sparse matrix of type '<class 'numpy.float64'>'
	with 28 stored elements in Compressed Sparse Row format>

### 3 - Hyperparameter Tuning


Randomized search on hyper parameters:

In [None]:
# Time to run: +- 7h

from sklearn.model_selection import RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier
import numpy as np

# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 200, stop = 2000, num = 10)]

# Number of features to consider at every split
max_features = ['auto', 'sqrt']

# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(10, 110, num = 11)]
max_depth.append(None)

# Minimum number of samples required to split a node
min_samples_split = [2, 5, 10]

# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 4]

# Method of selecting samples for training each tree
bootstrap = [True, False]

# Criterion
criterion = ['gini', 'entropy']

random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap,
               'criterion': criterion}

rf = RandomForestClassifier(random_state=42)

rf_random = RandomizedSearchCV(estimator = rf, param_distributions = random_grid, n_iter = 3, cv = 4, verbose=2, random_state=42, n_jobs = -1)

rf_random.fit(train_prepared, train_target)

Best parameters:
{'bootstrap': False,
 'criterion': 'entropy',
 'max_depth': 50,
 'max_features': 'auto',
 'min_samples_leaf': 4,
 'min_samples_split': 5,
 'n_estimators': 200}

In [None]:
from sklearn.model_selection import GridSearchCV, cross_val_score
from sklearn.ensemble import RandomForestClassifier

param_grid = {
    'bootstrap': [False],
    'max_depth': [40, 60],
    'max_features': ["auto"],
    'min_samples_leaf': [5],
    'min_samples_split': [8],
    'n_estimators': [150, 300]}


rf = RandomForestClassifier(random_state=42)

grid_search = GridSearchCV(rf,param_grid=param_grid,cv=3)
grid_search.fit(train_prepared, train_target)
grid_search.cv_results_['params']
grid_search.cv_results_['mean_test_score']

Best parameters
{'bootstrap': False,
 'max_depth': 60,
 'max_features': 'auto',
 'min_samples_leaf': 5,
 'min_samples_split': 8,
 'n_estimators': 300}


In [None]:
from sklearn.ensemble import RandomForestClassifier


rf = RandomForestClassifier( bootstrap = False, criterion = 'entropy', max_depth = 50, max_features = 'auto', min_samples_leaf= 4, 
                            min_samples_split = 5, n_estimators = 200, random_state=42)
rf.fit(train_prepared, train_target)

KeyboardInterrupt: ignored

Result 0.790198064967537 AUC

In [None]:
 rf = RandomForestClassifier( bootstrap = False, criterion = 'entropy', max_depth = 60, max_features = 'auto', min_samples_leaf= 5, 
                              min_samples_split = 8, n_estimators = 300, random_state=42)
rf.fit(train_prepared, train_target)

Result 0.7896261537831428 AUC

In [None]:
rf = RandomForestClassifier( bootstrap = False, criterion = 'entropy', max_depth = 50, max_features = 'auto', min_samples_leaf= 4, 
                            min_samples_split = 5, n_estimators = 250, random_state=42)
rf.fit(train_prepared, train_target)

Result 0.7905031207466537 AUC

In [None]:
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier( bootstrap = False, criterion = 'entropy', max_depth = 50, max_features = 'auto', min_samples_leaf= 4, 
                      min_samples_split = 5, n_estimators = 300, random_state=42)
rf.fit(train_prepared, train_target)

RandomForestClassifier(bootstrap=False, criterion='entropy', max_depth=50,
                       min_samples_leaf=4, min_samples_split=5,
                       n_estimators=300, random_state=42)

In [None]:
from sklearn.metrics import roc_auc_score
from sklearn.metrics import brier_score_loss

pred_rf = rf.predict_proba(test_prepared)

auc_rf = roc_auc_score(test_target, pred_rf[:,1])
loss_rf = brier_score_loss(test_target, pred_rf[:,1])

auc_rf

0.7905609772221769

In [None]:
loss_rf

0.18172957094162634

 Result 0.7905609772221769 AUC

Testing if ExtraTreesClassifier() looks promising:

In [None]:
rf_extra = ExtraTreesClassifier()
rf_extra.fit(train_prepared, train_target)

Result: 0.7464396156070683 

In [None]:
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier( bootstrap = True, criterion = 'entropy', max_depth = 50, max_features = 'auto', min_samples_leaf= 4, 
                      min_samples_split = 5, n_estimators = 300, random_state=42, class_weight='balanced')
rf.fit(train_prepared, train_target)



RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight='balanced',
                       criterion='entropy', max_depth=50, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=4, min_samples_split=5,
                       min_weight_fraction_leaf=0.0, n_estimators=300,
                       n_jobs=None, oob_score=False, random_state=42, verbose=0,
                       warm_start=False)

In [None]:
from sklearn.metrics import roc_auc_score
from sklearn.metrics import brier_score_loss

pred_rf = rf.predict_proba(test_prepared)

auc_rf = roc_auc_score(test_target, pred_rf[:,1])
loss_rf = brier_score_loss(test_target, pred_rf[:,1])

auc_rf


0.7862866965536888

In [None]:
loss_rf

0.18599798529566028

In [None]:
from imblearn.ensemble import BalancedRandomForestClassifier
brf = BalancedRandomForestClassifier(criterion = 'entropy', max_depth = 50, max_features = 'auto', min_samples_leaf= 4, 
                      min_samples_split = 5, n_estimators = 300, random_state=42)
brf.fit(train_prepared, train_target)

BalancedRandomForestClassifier(criterion='entropy', max_depth=50,
                               min_samples_leaf=4, min_samples_split=5,
                               n_estimators=300, random_state=42)

In [None]:
from sklearn.metrics import roc_auc_score
from sklearn.metrics import brier_score_loss

pred_brf = brf.predict_proba(test_prepared)

auc_brf = roc_auc_score(test_target, pred_brf[:,1])
loss_brf = brier_score_loss(test_target, pred_brf[:,1])

auc_brf

0.7852046606840751

In [None]:
loss_brf

0.1921610807966289

In [None]:
#Best model for the original data:
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier( bootstrap = False, criterion = 'entropy', max_depth = 55, max_features = 'auto', min_samples_leaf= 4, 
                      min_samples_split = 5, n_estimators = 300, random_state=42)
rf.fit(train_prepared, train_target)

RandomForestClassifier(bootstrap=False, ccp_alpha=0.0, class_weight=None,
                       criterion='entropy', max_depth=55, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=4, min_samples_split=5,
                       min_weight_fraction_leaf=0.0, n_estimators=300,
                       n_jobs=None, oob_score=False, random_state=42, verbose=0,
                       warm_start=False)

In [None]:
from sklearn.metrics import roc_auc_score
from sklearn.metrics import brier_score_loss

pred_rf = rf.predict_proba(test_prepared)

auc_rf = roc_auc_score(test_target, pred_rf[:,1])
loss_rf = brier_score_loss(test_target, pred_rf[:,1])

auc_rf

0.790575936333656

In [None]:
loss_rf

0.1817306326853902

Saving the best model for the original data:

In [None]:
import joblib

filename = 'BestRF.sav'
joblib.dump(rf, filename, compress=3)


['BestRF.sav']

In [None]:
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier( bootstrap = False, criterion = 'entropy', max_depth = 60, max_features = 'auto', min_samples_leaf= 5, 
                      min_samples_split = 5, n_estimators = 300, random_state=42)
rf.fit(train_prepared, train_target)

RandomForestClassifier(bootstrap=False, ccp_alpha=0.0, class_weight=None,
                       criterion='entropy', max_depth=60, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=5, min_samples_split=5,
                       min_weight_fraction_leaf=0.0, n_estimators=300,
                       n_jobs=None, oob_score=False, random_state=42, verbose=0,
                       warm_start=False)

In [None]:
from sklearn.metrics import roc_auc_score
from sklearn.metrics import brier_score_loss

pred_rf = rf.predict_proba(test_prepared)

auc_rf = roc_auc_score(test_target, pred_rf[:,1])
loss_rf = brier_score_loss(test_target, pred_rf[:,1])

auc_rf

0.7896261537831428

0.7896261537831428

In [None]:
#Testing Bootstrap:
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier( bootstrap = True, criterion = 'entropy', max_depth = 55, max_features = 'auto', min_samples_leaf= 4, 
                      min_samples_split = 5, n_estimators = 300, random_state=42)
rf.fit(train_prepared, train_target)

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='entropy', max_depth=55, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=4, min_samples_split=5,
                       min_weight_fraction_leaf=0.0, n_estimators=300,
                       n_jobs=None, oob_score=False, random_state=42, verbose=0,
                       warm_start=False)

In [None]:
from sklearn.metrics import roc_auc_score
from sklearn.metrics import brier_score_loss

pred_rf = rf.predict_proba(test_prepared)

auc_rf = roc_auc_score(test_target, pred_rf[:,1])
loss_rf = brier_score_loss(test_target, pred_rf[:,1])

auc_rf

0.7867739634598249

In [None]:
#Testing min_sample_leaf = 5:
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier( bootstrap = False, criterion = 'entropy', max_depth = 55, max_features = 'auto', min_samples_leaf= 5, 
                      min_samples_split = 5, n_estimators = 300, random_state=42)
rf.fit(train_prepared, train_target)

RandomForestClassifier(bootstrap=False, ccp_alpha=0.0, class_weight=None,
                       criterion='entropy', max_depth=55, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=5, min_samples_split=5,
                       min_weight_fraction_leaf=0.0, n_estimators=300,
                       n_jobs=None, oob_score=False, random_state=42, verbose=0,
                       warm_start=False)

In [None]:
from sklearn.metrics import roc_auc_score
from sklearn.metrics import brier_score_loss

pred_rf = rf.predict_proba(test_prepared)

auc_rf = roc_auc_score(test_target, pred_rf[:,1])
loss_rf = brier_score_loss(test_target, pred_rf[:,1])

auc_rf

0.7896865466877714

In [None]:
from imblearn.under_sampling import RandomUnderSampler
from imblearn.over_sampling import RandomOverSampler

rus = RandomUnderSampler(random_state=42)
data_under, target_under = rus.fit_resample(train_prepared, train_target)

ros = RandomOverSampler(random_state=42)
data_over, target_over = ros.fit_resample(train_prepared, train_target)



In [None]:
#Testing Best Model with Undersampling:

from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier( bootstrap = False, criterion = 'entropy', max_depth = 55, max_features = 'auto', min_samples_leaf= 4, 
                      min_samples_split = 5, n_estimators = 300, random_state=42)

rf.fit(data_under, target_under)

RandomForestClassifier(bootstrap=False, ccp_alpha=0.0, class_weight=None,
                       criterion='entropy', max_depth=55, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=4, min_samples_split=5,
                       min_weight_fraction_leaf=0.0, n_estimators=300,
                       n_jobs=None, oob_score=False, random_state=42, verbose=0,
                       warm_start=False)

In [None]:
from sklearn.metrics import roc_auc_score
from sklearn.metrics import brier_score_loss

pred_rf = rf.predict_proba(test_prepared)

auc_rf = roc_auc_score(test_target, pred_rf[:,1])
loss_rf = brier_score_loss(test_target, pred_rf[:,1])

auc_rf

0.7894275007426411

In [None]:


from sklearn.model_selection import RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier
import numpy as np

# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 200, stop = 2000, num = 10)]

# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(10, 110, num = 11)]
max_depth.append(None)

# Minimum number of samples required to split a node
min_samples_split = [2, 5, 10]

# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 4]

# Method of selecting samples for training each tree
bootstrap = [True, False]

# Criterion
criterion = ['gini', 'entropy']

random_grid = {'n_estimators': n_estimators,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'criterion': criterion}

rf = RandomForestClassifier(random_state=42)

rf_random = RandomizedSearchCV(estimator = rf, param_distributions = random_grid, n_iter = 3, cv = 3, verbose=2, random_state=42, n_jobs = -1)

rf_random.fit(data_under, target_under)

rf_random.best_params_

Fitting 3 folds for each of 3 candidates, totalling 9 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done   9 out of   9 | elapsed: 326.1min finished


{'criterion': 'entropy',
 'max_depth': 30,
 'min_samples_leaf': 2,
 'min_samples_split': 2,
 'n_estimators': 1000}

In [None]:
#Testing Best Model with Undersampling (after Random search): it took 9 hours to run because of high n_estimators

from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier( bootstrap = False, criterion = 'entropy', max_depth = 30, max_features = 'auto', min_samples_leaf= 2, 
                      min_samples_split = 2, n_estimators = 1000, random_state=42)

rf.fit(data_under, target_under)

from sklearn.metrics import roc_auc_score
from sklearn.metrics import brier_score_loss

pred_rf = rf.predict_proba(test_prepared)

auc_rf = roc_auc_score(test_target, pred_rf[:,1])
loss_rf = brier_score_loss(test_target, pred_rf[:,1])

auc_rf



0.7916932434176123

In [None]:
#Best Model with Undersampling with reasonable n_estimators that allows it to be trained in less than 3 hours

from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier( bootstrap = False, criterion = 'entropy', max_depth = 30, max_features = 'auto', min_samples_leaf= 2, 
                      min_samples_split = 2, n_estimators = 300, random_state=42)

rf.fit(data_under, target_under)

from sklearn.metrics import roc_auc_score
from sklearn.metrics import brier_score_loss

pred_rf = rf.predict_proba(test_prepared)

auc_rf_under = roc_auc_score(test_target, pred_rf[:,1])
loss_rf_under = brier_score_loss(test_target, pred_rf[:,1])

print(auc_rf_under, loss_rf_under)

0.7901069885052406 0.18942277854762332


In [None]:
print(auc_rf, loss_rf)

0.7916932434176123 0.18894090843081712


In [None]:
#Testing Best Model with Oversampling 

from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier( bootstrap = False, criterion = 'entropy', max_depth = 55, max_features = 'auto', min_samples_leaf= 4, 
                      min_samples_split = 5, n_estimators = 300, random_state=42)

rf.fit(data_over, target_over)

RandomForestClassifier(bootstrap=False, ccp_alpha=0.0, class_weight=None,
                       criterion='entropy', max_depth=55, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=4, min_samples_split=5,
                       min_weight_fraction_leaf=0.0, n_estimators=300,
                       n_jobs=None, oob_score=False, random_state=42, verbose=0,
                       warm_start=False)

In [None]:
from sklearn.metrics import roc_auc_score
from sklearn.metrics import brier_score_loss

pred_rf = rf.predict_proba(test_prepared)

auc_rf = roc_auc_score(test_target, pred_rf[:,1])
loss_rf = brier_score_loss(test_target, pred_rf[:,1])

auc_rf

0.7911611214789033

Saving the best model for oversampling:

In [None]:
import joblib
joblib.dump(rf, "RF_over.joblib", compress=3)

['RF_over.joblib']

### 4 - Evaluation of best models 

In [None]:
import joblib

RF_over = joblib.load('/RF_over.joblib')
RF = joblib.load('/BestRF.sav')

In [None]:
from sklearn.metrics import roc_auc_score
from sklearn.metrics import brier_score_loss

pred_rf = RF.predict_proba(test_prepared)

auc_rf = roc_auc_score(test_target, pred_rf[:,1])
loss_rf = brier_score_loss(test_target, pred_rf[:,1])


In [None]:
from sklearn.metrics import roc_auc_score
from sklearn.metrics import brier_score_loss

pred_rf_over = RF_over.predict_proba(test_prepared)

auc_rf_over = roc_auc_score(test_target, pred_rf_over[:,1])
loss_rf_over = brier_score_loss(test_target, pred_rf_over[:,1])

In [None]:
auc_rf = 0.790576
auc_rf_over = 0.791161
auc_rf_under = 0.790106

loss_rf = 0.181731
loss_rf_over = 0.182321
loss_rf_under = 0.189422

# List with AUC scores
auc_list = [auc_rf, auc_rf_over, auc_rf_under]

# List with Brier Scores
bs_list = [loss_rf, loss_rf_over, loss_rf_under]

# List with model names
names_list = ['Original','Oversampling', 'Undersampling']

# Dataframe 
auc_df = pd.DataFrame({"Model": names_list, "AUC": auc_list, "BS": bs_list})
auc_df.sort_values(by = "AUC", ascending=False)

Unnamed: 0,Model,AUC,BS
1,Oversampling,0.791161,0.182321
0,Original,0.790576,0.181731
2,Undersampling,0.790106,0.189422


### 5 - Feature Importance

The following code to get the feature names extracted was developed by Johannes Haupt

https://johaupt.github.io/

In [None]:
def get_feature_names(column_transformer):
    """Get feature names from all transformers.
    Returns
    -------
    feature_names : list of strings
        Names of the features produced by transform.
    """
    # Remove the internal helper function
    #check_is_fitted(column_transformer)
    
    # Turn loopkup into function for better handling with pipeline later
    def get_names(trans):
        # >> Original get_feature_names() method
        if trans == 'drop' or (
                hasattr(column, '__len__') and not len(column)):
            return []
        if trans == 'passthrough':
            if hasattr(column_transformer, '_df_columns'):
                if ((not isinstance(column, slice))
                        and all(isinstance(col, str) for col in column)):
                    return column
                else:
                    return column_transformer._df_columns[column]
            else:
                indices = np.arange(column_transformer._n_features)
                return ['x%d' % i for i in indices[column]]
        if not hasattr(trans, 'get_feature_names'):
        # >>> Change: Return input column names if no method avaiable
            # Turn error into a warning
            warnings.warn("Transformer %s (type %s) does not "
                                 "provide get_feature_names. "
                                 "Will return input column names if available"
                                 % (str(name), type(trans).__name__))
            # For transformers without a get_features_names method, use the input
            # names to the column transformer
            if column is None:
                return []
            else:
                return [name + "__" + f for f in column]

        return [name + "__" + f for f in trans.get_feature_names()]
    
    ### Start of processing
    feature_names = []
    
    # Allow transformers to be pipelines. Pipeline steps are named differently, so preprocessing is needed
    if type(column_transformer) == sklearn.pipeline.Pipeline:
        l_transformers = [(name, trans, None, None) for step, name, trans in column_transformer._iter()]
    else:
        # For column transformers, follow the original method
        l_transformers = list(column_transformer._iter(fitted=True))
    
    
    for name, trans, column, _ in l_transformers: 
        if type(trans) == sklearn.pipeline.Pipeline:
            # Recursive call on pipeline
            _names = get_feature_names(trans)
            # if pipeline has no transformer that returns names
            if len(_names)==0:
                _names = [name + "__" + f for f in column]
            feature_names.extend(_names)
        else:
            feature_names.extend(get_names(trans))
    
    return feature_names

In [None]:
import warnings
import sklearn
import pandas as pd
feature_names =get_feature_names(full_pipeline)



In [None]:
pip install eli5
import eli5
from eli5.sklearn import PermutationImportance


# convert to dense array
test_prep_dense = test_prepared.toarray()

perm = PermutationImportance(RF, random_state=42).fit(test_prep_dense, test_target)

eli5.show_weights(perm, feature_names = feature_names)

In [None]:
eli5.explain_weights(perm, top=100, feature_names = feature_names)

Weight,Feature
0.0704  ± 0.0009,num__total_debt
0.0144  ± 0.0008,num__loan_value
0.0132  ± 0.0014,num__loan_limit
0.0121  ± 0.0007,num__n_sem_course
0.0120  ± 0.0017,num__sem_funded
0.0087  ± 0.0011,ord__date_contract
0.0087  ± 0.0006,num__n_completed_sem
0.0067  ± 0.0017,num__family_income
0.0063  ± 0.0003,num__student_resource
0.0059  ± 0.0012,num__income_pc


### 6 - Conclusion:

The best Random forest model is as follows bellow trained in an oversampled data set:

rf = RandomForestClassifier( bootstrap = False, criterion = 'entropy', max_depth = 55, max_features = 'auto', min_samples_leaf= 4, 
                      min_samples_split = 5, n_estimators = 300, random_state=42)