# Churn Tree Classification Template

## Notebook for Decision Tree/Random Forest/Gradient Boost

**=================================================================================================================**

## Project Description

In this challenge, we will be tackling the churn prediction problem on a very unique and interesting group of subscribers on a video streaming service! 

Imagine that you are a new data scientist at this video streaming company and you are tasked with building a model that can predict which existing subscribers will continue their subscriptions for another month. We have provided a dataset that is a sample of subscriptions that were initiated in 2021, all snapshotted at a particular date before the subscription was cancelled. Subscription cancellation can happen for a multitude of reasons, including:
* the customer completes all content they were interested in, and no longer need the subscription
* the customer finds themselves to be too busy and cancels their subscription until a later time
* the customer determines that the streaming service is not the best fit for them, so they cancel and look for something better suited


## Data Tasks

### 1) Understand the shape of the data (Histograms, box plots, etc.)

### 2) Data Cleaning 

### 3) Data Exploration

### 4) Feature Engineering 

### 5) Data Preprocessing for Model

### 6) Basic Model Building 

### 7) Model Tuning 

### 8) Ensemble Model Building 

### 9) Results 

**=================================================================================================================**

## Import Libraries

In [None]:
import numpy as np
#from numpy import count_nonzero, median, mean
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import random

import datetime
from datetime import datetime, timedelta, date

import scipy
from scipy import stats
from scipy.stats import zscore
from collections import Counter

import sklearn
from sklearn.preprocessing import StandardScaler, MinMaxScaler, LabelEncoder, OneHotEncoder
from sklearn.preprocessing import PolynomialFeatures, RobustScaler, Binarizer, OrdinalEncoder

from sklearn.compose import make_column_transformer, ColumnTransformer, make_column_selector
from sklearn.pipeline import make_pipeline, Pipeline
from sklearn import set_config

set_config(transform_output="pandas")

from sklearn.model_selection import KFold, StratifiedKFold, GridSearchCV, RandomizedSearchCV
from sklearn.model_selection import train_test_split, cross_validate, cross_val_score

# from sklearn.feature_selection import f_classif, chi2, RFE, RFECV
# from sklearn.feature_selection import mutual_info_classif
# from sklearn.feature_selection import VarianceThreshold, GenericUnivariateSelect
# from sklearn.feature_selection import SelectFromModel, SelectKBest, SelectPercentile

from sklearn.inspection import permutation_importance

from sklearn.metrics import confusion_matrix, classification_report
from sklearn.metrics import ConfusionMatrixDisplay, PrecisionRecallDisplay, RocCurveDisplay 
from sklearn.metrics import f1_score, precision_score, recall_score, roc_auc_score, accuracy_score

from sklearn.tree import DecisionTreeClassifier, plot_tree

from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, HistGradientBoostingClassifier


import xgboost as xgb
from xgboost import XGBClassifier
from xgboost import plot_importance

import feature_engine

from feature_engine.selection import (DropConstantFeatures, DropDuplicateFeatures, 
                                      DropCorrelatedFeatures, SmartCorrelatedSelection)
from feature_engine.selection import SelectBySingleFeaturePerformance, SelectByShuffling, RecursiveFeatureElimination
from feature_engine.selection import RecursiveFeatureAddition

%matplotlib inline
#sets the default autosave frequency in seconds
%autosave 60 
sns.set_style('dark')
sns.set(font_scale=1.2)

plt.rc('axes', labelsize=14)
plt.rc('xtick', labelsize=12)
plt.rc('ytick', labelsize=12)

font = {'family' : 'monospace',
          'weight' : 'bold',
          'size'   : '20'}
plt.rc('font' , **font)

import warnings
warnings.filterwarnings('ignore')

# import pickle
# from pickle import dump, load

pd.set_option('display.max_columns',None)
#pd.set_option('display.max_rows',100)
pd.set_option('display.width', 1000)
pd.set_option('display.float_format','{:.2f}'.format)

# Ensure results are reproducible
random.seed(0)
np.random.seed(0)
np.set_printoptions(suppress=True)

In [None]:
data_descriptions = pd.read_csv('data_descriptions.csv')
data_descriptions

**=================================================================================================================**

## Data Quick Glance

In [None]:
df_train = pd.read_csv("dftrain.csv")

In [None]:
df_train.head()

In [None]:
df_train.info()

In [None]:
df_train.dtypes.value_counts()

In [None]:
# Descriptive Statistical Analysis
df_train.describe(include="all")

In [None]:
# Descriptive Statistical Analysis
df_train.describe(include=["int", "float"])

In [None]:
# Descriptive Statistical Analysis
df_train.describe(include="object")

In [None]:
df_train.shape

In [None]:
df_train.columns

In [None]:
# Check class balance
df_train['churn'].value_counts().to_frame()

In [None]:
df_train['churn'].value_counts(normalize=True).to_frame()

In [None]:
df_train.isnull().sum()

In [None]:
df_train.duplicated().sum()

**=================================================================================================================**

## Load Test Set

In [None]:
df_test = pd.read_csv("dftestnolabel.csv")

In [None]:
df_test.head()

In [None]:
#preprocessor.transform(df_test)

In [None]:
#testdata = preprocessor.transform(df_test)

**==================================================================================================================**

In [None]:
df_train.corr()

In [None]:
plt.figure(figsize=(16,9))
sns.heatmap(df_train.corr(),cmap="coolwarm",annot=True,fmt='.2f',linewidths=2)
plt.title("Correlation Heatmap", fontsize=20)
plt.show()

**==================================================================================================================**

## Create a small dataset for model training

In [None]:
#df = df.sample(frac=0.15, random_state=0)

In [None]:
#df.reset_index(drop=True, inplace=True)

In [None]:
#df.head()

In [None]:
#df.shape

**=================================================================================================================**

## Train Test Split

In [None]:
df_train.shape

In [None]:
df_train.head(1)

In [None]:
X = df_train.iloc[:,0:19]
y = df_train.iloc[:,19]

In [None]:
X.values, y.values

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=0)

In [None]:
X_train.shape, X_test.shape, y_train.shape, y_test.shape

In [None]:
Counter(y_train),Counter(y_test)

In [None]:
counter = Counter(y_train)
print(counter[0])
print(counter[1])

In [None]:
# estimate scale_pos_weight value
estimate = counter[0] / counter[1]
estimate

In [None]:
scale_pos_weight = 4.54

**=================================================================================================================**

# Model Training

## Data Pipelines

Data Pipelines simplify the steps of processing the data. We use the module <code>Pipeline</code> to create a pipeline. 
`Pipeline` lets you chain together multiple operators on your data that both have a `fit` method.

### Combine multiple processing steps into a `Pipeline`

A pipeline contains a series of steps, where a step is ("name of step", actual_model). The "name of step" string is only used to help you identify which step you are on, and to allow you to specify parameters at that step.  

In [None]:
# Declare preprocessing functions

#imp = SimpleImputer(missing_values=np.nan, strategy='mean')
#ohe = OneHotEncoder()
#oe = OrdinalEncoder()
#ss = StandardScaler()
#mm = MinMaxScaler()

In [None]:
list(df_train.select_dtypes(include=["int64","float64"]))

In [None]:
list(df_train.select_dtypes(include=["bool","object"]))

In [None]:
dropcols = []

In [None]:
numcols = ['accountage', 'monthlycharges', 'totalcharges', 'viewinghoursperweek', 'averageviewingduration',
 'contentdownloadspermonth', 'userrating', 'supportticketspermonth', 'watchlistsize']

In [None]:
catcols = ['subscriptiontype', 'paymentmethod', 'paperlessbilling', 'contenttype', 'multideviceaccess', 'deviceregistered',
 'genrepreference', 'gender', 'parentalcontrol', 'subtitlesenabled']

In [None]:
# We create the preprocessing pipelines for both
# numerical and categorical data


drop_transformer = ColumnTransformer(transformers=
                                    ("dropcolumns", "drop", dropcols)
                                    )

numeric_transformer = Pipeline(steps=[
                             # ("scalar", StandardScaler()),
                              ("minmax", MinMaxScaler())
])

categorical_transformer = Pipeline(steps=[
                                  ("onehot", OneHotEncoder(sparse_output=False, drop='if_binary')),
    
                                  #("ordinal", OrdinalEncoder(categories='auto', handle_unknown="error"))
   
])

In [None]:
preprocessor = ColumnTransformer(
               transformers=[
                           ("dropcolumns", "drop", dropcols),
                           ("numerical", numeric_transformer, numcols),
                           ("categorical", categorical_transformer, catcols),
                   
                            ],
               remainder="passthrough",
               verbose_feature_names_out=False)

In [None]:
preprocessor = ColumnTransformer(
               transformers=[
                           ("dropcolumns", "drop", ["id"]),
                           ("numerical", numeric_transformer, numcols),
                           #("categorical", categorical_transformer, catcols),
                   
                            ],
               remainder="drop",
               verbose_feature_names_out=False)

In [None]:
# Check features transformation (Train Set)

preprocessor.fit_transform(X_train)

In [None]:
# Check features transformation (Test Set)

preprocessor.transform(X_test)

In [None]:
train = preprocessor.fit_transform(X_train)
train.describe()

**=================================================================================================================**

## Decision Tree Model (Baseline)

The `DecisionTreeClassifier` has many arguments (model hyperparameters) that can be customized and eventually tune the generated decision tree classifiers. Among these arguments, there are three commonly tuned arguments as follows:
- criterion: `gini` or `entropy`, which specifies which criteria to be used when splitting a tree node
- max_depth: a numeric value to specify the max depth of the tree. Larger tree depth normally means larger model complexity
- min_samples_leaf: The minimal number of samples in leaf nodes. Larger samples in leaf nodes will tend to generate simpler trees


In [None]:
dtpipeline = Pipeline(steps=[
                        ("preprocessor", preprocessor),
                        ("decisiontree", DecisionTreeClassifier(random_state=0))
                    
])

In [None]:
dtpipeline.fit(X_train, y_train)

In [None]:
dtpred = dtpipeline.predict(X_test)

In [None]:
dtpred[0:5]

In [None]:
# Extract the second column of probabilities (class 1) and rename it
dt_predicted_probability = dtpred[:, 1]

In [None]:
dt_predicted_probability

In [None]:
print("Decision Tree Classifier\n")
print('Accuracy:', '%.3f' % accuracy_score(y_test, dtpred))
print('Precision:', '%.3f' % precision_score(y_test, dtpred))
print('Recall:', '%.3f' % recall_score(y_test, dtpred))
print('F1 Score:', '%.3f' % f1_score(y_test, dtpred))
print('AUC score:', '%3.f' % roc_auc_score(y_test, dtpred))

In [None]:
print("Decision Tree Classifier\n")
print('Accuracy:', '%.3f' % accuracy_score(y_test, dt_predicted_probability))
print('Precision:', '%.3f' % precision_score(y_test, dt_predicted_probability))
print('Recall:', '%.3f' % recall_score(y_test, dt_predicted_probability))
print('F1 Score:', '%.3f' % f1_score(y_test, dt_predicted_probability))
print('AUC score:', '%3.f' % roc_auc_score(y_test, dt_predicted_probability))

**=================================================================================================================**

## Decision Tree Model Evaluation

In [None]:
print(classification_report(y_test, dtpred))

In [None]:
dtcm = confusion_matrix(y_test, dtpred)
dtcm

In [None]:
plt.rcParams.update({'font.size': 15})

fig, ax = plt.subplots(figsize=(15,5))

ConfusionMatrixDisplay.from_estimator(estimator=dtpipeline, X=X_test, y=y_test, ax=ax, 
                                      labels=dtpipeline.classes_, cmap="viridis")

ax.set_title('Confusion matrix of the classifier', size=20)
ax.grid(False)

plt.show()

In [None]:
fig, ax = plt.subplots(figsize=(10,5))

RocCurveDisplay.from_estimator(estimator=dtpipeline, X=X_test, y=y_test, ax=ax)
ax.set_title('ROC Curve of the classifier', size=20)
ax.grid(False)

plt.show()

In [None]:
fig, ax = plt.subplots(figsize=(10,5))

PrecisionRecallDisplay.from_estimator(estimator=dtpipeline, X=X_test, y=y_test, ax=ax)
ax.set_title('Precision/Recall of the classifier', size=20)

plt.show()

**=================================================================================================================**

## K-Fold Validation (DT)

In [None]:
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=0)

In [None]:
dtcv = cross_validate(estimator=dtpipeline, X=X_train, y=y_train, scoring="roc_auc", cv=skf, n_jobs=2, return_train_score=True)
dtcv

In [None]:
dtcv["train_score"].mean()

In [None]:
dtcv["test_score"].mean()

## Cross Validation Score

In [None]:
cv2 = cross_val_score(estimator=dtpipeline, X=X_train, y=y_train, scoring="roc_auc", cv=skf, n_jobs=2)

In [None]:
cv2

**=================================================================================================================**

In [None]:
### Plot Tree

In [None]:
dtpipeline.named_steps.decisiontree

In [None]:
plt.figure(figsize=(20,12))
plot_tree(dtpipeline.named_steps.decisiontree, max_depth=2, feature_names=X.columns, class_names=['0','1'], fontsize=14, filled=True)
plt.show()

**=================================================================================================================**

In [None]:
#preprocessor.fit_transform(X_train).columns

In [None]:
importances = dtpipeline.named_steps.decisiontree.feature_importances_

feature_importances = pd.Series(importances, index=preprocessor.fit_transform(X_train).columns)

fig, ax = plt.subplots()

feature_importances.plot.barh(ax=ax, figsize=(12,8))
ax.set_title("Decision Tree Feature Importances")
ax.tick_params('x', rotation=0)

fig.show()

In [None]:
# importances = dtpipeline.named_steps.decisiontree.feature_importances_

# feature_importances = pd.Series(importances, index=preprocessor.fit_transform(X_train).columns)

# fig, ax = plt.subplots()

# feature_importances.plot.barh(ax=ax, figsize=(12,8))
# ax.set_title("Decision Tree Feature Importances")
# ax.tick_params('x', rotation=0)

# fig.show()

#fig.savefig("tree.png")

In [None]:
feature_importances_df = pd.DataFrame(feature_importances, columns=["importances"])
feature_importances_df = feature_importances_df.sort_values(by='importances')
feature_importances_df

In [None]:
fig, ax = plt.subplots(figsize=(12,8))

sns.barplot(data=feature_importances_df, x=feature_importances_df.importances, y=feature_importances_df.index, orient='h')

ax.set_title("Decision Tree: Feature Importances", fontsize=20)

ax.set_xlabel("Importance")
ax.set_ylabel("Feature")

plt.show()

**=================================================================================================================**

## Random Forest Model

In [None]:
rfpipeline = Pipeline(steps=[
                        ("preprocessor", preprocessor),
                        ("randomforest", RandomForestClassifier(random_state=0))
                    
])

In [None]:
rfpipeline.fit(X_train, y_train)

In [None]:
rfpred = rfpipeline.predict(X_test)

In [None]:
rfpred[0:5]

In [None]:
print("Random Tree Classifier\n")
print('Accuracy:', '%.3f' % accuracy_score(y_test, rfpred))
print('Precision:', '%.3f' % precision_score(y_test, rfpred))
print('Recall:', '%.3f' % recall_score(y_test, rfpred))
print('F1 Score:', '%.3f' % f1_score(y_test, rfpred))
print('AUC score:', '%3.f' % roc_auc_score(y_test, rfpred))

## Random Forest Model Evaluation

In [None]:
rfcm = confusion_matrix(y_test,rfpred)
rfcm

In [None]:
print(classification_report(y_test,rfpred))

In [None]:
plt.rcParams.update({'font.size': 15})
fig, ax = plt.subplots(figsize=(10,5))

ConfusionMatrixDisplay.from_estimator(estimator=rfpipeline, X=X_test, y=y_test, ax=ax, display_labels=rfpipeline.classes_)
ax.set_title('Confusion matrix of the classifier', size=15)
ax.grid(False)
plt.show()

In [None]:
fig, ax = plt.subplots(figsize=(10,5))

RocCurveDisplay.from_estimator(estimator=rfpipeline, X=X_test, y=y_test, ax=ax)
ax.set_title('ROC Curve of the classifier', size=15)

plt.show()

## K-Fold Validation (RF)

In [None]:
#kf = KFold(n_splits=5, shuffle=True, random_state=0)

In [None]:
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=0)

In [None]:
rfcv = cross_validate(estimator=rfpipeline, X=X_train, y=y_train, scoring="roc_auc", cv=skf, n_jobs=2, return_train_score=False)
rfcv

In [None]:
rfcv["train_score"].mean()

In [None]:
rfcv["test_score"].mean()

## Cross Validation Score

In [None]:
rfcv2 = cross_val_score(estimator=rfpipeline, X=X_train, y=y_train, scoring="roc_auc", cv=skf, n_jobs=2)

In [None]:
rfcv2

## GridSearchCV (RF)

In [None]:
rfpipeline.named_steps.randomforest.get_params()

In [None]:
parameters = {'randomforest__criterion': ['gini', 'entropy', 'log_loss'],
              'randomforest__ccp_alpha': [0.0, 0.01, 0.02, 0.03, 0.04]
             }

In [None]:
# Assign a dictionary of scoring metrics to capture
scoring = {'accuracy', 'precision', 'recall', 'f1', 'roc_auc'}

In [None]:
rfgs = GridSearchCV(estimator=rfpipeline, param_grid=parameters, scoring=scoring, n_jobs=-1, cv=5, refit='roc_auc')

In [None]:
%%time
rfgs.fit(X_train, y_train)

In [None]:
rfgs.best_params_

In [None]:
rfgs.best_score_

## RandomSearchCV (RF)

In [None]:
parameters = {'randomforest__n_estimators': stats.randint(50, 200),
              'randomforest__max_depth' : stats.randint(2,10),
              'randomforest__min_samples_split': stats.randint(2,5),
              'randomforest__min_samples_leaf' : stats.randint(1,5),
              'randomforest__ccp_alpha': stats.uniform(0,0.05)
             }

In [None]:
# Assign a dictionary of scoring metrics to capture
scoring = {'accuracy', 'precision', 'recall', 'f1', 'roc_auc'}

In [None]:
rf_randm = RandomizedSearchCV(estimator=rfpipeline, param_distributions = parameters, cv = 5, n_iter = 10, 
                           n_jobs=2, scoring="roc_auc", refit=True)

In [None]:
%%time
rf_randm.fit(X_train, y_train)

In [None]:
rf_randm.best_estimator_

In [None]:
bestpipeline = rf_randm.best_estimator_

In [None]:
rf_randm.best_score_

In [None]:
rf_randm.best_params_

In [None]:
# we also find the data for all models evaluated

results = pd.DataFrame(rf_randm.cv_results_)

print(results.shape)

results.head()

In [None]:
results.columns

In [None]:
# we can order the different models based on their performance
results.sort_values(by='mean_test_score', ascending=False, inplace=True)

results.reset_index(drop=True, inplace=True)

results[['param_randomforest__max_depth', 'param_randomforest__min_samples_leaf', 
         'param_randomforest__min_samples_split', 'param_randomforest__n_estimators',
         'mean_test_score', 'rank_test_score']].T

In [None]:
def make_results(model_name, model_object):
    '''
    Accepts as arguments a model name (your choice - string) and
    a fit GridSearchCV model object.
  
    Returns a pandas df with the F1, recall, precision, and accuracy scores
    for the model with the best mean F1 score across all validation folds.  
    '''

    # Get all the results from the CV and put them in a df
    cv_results = pd.DataFrame(model_object.cv_results_)

    # Isolate the row of the df with the max(mean f1 score)
    best_estimator_results = cv_results.iloc[cv_results['mean_test_score'].idxmax(), :]

    # Extract accuracy, precision, recall, and f1 score from that row
#     f1 = best_estimator_results.mean_test_f1
#     recall = best_estimator_results.mean_test_recall
#     precision = best_estimator_results.mean_test_precision
#     accuracy = best_estimator_results.mean_test_accuracy
    rocauc = best_estimator_results.mean_test_score
  
    # Create table of results
    table = pd.DataFrame()
    table = table.append({'Model': model_name,
#                         'F1': f1,
#                         'Recall': recall,
#                         'Precision': precision,
#                         'Accuracy': accuracy,
                        'ROC-AUC' : rocauc  
                        },
                        ignore_index=True
                       )
  
    return table

In [None]:
# Call the function on our model
rf_result_table = make_results("Random Forest RCV", rf_randm)

rf_result_table

**=================================================================================================================**

## Feature importance (or Gini) graph

In [None]:
importances = rfpipeline.named_steps.randomforest.feature_importances_

feature_importances = pd.Series(importances, index=preprocessor.fit_transform(X_train).columns)

fig, ax = plt.subplots()
feature_importances.sort_values(ascending=False).plot.barh(ax=ax, figsize=(12,8))

ax.set_title("Random Forest Feature Importances", size=20)
ax.tick_params('x', rotation=0)

fig.show()

In [None]:
# feature_importances_df = pd.DataFrame(feature_importances, columns=["importances"])
# feature_importances_df = feature_importances_df.sort_values(by='importances')
# feature_importances_df

In [None]:
# fig, ax = plt.subplots(figsize=(12,8))

# sns.barplot(data=feature_importances_df, x=feature_importances_df.importances, y=feature_importances_df.index, orient='h')

# ax.set_title("Decision Tree: Feature Importances for Employee Leaving", fontsize=15)

# ax.set_xlabel("Importance")
# ax.set_ylabel("Feature")

# plt.show()

**=================================================================================================================**

## Permutation Importance

Permutation feature importance is a model inspection technique that can be used for any fitted estimator when the data is tabular. This is especially useful for non-linear or opaque estimators. The permutation feature importance is defined to be the decrease in a model score when a single feature value is randomly shuffled. This procedure breaks the relationship between the feature and the target, thus the drop in the model score is indicative of how much the model depends on the feature. This technique benefits from being model agnostic and can be calculated many times with different permutations of the feature.

`
permutation_importance(estimator, X,  y, scoring=None, n_repeats=5,
                                   n_jobs=None, random_state=None, sample_weight=None, max_samples=1.0)
`

We need to pass the model and the validation set to the permutation_importance function.

The n_repeats parameter specifies the number of times the feature values are shuffled. More repetitions will give more accurate results, but will take longer to compute.

The random_state parameter is used to set the random seed for reproducibility.

In [None]:
pm = permutation_importance(estimator=rfpipeline, X=X_test, y=y_test, n_jobs=-1, 
                            scoring="roc_auc", random_state=0, n_repeats=10)

pm

In [None]:
pm2 = permutation_importance(estimator=rf_randm.best_estimator_, X=X_test, y=y_test, n_jobs=-1, 
                            scoring="roc_auc", random_state=0, n_repeats=10)

pm2

In [None]:
sorted_idx = pm2.importances_mean.argsort()
fig = plt.figure(figsize=(12, 8))
plt.barh(range(len(sorted_idx)), pm2.importances_mean[sorted_idx], align='center')
plt.yticks(range(len(sorted_idx)), np.array(X_test.columns)[sorted_idx])
plt.title('Random Forest Permutation Importance')
plt.show()

**=================================================================================================================**

## Gradient Boosting Model

In [None]:
gbcpipeline = Pipeline(steps=[
                        ("preprocessor", preprocessor),
                        ("graboost", GradientBoostingClassifier(random_state=0))
                    
])

In [None]:
gbcpipeline.fit(X_train, y_train)

In [None]:
gbcpred = gbcpipeline.predict(X_test)

In [None]:
gbcpred[0:5]

In [None]:
print("Gradient Boost Tree Classifier\n")
print('Accuracy:', '%.3f' % accuracy_score(y_test, gbcpred))
print('Precision:', '%.3f' % precision_score(y_test, gbcpred))
print('Recall:', '%.3f' % recall_score(y_test, gbcpred))
print('F1 Score:', '%.3f' % f1_score(y_test, gbcpred))
print('AUC score:', '%3.f' % roc_auc_score(y_test, gbcpred))

## K-Fold Validation GBC

In [None]:
kf = KFold(n_splits=5, shuffle=True, random_state=0)

In [None]:
gbccv = cross_validate(estimator=gbcpipeline, X=X_train, y=y_train, scoring="roc_auc", cv=kf, n_jobs=2, 
                    return_train_score=False)
gbccv

In [None]:
gbccv["train_score"].mean()

In [None]:
gbccv["test_score"].mean()

**=================================================================================================================**

## RandomSearchCV (GBC)

In [None]:
gbcpipeline.named_steps.graboost.get_params()

In [None]:
parameters = {'graboost__learning_rate': stats.uniform(0,1),
              'graboost__n_estimators': stats.randint(50,250),
              'graboost__min_samples_split' : stats.uniform(0,1),
              'graboost__min_samples_leaf' : stats.uniform(0,1),
              'graboost__max_depth': stats.randint(2,10)
              
             }

In [None]:
# Assign a dictionary of scoring metrics to capture
scoring = {'accuracy', 'precision', 'recall', 'f1', 'roc_auc'}

In [None]:
gbc_randm = RandomizedSearchCV(estimator=gbcpipeline, param_distributions = parameters, cv = 5, n_iter = 10, 
                           n_jobs=-1, scoring='roc_auc', refit=True)

In [None]:
%%time
gbc_randm.fit(X_train, y_train)

In [None]:
gbc_randm.best_estimator_

In [None]:
gbc_randm.best_score_

In [None]:
gbc_randm.best_params_

In [None]:
# we also find the data for all models evaluated

results = pd.DataFrame(gbc_randm.cv_results_)

results.head().T

In [None]:
# we can order the different models based on their performance
results.sort_values(by='mean_test_score', ascending=False, inplace=True)

results.reset_index(drop=True, inplace=True)

results[['param_graboost__learning_rate','param_graboost__max_depth', 'param_graboost__min_samples_leaf',
         'param_graboost__min_samples_split','param_graboost__n_estimators']].head()

In [None]:
def make_results(model_name, model_object):
    '''
    Accepts as arguments a model name (your choice - string) and
    a fit GridSearchCV model object.
  
    Returns a pandas df with the F1, recall, precision, and accuracy scores
    for the model with the best mean F1 score across all validation folds.  
    '''

    # Get all the results from the CV and put them in a df
    cv_results = pd.DataFrame(model_object.cv_results_)

    # Isolate the row of the df with the max(mean f1 score)
    best_estimator_results = cv_results.iloc[cv_results['mean_test_score'].idxmax(), :]

    # Extract accuracy, precision, recall, and f1 score from that row
    #f1 = best_estimator_results.mean_test_f1
    #recall = best_estimator_results.mean_test_recall
    #precision = best_estimator_results.mean_test_precision
    #accuracy = best_estimator_results.mean_test_accuracy
    rocauc = best_estimator_results.mean_test_score
  
    # Create table of results
    table = pd.DataFrame()
    table = table.append({'Model': model_name,
#                         'F1': f1,
#                         'Recall': recall,
#                         'Precision': precision,
#                         'Accuracy': accuracy,
                        'ROC-AUC' : rocauc  
                        },
                        ignore_index=True
                       )
  
    return table

In [None]:
# Call the function on our model
gbc_result_table = make_results("Gradient Boosting RCV", gbc_randm)

In [None]:
gbc_result_table

**=================================================================================================================**

## Permutation Importance

Permutation feature importance is a model inspection technique that can be used for any fitted estimator when the data is tabular. This is especially useful for non-linear or opaque estimators. The permutation feature importance is defined to be the decrease in a model score when a single feature value is randomly shuffled. This procedure breaks the relationship between the feature and the target, thus the drop in the model score is indicative of how much the model depends on the feature. This technique benefits from being model agnostic and can be calculated many times with different permutations of the feature.

`
permutation_importance(estimator, X,  y, scoring=None, n_repeats=5,
                                   n_jobs=None, random_state=None, sample_weight=None, max_samples=1.0)
`

We need to pass the model and the validation set to the permutation_importance function.

The n_repeats parameter specifies the number of times the feature values are shuffled. More repetitions will give more accurate results, but will take longer to compute.

The random_state parameter is used to set the random seed for reproducibility.

**=================================================================================================================**

## HistGradientBoostingClassifier

In [None]:
hgbcpipeline = Pipeline(steps=[
                        ("preprocessor", preprocessor),
                        ("graboost", HistGradientBoostingClassifier(scoring="roc_auc", random_state=0))
                    
])

### HistGradientBoosting Model

In [None]:
hgbcpipeline.fit(X_train, y_train)

In [None]:
hgbc_pred = hgbcpipeline.predict(X_test)

In [None]:
# Extract the second column of probabilities (class 1) and rename it
hgbc_predicted_probability = hgbc_pred[:, 1]

hgbcprob = hgbc_predicted_probability.round(0)

hgbcprob

In [None]:
print("HistGradientBoosting Classifier\n")
print('Accuracy:', '%.3f' % accuracy_score(y_test, hgbc_pred))
print('Precision:', '%.3f' % precision_score(y_test, hgbc_pred))
print('Recall:', '%.3f' % recall_score(y_test, hgbc_pred))
print('F1 Score:', '%.3f' % f1_score(y_test, hgbc_pred))
print('AUC score:', '%3.f' % roc_auc_score(y_test, hgbc_pred))

In [None]:
print("HistGradientBoosting Classifier\n")
print('Accuracy:', '%.3f' % accuracy_score(y_test, hgbcprob))
print('Precision:', '%.3f' % precision_score(y_test, hgbcprob))
print('Recall:', '%.3f' % recall_score(y_test, hgbcprob))
print('F1 Score:', '%.3f' % f1_score(y_test, hgbcprob))
print('AUC score:', '%3.f' % roc_auc_score(y_test, hgbcprob))

### K-Fold Validation

In [None]:
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=0)

In [None]:
hgcv = cross_validate(estimator=hgbcpipeline, X=X_train, y=y_train, scoring="roc_auc", cv=skf, n_jobs=2)
hgcv

In [None]:
hgcv["test_score"].mean()

### Using RandomSearchCV

In [None]:
hgbcpipeline.get_params()

In [None]:
parameters = {'graboost__l2_regularization' : stats.uniform(0,1),
              'graboost__learning_rate': stats.uniform(0,1),
              'graboost__max_iter' : stats.randint(20,100),
              'graboost__max_bins': stats.randint(10,100),
              'graboost__max_depth': stats.randint(2,10),
              'graboost__max_leaf_nodes' : stats.randint(2,15),
              'graboost__min_samples_leaf': stats.randint(1,20)
             }

In [None]:
# Assign a dictionary of scoring metrics to capture
scoring = {'accuracy', 'precision', 'recall', 'f1', 'roc_auc'}

In [None]:
hgbc_randm = RandomizedSearchCV(estimator=hgbcpipeline, param_distributions = parameters, cv = 5, n_iter = 15, 
                           n_jobs=2, scoring='roc_auc', refit=True, return_train_score=True)

In [None]:
%%time
hgbc_randm.fit(X_train, y_train)

In [None]:
hgbc_randm.best_estimator_

In [None]:
hgbc_randm.best_score_

In [None]:
hgbc_randm.best_params_

In [None]:
# we also find the data for all models evaluated

results = pd.DataFrame(hgbc_randm.cv_results_)

print(results.shape)

results.head()

In [None]:
results.columns

In [None]:
# we can order the different models based on their performance
results.sort_values(by='mean_test_score', ascending=False, inplace=True)

results.reset_index(drop=True, inplace=True)

results[['rank_test_score', 'param_graboost__l2_regularization', 'param_graboost__learning_rate',
         'param_graboost__max_bins', 'param_graboost__max_depth', 'param_graboost__max_leaf_nodes',
         'param_graboost__min_samples_leaf', 'mean_test_score']].head()

## HGBM Model Evaluation

In [None]:
hgbcm = confusion_matrix(y_test, hgbc_pred)
hgbcm

In [None]:
print(classification_report(y_test, hgbc_pred))

In [None]:
fig, ax = plt.subplots(figsize=(10,5))

ConfusionMatrixDisplay.from_estimator(estimator=hgbcpipeline, X=X_test, y=y_test, ax=ax, display_labels=hgbcpipeline.classes_)
ax.set_title('Confusion matrix of the classifier', size=15)
ax.grid(False)
plt.show()

In [None]:
fig, ax = plt.subplots(figsize=(10,5))

RocCurveDisplay.from_estimator(estimator=hgbcpipeline, X=X_test, y=y_test, ax=ax)
ax.set_title('ROC Curve of the classifier', size=20)

plt.show()

### Build a feature importance graph

In [None]:
permimpt = permutation_importance(estimator=hgbcpipeline, X=X_train, y=y_train, scoring="roc_auc", n_repeats=15,
                       n_jobs=-1, random_state=0)

permimpt

In [None]:
hgbc_importances = pd.Series(data=permimpt["importances_mean"], index=X.columns)
hgbc_sorted = hgbc_importances.sort_values(ascending=False)

hgbc_sorted

In [None]:
fig, ax = plt.subplots(figsize=(12,8))
sns.barplot(y=hgbc_sorted.index, x=hgbc_sorted.values, orient="h")

ax.set_title("Hist Gradient Boosting Features Importance", size=20)



plt.show()

In [None]:
pm = permutation_importance(estimator=rfpipeline, X=X_test, y=y_test, n_jobs=-1, 
                            scoring="roc_auc", random_state=0, n_repeats=10)

pm

In [None]:
pm2 = permutation_importance(estimator=hgbc_randm.best_estimator_, X=X_test, y=y_test, n_jobs=-1, 
                            scoring="roc_auc", random_state=0, n_repeats=10)

pm2

In [None]:
sorted_idx = pm2.importances_mean.argsort()
fig = plt.figure(figsize=(12, 8))
plt.barh(range(len(sorted_idx)), pm2.importances_mean[sorted_idx], align='center')
plt.yticks(range(len(sorted_idx)), np.array(X_test.columns)[sorted_idx])
plt.title('Random Forest Permutation Importance')
plt.show()

**=================================================================================================================**

## XGBoost (Scikit-Learn)

In [None]:
xgbcpipeline = Pipeline(steps=[
                        ("preprocessor", preprocessor),
                        ("xgbc", XGBClassifier(random_state=0, 
                                               booster = 'gbtree',
                                               objective='binary:logistic', 
                                               eval_metric="auc",
                                               scale_pos_weight=4.54,
                                               tree_method = "hist"))
                    
])

In [None]:
xgbcpipeline.fit(X_train, y_train)

In [None]:
xgbcpred = xgbcpipeline.predict(X_test)

In [None]:
print("XGB Classifier\n")
print('Accuracy:', '%.3f' % accuracy_score(y_test, xgbcpred))
print('Precision:', '%.3f' % precision_score(y_test, xgbcpred))
print('Recall:', '%.3f' % recall_score(y_test, xgbcpred))
print('F1 Score:', '%.3f' % f1_score(y_test, xgbcpred))
print('AUC score:', '%3.f' % roc_auc_score(y_test, xgbcpred))

## XGB Model Evaluation

In [None]:
xgbcm = confusion_matrix(y_test, xgbcpred)
xgbcm

In [None]:
print(classification_report(y_test, xgbcpred))

In [None]:
fig, ax = plt.subplots(figsize=(10,5))

ConfusionMatrixDisplay.from_estimator(estimator=xgbcpipeline, X=X_test, y=y_test, ax=ax, display_labels=xgbcpipeline.classes_)
ax.set_title('Confusion matrix of the classifier', size=20)
ax.grid(False)
plt.show()

In [None]:
fig, ax = plt.subplots(figsize=(10,5))

RocCurveDisplay.from_estimator(estimator=xgbcpipeline, X=X_test, y=y_test, ax=ax)
ax.set_title('ROC Curve of the classifier', size=20)
ax.grid(False)
plt.show()

## XGB K-Fold Validation

In [None]:
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=0)

In [None]:
scoring = {'accuracy', 'precision', 'recall', 'f1', 'roc_auc'}

In [None]:
xgbcv = cross_validate(estimator=xgbcpipeline, X=X_train, y=y_train, scoring=scoring, cv=skf, n_jobs=2)
xgbcv

In [None]:
xgbcv["test_roc_auc"].mean()

## RandomSearchCV

In [None]:
xgbcpipeline.named_steps.xgbc.get_params()

In [None]:
parameters = {"xgbc__n_estimators": stats.randint(50,350),
              "xgbc__max_depth" : stats.randint(3,10),
              "xgbc__min_child_weight": stats.randint(0,5),
              "xgbc__colsample_bytree" : stats.uniform(0,1),
              "xgbc__subsample" : stats.uniform(0,1),
              "xgbc__eta" : stats.uniform(0,1),
              "xgbc__gamma" : stats.randint(0,10),
              "xgbc__reg_alpha": stats.randint(0,10),
              "xgbc__reg_lambda": stats.randint(0,10),
              "xgbc__max_bin": stats.randint(0,256),              
              "xgbc__max_delta_step" : stats.randint(1,10),
              "xgbc__scale_pos_weight" : stats.uniform(1,5)
              
             }

In [None]:
scoring = {'accuracy', 'precision', 'recall', 'f1', 'roc_auc'}

In [None]:
xgbrandm = RandomizedSearchCV(estimator=xgbcpipeline, param_distributions = parameters, cv = 5, n_iter = 40, 
                           n_jobs=2, scoring=scoring, refit='roc_auc')

In [None]:
xgbrandm.fit(X_train,y_train)

In [None]:
xgbrandm.best_estimator_

In [None]:
bestxgb = xgbrandm.best_estimator_

In [None]:
bestxgb

In [None]:
xgbrandm.best_score_

In [None]:
xgbrandm.best_params_

## Tuned XGB Model Evaluation

In [None]:
tunedxgbcpred = bestxgb.predict(X_test)

In [None]:
tunedxgbcpred[0:5]

In [None]:
print("XGB Classifier\n")
print('Accuracy:', '%.3f' % accuracy_score(y_test, tunedxgbcpred))
print('Precision:', '%.3f' % precision_score(y_test, tunedxgbcpred))
print('Recall:', '%.3f' % recall_score(y_test, tunedxgbcpred))
print('F1 Score:', '%.3f' % f1_score(y_test, tunedxgbcpred))
print('AUC score:', '%3.f' % roc_auc_score(y_test, tunedxgbcpred))

In [None]:
tunedxgbcm = confusion_matrix(y_test, tunedxgbcpred)
tunedxgbcm

In [None]:
print(classification_report(y_test, tunedxgbcpred))

In [None]:
fig, ax = plt.subplots(figsize=(10,5))

ConfusionMatrixDisplay.from_estimator(estimator=bestxgb, X=X_test, y=y_test, ax=ax, display_labels=bestxgb.classes_)
ax.set_title('Confusion matrix of the classifier', size=20)
ax.grid(False)
plt.show()

In [None]:
fig, ax = plt.subplots(figsize=(10,5))

RocCurveDisplay.from_estimator(estimator=bestxgb, X=X_test, y=y_test, ax=ax)
ax.set_title('ROC Curve of the classifier', size=20)
ax.grid(False)
plt.show()

## XGBoost Feature importance

The XGBoost library has a function called `plot_importance`, which we imported at the beginning of this notebook. This let's us check the features selected by the model as the most predictive. We can create a plot by calling this function and passing to it the best estimator from our grid search.

In [None]:
fig, ax = plt.subplots(figsize=(12,12))
plot_importance(xgbcpipeline.named_steps.xgbc, ax=ax, title="XGB Classifier Feature Importance",
                importance_type = 'weight')

plt.show()

In [None]:
fig, ax = plt.subplots(figsize=(12,12))
plot_importance(bestxgb, ax=ax, title="XGB Classifier Feature Importance",
                importance_type = 'weight')

plt.show()

In [None]:
fig, ax = plt.subplots(figsize=(12,12))
plot_importance(xgbcpipeline.named_steps.xgbc, ax=ax, title="XGB Classifier Feature Importance",
                importance_type = 'gain')

plt.show()

In [None]:
xgbcpipeline.named_steps.xgbc.feature_importances_

In [None]:
xgbcpipeline.named_steps.preprocessor.get_feature_names_out()

In [None]:
feat_importances = pd.Series(xgbcpipeline.named_steps.xgbc.feature_importances_, 
                             index=xgbcpipeline.named_steps.preprocessor.get_feature_names_out())

In [None]:
feat_importances

In [None]:
feat_importances.nlargest(10).plot(kind='barh', figsize=(10,8))
plt.title('Feature Importances')
plt.show()

### The permutation based importance

In [None]:
pm = permutation_importance(estimator=xgbcpipeline, X=X_test, y=y_test, n_jobs=-1, 
                            scoring="roc_auc", random_state=0, n_repeats=30)

pm

In [None]:
sorted_idx = pm.importances_mean.argsort()
fig = plt.figure(figsize=(12, 8))
plt.barh(range(len(sorted_idx)), pm.importances_mean[sorted_idx], align='center')
plt.yticks(range(len(sorted_idx)), np.array(X_test.columns)[sorted_idx])
plt.title('XGB Permutation Importance')
plt.show()

**=================================================================================================================**

## Predict Test Data Set

In [None]:
predicted_probability = hgbcpipeline.predict_proba(testset)

In [None]:
predicted_probability

In [None]:
predicted_probability = predicted_probability.round(0)[:,1]
predicted_probability

In [None]:
test_df = pd.read_csv("testoriginal.csv")

In [None]:
test_df.head()

In [None]:
# Combine predictions with label column into a dataframe
prediction_df = pd.DataFrame({'CustomerID': test_df[['CustomerID']].values[:, 0],
                             'predicted_probability': predicted_probability})

In [None]:
prediction_df

In [None]:
# FINAL TEST CELLS - please make sure all of your code is above these test cells

# Writing to csv for autograding purposes
prediction_df.to_csv("prediction_submission.csv", index=False)
submission = pd.read_csv("prediction_submission.csv")

assert isinstance(submission, pd.DataFrame), 'You should have a dataframe named prediction_df.'

In [None]:
# FINAL TEST CELLS - please make sure all of your code is above these test cells

assert submission.columns[0] == 'CustomerID', 'The first column name should be CustomerID.'
assert submission.columns[1] == 'predicted_probability', 'The second column name should be predicted_probability.'

In [None]:
# FINAL TEST CELLS - please make sure all of your code is above these test cells

assert submission.shape[0] == 104480, 'The dataframe prediction_df should have 104480 rows.'

In [None]:
# FINAL TEST CELLS - please make sure all of your code is above these test cells

assert submission.shape[1] == 2, 'The dataframe prediction_df should have 2 columns.'

**=================================================================================================================**

# Filter Methods (Other Methods)

## Univariate Performance with Feature-engine

This procedure works as follows:

- Train a ML model per every single feature
- Determine the performance of the models
- Select features if model performance is above a certain threshold

The C value in Logistic Regression is an user adjustable parameter that controls regularisation. In simple terms, higher values of C will instruct our model to fit the training set as best as possible, while lower C values will favour a simple models with coefficients closer to zero.


In [None]:
xgbft = XGBClassifier(random_state=0, booster = 'gbtree', objective='binary:logistic', eval_metric="auc")

In [None]:
# set up the selector
sel = SelectBySingleFeaturePerformance(
    variables=None,
    estimator=xgbft,
    scoring="roc_auc",
    cv=5,
    threshold=0.5
)

In [None]:
# find predictive features
sel.fit(X_train, y_train)

In [None]:
# the transformer stores a dictionary of feature:metric pairs
# notice that the roc can be positive or negative.
# the selector selects based on the absolute value

#In general, an AUC of 0.5 suggests no discrimination 
#(i.e., ability to diagnose patients with and without the disease or condition based on the test), 
#0.7 to 0.8 is considered acceptable, 0.8 to 0.9 is considered excellent, and more than 0.9 is considered outstanding

sel.feature_performance_

In [None]:
pd.Series(sel.feature_performance_).sort_values(ascending=False).plot.barh(figsize=(20, 8))
plt.title('Performance of ML models trained with individual features', size=20)
plt.xticks(rotation=45)
plt.ylabel('ROC Score')
plt.show()

In [None]:
# the features that will be removed

sel.features_to_drop_

In [None]:
# select features in the dataframes

X_train = sel.transform(X_train)
X_test = sel.transform(X_test)

In [None]:
X_train.shape, X_test.shape

In [None]:
X_train.columns

**=================================================================================================================**

# Step forward feature selection

Step forward feature selection starts by training a machine learning model for each feature in the dataset and selecting, as the starting feature, the one that returns the best performing model according to the evaluation criteria we choose.

In the second step, it creates machine learning models for all combinations of the feature selected in the previous step and a second feature. It selects the pair that produces the best performing algorithm.

It continues by adding 1 feature at a time to the features that were selected in previous steps, until a stopping criteria is reached.

In theory, models with more features perform better. The algorithm will continue adding new features until a certain criteria is met. For example, until the model performance does not increase beyond a certain threshold. Or, as we show in this notebook, until a certain number of features are selected.

The model performance metric can be the roc_auc for classification and the r squared for regression, for example, and it is determined by the user. 

Step forward feature selection is called a greedy procedure because it evaluates many possible single, double, triple, and so on feature combinations. Therefore, it is very computationally expensive and, sometimes, if the feature space is big enough, even unfeasible.

Scikit-learn provides various stopping criteria to stop the search:

* when a certain number of features is reached (like MLXtend) (arbitrary)
* if the performance does not increase beyond a certain threshold (ideal but expensive)
* selects half of the features (arbitrary)

### Step Forward Feature Selection Regression

In [None]:
df = pd.read_csv("carpricemod.csv")

In [None]:
df.shape

In [None]:
df.head(1)

In [None]:
X = df.iloc[:,0:14]
y = df.iloc[:,14]

In [None]:
X.values, y.values

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

In [None]:
X_train.shape, X_test.shape, y_train.shape, y_test.shape

In [None]:
# within the SFS we indicate:

# 1) the algorithm we want to create, in this case RandomForests
# (note that I use few trees to speed things up)

# 2) the stopping criteria: see sklearn documentation for more details

# 3) whether to perform step forward or step backward

# 4) the evaluation metric: in this case the roc_auc
# 5) the cross-validation

# this is going to take a while, do not despair

sfs = SFS(estimator=LinearRegression(), 
          n_features_to_select=6,
          direction='forward',
          scoring='r2',
          cv=5,
          n_jobs=-1)

In [None]:
sfs = sfs.fit(X_train, y_train)

In [None]:
sfs.get_feature_names_out()

In [None]:
sfs.n_features_to_select

### Step Forward Feature Selection Classification

In [None]:
#df = pd.read_csv(".csv")

In [None]:
df.shape

In [None]:
X = df.iloc[:,0:8]
y = df.iloc[:,8]

In [None]:
X.values, y.values

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=0)

In [None]:
X_train.shape, X_test.shape, y_train.shape, y_test.shape

In [None]:
# within the SFS we indicate:

# 1) the algorithm we want to create, in this case RandomForests
# (note that I use few trees to speed things up)

# 2) the stopping criteria: see sklearn documentation for more details

# 3) whether to perform step forward or step backward

# 4) the evaluation metric: in this case the roc_auc
# 5) the cross-validation

# this is going to take a while, do not despair

sfs = SFS(estimator=LogisticRegression(random_state=0), 
          n_features_to_select=4,
          direction='forward',
          scoring='f1',
          cv=5,
          n_jobs=-1)

In [None]:
sfs = sfs.fit(X_train, y_train)

In [None]:
sfs.get_feature_names_out()

# Step backward feature selection

Step Backward Feature Selection starts by fitting a model using all the features in the data set and determining its performance. 

Then, it trains models on all possible combinations of all features, minus one, and removes the feature that returns the model with the lowest performance.

In the third step, it trains models in all possible combinations of the features remaining from step 2, minus 1 feature, and removes the feature that produced the lowest performing model.

The algorithm stops when a certain criteria determined by the user is met. This criteria could be that the model performance does not decrease beyond a certain threshold, or alternatively, as we show in this notebook, when we reach a certain number of selected features.

The evaluation metric can be the roc_auc for classification or the r squared for regression, for example, and is determined by the user.

Step Backward Feature Selection is called greedy because it evaluates all possible n, and then n-1 and n-2 and so on feature combinations. Therefore, it is very computationally expensive and sometimes, if the feature space is big enough, even unfeasible.

Scikit-learn provides various stopping criteria to stop the search:

* when a certain number of features is reached (like MLXtend) (arbitrary)
* if the performance does not increase beyond a certain threshold (ideal but expensive)
* selects half of the features (arbitrary)

### Step Forward Feature Selection Regression

In [None]:
df = pd.read_csv(".csv")

In [None]:
df.shape

In [None]:
df.head(1)

In [None]:
X = df.iloc[:,0:14]
y = df.iloc[:,14]

In [None]:
X.values, y.values

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

In [None]:
X_train.shape, X_test.shape, y_train.shape, y_test.shape

In [None]:
# within the SFS we indicate:

# 1) the algorithm we want to create, in this case RandomForests
# (note that I use few trees to speed things up)

# 2) the stopping criteria: see sklearn documentation for more details

# 3) whether to perform step forward or step backward

# 4) the evaluation metric: in this case the roc_auc
# 5) the cross-validation

# this is going to take a while, do not despair

sfs = SFS(estimator=LinearRegression(), 
          n_features_to_select=6,
          direction='backward',
          scoring='r2',
          cv=5,
          n_jobs=-1)

In [None]:
sfs = sfs.fit(X_train, y_train)

In [None]:
sfs.get_feature_names_out()

In [None]:
sfs.n_features_to_select

### Step Backward Feature Selection Classification

In [None]:
#df = pd.read_csv(".csv")

In [None]:
df.shape

In [None]:
X = df.iloc[:,0:8]
y = df.iloc[:,8]

In [None]:
X.values, y.values

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=0)

In [None]:
X_train.shape, X_test.shape, y_train.shape, y_test.shape

In [None]:
# within the SFS we indicate:

# 1) the algorithm we want to create, in this case RandomForests
# (note that I use few trees to speed things up)

# 2) the stopping criteria: see sklearn documentation for more details

# 3) whether to perform step forward or step backward

# 4) the evaluation metric: in this case the roc_auc
# 5) the cross-validation

# this is going to take a while, do not despair

sfs = SFS(estimator=LogisticRegression(random_state=0), 
          n_features_to_select=4,
          direction='backward',
          scoring='f1',
          cv=5,
          n_jobs=-1)

In [None]:
sfs = sfs.fit(X_train, y_train)

In [None]:
sfs.get_feature_names_out()

**=================================================================================================================**

**=================================================================================================================**

# Pickle  

When models take a long time to fit, you don’t want to have to fit them more than once. If your kernel disconnects or you shut down the notebook and lose the cell’s output, you’ll have to refit the model, which can be frustrating and time-consuming. 

`pickle` is a tool that saves the fit model object to a specified location, then quickly reads it back in. It also allows you to use models that were fit somewhere else, without having to train them yourself.

### Save the Model
This step will ***W***rite (i.e., save) the model, in ***B***inary (hence, `wb`), to the folder designated by the above path. In this case, the name of the file we're writing is `rf_cv_model.pickle`.

In [None]:
filename = 'model.sav'
dump(xgbnew,open(filename,'wb'))

Once we save the model, we'll never have to re-fit it when we run this notebook. Ideally, we could open the notebook, select "Run all," and the cells would run successfully all the way to the end without any model retraining. 

For this to happen, we'll need to return to the cell where we defined our grid search and comment out the line where we fit the model. Otherwise, when we re-run the notebook, it would refit the model. 

Similarly, we'll also need to go back to where we saved the model as a pickle and comment out those lines.  

Next, we'll add a new cell that reads in the saved model from the folder we already specified. For this, we'll use `rb` (read binary) and be sure to assign the model to the same variable name as we used above, `rf_cv`.

### Load the Model

In [None]:
loaded_model = load(open(filename,'rb'))

In [None]:
loaded_model

**=================================================================================================================**

#### Python code done by Dennis Lam