<a href="https://colab.research.google.com/github/claudio1975/SDS2020/blob/master/notebooks/laboratory/exercises/2c_Ex_An_Experimental_EDA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **An Experimental Exploratory Data Analysis for a Classification Task step 2**

### ***From Visualization to Statistical Analysis***

### ***From Feature Engineering to Feature Selection***

### ***From the Best Model Selection to Interpretability***



To start the exploration set up the environment with libraries, upload the data set (it's stored in a github repository) and split it into target variable and features variables. No more set up is required using Google Colab. Look at the guidelines: https://colab.research.google.com/notebooks/welcome.ipynb

#### **Contents**

The goal of this challenge, launched by CrowdAnalytix, is to develop a model to predict whether a mortgage will be funded or not based on certain factors in a customer’s application data. 
The evaluation metric used is the F1 score.
The data set is made up by 45.642 observations with predictor variables (21 features) and the target variable. It's a classification task with the goal to predict the 'Result' target variable for every row (Funded, Not Funded). Look at the competition: https://www.crowdanalytix.com/contests/propensity-to-fund-mortgages


### **Exploratory Data Analysis (EDA) Pipeline**

![](http://www.theleader.info/wp-content/uploads/2017/07/Mortgage-rates.jpg)

# Prepare Workspace

#####- Upload libraries

In [None]:
# Upload libraries

# to handle data set
import pandas as pd
import numpy as np

# to plot
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import plot_confusion_matrix

# statistics
import statistics
import scipy.stats as stats
import statsmodels.api as sm
from statsmodels.formula.api import logit
from scipy.stats import chi2_contingency
from scipy.stats import kurtosis 
from scipy.stats import skew
from statistics import stdev 

# to split data set 
from sklearn.model_selection import train_test_split

# standardization
from sklearn.preprocessing import StandardScaler

# to build models
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neural_network import MLPClassifier

# to evaluate models
from sklearn.metrics import f1_score

# to handle imbalanced data set
from imblearn.over_sampling import RandomOverSampler
from collections import Counter

# feature engineering
!pip install feature-engine
import feature_engine
from sklearn.preprocessing import KBinsDiscretizer

# feature importance
from sklearn.tree import DecisionTreeClassifier
!pip install eli5 
import eli5
from eli5.sklearn import PermutationImportance
!pip install shap
import shap
import eli5
from eli5.sklearn import PermutationImportance

import warnings
warnings.filterwarnings('ignore')

#####- Upload data set

In [None]:
# Upload dataset
url = 'https://raw.githubusercontent.com/claudio1975/SDS2020/master/data/CAX_train_small.csv'
df = pd.read_csv(url)

#####- Split data set

In [None]:
# Split data set between target and features
X_full = df
y = X_full.RESULT
X_full = X_full.drop(['RESULT'], axis=1)


# Summarize Data

In [None]:
# Look at dimension of data set and types of each attribute
df.info()

In [None]:
# Summarize attribute distributions of the data frame
df.describe(include='all')

In [None]:
# Take a peek at the first rows of the data
df.head(10)

Explanatory variables are grouped into categorical variables and numerical variables and for each one let's do a graphical and non-graphical analysis, but before this split let's run some some data preparation activities.

# Formatting Features

If necessary, it's a good practice to format data, after have taken a peek of it. Missing values on numeric features are marked by "-1", meanwhile for categorical features they are marked with "Unknown"; let's imput these values with "NA".  

In [None]:
# Replaced both '-1' and 'Unknown' values with NA's
X_full[X_full== -1] = np.nan
X_full[X_full=="Unknown"] = np.nan

In [None]:
# Format data into float and object types and split mixed variables
X_full['PROPERTY VALUE'] = X_full['PROPERTY VALUE'].astype(float)
X_full['MORTGAGE PAYMENT'] = X_full['MORTGAGE PAYMENT'].astype(float)
X_full['AMORTIZATION'] = X_full['AMORTIZATION'].astype(float)
X_full['TERM'] = X_full['TERM'].astype(float)
X_full['INCOME'] = X_full['INCOME'].astype(float)
X_full['INCOME TYPE'] = X_full['INCOME TYPE'].astype(object)
X_full['CREDIT SCORE'] = X_full['CREDIT SCORE'].astype(float)
X_full['FSA_num'] = X_full['FSA'].str.extract('(\d+)') # extract numerical part
X_full['FSA_let'] = X_full['FSA'].str[0] # extract the first letter

In [None]:
# Rename some features for a practical use
X_full = X_full.rename(columns={"MORTGAGE PURPOSE":"MORTGAGE_PURPOSE","PAYMENT FREQUENCY":"PAYMENT_FREQUENCY","PROPERTY TYPE":"PROPERTY_TYPE","AGE RANGE":"AGE_RANGE","PROPERTY VALUE": "PROPERTY_VALUE",
                                "MORTGAGE PAYMENT": "MORTGAGE_PAYMENT", "MORTGAGE AMOUNT":"MORTGAGE_AMOUNT","INCOME TYPE":"INCOME_TYPE","CREDIT SCORE":"CREDIT_SCORE"})

# Handling Missing Values

There are two categorical features with missing values lower than 40%. The approach followed: filled up missing values with random sampling. With large percentage of missing values (>=15%) it's suggested to add a "missing indicator", a boolean variable with 1/true (missing value) or 0/false (actual value). 

In [None]:
# Check missing values both to numeric features and categorical features
X_full.isnull().sum()/X_full.shape[0]*100

In [None]:
# create the new variable where NA will be imputed
X_full['GENDER_imputed'] = 

In [None]:
# extract the random sample to fill the na
random_sample_X_full =

In [None]:
# pandas needs to have the same index in order to merge datasets
random_sample_X_full.index = 

In [None]:
# input missing values in the newly created variable
X_full.loc[X_full['GENDER'].isnull(), 'GENDER_imputed'] = random_sample_X_full

In [None]:
# create the new variable where NA will be imputed
X_full['INCOME_TYPE_imputed'] = 

In [None]:
# extract the random sample to fill the na
random_sample_X_full = 

In [None]:
# pandas needs to have the same index in order to merge datasets
random_sample_X_full.index = 

In [None]:
# input the missing values in the newly created variable
X_full.loc[X_full['INCOME_TYPE'].isnull(), 'INCOME_TYPE_imputed'] = random_sample_X_full

In [None]:
# drop the original features
X_full = X_full.drop(['GENDER', 'INCOME_TYPE'],axis=1) 

In [None]:
# final check
X_full.isnull().sum()/X_full.shape[0]*100

# Handling Categorical Features


In [None]:
# let's have a look at how many labels for categorical features
for col in X_full.columns:
  if X_full[col].dtype =="object":
    print(col, ': ', len(X_full[col].unique()), ' labels')

In [None]:
# Select categorical columns with relatively low cardinality (convenient but arbitrary)
categorical_cols = [cname for cname in X_full.columns if
                    X_full[....].nunique() <= 15 and 
                    X_full[....].dtype == "object"]

In [None]:
# Subset with categorical features
cat = X_full[categorical_cols]
cat.columns


In [None]:
cat.info()

#####- Feature Engineering on categorical features: target encoding

Let's transform categorical features into numerical variables with target encoding methodology to afford a better understanding of variables by machine learning models.

In [None]:
# Merge categorical covariates with dependent variable
cat2 = pd.concat([cat,y],axis=1)
cat2['RESULT'] = np.where(cat2['RESULT']=='FUNDED',1,0)

In [None]:
# calculate the mean target value per category for each feature and capture the result in a dictionary 
MORTGAGE_PURPOSE_LABELS = cat2.groupby(['MORTGAGE_PURPOSE'])['RESULT'].mean().to_dict()
PAYMENT_FREQUENCY_LABELS = cat2.groupby(['PAYMENT_FREQUENCY'])['RESULT'].mean().to_dict()
PROPERTY_TYPE_LABELS = cat2.groupby(['PROPERTY_TYPE'])['RESULT'].mean().to_dict()
AGE_RANGE_LABELS = cat2.groupby(['AGE_RANGE'])['RESULT'].mean().to_dict()
GENDER_LABELS = cat2.groupby(['GENDER_imputed'])['RESULT'].mean().to_dict()
FSA_num_LABELS = cat2.groupby(['FSA_num'])['RESULT'].mean().to_dict()

In [None]:
# replace for each feature the labels with the mean target values
cat2['MORTGAGE_PURPOSE'] = cat2['MORTGAGE_PURPOSE'].map(....)
cat2['PAYMENT_FREQUENCY'] = cat2['PAYMENT_FREQUENCY'].map(....)
cat2['PROPERTY_TYPE'] = cat2['PROPERTY_TYPE'].map(......)
cat2['AGE_RANGE'] = cat2['AGE_RANGE'].map(.....)
cat2['GENDER_imputed'] = cat2['GENDER_imputed'].map(.....)
cat2['FSA_num'] = cat2['FSA_num'].map(.....)

In [None]:
# Look at the new subset
target_cat = cat2.drop(['RESULT'], axis=1)
target_cat.shape

In [None]:
target_cat.head()

# Numerical Features

In [None]:
# Select numerical columns
numerical_cols = [cname for cname in X_full.columns if 
                X_full[cname].dtype in ['float64']]

In [None]:
# Subset with numerical features
num = X_full[numerical_cols]
num.columns

In [None]:
num.info()

In [None]:
# Grasp all
X_all = pd.concat([target_cat, num], axis=1, join='inner')

# Split data set

To analyze the performance of a model is a good manner to split the data set into the training set and the test set. It's been decided to split it into three parts: training set, validation set and test set for a better understanding of models. The training set is a sample of data used to fit the model, meanwhile the validation set is a sample of data used to provide an unbiased evaluation of the model that fit on the training set and to tune the model hyperparameters (not in this explorative phase). The test set is a sample of data used to provide an unbiased evaluation of the model applied on data never seen before.

In [None]:
# Break off validation and test set from training data
X_train, X_test, y_train, y_test = train_test_split(X_all, y, train_size=0.8, test_size=0.2,
                                                                random_state=0)
X_train, X_valid, y_train, y_valid = train_test_split(X_train, y_train, train_size=0.8, test_size=0.2,
                                                                random_state=0)

# Standardization

Since values of the features are not uniform and may be neagatively impact the skill of some models, the same models are evaluated with a standardized copy of the data set. It means, data are transformed such that each feature has a mean value of 0 and a standard deviation of 1. 

In [None]:
# Standardization of data
sc = StandardScaler()
X_train_sc = sc.fit_transform(X_train)
X_valid_sc = sc.fit_transform(X_valid)
X_test_sc = sc.transform(X_test)

# Modeling Part

The traditional data exploration is extended looking at the behaviour of several baseline models and which features can be relevant for the prediction. This exploration is splitted in two parts: without handling the imbalanced target variable (scaled baseline models) and handling it (scaled baseline models).  

- Evaluation Metric and Confusion Matrix

The confusion matrix is a summary table representation of prediction results for a classification problem. The number of correct and incorrect predictions are summarized with count values and broken down by each class. The diagonal elements represent the number of points for which the predicted label is equal to the true label, while off-diagonal elements are those that are mislabeled by the classifier. Good predictions coming from the higher diagonal values of the confusion matrix. For this imbalanced classification task is not used Accuracy metric but more appropriately the F1 score metric that combines both precision and recall, it's an harmonic mean between them, it's indicates how precise is the classifier (precision) and how robust it is (recall). F1 score equal to 0.00 indicates a poor model, instead F1 score equal 1.00 indicates a perfect model.


#  Modeling Part I: without handling imbalanced data set

The analysis is based on six baseline models: Logistic Regression as the easiest model and as well as benchmark, then other five models: Bagging, Random Forest, AdaBoost, Gradient Boosting Machine and Neural Networks (MLP).

#####- Baseline Models

In [None]:
# Spot Check Algorithms
models = []
models.append(('LogisticRegression', LogisticRegression(random_state=0)))
models.append(('Bagging', BaggingClassifier(random_state=0)))
models.append(('RandomForest', RandomForestClassifier(random_state=0)))
models.append(('AdaBoost', AdaBoostClassifier(random_state=0)))
models.append(('GBM', GradientBoostingClassifier(random_state=0)))
models.append(('NN', MLPClassifier(random_state=0)))
results_tr = []
results_v = []
results_t = []
names = []
score = []
skf = StratifiedKFold(n_splits=5)
for (name, model) in models:
    param_grid = {}
    my_model = GridSearchCV(model,param_grid,cv=skf)
    my_model.fit(.....)
    predictions_tr = my_model.predict(X_train_sc) 
    predictions_v = my_model.predict(X_valid_sc)
    predictions_t = my_model.predict(X_test_sc)
    f1_train = f1_score(y_train, predictions_tr, average='macro') 
    f1_valid = f1_score(y_valid, predictions_v,average='macro') 
    f1_test = f1_score(y_test, predictions_t,average='macro') 
    results_tr.append(f1_train)
    results_v.append(f1_valid)
    results_t.append(f1_test)
    
    names.append(name)
    f_dict = {
        'model': name,
        'f1_train': f1_train,
        'f1_valid': f1_valid,
        'f1_test': f1_test
    }
    score.append(f_dict)
    # Computing Confusion matrix for the above algorithms
    sns.set( rc = {'figure.figsize': (5, 5)})
    plt.figure()
    plot_confusion_matrix(my_model,X_test_sc, y_test,values_format= '.2f', cmap='Blues')
    plt.title(name)
    plt.show()   
score = pd.DataFrame(score, columns = ['model',......])

In [None]:
# Look at the F1 score for each model and for each data set
print(score)

In [None]:
# Plot results for a graphical comparison
print("Spot Check Algorithms")
sns.set( rc = {'figure.figsize': (15, 5)})
plt.figure()
plt.subplot(1,3,1)  
sns.stripplot(x="model", y="f1_train",data=score,size=15)
plt.xticks(rotation=90)
plt.title('Train results')
axes = plt.gca()
axes.set_ylim([0,1.1])
plt.subplot(1,3,2)
sns.stripplot(x="model", y="f1_valid",data=score,size=15)
plt.xticks(rotation=90)
plt.title('Validation results')
axes = plt.gca()
axes.set_ylim([0,1.1])
plt.subplot(1,3,3)
sns.stripplot(x="model", y="f1_test",data=score,size=15)
plt.xticks(rotation=90)
plt.title('Test results')
axes = plt.gca()
axes.set_ylim([0,1.1])
plt.show()