<a href="https://colab.research.google.com/github/aaron-ruhl/Hackathon/blob/main/greatLearningHackathon.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Importing libraries, data, and dictionary

## Libraries & Options

**Libraries**
****

In [None]:
!pip install feature_engine

In [None]:
#importing standard python libraries for working with numbers
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

#incase I decide to make diagnostic plots of skewed distributions
import scipy.stats as stats

#sklearn libraries for data pre-processing
from sklearn.model_selection import train_test_split,StratifiedKFold,cross_val_score
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn import metrics

#sklearn libraries for model building
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import GradientBoostingClassifier,AdaBoostClassifier
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.ensemble import StackingClassifier

#library for make_pipeline
from sklearn.pipeline import make_pipeline

#sklearn library used for hypertuning
from sklearn.model_selection import RandomizedSearchCV,GridSearchCV

#imblearn library for under/over sampling
from imblearn.under_sampling import RandomUnderSampler
from imblearn.over_sampling import SMOTENC

#feature_engine libraries for refined feature engineering steps
from feature_engine.imputation import (
    AddMissingIndicator,
    RandomSampleImputer
)
from feature_engine.outliers import ArbitraryOutlierCapper
from feature_engine.encoding import OrdinalEncoder,OneHotEncoder
import feature_engine.transformation as vt


**Options**
****

In [None]:
#Setting pandas display options
pd.options.display.max_columns = 50
pd.options.display.max_rows = 100

#google colab display options
import warnings
warnings.filterwarnings('ignore')

## Dataset & Dictionary

**Dataset**
****

In [None]:
from google.colab import files
import io

uploaded=files.upload()
data=pd.read_csv(io.BytesIO(uploaded["Train_set.csv"]))

**Dictionary**
****

In [None]:
uploaded2=files.upload()
dictionary=pd.read_csv(io.BytesIO(uploaded2["Data_Dictionary.csv"]))

# Data Overview

## Preliminary Analysis

In [None]:
dictionary

In [None]:
data.head()

In [None]:
data.tail()

*Observations:*
- `ID` is unique to everyone and not useful
- Categorical Variables
  - Ordinal
    - `loan_grade`,`loan_subgrade`,
  - Nominal
    - `loan_term`,`home_ownership`,`income_verification_status`,`loan_purpose`, `state_code`,`application_type`,`job_experience`, & `default`, which is the target class
  
- Numerical Variables
  - Discreet
    - `loan_amnt`, `dlinq_2yrs`, `public_records`, `revolving_balance`, `total_acc`, `last_week_pay`
  - Continuous
    -  `interest_rate`, `annual_income`, `debt_to_income`, `interest_recieve`,`total_current_balance`, `total_revolving_limit`  

In [None]:
data.shape

In [None]:
data.info()

In [None]:
data.iloc[:,1:].describe(include=np.number).T

*Observations:*
- `total_current_balance` and `total_revolving_limit` are missing observations in around 5000 rows.
- `interest_rate`, `interest_recieve`, `revolving_balance`, & `debt_to_income` have pretty high maximums. `total_current_balance` & `total_revolving_limit` aswell.
  - lognormal, univariate distribution is what these appear to describe so far.
- Most applicants did not have any legal cases open (`public_records`) or prior 30+ day delinquencies within the last 2 yrs(`delinq_2yrs`).
- `total_acc` & `last_week_pay` also have a similar high maximum skew, but these are discreet variables.

In [None]:
high_maximums=['interest_rate', 'interest_recieve', 'revolving_balance', 'debt_to_income','total_current_balance', 'total_revolving_limit']

In [None]:
data.iloc[:,1:].describe(include=np.object_).T

*Observations:*
- 7 `loan_grade` and 35 `loan_subgrade`, this will give me an excellent metric for fine tuning the prediciton model; no missing values aswell, which makes sense considering every application gets scored.
- `state_code` had 50 different possibilities, which is great, business was booming. Already I can see that 'CA' was present the most with around 13.5 thousand entries recorded.
  - The excessive cardinality of `loan_grade/loan_subgrade` & `state_code` might need to be converted into ordinal values that retains the precise ordering for the former and fewer categories that retain the overall hierarchy for the latter.
- The rest of the variables are relatively well behaved and can be discussed further in EDA

In [None]:
data.isna().sum()

*Observations:*
- `job_experience`, `total_current_balance`, `total_revolving_balance`, `deling_2yrs`, `annual_income`, & `total_acc` seem like things that an applicant might have left blank either randomly or maybe some reason that is not random. I should impute and add an indicator for each if that row contained a missing value. I might need to get creative and EDA will help with that.
  - I wonder if the 2 missing observations for `public_records`, `delinq_2yrs`, & `total_acc` occured with the same `ID`.
- `last_week_pay` this might be just representing zero values because the applicant did not choose to pay off some of EMI early. The amount of missing is smaller here and seems to have been a rare occurence. Adding and indicator here might be a good idea; I will know more after EDA.

In [None]:
data.default.value_counts(1)

*Observation:*
- Considering that there is almost one hundred thousand observations, this is imbalanced, but not terrible.

# Splitting & Isolating Data \*run before proceeding\*

### Splitting data

In [None]:
X=data.drop('default',axis=1)
y=data['default'].astype(float)

**Importing X_test**

In [None]:
uploadedTest = files.upload()
testData=pd.read_csv(io.BytesIO(uploadedTest['Test_set.csv']))

**making the split of train data**

In [None]:
X_train, X_val, y_train, y_val = train_test_split(X,y,random_state=10,test_size=0.25,stratify=y)
X_train.shape, X_val.shape

X_test = testData.copy() #The actual 'y_test' is not included and is checked on leaderboard for hackathon

In [None]:
X_train.shape

### Building data_EDA with ID's

In [None]:
# I can use the ID's for EDA to avoid data leakage by only looking at X_train in EDA.
print('Same amount of unique rows as the entire X_train dataset proof:',X_train.shape,'\n',
      X_train.ID.unique,sep='')

In [None]:
#Lets check to see that it worked
X_train_IDs = list(X_train.ID.unique())
len(X_train_IDs)

In [None]:
X_train_IDs[:30]

In [None]:
data_EDA = data.set_index('ID')
data_EDA = data_EDA.loc[X_train_IDs]
data_EDA.shape

In [None]:
#just a couple sanity checks
(data_EDA.index == X_train_IDs).all()

In [None]:
data_EDA.head()

In [None]:
X_train.head(50)

*Observations:*
- Data appears to have remained intact. Note, I already split it up and the steps above only required that I get the ID's from X_train.

# Exploratory Data Analysis

### --- Establishing EDA Standards

1) Categorical
****

In [None]:
#I am setting the columns as a list of like-terms for easier graphing. Also it helps enhance reproducibility during model building, which can easily start to get frustrating when multiple changes need to be made.

ordinal = ['loan_grade','loan_subgrade']
nominal = ['loan_term','home_ownership','income_verification_status','loan_purpose', 'state_code','application_type','job_experience']

target='default'

2) Numerical
****

In [None]:
discreet = ['loan_amnt', 'delinq_2yrs', 'public_records', 'revolving_balance', 'total_acc', 'last_week_pay']
continuous = ['interest_rate', 'annual_income', 'debt_to_income', 'interest_receive','total_current_balance', 'total_revolving_limit']

## **EDA --- Categorical**

## **Univariate**

### --- labeled_barplot
***

In [None]:
def labeled_barplot(data, feature, perc=False, n=None):
    """
    Barplot with percentage at the top

    data: dataframe
    feature: dataframe column
    perc: whether to display percentages instead of count (default is False)
    n: displays the top n category levels (default is None, i.e., display all levels)
    """

    total = len(data[feature])  # length of the column
    count = data[feature].nunique()
    if n is None:
        plt.figure(figsize=(count + 2, 6))
    else:
        plt.figure(figsize=(n + 2, 6))

    plt.xticks(rotation=90, fontsize=15)
    ax = sns.countplot(
        data=data,
        x=feature,
        palette="Paired",
        order=data[feature].value_counts().index[:n],
    )

    for p in ax.patches:
        if perc == True:
            label = "{:.1f}%".format(
                100 * p.get_height() / total
            )  # percentage of each class of the category
        else:
            label = p.get_height()  # count of each level of the category

        x = p.get_x() + p.get_width() / 2  # width of the plot
        y = p.get_height()  # height of the plot

        ax.annotate(
            label,
            (x, y),
            ha="center",
            va="center",
            size=12,
            xytext=(0, 5),
            textcoords="offset points",
        )  # annotate the percentage

    plt.show()  # show the plot

### **Ordinal Only**

In [None]:
for feature in ordinal:
  if (data_EDA[feature].isna()).any() == True:
    print("*"*80,"\n",
          "missing detected for {}, and total missing: {}".format(feature,data_EDA[feature].isna().sum()))
  labeled_barplot(data_EDA,feature,target)

*Observations:*
- Most of the loan applications were given a `loan_grade` of "B" or "C", followed by 'A" or D. Then E, F, and G
- `loan_subgrade` contains a reasonable sample of all types of subgrade.

### **Nominal Only**

In [None]:
for feature in nominal:
  if (data_EDA[feature].isna()).any() == True:
    print("*"*80,"\n",
          "missing detected for {}, and total missing: {}".format(feature,data_EDA[feature].isna().sum()))
  labeled_barplot(data_EDA,feature,target)

*Observations:*

`loan_term`  
  - Around 2x "3 years"(48970) compared to "5 years"(20910)

`home_ownership`
  - "Mortgage"/"Rent"(34738/28175) is 50/50 vast majority minus "OWN", which takes up the rest(6945).

`income_verification_status`
  - Relatively even split of "Source Verified", "Verified", and "Not Verified".

`loan_purpose`
  - "Debt consolidation" listed most often as the reason for getting the loan.
  - "home_improvement" and "credit_card" were seperated from "other".

`state_code`
  - "Idaho" only has one observation & this could cause dimensionality issues if I just convert it into OHE like this.
  - Consider how `loan_grade` and `loan_subgrade` are seperated into A,B,C,D and a1,a2,a3,...,g5. We can see how the latter allows much more precision, while the former offers reduced complexity/variance. In fact I will decide later on to just remove the main grading because it adds too much bias.
    - The states might be organized into some reasonable grouping that retains the overall demographical differences. I should try to keep as much precision as possible whilst reducing the overall complexity here.

`application_type`
  - Only 45 observations for "JOINT", which technically should be 90 people, but much less than the almost 70k in "INDIVIDUAL". I will consider removing this column to avoid biasing the "INDIVIDUAL" applicants.

`job_experience`
  - I might be able to use random imputation with missing indicator variables to great effect here.

## **Target**
****



### --- stacked_barplot
***

In [None]:
def stacked_barplot(data, feature, target):
    """
    Print the category counts and plot a stacked bar chart

    data: dataframe
    feature: independent variable
    target: target variable
    """
    count = data[feature].nunique()
    sorter = data[target].value_counts().index[-1]

    tab1 = pd.crosstab(data[feature], data[target], margins=True).sort_values(
        by=sorter, ascending=False
    )
    sorter2 = data[target].value_counts(1).index[-1]
    tab2 = pd.crosstab(data[feature], data[target], margins=True, normalize='index').sort_values(
        by=sorter2, ascending=False
    )
    print("-" * 120)
    print(tab1)
    print("-" * 120)
    print(tab2)

    tab = pd.crosstab(data[feature], data[target], normalize="index").sort_values(
        by=sorter, ascending=False
    )
    tab.plot(kind="bar", stacked=True, figsize=(count + 1, 5))
    plt.legend(
        loc="lower left",
        frameon=False,
    )
    plt.legend(loc="upper left", bbox_to_anchor=(1, 1))
    plt.show()

### **Ordinal Only**

In [None]:
for feature in ordinal:
  if (data_EDA[feature].isna()).any() == True:
    print("*"*80,"\n",
          "missing detected for {}, and total missing: {}".format(feature,data_EDA[feature].isna().sum()))
  stacked_barplot(data_EDA,feature,target)

*Observations:*
- `loan_grade` & `loan_subgrade` can be replaced with just one or the other because they both include A,B,C,D distinctions.
  - I just need to add and ordinal value for the subgrade and indicators for the overall grade so the classification model may "see" this interaction

### **Nominal Only**

In [None]:
for feature in nominal:
  if (data_EDA[feature].isna()).any() == True:
    print("*"*80,"\n",
          "missing detected for {}, and total missing: {}".format(feature,data_EDA[feature].isna().sum()))
  stacked_barplot(data_EDA,feature,target)

*Observations:*
- `state_code` might be further seperated based on south, east, north, and west. Plus others

## **EDA --- Numerical**

## **Univariate**

### --- histogram_boxplot
***

In [None]:
def histogram_boxplot(data, feature, figsize=(12, 7), kde=False,bins=None,hue=None,color=None,palette=None):
    """
    Boxplot and histogram combined

    data: dataframe
    feature: dataframe column
    figsize: size of figure (default (12,7))
    kde: whether to the show density curve (default False)
    bins: number of bins for histogram (default None)
    """
    f2, (ax_box2, ax_box3, ax_hist2) = plt.subplots(
        nrows=3,  # Number of rows of the subplot grid= 2
        sharex=True,  # x-axis will be shared among all subplots
        gridspec_kw={"height_ratios": (0.20, 0.20, 0.60)},
        figsize=figsize,
    )  # creating the 2 subplots
    sns.boxplot(
        data=data, x=feature, ax=ax_box2, showmeans=True, color='RebeccaPurple'
    )  # boxplot will be created and a star will indicate the mean value of the column
    if hue == None:
      sns.violinplot(
          data=data, x=feature, ax=ax_box3, palette=palette
      )
    if hue != None:
      sns.boxplot(
          data=data, x=feature, y=data[hue], ax=ax_box3, showmeans=True, palette=palette, orient="h"
    )
    sns.histplot(
        data=data, x=feature, kde=kde, hue=hue, ax=ax_hist2, bins=bins, palette=palette
    ) if bins else sns.histplot(
        data=data, x=feature, kde=kde, ax=ax_hist2
    )  # For histogram
    ax_hist2.axvline(
        data[feature].mean(), color="green", linestyle="--"
    )  # Add mean to the histogram
    ax_hist2.axvline(
        data[feature].median(), color="black", linestyle="-"
    )  # Add median to the histogram

### **Discreet**
****

In [None]:
for feature in discreet:
  if (data_EDA[feature].isna()).any() == True:
    print("*"*80,"\n",
          "missing detected for {}, and total missing: {}".format(feature,data_EDA[feature].isna().sum()))
  histogram_boxplot(data_EDA,feature)

*Observations:*
- Need to check values of `delinq_2yrs` & `public_records` greater than 2-5
- Should investigate `revolving_balance` > 60k
- Should check `total_acc` > 60
- `last_week_pay` > 250

In [None]:
data_EDA[data_EDA['delinq_2yrs']>17]

In [None]:
data_EDA[data_EDA['public_records']>12]

In [None]:
data_EDA[data_EDA['revolving_balance']>900000]

In [None]:
data_EDA[data_EDA['total_acc']>100]

In [None]:
data_EDA[data_EDA['last_week_pay']>269]

### **Continuous**
****

In [None]:
for feature in continuous:
  if (data_EDA[feature].isna()).any() == True:
    print("*"*80,"\n",
          "missing detected for {}, and total missing: {}".format(feature,data_EDA[feature].isna().sum()))
  histogram_boxplot(data_EDA,feature)

In [None]:
data_EDA[data_EDA['annual_income']>2000000]

In [None]:
data_EDA[data_EDA['debt_to_income']>55.0]

In [None]:
data_EDA[data_EDA['total_current_balance']>3000000]

In [None]:
data_EDA[data_EDA['total_revolving_limit']>1000000]

## **Target**

### **Discreet**
****

In [None]:
for feature in discreet:
  if (data_EDA[feature].isna()).any() == True:
    print("*"*80,"\n",
          "missing detected for {}, and total missing: {}".format(feature,data_EDA[feature].isna().sum()))
  histogram_boxplot(data_EDA,feature,hue=target)

### Continuous
****

In [None]:
for feature in continuous:
  if (data_EDA[feature].isna()).any() == True:
    print("*"*80,"\n",
          "missing detected for {}, and total missing: {}".format(feature,data_EDA[feature].isna().sum()))
  histogram_boxplot(data_EDA,feature,hue=target)

## **EDA --- Target Only**

In [None]:
plt.figure(figsize=(10,7))
sns.histplot(data_EDA,x=target);

*Observation:*
- imbalanced, but still pretty decent amount of the minority class.

## **EDA --- Correlation Matrix**

In [None]:
plt.figure(figsize=(20,14))
sns.heatmap(data_EDA.corr(),annot=True,cmap="Spectral");

# Data Preprocessing

## Establishing preprocessing standards \*Run before proceeding\*

Grouping `state_code` based on well established demographic maps maintained by the 'US Census Bureau'. There is actually a further grouping that reduces cardinality even more.
****

In [None]:
newEngland = ['CT','ME','MA','NH','RI','VT']
middleAtlantic = ['NJ','NY','PA']
eastNorthCentral = ['IN','IL','MI','OH','WI']
westNorthCentral = ['IA','KS','MN','MO','NE','ND','SD']
southAtlantic = ['DE','DC','FL','GA','MD','NC','SC','VA','WV']
eastSouthCentral = ['AL','KY','MS','TN']
westSouthCentral = ['AR','LA','OK','TX']
mountain = ['AZ','CO','ID','NM','MT','UT','NV','WY']
pacific = ['AK','CA','HI','OR','WA']

myList= [newEngland,middleAtlantic,eastNorthCentral,westNorthCentral,southAtlantic,eastSouthCentral,westSouthCentral,mountain,pacific]
myNames= ['newEngland','middleAtlantic','eastNorthCentral','westNorthCentral','southAtlantic','eastsouthCentral','westSouthCentral','mountain','pacific']

In [None]:
def state_code_filter(data):
  i=0

  for y in myList:
    for x in y:
      data.loc[(data['state_code']==x),'state_code'] = myNames[i]
    i+=1

Defining the Outlier Capping
****

In [None]:
capper = ArbitraryOutlierCapper(max_capping_dict={
    'delinq_2yrs': 18, 'public_records': 13, 'revolving_balance': 1000000,'total_acc': 100,'last_week_pay': 270,'annual_income': 2000000,'debt_to_income': 55.0,'total_current_balance': 3000000,'total_revolving_limit': 1000000
    },
                                min_capping_dict=None)

Missing
****

In [None]:
missing=['job_experience','annual_income','delinq_2yrs','public_records','total_acc','last_week_pay','total_current_balance','total_revolving_limit','interest_receive','debt_to_income']
#data_EDA.isna().sum()>0

Numerical Columns + `loan_subgrade`
****

In [None]:
numericalColumns = discreet+continuous
numericalColumns.append('loan_subgrade')
numericalColumns

## Experimentation with data_EDA

### --- diagnostic_plots
****

In [None]:
def diagnostic_plots(df, feature, bins=28):
    # The function takes a dataframe (df) and
    # the feature of interest as arguments.

    # Define figure size.
    plt.figure(figsize=(16, 4))

    # histogram
    plt.subplot(1, 3, 1)
    sns.histplot(df[feature], bins=bins)
    plt.title('Histogram')

    # Q-Q plot
    plt.subplot(1, 3, 2)
    stats.probplot(df[feature], dist="norm", plot=plt)
    plt.ylabel('Case Status')

    # boxplot
    plt.subplot(1, 3, 3)
    sns.boxplot(y=df[feature])
    plt.title('Boxplot')

    plt.show()

### Filtering Categories with data_EDA

In [None]:
state_code_filter(data_EDA)

In [None]:
data_EDA.state_code.value_counts()

*Observations:*
- This is a huge improvement!

### Imputing Missing Values with data_EDA

In [None]:
data_EDA.replace({'interest_receive':0},value=np.nan,inplace=True)

In [None]:
missingIndicator = AddMissingIndicator()
data_EDA = missingIndicator.fit_transform(data_EDA)

randomImputer = RandomSampleImputer(random_state = 1, variables=missing)
data_EDA = randomImputer.fit_transform(data_EDA)

print(data_EDA.isna().sum())

data_EDA

In [None]:
data_EDA.describe(include=np.number).T

### Capping Outliers with data_EDA

In [None]:
# outlier detection using boxplot
numeric_columns = discreet + continuous


plt.figure(figsize=(15, 12))

for i, feature in enumerate(numeric_columns):
    plt.subplot(6, 4, i + 1)
    plt.boxplot(data_EDA[feature], whis=1.5)
    plt.tight_layout()
    plt.title(feature)

plt.show()

In [None]:
data_EDA = capper.fit_transform(data_EDA)

In [None]:
'''
Some checks to make sure it is working as expected
'''

#data_EDA[data_EDA['delinq_2yrs'] >= 18]
#data_EDA[data_EDA['public_records'] >= 13]
#data_EDA[data_EDA['revolving_balance'] >= 1000000]
#data_EDA[data_EDA['total_acc'] >= 100]
#data_EDA[data_EDA['last_week_pay'] >= 270]
#data_EDA[data_EDA['annual_income'] >= 2000000]
#data_EDA[data_EDA['debt_to_income'] >= 55.0]
#data_EDA[data_EDA['total_current_balance'] >= 3000000]
#data_EDA[data_EDA['total_revolving_limit'] >= 1000000]
#data_EDA.info()
data_EDA.describe(include=np.number).T

In [None]:
data.describe(include=np.number).T

**Massive improvement**

### Checking transforms with data_EDA


**Discreet**
****

In [None]:
for feature in discreet:
  if (data_EDA[feature].isna()).any() == True:
    print("*"*80,"\n",
          "missing detected for {}, and total missing: {}".format(feature,data_EDA[feature].isna().sum()))
  diagnostic_plots(data_EDA,feature)

---
log transform
****

In [None]:
for feature in continuous:
  if (data_EDA[feature].isna()).any() == True:
    print("*"*80,"\n",
          "missing detected for {}, and total missing: {}".format(feature,data_EDA[feature].isna().sum()))

  transform=pd.DataFrame(np.log(data_EDA[feature]))

  diagnostic_plots(transform,feature)

---
sqrt transform
****

In [None]:
for feature in continuous:
  if (data_EDA[feature].isna()).any() == True:
    print("*"*80,"\n",
          "missing detected for {}, and total missing: {}".format(feature,data_EDA[feature].isna().sum()))

  transform=pd.DataFrame((data_EDA[feature])**1/2)

  diagnostic_plots(transform,feature)

---
yeojohnson
****

In [None]:
from scipy.stats import yeojohnson

for feature in continuous:
  if (data_EDA[feature].isna()).any() == True:
    print("*"*80,"\n",
          "missing detected for {}, and total missing: {}".format(feature,data_EDA[feature].isna().sum()))

  transform,_= stats.yeojohnson(data_EDA[feature])
  boxcox_EDA = pd.DataFrame(transform,columns=[feature])
  boxcox_EDA

  diagnostic_plots(boxcox_EDA,feature)

**Continuous**
****

In [None]:
for feature in continuous:
  if (data_EDA[feature].isna()).any():
    print("*"*80,"\n",
          "missing detected for {}, and total missing: {}".format(feature,data_EDA[feature].isna().sum()))
  diagnostic_plots(data_EDA,feature)

---
log transform
****

In [None]:
for feature in continuous:
  if (data_EDA[feature].isna()).any() == True:
    print("*"*80,"\n",
          "missing detected for {}, and total missing: {}".format(feature,data_EDA[feature].isna().sum()))

  transform=pd.DataFrame(np.log(data_EDA[feature]))

  diagnostic_plots(transform,feature)

---
sqrt transform
****

In [None]:
for feature in continuous:
  if (data_EDA[feature].isna()).any() == True:
    print("*"*80,"\n",
          "missing detected for {}, and total missing: {}".format(feature,data_EDA[feature].isna().sum()))

  transform=pd.DataFrame((data_EDA[feature])**1/2)

  diagnostic_plots(transform,feature)

---
yeojohnson
****

In [None]:
from scipy.stats import yeojohnson

for feature in continuous:
  if (data_EDA[feature].isna()).any() == True:
    print("*"*80,"\n",
          "missing detected for {}, and total missing: {}".format(feature,data_EDA[feature].isna().sum()))

  transform,_= stats.yeojohnson(data_EDA[feature])
  boxcox_EDA = pd.DataFrame(transform,columns=[feature])
  boxcox_EDA

  diagnostic_plots(boxcox_EDA,feature)

### StandardScaler with data_EDA

In [None]:
continuous

In [None]:
#scaling the numerical columns
scaler = StandardScaler()

data_EDA[continuous] = scaler.fit_transform(
    data_EDA[continuous])

In [None]:
scaler.get_feature_names_out()

In [None]:
data_EDA[continuous].describe().T

### Trying Ordinal Discretization with data_EDA

In [None]:
data_EDA.public_records.value_counts()

In [None]:
data_EDA.replace({'public_records': {
    0: "Zero", 1.0: "One", 2.0: "Two", 3.0: "Three", 4.0: "Four", 5.0: "Five", 5.0: "Five", 6.0: "Six",
    7.0: "Ten_to_Thirteen", 8.0: "Ten_to_Thirteen", 9.0: "Ten_to_Thirteen", 10.0: "Ten_to_Thirteen", 11.0: "Ten_to_Thirteen", 12.0: "Ten_to_Thirteen",
    13.0: "Ten_to_Thirteen"}},inplace=True)

In [None]:
labeled_barplot(data_EDA,'public_records')

In [None]:
stacked_barplot(data_EDA,'public_records','default')

In [None]:
stacked_barplot(data_EDA,'loan_subgrade','default')

In [None]:
data_EDA_Xtrain = data_EDA.drop('default',axis=1)
data_EDA_ytrain = data_EDA['default']

enc = OrdinalEncoder(encoding_method = 'ordered',variables=['loan_subgrade','public_records'])

data_EDA_Xtrain = enc.fit_transform(data_EDA_Xtrain, data_EDA_ytrain)

data_EDA_Xtrain.loan_subgrade.unique()

In [None]:
enc.encoder_dict_

In [None]:
data_EDA_Xtrain.info()

## Preparing Data \*Can skip if going to "Final Pipeline"\*

**Final Steps for pipeline**
****
These are the final steps I decided based on experimentation and can be skipped if just running the final pipeline, but should be ran to use "Model Building" section.
****
\*\*NOTE: **ONLY** data_EDA has been altered and manipulated until now. This has ensured X_train,X_val, & X_test were isolated from all experimentation done. Now, for the first time I will access and manipulate these splits.\*\*

In [None]:
#Removing `loan_grade` because `loan_subgrade` also includes markers for loan_grade.
X_train.drop('loan_grade',axis=1,inplace=True)
X_val.drop('loan_grade',axis=1,inplace=True)
X_test.drop('loan_grade',axis=1,inplace=True)

#Removing ID because it is unique and I decided it may not be worth the hassle of grouping in any meaningful way
X_train.drop('ID',axis=1,inplace=True)
X_val.drop('ID',axis=1,inplace=True)
X_test.drop('ID',axis=1,inplace=True)

#Fixing one of the values in `job_experience`. It had a '<' character that caused an issue with sklearn
X_train.replace({'job_experience': '<5 Years'},value='under5yrs',inplace=True)
X_val.replace({'job_experience': '<5 Years'},value='under5yrs',inplace=True)
X_test.replace({'job_experience': '<5 Years'},value='under5yrs',inplace=True)

#Replacing these zeros with missing values
X_train.replace({'interest_receive':0,'total_revolving_limit':0},value=np.nan,inplace=True)
X_val.replace({'interest_receive':0,'total_revolving_limit':0},value=np.nan,inplace=True)
X_test.replace({'interest_receive':0,'total_revolving_limit':0},value=np.nan,inplace=True)

#Running the state_code filter
state_code_filter(X_train)
state_code_filter(X_val)
state_code_filter(X_test)
'''
A check to run if interested
'''
#X_train.state_code.value_counts()


In [None]:
#Adding missing indicators
missingIndicator = AddMissingIndicator(variables=['job_experience','last_week_pay','total_current_balance','total_revolving_limit'])
X_train = missingIndicator.fit_transform(X_train)
X_val = missingIndicator.transform(X_val)

#Random Sample Imputation
randomImputer = RandomSampleImputer(random_state = 1, variables=missing)
X_train = randomImputer.fit_transform(X_train)
X_val = randomImputer.transform(X_val)

In [None]:
#Capping the outliers with capper defined in preprocessing standards
X_train = capper.fit_transform(X_train)
X_val = capper.transform(X_val)

In [None]:
#Ordinal Encoding of 'loan_subgrade'
enc = OrdinalEncoder(encoding_method = 'ordered',variables=['loan_subgrade'])

X_train = enc.fit_transform(X_train, y_train)
X_val = enc.transform(X_val)

In [None]:
'''
Can run this to get the dictionary from the transformer fit
'''
#enc.encoder_dict_

In [None]:
#Setting the standard for which columns I want to be converted to OHE
oneHotCols = ['loan_term','home_ownership','income_verification_status',
              'loan_purpose','state_code','application_type','job_experience'] #state_code should be filtered BEFORE going into OHE. Also notice that 'loan_subgrade' is not included because I was ablw to make this an ordinal variable.

In [None]:
#Making OHE columns to prepare categorical data for sklearn
X_train=pd.get_dummies(X_train, columns=oneHotCols)
X_val=pd.get_dummies(X_val, columns=oneHotCols)
X_test=pd.get_dummies(X_test, columns=oneHotCols)

In [None]:
'''
Final check to see what this ended up producing
'''

#X_train.describe(include=np.number).T

In [None]:
'''
Final check to see what this ended up producing
'''

#X_train.info()

# Model Building

## Defining Scorer
****

In [None]:
from sklearn import metrics
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.accuracy_score)

## Defining Confusion Matrix
****

In [None]:
## Function to create confusion matrix
def make_confusion_matrix(model,y_actual,X_test=X_val,labels=[1, 0]): #I exposed X_test in the function definition so I can switch it to X_val if needed
    '''
    model : classifier to predict values of X
    y_actual : ground truth

    '''
    y_predict = model.predict(X_test)
    cm=metrics.confusion_matrix( y_actual, y_predict, labels=[0, 1])
    df_cm = pd.DataFrame(cm, index = [i for i in ["Actual - No","Actual - Yes"]],
                  columns = [i for i in ['Predicted - No','Predicted - Yes']])
    group_counts = ["{0:0.0f}".format(value) for value in
                cm.flatten()]
    group_percentages = ["{0:.2%}".format(value) for value in
                         cm.flatten()/np.sum(cm)]
    labels = [f"{v1}\n{v2}" for v1, v2 in
              zip(group_counts,group_percentages)]
    labels = np.asarray(labels).reshape(2,2)
    plt.figure(figsize = (10,7))
    sns.heatmap(df_cm, annot=labels,fmt='')
    plt.ylabel('True label')
    plt.xlabel('Predicted label')

## Baseline

### --- Defining repeatedly used variables
****

In [None]:
# Setting default data length, number of splits equal to 5, and building a list used repeatedly
dataLength = len(X_train)
n_splits=5
fold_columns=["fold1","fold2","fold3","fold4","fold5"]

### Model Building with Original data

In [None]:
'''
Setting up the models list
'''

models = []  # Empty list to store all the models

# Appending models into the list

models.append(("dtree", DecisionTreeClassifier(random_state=7)))
models.append(("logit", LogisticRegression(random_state=7)))
models.append(("bagging", BaggingClassifier(random_state=7)))
models.append(("random_forest", RandomForestClassifier(random_state=7)))
models.append(("adaboost", AdaBoostClassifier(random_state=7)))
models.append(("gradient", GradientBoostingClassifier(random_state=7)))
models.append(("xgboost", XGBClassifier(random_state=7)))

In [None]:
results = []  # Empty list to store all model's CV scores
names = []  # Empty list to store name of the models

print("\n",
      "running cross-validation on training dataset, {} splits & length of {}...\n".format(n_splits,len(X_train))
      )

# loop through all models to get the mean cross validated score
for name, model in models:
    kfold = StratifiedKFold(
        n_splits=n_splits, shuffle=True, random_state=1
    )
    cv_result = cross_val_score(
        estimator=model, X=X_train, y=y_train, scoring=scorer, cv=kfold
    )
    results.append(cv_result)
    names.append(name)
#making a DataFrame of the results for graphing
results_plot=(pd.DataFrame(results,columns=fold_columns,index=names)).T

print(results_plot,'\n\nMean cross-validation scores...\n\n',
      results_plot.mean(),sep='')

print("\n" "checking performance against `X_val` dataset...")


scores = []

# loop through all models to get the validation data score
for name, model in models:
    model.fit(X_train, y_train)
    score = metrics.accuracy_score(y_val, model.predict(X_val))
    scores.append(score)

results_val_plot = pd.DataFrame(scores,index=names,columns=["Accuracy"])
results_val_plot #making a DataFrame of the results for graphing

In [None]:
plt.figure(figsize=(10,7))
sns.boxplot(data=results_plot,showmeans=True)
plt.title("Accuracy scores for non-linear models")
plt.ylabel("Accuracy");

In [None]:
sns.barplot(data=results_val_plot,x=results_val_plot["Accuracy"],
            y=results_val_plot.index)
plt.yticks(rotation=45)
plt.title("Test of models against `X_val` dataset");

In [None]:
for model in models:
  print(model[0],"*"*50)
  model=model[1].fit(X_train,y_train)
  make_confusion_matrix(model,y_train,X_train)

In [None]:
for model in models:
  print(model[0],"*"*50)
  model=model[1].fit(X_train,y_train)
  make_confusion_matrix(model,y_val)

### Model Building with Undersampled data

In [None]:
# Random undersampler for under sampling the data
rus = RandomUnderSampler(random_state=1, sampling_strategy=1)
X_train_un, y_train_un = rus.fit_resample(X_train, y_train)
print('Original y_train:\n {}\nNew y_train_un: \n{}'.format(y_train.value_counts(1),y_train_un.value_counts(1)),'\n',sep='')
X_train_un.shape,y_train_un.value_counts()

In [None]:
results_un = []  # Empty list to store all model's CV scores
names_un = []

dataLengthUn = len(X_train_un)

print("\n",
      "running cross-validation on training dataset, {} splits & length of {}...which is a test size of {} and train size of {}\n".format(n_splits,dataLengthUn,np.round(dataLengthUn/5,0),np.round(dataLengthUn-(dataLengthUn/5),0))
      )

# loop through all models to get the mean cross validated score
for name, model in models:
    kfold = StratifiedKFold(
        n_splits=n_splits, shuffle=True, random_state=1
    )
    cv_result = cross_val_score(
        estimator=model, X=X_train_un, y=y_train_un, scoring=scorer, cv=kfold
    )
    results_un.append(cv_result)
    names_un.append(name)

#making a DataFrame of the results for graphing
results_plot_un=(pd.DataFrame(results_un,columns=fold_columns,index=names_un)).T

print(results_plot_un,'\n\nMean cross-validation scores...\n',
      results_plot_un.mean(),sep='')

print("\n" "checking performance against `X_val` dataset...")

scores=[]

# loop through all models to get the validation data score
for name, model in models:
    model.fit(X_train_un, y_train_un)
    score = metrics.accuracy_score(y_val, model.predict(X_val))
    scores.append(score)

results_val_plot_un = pd.DataFrame(scores,index=names_un,columns=["Accuracy"])
results_val_plot_un #making a DataFrame of the results for graphing

In [None]:
plt.figure(figsize=(10,7))
sns.boxplot(data=results_plot_un,showmeans=True)
plt.title("Accuracy scores for non-linear models")
plt.ylabel("Accuracy");

In [None]:
sns.barplot(data=results_val_plot_un,x=results_val_plot_un["Accuracy"],
            y=results_val_plot_un.index)
plt.yticks(rotation=45)
plt.title("Test of models against `X_val` dataset");

In [None]:
for model in models:
  print(model[0],"*"*50)
  model=model[1].fit(X_train_un,y_train_un)
  make_confusion_matrix(model,y_train_un,X_train_un)

In [None]:
for model in models:
  print(model[0],"*"*50)
  model=model[1].fit(X_train_un,y_train_un)
  make_confusion_matrix(model,y_val)

### Model Building with Oversampled data


In [None]:
# Synthetic Minority Over Sampling Technique
smnc = SMOTENC(sampling_strategy=1, k_neighbors=5, random_state=1, categorical_features=np.arange(13,45))

X_train_over, y_train_over = smnc.fit_resample(X_train, y_train)

print('Original y_train:\n{}\nNew y_train_over: \n{}'.format(y_train.value_counts(1),y_train_over.value_counts(1)),'\n',sep='')
X_train_over.shape,y_train_over.value_counts()

In [None]:
results_over = []  # Empty list to store all model's CV scores
names_over = []

dataLengthOver = len(X_train_over)

print("\n",
      "running cross-validation on training dataset, {} splits & length of {}...which is a test size of {} and train size of {}\n".format(n_splits,dataLengthOver,np.round(dataLengthOver/5,0),np.round(dataLengthOver-(dataLengthOver/5),0))
      )

# loop through all models to get the mean cross validated score
for name, model in models:
    kfold = StratifiedKFold(
        n_splits=n_splits, shuffle=True, random_state=1
    )
    cv_result = cross_val_score(
        estimator=model, X=X_train_over, y=y_train_over, scoring=scorer, cv=kfold
    )
    results_over.append(cv_result)
    names_over.append(name)

#making a DataFrame of the results for graphing
results_plot_over=(pd.DataFrame(results_over,columns=fold_columns,index=names_over)).T

print(results_plot_over,'\n\nMean cross-validation scores...\n\n',
      results_plot_over.mean(),sep='')

print("\n" "checking performance against `X_val` dataset...")

scores=[]

# loop through all models to get the validation data score
for name, model in models:
    model.fit(X_train_over, y_train_over)
    score = metrics.accuracy_score(y_val, model.predict(X_val))
    scores.append(score)

results_val_plot_over = pd.DataFrame(scores,index=names_over,columns=["Accuracy"])
results_val_plot_over #making a DataFrame of the results for graphing

In [None]:
plt.figure(figsize=(10,7))
sns.boxplot(data=results_plot_over,showmeans=True)
plt.title("Accuracy scores for non-linear models")
plt.ylabel("Accuracy");

In [None]:
sns.barplot(data=results_val_plot_over,x=results_val_plot_over["Accuracy"],
            y=results_val_plot_over.index)
plt.yticks(rotation=45)
plt.title("Test of models against `X_val` dataset");

In [None]:
for model in models:
  print(model[0],"*"*50)
  model=model[1].fit(X_train_over,y_train_over)
  make_confusion_matrix(model,y_train_over,X_train_over)

In [None]:
for model in models:
  print(model[0],"*"*50)
  model=model[1].fit(X_train_over,y_train_over)
  make_confusion_matrix(model,y_val,X_val)

### Model Building with Under/Oversampled data

In [None]:
# Random undersampler for under sampling the data
rus = RandomUnderSampler(random_state=1, sampling_strategy=.5)
X_train_un2, y_train_un2 = rus.fit_resample(X_train, y_train)
print('Original y_train:\n {}\nNew y_train_un2: \n{}'.format(y_train.value_counts(1),y_train_un2.value_counts(1)),'\n',sep='')
X_train_un2.shape,y_train_un2.value_counts()

In [None]:
# Synthetic Minority Over Sampling Technique
smnc = SMOTENC(sampling_strategy=1, k_neighbors=5, random_state=1, categorical_features=np.arange(13,45))

X_train_over2, y_train_over2 = smnc.fit_resample(X_train_un2, y_train_un2)

print('Original y_train:\n{}\nNew y_train_over: \n{}'.format(y_train.value_counts(1),y_train_over2.value_counts(1)),'\n',sep='')
X_train_over2.shape,y_train_over2.value_counts()

In [None]:
results_over2 = []  # Empty list to store all model's CV scores
names_over2 = []

n_splits=5# Setting number of splits equal to 5
fold_columns=["fold1","fold2","fold3","fold4","fold5"]
dataLengthOverUn = len(X_train_over2)

print("\n",
      "running cross-validation on training dataset, {} splits & length of {}...which is a test size of {} and train size of {}\n".format(n_splits,dataLengthOverUn,np.round(dataLengthOverUn/5,0),np.round(dataLengthOverUn-(dataLengthOverUn/5),0))
      )

# loop through all models to get the mean cross validated score
for name, model in models:
    kfold = StratifiedKFold(
        n_splits=n_splits, shuffle=True, random_state=1
    )
    cv_result = cross_val_score(
        estimator=model, X=X_train_over2, y=y_train_over2, scoring=scorer, cv=kfold
    )
    results_over2.append(cv_result)
    names_over2.append(name)

#making a DataFrame of the results for graphing
results_plot_over2=(pd.DataFrame(results_over2,columns=fold_columns,index=names_over2)).T

print(results_plot_over2,'\n\nMean cross-validation scores...\n\n',
      results_plot_over2.mean(),sep='')

print("\n" "checking performance against `X_val` dataset...")

scores=[]

# loop through all models to get the validation data score
for name, model in models:
    model.fit(X_train_over2, y_train_over2)
    score = metrics.accuracy_score(y_val, model.predict(X_val))
    scores.append(score)

results_val_plot_over2 = pd.DataFrame(scores,index=names_over2,columns=["Accuracy"])
results_val_plot_over2 #making a DataFrame of the results for graphing

In [None]:
plt.figure(figsize=(10,7))
sns.boxplot(data=results_plot_over2,showmeans=True)
plt.title("Accuracy scores for non-linear models")
plt.ylabel("Accuracy");

In [None]:
sns.barplot(data=results_val_plot_over2,x=results_val_plot_over2["Accuracy"],
            y=results_val_plot_over2.index)
plt.yticks(rotation=45)
plt.title("Test of models against `X_val` dataset");

In [None]:
for model in models:
  print(model[0],"*"*50)
  model=model[1].fit(X_train_over2,y_train_over2)
  make_confusion_matrix(model,y_train_over2,X_train_over2)

In [None]:
for model in models:
  print(model[0],"*"*50)
  model=model[1].fit(X_train_over2,y_train_over2)
  make_confusion_matrix(model,y_val,X_val)

## Hypertuning

### Tuned with Original data

In [None]:
from statsmodels.stats.outliers_influence import variance_inflation_factor

# let's check the VIF of the predictors
vif_series = pd.Series(
    [variance_inflation_factor(X_train.values, i) for i in range(X_train.shape[1])],
    index=X_train.columns,
    dtype=float,
)
print("VIF values: \n\n{}\n".format(vif_series))

In [None]:
models_tuned = []

`Decision Tree Tuning`
****

**Pre-Pruning**

In [None]:
# defining model
Model = DecisionTreeClassifier(random_state=1)

# Parameter grid to pass in RandomSearchCV
param_grid = {'max_depth': np.arange(2,10),
              'min_samples_leaf': [1, 4, 7, 10],
              'max_leaf_nodes' : [10,12,16,18],
              'min_impurity_decrease': [0.0001,0.001,.01] }
print("Running cross-validation on training dataset, {} splits & length of {}...which is a test size of {} and train size of {}\n".format(n_splits,dataLength,np.round(dataLength/5,0),np.round(dataLength-(dataLength/5),0))
      )

#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_iter=30, n_jobs = -1, scoring=scorer, cv=n_splits, random_state=7,verbose=2)

#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train,y_train)
best_params=randomized_cv.best_params_
models_tuned.append(("dtree_tuned_original", randomized_cv.best_estimator_))

#Printing the results
print('Computed final model from CV of various random arrangements created in random grid search....\n',
      '\n','Best parameters are {}'.format(best_params),
      '\n','---& with CV score(Accuracy)={}'.format(randomized_cv.best_score_),
      '\n','Feature Importances:{}'.format(pd.DataFrame(randomized_cv.best_estimator_.feature_importances_,
                                                     index=[X_train.columns],columns=['feature_importances']).sort_values(
                                                         by='feature_importances',ascending=False).head(10)
                                                     ),sep='')


`Bagging Classifier Tuning`
****

**RANDOM GRID SEARCH**

In [None]:
# defining model
Model = BaggingClassifier(random_state=7)

# Parameter grid to pass in RandomSearchCV
param_grid = {'max_samples': [0.6,0.9,1],
              'max_features': [0.3,0.6,0.9,1],
              'n_estimators' : np.arange(50,90,3)
}
print("Running cross-validation on training dataset, {} splits & length of {}...which is a test size of {} and train size of {}\n".format(n_splits,dataLength,np.round(dataLength/5,0),np.round(dataLength-(dataLength/5),0))
      )

#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_iter=30, n_jobs = -1, scoring=scorer, cv=n_splits, random_state=7,verbose=2)

#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train,y_train)
best_params=randomized_cv.best_params_
models_tuned.append(("bagging_tuned_original", randomized_cv.best_estimator_))

#Printing the results
print("Computed final model from CV of various random arrangements created in random grid search....\n\nBest parameters are {}\n ----& with CV score(Accuracy)={}\n".format(best_params,randomized_cv.best_score_))

**GRID SEARCH**

In [None]:
# defining model
Model = BaggingClassifier(random_state=7)

# Parameter grid to pass in RandomSearchCV
param_grid = {'max_samples': [0.6,0.9,1],
              'max_features': [0.3,0.6,0.9,1],
              'n_estimators' : np.arange(50,90,3)
}
print("Running cross-validation on training dataset, {} splits & length of {}...which is a test size of {} and train size of {}\n".format(n_splits,dataLength,np.round(dataLength/5,0),np.round(dataLength-(dataLength/5),0))
      )

#Calling GridSearchCV
grid_cv = GridSearchCV(estimator=Model, param_grid=param_grid, n_jobs = -1, scoring=scorer, cv=n_splits,verbose=2)

#Fitting parameters in GridSearchCV
grid_cv=grid_cv.fit(X_train,y_train)
best_params=grid_cv.best_params_
models_tuned.append(("bagging_tuned_original", grid_cv.best_estimator_))

#Printing the results
print("Computed final model from CV of various random arrangements created in random grid search....\n\nBest parameters are {}\n ----& with CV score(Accuracy)={}\n".format(best_params,grid_cv.best_score_))

`Random Forest Tuning`
****

**RANDOM GRID SEARCH**

In [None]:
# defining model
Model = RandomForestClassifier(random_state=1)

# Parameter grid to pass in RandomSearchCV
param_grid = { "n_estimators": [100,150,200,250],
              "min_samples_leaf": np.arange(1,7),
              "max_features": ['sqrt','log2',None,[0.3,0.2,0.5]],
              "max_samples": np.arange(0.5, 0.8, 0.1)}

print("Running cross-validation on training dataset, {} splits & length of {}...which is a test size of {} and train size of {}\n".format(n_splits,dataLength,np.round(dataLength/5,0),np.round(dataLength-(dataLength/5),0))
      )

#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_iter=30, n_jobs = -1, scoring=scorer, cv=n_splits, random_state=7,verbose=2)

#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train,y_train)
best_params=randomized_cv.best_params_
models_tuned.append(("random_forest_tuned_original", randomized_cv.best_estimator_))

#Printing the results
print('Computed final model from CV of various random arrangements created in random grid search....\n',
      '\n','Best parameters are {}'.format(best_params),
      '\n','---& with CV score(Accuracy)={}'.format(randomized_cv.best_score_),
      '\n','Feature Importances:{}'.format(pd.DataFrame(randomized_cv.best_estimator_.feature_importances_,
                                                     index=[X_train.columns],columns=['feature_importances']).sort_values(
                                                         by='feature_importances',ascending=False).head(10)
                                                     ),sep='')

**GRID SEARCH**

In [None]:
# defining model
Model = RandomForestClassifier(random_state=1)

# Parameter grid to pass in RandomSearchCV
param_grid = { "n_estimators": [100,150,250],
              "min_samples_leaf": np.arange(1,7),
              "max_features": [[0.3,0.2,0.5],'sqrt'],
              "max_samples": np.arange(0.5, 0.8, 0.1)}

print("Running cross-validation on training dataset, {} splits & length of {}...which is a test size of {} and train size of {}\n".format(n_splits,dataLength,np.round(dataLength/5,0),np.round(dataLength-(dataLength/5),0))
      )

#Calling GridSearchCV
grid_cv = GridSearchCV(Model, param_grid, n_jobs = -1, scoring=scorer, cv=n_splits,verbose=2)

#Fitting parameters in GridSearchCV
grid_cv.fit(X_train,y_train)
best_params=grid_cv.best_params_

#Printing the results
print('Computed final model from CV of various random arrangements created in random grid search....\n',
      '\n','Best parameters are {}'.format(best_params),
      '\n','---& with CV score(Accuracy)={}'.format(grid_cv.best_score_),
      '\n','Feature Importances:{}'.format(pd.DataFrame(grid_cv.best_estimator_.feature_importances_,
                                                     index=[X_train.columns],columns=['feature_importances']).sort_values(
                                                         by='feature_importances',ascending=False).head(10)
                                                     ),sep='')

`XGBoost Tuning`
****

**RANDOM GRID SEARCH**

In [None]:
# defining model
Model = XGBClassifier(random_state=1)

# Parameter grid to pass in RandomSearchCV
param_grid={ 'n_estimators': [150,200, 250, 300],
            'scale_pos_weight': [5,10,3],
            'learning_rate': [0.1,0.2,0.05],
            'gamma': [0,3,5],
            'subsample': [0.7,0.8,0.9] }

print("Running cross-validation on training dataset, {} splits & length of {}...which is a test size of {} and train size of {}\n".format(n_splits,dataLength,np.round(dataLength/5,0),np.round(dataLength-(dataLength/5),0))
      )

#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_iter=30, n_jobs = -1, scoring=scorer, cv=n_splits, random_state=7,verbose=2)

#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train,y_train)
best_params=randomized_cv.best_params_
models_tuned.append(("xgboost_tuned_original", randomized_cv.best_estimator_))

#Printing the results
print('Computed final model from CV of various random arrangements created in random grid search....\n',
      '\n','Best parameters are {}'.format(best_params),
      '\n','---& with CV score(Accuracy)={}'.format(randomized_cv.best_score_),
      '\n','Feature Importances:{}'.format(pd.DataFrame(randomized_cv.best_estimator_.feature_importances_,
                                                     index=[X_train.columns],columns=['feature_importances']).sort_values(
                                                         by='feature_importances',ascending=False).head(10)
                                                     ),sep='')

**GRID SEARCH**

In [None]:
# defining model
Model = XGBClassifier(random_state=1)

# Parameter grid to pass in RandomSearchCV
param_grid={ 'n_estimators': [150,200,250,300],
            'scale_pos_weight': [3,5,10],
            'learning_rate': [0.1,0.2,0.05],
            'gamma': [0,3,5],
            'subsample': [0.7,0.8,0.9] }

print("Running cross-validation on training dataset, {} splits & length of {}...which is a test size of {} and train size of {}\n".format(n_splits,dataLength,np.round(dataLength/5,0),np.round(dataLength-(dataLength/5),0))
      )

#Calling GridSearchCV
grid_cv = GridSearchCV(Model, param_grid,  n_jobs = -1, scoring=scorer, cv=n_splits,verbose=2)

#Fitting parameters in GridSearchCV
grid_cv.fit(X_train,y_train)
best_params=grid_cv.best_params_

#Printing the results
print('Computed final model from CV of various random arrangements created in random grid search....\n',
      '\n','Best parameters are {}'.format(best_params),
      '\n','---& with CV score(Accuracy)={}'.format(grid_cv.best_score_),
      '\n','Feature Importances:{}'.format(pd.DataFrame(grid_cv.best_estimator_.feature_importances_,
                                                     index=[X_train.columns],columns=['feature_importances']).sort_values(
                                                         by='feature_importances',ascending=False).head(10)
                                                     ),sep='')

## Model Performance Comparison

### Original data tuning results

In [None]:
#Save point so I do not have to run all the above tuning results in one long mega-session
'''
models_tuned = [
 ('dtree_tuned',
  DecisionTreeClassifier(min_samples_leaf= 10, min_impurity_decrease= 0.0001,
                            max_leaf_nodes=18, max_depth =9, random_state=7)),
 ('logit_tuned', LogisticRegression(C=0.1, random_state=7)),
 ('random_forest_tuned',
  RandomForestClassifier(n_estimators= 150, min_samples_leaf= 4, max_samples= 0.7, max_features= None,random_state=7)),
 ('adaboost_tuned',
  AdaBoostClassifier(base_estimator= DecisionTreeClassifier(max_depth=3,
                                                            random_state=7),
                     n_estimators= 250, learning_rate= 0.2)),
 ('xgboost_tuned',
  XGBClassifier(subsample= 0.9, scale_pos_weight= 3,
                      n_estimators = 300, learning_rate = 0.2, gamma = 5,random_state=7)),
 ('bagging_tuned',
  BaggingClassifier(max_features=0.9, max_samples=0.8,
                    n_estimators=70,random_state=7))]
'''

In [None]:
models_tuned

In [None]:
results_tuned = []  # Empty list to store all model's CV scores
names_tuned = []

n_splits=5# Setting number of splits equal to 10
fold_columns=["fold1","fold2","fold3","fold4","fold5"]

# loop through all models to get the mean cross validated score
for name, model in models_tuned:
    kfold = StratifiedKFold(
        n_splits=n_splits, shuffle=True, random_state=7
    )
    cv_result = cross_val_score(
        estimator=model, X=X_train, y=y_train, scoring=scorer, cv=kfold
    )
    results_tuned.append(cv_result)
    names_tuned.append(name)

#making a DataFrame of the results for graphing
results_plot_tuned=(pd.DataFrame(results_tuned,columns=fold_columns,index=names_tuned)).T

print(results_plot_tuned,'\n\nMean cross-validation scores...\n',
      results_plot_tuned.mean(),sep='')

print("\n" "checking performance against `X_val` dataset..." "\n")

scores=[]

# loop through all models to get the validation data score
for name, model in models_tuned:
    model.fit(X_train, y_train)
    score = metrics.accuracy_score(y_val, model.predict(X_val))
    scores.append(score)

results_val_plot_tuned = pd.DataFrame(scores,index=names_tuned,columns=["Accuracy"])
results_val_plot_tuned#making a DataFrame of the results for graphing

In [None]:
plt.figure(figsize=(10,7))
sns.boxplot(data=results_plot_tuned,showmeans=True)
plt.xticks(rotation=45)
plt.ylabel("Accuracy")
plt.title("Cross Validation Performance of tuned models with original dataset");

In [None]:
sns.barplot(data=results_val_plot_tuned,x=results_val_plot_tuned["Accuracy"],
            y=results_val_plot_tuned.index)
plt.yticks(rotation=45)
plt.xlim(.7,1)
plt.title("Test of models against `X_val` dataset, tuned with original dataset");

In [None]:
for model in models_tuned:
  print(model[0],"*"*50)
  model=model[1].fit(X_train,y_train)
  make_confusion_matrix(model,y_val)

In [None]:
from sklearn.ensemble import StackingClassifier

final_model = StackingClassifier(
    estimators=
    [
    ('xgboost_tuned',
     XGBClassifier(subsample= 0.9, scale_pos_weight= 3,
                   n_estimators = 300, learning_rate = 0.2, gamma = 3,random_state=7)),
     ('bagging_tuned',
      BaggingClassifier(max_features=0.9, max_samples=0.8,
                        n_estimators=70,random_state=7)),
    ])

final_model.fit(X_train, y_train)
score = metrics.accuracy_score(y_val, final_model.predict(X_val))

print(score)

In [None]:
print("Voting Classifier results:\n")
make_confusion_matrix(final_model,y_val)

# Final Pipeline & Final Results

## Building Final Pipeline

### Establishing Data Filtering

\* I am pretty sure I could have defined my own class for the pipeline here, but this works and I prefer making these steps very obvious to see\*

In [None]:
#Removing `loan_grade` because `loan_subgrade` also includes markers for loan_grade.
X_train.drop('loan_grade',axis=1,inplace=True)
X_val.drop('loan_grade',axis=1,inplace=True)
X_test.drop('loan_grade',axis=1,inplace=True)

#Removing ID because it is unique and I decided it may not be worth the hassle of grouping in any meaningful way
X_train.drop('ID',axis=1,inplace=True)
X_val.drop('ID',axis=1,inplace=True)

#saving the IDlist so that I can make an excel document with the IDs and results later
iDlist = X_test['ID']
#then also dropping it as I did with all others
X_test.drop('ID',axis=1,inplace=True)

#Fixing one of the values in `job_experience`. It had a '<' character that caused an issue with sklearn
X_train.replace({'job_experience': '<5 Years'},value='under5yrs',inplace=True)
X_val.replace({'job_experience': '<5 Years'},value='under5yrs',inplace=True)
X_test.replace({'job_experience': '<5 Years'},value='under5yrs',inplace=True)

#Replacing these zeros with missing values
X_train.replace({'interest_receive':0,'total_revolving_limit':0},value=np.nan,inplace=True)
X_val.replace({'interest_receive':0,'total_revolving_limit':0},value=np.nan,inplace=True)
X_test.replace({'interest_receive':0,'total_revolving_limit':0},value=np.nan,inplace=True)

#Running the state_code filter
state_code_filter(X_train)
state_code_filter(X_val)
state_code_filter(X_test)

In [None]:
#Setting the standard for which columns I want to be converted to OHE 1's or 0s
oneHotCols = ['loan_term','home_ownership','income_verification_status',
              'loan_purpose','state_code','application_type','job_experience'] #note state_code should already be filtered by this point

max_capping_dictionary={
        'delinq_2yrs': 18, 'public_records': 13,
        'revolving_balance': 1000000,'total_acc': 100,'last_week_pay': 270,
        'annual_income': 2000000,'debt_to_income': 55.0,
        'total_current_balance': 3000000,'total_revolving_limit': 1000000}

### Defining Pipeline

In [None]:
#Building a final pipeline whose steps are named automatically using make_pipeline from imblearn.pipeline
finalPipe=make_pipeline(
    AddMissingIndicator(variables=
                        ['job_experience','last_week_pay',
                         'total_current_balance','total_revolving_limit']),
    RandomSampleImputer(random_state = 1, variables=missing),
    ArbitraryOutlierCapper(max_capping_dict=max_capping_dictionary,min_capping_dict=None),
    OrdinalEncoder(encoding_method = 'ordered',
                   variables=['loan_subgrade']),
    OneHotEncoder(variables=oneHotCols),
    StackingClassifier(estimators=[
        ('xgboost_tuned',
     XGBClassifier(subsample= 0.9, scale_pos_weight= 3,
                   n_estimators = 300, learning_rate = 0.2, gamma = 3,random_state=7)),
        ('bagging_tuned',
     BaggingClassifier(max_features= 0.9, max_samples= 0.9, n_estimators= 59))
        ],n_jobs=-1)
    )

In [None]:
finalPipe.steps

### Building the Final Pipeline

In [None]:
finalPipe.fit(X_train,y_train)

## Final pipeline: CV results

In [None]:
#Running a CV test with final model
n_splits=5
fold_columns=["fold1","fold2","fold3","fold4","fold5"]

kfold = StratifiedKFold(
    n_splits=n_splits, shuffle=True, random_state=7
)
cv_result = cross_val_score(
    estimator=finalPipe, X=X_train, y=y_train, scoring=scorer, cv=kfold
)

#Printing out the results from CV test
print('spread of CV scores: {}\nMean CV Score: {}'.format(cv_result,cv_result.mean()))

In [None]:
#Running a CV test with final model
n_splits=5
fold_columns=["fold1","fold2","fold3","fold4","fold5"]

kfold = StratifiedKFold(
    n_splits=n_splits, shuffle=True, random_state=7
)
cv_result = cross_val_score(
    estimator=finalPipe, X=X_train, y=y_train, scoring=scorer, cv=kfold
)

#Printing out the results from CV test
print('spread of CV scores: {}\nMean CV Score: {}'.format(cv_result,cv_result.mean()))

## Final pipeline: X_val results



Original run

In [None]:
#Printing out different metrics of X_val test results
print('Accuracy: {}\nRecall: {}\nPrecision: {}\nF1: {}'.format(metrics.accuracy_score(finalPipe.predict(X_val),y_val),
      metrics.recall_score(finalPipe.predict(X_val),y_val),
      metrics.precision_score(finalPipe.predict(X_val),y_val),
      metrics.f1_score(finalPipe.predict(X_val),y_val)))

In [None]:
#Making a confusion matrix of these results from X_val predictions
print("Stacking Classifier Results:\n")
make_confusion_matrix(finalPipe,y_val,X_test=X_val)

running a second time in the future with similar results

In [None]:
#Printing out different metrics of X_val test results
print('Accuracy: {}\nRecall: {}\nPrecision: {}\nF1: {}'.format(metrics.accuracy_score(finalPipe.predict(X_val),y_val),
      metrics.recall_score(finalPipe.predict(X_val),y_val),
      metrics.precision_score(finalPipe.predict(X_val),y_val),
      metrics.f1_score(finalPipe.predict(X_val),y_val)))

In [None]:
#Making a confusion matrix of these results from X_val predictions
print("Stacking Classifier Results:\n")
make_confusion_matrix(finalPipe,y_val,X_test=X_val)

## Final pipeline: X_test results

In [None]:
#making results list
results = finalPipe.predict(X_test)
results

In [None]:
#checking IDlist
iDlist

**Building csv file**

In [None]:
#building DataFrame of "results" with IDs as index and default as the only column name
finalResults=pd.DataFrame(results,index=iDlist,columns=["default"])
finalResults

In [None]:
#saving this result locally to submit
finalResults.to_csv('final_results.csv')

files.download("final_results.csv")