# Project - Supervised Classifcation - Loan Modelling


## Background and Context

AllLife Bank is a US bank that has a growing customer base. The majority of these customers are liability customers (depositors) with varying sizes of deposits. The number of customers who are also borrowers (asset customers) is quite small, and the bank is interested in expanding this base rapidly to bring in more loan business and in the process, earn more through the interest on loans. In particular, the management wants to explore ways of converting its liability customers to personal loan customers (while retaining them as depositors).

A campaign that the bank ran last year for liability customers showed a healthy conversion rate of over 9% success. This has encouraged the retail marketing department to devise campaigns with better target marketing to increase the success ratio.

We need to build a model that will help the marketing department to identify the potential customers who have a higher probability of purchasing the loan.

## Objective

To predict whether a liability customer will buy a personal loan or not.
Which variables are most significant.
Which segment of customers should be targeted more.

## Key Questions

1. What are the key factors influencing whether a liability customer will buy a personal loan or not?
2. Is there a good predictive model so that we can increase the conversion rate of the campaign? 
3. What does the performance assessment look like for such a model?


## Data Dictionary
* ID: Customer ID
* Age: Customer’s age in completed years
* Experience: #years of professional experience
* Income: Annual income of the customer (in thousand dollars)
* ZIP Code: Home Address ZIP code.
* Family: the Family size of the customer
* CCAvg: Average spending on credit cards per month (in thousand dollars)
* Education: Education Level. 1: Undergrad; 2: Graduate;3: Advanced/Professional
* Mortgage: Value of house mortgage if any. (in thousand dollars)
* Personal_Loan: Did this customer accept the personal loan offered in the last campaign?
* Securities_Account: Does the customer have securities account with the bank?
* CD_Account: Does the customer have a certificate of deposit (CD) account with the bank?
* Online: Do customers use internet banking facilities?
* CreditCard: Does the customer use a credit card issued by any other Bank (excluding All life Bank)?

## Import necessary libraries and load data

In [None]:
# Libraries to help with reading and manipulating data
import pandas as pd
import numpy as np

# Libraries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

# Removes the limit from the number of displayed columns and rows.
# This is so that we can see the entire dataframe when we print it
pd.set_option('display.max_columns', None)
# pd.set_option('display.max_rows', None)
pd.set_option('display.max_rows', 200)
# We are setting the random seed via np.random.seed so that
# we get the same random results every time
np.random.seed(1)


In [None]:
# Load the loan modelling CSV file
data = pd.read_csv('Loan_Modelling.csv')

In [None]:
print(f'Number of rows: {data.shape[0]} and number of columns: {data.shape[1]}')

In [None]:
# Check the first 5 rows of data
data.head()

In [None]:
# Check the last 5 rows of data
data.tail()

In [None]:
# checking column names, datatypes and number of non-null values
data.info()

In [None]:
# Checking for missing values.
data.isnull().sum()

In [None]:
# Checking for duplicate rows.
data.duplicated().sum()

**Observations:**
* ID column can be dropped as it is a serial number and does not contain any useful information
* There are no null or missing values in the data. 
* There are no duplicate records.
* All column types are numeric.

In [None]:
# Not interested in the ID column so we are going to drop the ID column
data.drop(['ID'], axis=1, inplace=True)

In [None]:
# Check the Summary of the data
data.describe().T

**Observations:**

* *Age*: Varies from 23 to 67 years with mean and median around 45 years. This data range looks good.
* *Experience*: Ranges from -3 to 43 years. Negative value for professional experience does not make sense. It could be data input errors. We shall inspect the values and take corrective steps.
* *Income*: Minimum is 8K and max is 224K. Mean is around 73K and median 64K. This data range looks good and no correction is required.
* *ZIPCode*: All zip codes are from 90005 to 96651. Hence data appears to be from state of California. We need to figure out a strategy to convert this data to a set of categorical values. 
* *Family*: Customer family size with min 1 and maximum 4 and mean/median around 2. This data range looks good.
* *CCAvg*: Average spending on credit cards per month varies from 0 to 10K with mean of 2k and median of 1.5K. This data range also looks good. 
* *Education*: Education column is a category column values are 1(Undergraduate), 2(Graduate) or 3(Advanced/Professional).
* *Mortgage*: Value of house mortgage also seems normal and varies from 0 to 635K. Median value of morgage is 0 since most of the customers will not have mortgage.
* *Personal_Loan*: This is our target variable and is of type binary 0 or 1(accepted the personal loan offer). 
* *Securities_Account*, *CD_Account*, *Online*, *CreditCard*: All these columns are again binary type categorical variables with 0 and 1 values. Data summary shows no abnormality in the data values.

Other than Age, Experience, Income, CCAvg and Mortgage all other features of categorial type based on the data definitions. Lets check the range of values for these categorical variables to make sure they are categorical and the size of the unique values in them.

In [None]:
cat_cols = ['ZIPCode', 'Family', 'Education', 'Personal_Loan', 'Securities_Account', 'CD_Account', 'Online', 'CreditCard']
for col in cat_cols:
    print('>> Domain of: ', col)
    cat = data[col].value_counts(dropna=False).sort_values(ascending=False)
    print(cat)
    print('-------------------------------\n')

**Observations:**
* ZIP codes have 467 unique values. We are going to convert ZIP codes to a more manageable categorical type.
* Other categorical values are as described in the data definition hence we are good with them.

### Fixing negative Experience Values

In [None]:
# Check the rows where we have negative Professional Experience 
len(data[data.Experience < 0])

In [None]:
# Lets check all the rows, since there are 52
data[data.Experience < 0]

> **Observations**:
* It appears that where Experience is negative, Age of the customers are within 23 to 29 years, which means customers are younger and have less professional experience. So we can safely conclude that negative sign is just an input error and in order to fix this, we can just convert the values to positive and no other transformation is necessary.

In [None]:
# Conver the negative values to positive with a lambda function
data['Experience'] = data['Experience'].apply(lambda x: x if x > 0 else -x)

In [None]:
# Making sure there are no negative values
len(data[data.Experience < 0])

In [None]:
# Sample of rows having Experience > 40 also looks good as Age of these customers are over 60.
data[data.Experience > 40].sample(10)

### Treatment of ZIPCode column

We will use [usezipcode](https://pypi.org/project/uszipcode/) to convert zip code to major city, county and check the range of values in each category. For this purpose we will write a function to get Major City or County from a zip code.

In [None]:
#!pip install uszipcode
from uszipcode import SearchEngine
zip_search = SearchEngine(simple_zipcode=True)
def zip_to_name(code, name):
    """
    Convert zip code to another attribute, such as City or County
    code: ZIP code, name: major_city, county or state
    Returns name of the attribute. If major_city then it will return the Major City of the given  zip code.
    """
    zipcode = zip_search.by_zipcode(code)
    return zipcode.to_dict()[name]

In [None]:
# Test function with sample ZIP code
print('Major city for zip: 22102: ', zip_to_name(22102, 'major_city'))
print('County for zip: 22102: ', zip_to_name(22102, 'county'))
print('State for zip: 22102: ', zip_to_name(22102, 'state'))

In [None]:
# Add three columns Major_City, County and State from Zip code.
data['Major_City'] = data['ZIPCode'].map(lambda zcode: zip_to_name(zcode, 'major_city'))
data['County'] = data['ZIPCode'].map(lambda zcode: zip_to_name(zcode, 'county'))
data['State'] = data['ZIPCode'].map(lambda zcode: zip_to_name(zcode, 'state'))

In [None]:
# Take a look of random sample of 10 rows
data[['ZIPCode', 'Major_City', 'County', 'State']].sample(10)

Check the value ranges of the newly added columns we do not want to drop NA values in case there are invalid zip codes for which there will be no city, county or state.

In [None]:
# Check the value ranges of the newly added columns, 
for col in ['Major_City', 'County', 'State']:
    print('>> Domain of: ', col)
    cat = data[col].value_counts(dropna=False).sort_values(ascending=False)
    print(cat)
    print('\n')

**Observations**:
* There are invalid ZIP codes it seems in 34 rows, as 34 rows have NaN in State and County columns. 
* Major_City: If we convert zip to Major_City then it will still be 245 unique values which are very high. It will make the model very complex.
* County: Unique number of all counties are also over 30. But we can use this column and group it by median income of its population to segregate customers into smaller groups which will also be a good fit for our machine learning models.
* State: If we convert zip to state then it will be only one or two values. We may loose significance of the ZIP column altogether.

Hence our decision would be to go with the County column and merge them to a smaller number of groups based on median income of the couties.

In [None]:
# Take a look at sample of 10 rows where where county is missing.
data[data['County'].isnull()].sample(10)

In [None]:
# check invalid zipcodes based on the County = None
data[data['County'].isnull()]['ZIPCode'].unique()

Checking with [USPS](https://tools.usps.com/zip-code-lookup.htm?citybyzipcode) it seems the above zip codes are invalid and do not exist. We will user "Other" in place of invalid Counties.

In [None]:
# https://tools.usps.com/zip-code-lookup.htm?citybyzipcode
data['County'].fillna('Other', inplace=True)

In [None]:
# Adding a column that will have median income group by County
data['Income_Median'] = data.groupby('County')['Income'].transform(lambda x: x.median())

In [None]:
# Checking sample of the data
data[['County', 'Income', 'Income_Median']].sample(10)

In [None]:
# Sort and print the unique median income values, so that we can use pd.cut() to bin the median incomes 
# to managable ranges
print(sorted(data['Income_Median'].unique()))

In [None]:
# Median income grouped by County varies from 25K to 86K, but there are no values between 25K to 44K, 
# hence we plan to bin the median income besed on below ranges.
bins = [20, 40, 50, 60, 70, 80, 90]

# Labels of the bins are one less than the bins size, our bins would be like income less than 40K (MI_<40), 
# income between 40K to 50K, will be denoted by MI_<50, income between 50K to 60K, will be denoted by MI_<60 etc.
labels = ['MI_<'+ str(i) for i in bins[1:]]
print(bins)
print(labels)

In [None]:
# Add a new column 'County_MI' county median income to collect the binned data using pd.cut()
data['County_MI'] = pd.cut(data['Income_Median'], bins=bins, labels=labels)

In [None]:
# Check random sample of the data
data[['ZIPCode', 'County', 'Income_Median', 'County_MI']].sample(10)

In [None]:
# Now we can see all counties are categorized into six median income groups, we are going to use this new column 
# in place of ZIPCode
data['County_MI'].value_counts(ascending=False, dropna=False)

In [None]:
# We have added few columns, now lets check all the columns and we are going to drop the columns that are not necessary.
data.columns

In [None]:
#Keep a back up of ZIPCode','Major_City', 'County', 'State', 'Income_Median' data
zip_county_mi = data[['ZIPCode','County','Income_Median', 'County_MI']].copy()

# Drop 'ZIPCode','Major_City', 'County', 'State', 'Income_Median' and only keep County_MI column
data.drop(['ZIPCode','Major_City', 'County', 'State', 'Income_Median'], axis=1, inplace=True)

In [None]:
# Check the first 5 rows of the modified data set
data.head()

In [None]:
data.info()

Data preparation and feature engineering is complete, at the end we have 13 columns. ZIPCode has been replaced with County_MI. During the univariate analysis if we find more issues then we should revisit and treat the feature accordingly.

## Univariate Analysis

For making the univariate analysis easier, we define below functions that will help us to plot both non-categorical columns such as Age, Income etc. as box and hist plots and categorical columns such as Family, Education, Online etc. as count plot with percentages.

In [None]:
# function to plot a boxplot and a histogram along the same scale.
def box_hist_plot(data, feature, figsize=(12, 7), bins=10):
    """
    This will show a box and hist plot in a column alignment, For hist plot kde is set to True

    data: dataframe
    feature: dataframe column
    figsize: size of figure (default (12,7))
    """
    # creating the 2 subplots
    f2, (ax_box2, ax_hist2) = plt.subplots(
        nrows=2,  # Number of rows of the subplot grid= 2
        sharex=True,  # x-axis will be shared among all subplots
        gridspec_kw={"height_ratios": (0.25, 0.75)},
        figsize=figsize,
    )  
    # boxplot will be created and a star will indicate the mean value of the column
    sns.boxplot(data=data, x=feature, ax=ax_box2, showmeans=True, color="violet")  
    # For hist plot
    sns.histplot(data=data, x=feature, kde=True, bins=bins, ax=ax_hist2) 
    # Add mean to the histogram
    ax_hist2.axvline(data[feature].mean(), color="green", linestyle="--") 
    # Add median to the histogram
    ax_hist2.axvline(data[feature].median(), color="black", linestyle="-")  

In [None]:
# function to create bar plot with percent labels
def bar_perc_plot(data, feature):
    """
    Barplot with percentage at the top

    data: dataframe
    feature: dataframe column
    """

    # calculate the fig width dynamically
    total = len(data[feature]) 
    count = data[feature].nunique()
    plt.figure(figsize=(count + 1, 5))

    plt.xticks(rotation=90, fontsize=15)
    ax = sns.countplot(
        data=data,
        x=feature,
        palette="Paired",
        # Sort the bars from high to low
        order=data[feature].value_counts().sort_values(ascending=False).index,
    )

    for bar in ax.patches:
        # percentage of each class of the category
        label = "{:.2f}%".format(100 * bar.get_height() / total)  
        x = bar.get_x() + bar.get_width() / 2  # width of the plot
        y = bar.get_height()  # height of the plot

        # annotate the percentage
        ax.annotate(
            label,
            (x, y),
            ha="center",
            va="center",
            size=12,
            xytext=(0, 5),
            textcoords="offset points",
        ) 

    plt.show();  # show the plot

### Observations on Age

In [None]:
#Age
box_hist_plot(data, 'Age')

> Age is normally distributed and no outliers

### Observations on Experience

In [None]:
#Experience
box_hist_plot(data, 'Experience')

> Professional experience of customers are normally distributed and there are no outliers

### Observations on Income

In [None]:
#Income
box_hist_plot(data, 'Income')

> Income is right skewed with mean and median are separated by around 10K. Although this is not an issue for Logistic Regression or Decision Tree models. Below text is from the Logistic regression Literature 

> "Logistic regression does not make many of the key assumptions of linear regression and general linear models that are based on ordinary least squares algorithms – particularly regarding linearity, normality, homoscedasticity, and measurement level."

> Hence we do not need to transform this column to Log normal or other similar transformations. We would follow the same approach for other columns as well.

### Observations on CCAvg

In [None]:
#CCAvg
box_hist_plot(data, 'CCAvg')

> CCAvg has a right skew, and there are many outliers. But again outliers are not a big issue for Logistic Regression or Decision Tree models.

### Observations on Mortgage

In [None]:
#Mortgage
box_hist_plot(data, 'Mortgage')

> Mortgage is highly right skewed, and there are many outliers. This is because most of the customers do not have Morgage as suggested by the mean which is close to 0. Only few customers have very high morgage in the range of 500K to 600K but that is not abnormal.

In [None]:
# Customers having higher mortgage have higher income and are older customers. This shows data is not misleading.
data[data.Mortgage > 500].sample(10)

### Observations on Family

In [None]:
bar_perc_plot(data, 'Family')

> Most of the families have single member follwed by 2, 4 and 3 members.

### Observations on Education

In [None]:
bar_perc_plot(data, 'Education')

> Most of customers are Undergraduate followed by Advanced degree and Graduate

### Observations on Personal_Loan

In [None]:
bar_perc_plot(data, 'Personal_Loan')

> * 90.4% of customers have not responded or signed up for Personal Loan
> * Only 9.6% of customers have responded to the Personal Loan campaign.
> * Conversion rate of the campaign is below 10%. 

### Observations on Securities_Account

In [None]:
bar_perc_plot(data, 'Securities_Account')

> A little over 10% customers have Securities account with the bank.

### Observations on CD_Account

In [None]:
bar_perc_plot(data, 'CD_Account')

> Only around 6% customers have CD account with the bank.

### Observations on Online

In [None]:
bar_perc_plot(data, 'Online')

> Around 40% customers use Online banking but majority of customers do not use Online banking

### Observations on CreditCard

In [None]:
bar_perc_plot(data, 'CreditCard')

> Around 29% customers use Credit cards from other banks which are not issued by AllLife Bank

### Observations on Customer Locations

In [None]:
bar_perc_plot(data, 'County_MI')

> * More than 70% customers are from Couties where median income is between 60K to 70K. 
> * Only few customers less than 1% are from very high income counties or very low income counties. 

## Multivariate Analysis

In [None]:
# Check the correlation between features
plt.figure(figsize=(15,8))
sns.heatmap(data.corr(), annot=True);

> * Age and Experience are very strongly correlated. We will remove 'Experience' feature during our model building. 
> * Income and monthly credit card average are correlated but it is not very high.
> * Also there is a little correlation between Income and Personal Loan
> * Otherwise most of the features are not very correlated to each other

In [None]:
# Draw Pair plot excluding the binary columns, we need to check the binary columns separately as they don't produce good result in pairplots
sns.pairplot(
    data[['Age', 'Experience', 'Income', 'CCAvg', 'Mortgage', 'Family', 'Education', 'Personal_Loan']], 
    hue='Personal_Loan'
);

> **Observations**:
> * Here again we see Age and Experience are very strongly correlated. 
> * Interesting to see that customers who have resonded to the Personal Loan campaign occupy different space in the pair plots.
> * Domain space of Personal Loan customers are towards higher Income, CCAvg, Mortgage, Family members and Education
> * Personal Loan customers belong to all Age and Experience in general.

### Distribution of Personal Loan with Non-Categorical features

Ignoring the outliers let's check the distribution of Personal Loan takers based on the non-categorical features

In [None]:
cols = data[
    [
        "Age",
        "Experience",
        "Income",
        "CCAvg",
        "Mortgage"
    ]
].columns.tolist()
plt.figure(figsize=(12, 10))

for i, variable in enumerate(cols):
    plt.subplot(3, 2, i + 1)
    sns.boxplot(data = data, x="Personal_Loan", y=variable, showfliers=False)
    plt.tight_layout()
    plt.title(variable)
plt.show()

> **Observations**:
> * Distribution of Personal Loan customers have higher lower bound and lower upper bound when its comes to Age, compared to its other counterpart.
> * Not much difference with regards to Experience between customers with Personal Loan 1 and 0
> * Income and CCAge charts clearly tells Personal Loan customers have higher Income and CCAvge in general.
> * Although Personal Loan customers spreads across all Morgage ranges but higher mortgage customers are more Personal Loan takers in general.

### Distribution of Personal Loan with Categorical features

In [None]:
cols = data[
    [
        "Family",
        "Education",
        "Securities_Account",
        "CD_Account",
        "Online",
        "CreditCard"
    ]
].columns.tolist()
plt.figure(figsize=(12, 10))

for i, variable in enumerate(cols):
    plt.subplot(3, 2, i + 1)
    sns.countplot(data = data, x='Personal_Loan', hue=variable)
    plt.tight_layout()
    plt.title(variable)
plt.show()

> **Observations**:
> * Personal Loan customers spread across all family sizes but percentage is higher with higher family sizes.
> * Personal Loan customers spread across all Education qualifications but percentage is higher with higher educated customers.
> * Higher percentage of Personal Loan customers have no Security or CD account in the bank.
> * Only around 50% of Personal Loan customers have credit cards with other banks.
> * Higher percentage of Personal Loan customers use Online or internet banking features of the bank.

In [None]:
# Finally check the Personal Loan customer distribution across counties
sns.histplot(data = data, x='County_MI', hue='Personal_Loan', multiple='stack');

> * Most of the Personal Loan customers are from couties where median income is between 60K to 70K. But this may be misleading as most customers are also from these counties.
> * There are no Personal Loan customers from very high income or very low income counties.

## Model Building for Cassification

We are going to now build two types of classification models, first with Logistic Regression followed by Decision Tree to address the key questions. But in both cases model evaluation criteria remains same. At the end we'll compare the performance of both types of models and and recommendations for AllLife Bank's marketing department Personal Loan campaign.

### Model evaluation criterion:

### Model can make wrong predictions as:
1. Predicting a person who has taken a personal loan but in reality he/she did not.
2. Predicting a person who did not take a personal loan but in reality he/she did.

### Which case is more important? 

* Predicting a person who has taken a personal loan but in reality he/she did not then cost related to campaign on personal loan to the customer is wasted. Campaign cost per customer is generally not significant. 

* If we predict a person who did not take a personal loan but in reality he/she did then it will be loss of oppertunity for the company and this is precisly we want to avoid since we want to increase the customer base of personal loan.


### How to reduce this loss i.e need to reduce False Negatives?
*  `recall` should be maximized, the greater the `recall` higher the chances of identifying customers who would like to buy a personal loan product.

## Model Building with Logistic Regression

In [None]:
# Lets take a backup of the prepared dataset
data_bkp = data.copy()

# Drop the Experience column as it is very strongly correlated with Age
data.drop(['Experience'], axis=1, inplace=True)

In [None]:
# Take a look at first few rows
data.head()

In [None]:
# To build model for prediction
from sklearn.linear_model import LogisticRegression

# Library to split data
from sklearn.model_selection import train_test_split

# To get diferent metric scores
from sklearn.metrics import (
    f1_score,
    accuracy_score,
    recall_score,
    precision_score,
    confusion_matrix,
    roc_auc_score,
    plot_confusion_matrix,
    precision_recall_curve,
    roc_curve,
)

#### Create Traing and Test sets

We'll be using the same traing and test sets for both classification algorithms.

In [None]:
X = data.drop(['Personal_Loan'], axis=1)
y = data['Personal_Loan']

X = pd.get_dummies(X, drop_first=True)

# Splitting data in train and test sets, with stratify so that data is split in a stratified fashion
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=1, stratify=y)

In [None]:
print("Shape of Training set : ", X_train.shape)
print("Shape of test set : ", X_test.shape)
print("Percentage of classes in training set:")
print(y_train.value_counts(normalize=True))
print("Percentage of classes in test set:")
print(y_test.value_counts(normalize=True))

In order to calculate different metrics and confusion matrix and promote code reuse we are going to write two utility functions as below.
* The show_model_perf_with_threshold function will be used to check the model performance of models. 
* The show_confusion_matrix_with_threshold function will be used to plot confusion matrix.

In [None]:
# defining a function to compute different metrics to check performance of a classification model built using sklearn

def show_model_perf_with_threshold(model, predictors, target, threshold=0.5):
    """
    Function to compute different metrics such as Accuracy, Recall, Precision and F1, based on the threshold specified

    model: classifier
    predictors: independent variables
    target: dependent variable
    threshold: threshold for classifying the observation as class 1
    """

    # predicting using the independent variables
    pred_prob = model.predict_proba(predictors)[:, 1]
    pred_thres = pred_prob > threshold
    pred = np.round(pred_thres)

    acc = accuracy_score(target, pred)  # to compute Accuracy
    recall = recall_score(target, pred)  # to compute Recall
    precision = precision_score(target, pred)  # to compute Precision
    f1 = f1_score(target, pred)  # to compute F1-score

    # creating a dataframe of metrics
    df_perf = pd.DataFrame(
        {
            "Accuracy": acc,
            "Recall": recall,
            "Precision": precision,
            "F1": f1,
        },
        index=[0],
    )

    return df_perf


In [None]:
# defining a function to plot the confusion_matrix of a classification model built using sklearn
def show_confusion_matrix_with_threshold(model, predictors, target, threshold=0.5):
    """
    To plot the confusion_matrix, based on the threshold specified, with percentages

    model: classifier
    predictors: independent variables
    target: dependent variable
    threshold: threshold for classifying the observation as class 1
    """
    pred_prob = model.predict_proba(predictors)[:, 1]
    pred_thres = pred_prob > threshold
    y_pred = np.round(pred_thres)

    cm = confusion_matrix(target, y_pred)
    labels = np.asarray(
        [
            ["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
            for item in cm.flatten()
        ]
    ).reshape(2, 2)

    plt.figure(figsize=(6, 4))
    sns.heatmap(cm, annot=labels, fmt="")
    plt.ylabel("True label")
    plt.xlabel("Predicted label")

### Logistic Regression

In [None]:
# There are different solvers available in Sklearn logistic regression
# The newton-cg solver is faster for high-dimensional data

# Since the Personal Loan values are unbalanced and only 9.6 percent are 1s 
# we are going to give more weight to value 1 as this is our target
lg = LogisticRegression(solver="newton-cg", random_state=1, class_weight={0: 0.1, 1: 0.9})
model = lg.fit(X_train, y_train)

In [None]:
# let us check the coefficients and intercept of the model
coef_df = pd.DataFrame(
    np.append(lg.coef_, lg.intercept_),
    index=X_train.columns.tolist() + ["Intercept"],
    columns=["Coefficients"],
)
coef_df.sort_values(by='Coefficients', ascending=False)

### Coefficient interpretations

Logistic regression model equation is not a liner one, hence coefficients will not have direct liner/proportional contributions. But a positive coefficient will have some degree of positive influence and vice versa.

* Coefficient of Age, Income, Family, CCAvg, Education, Mortgage, CD_Account are positive, increase in these will lead to increase in chances of a person taking Personal Loan it seems. 
* Coefficient of Securities_Account, Online and CreditCard are negative hence increase in these will lead to decrease in chances of a person taking Personal Loan.
* People from Counties having median income between 60K to 70K and 70K to 80K have higher chances of taking Personal Loan than other couties based on the coefficient signs.

###  Converting coefficients to odds

In [None]:
# converting coefficients to odds
odds = np.exp(lg.coef_[0])

# finding the percentage change
perc_change_odds = (np.exp(lg.coef_[0]) - 1) * 100

# adding the odds to a dataframe
odd_df = pd.DataFrame({"Odds": odds, "Change_odd%": perc_change_odds}, index=X_train.columns)
odd_df.sort_values(by='Change_odd%', ascending=False)

### Coefficient interpretations

* `CD_Account`: Holding all other features constant a 1 unit change in CD_Account will increase the odds of a person taking Personal Loan by 12.27 times or more than 1000% increase in odds for the same.
* `Income`: Holding all other features constant a 1 unit change in Income will increase the odds of a person taking Personal Loan by 1.057 times or a 5.67% increase in odds for the same.
* `Family`: Holding all other features constant a 1 unit change in Family will increase the odds of a person taking Personal Loan by 1.73 times or a 73.40% increase in odds for the same.
* `CreditCard`: Holding all other features constant a 1 unit change in the CreditCard will decrease the odds of a person taking Personal Loan by 0.44 times or a 55.55% decrease in odds for the same.

`Interpretation for other attributes can be done similarly.`

#### Checking model performance on Training set

In [None]:
# creating confusion matrix
show_confusion_matrix_with_threshold(lg, X_train, y_train)

In [None]:
log_reg_model_train_perf = show_model_perf_with_threshold(lg, X_train, y_train)
print("Training performance:")
log_reg_model_train_perf

#### Checking model performance on Test set

In [None]:
# creating confusion matrix
show_confusion_matrix_with_threshold(lg, X_test, y_test)

In [None]:
log_reg_model_test_perf = show_model_perf_with_threshold(lg, X_test, y_test)
print("Test set performance:")
log_reg_model_test_perf

> **Observations**:
* Both Training and Test sets have very high accuracy which is natural as Target variable is very unbalanced.
* Training set **Recall** is 0.9 and test set recall is 0.85, this shows the model is not overfitted, but there may be room for improvements.
* Precision is about 0.5 for both sets. This should be fine as we are looking for a higher recall score.

Let's compute Area Under the Receiver Operating Characteristic Curve (ROC AUC) from prediction scores.

#### ROC-AUC on Traing set

In [None]:
## ROC-AUC on the Training Set
logit_roc_auc_train = roc_auc_score(y_train, lg.predict_proba(X_train)[:, 1])
fpr, tpr, thresholds = roc_curve(y_train, lg.predict_proba(X_train)[:, 1])
plt.figure(figsize=(7, 5))
plt.plot(fpr, tpr, label="Logistic Regression (area = %0.2f)" % logit_roc_auc_train)
plt.plot([0, 1], [0, 1], "r--")
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("Receiver operating characteristic")
plt.legend(loc="lower right")
plt.show()

#### ROC-AUC on Test set

In [None]:
# ROC-AUC on the test set
logit_roc_auc_test = roc_auc_score(y_test, lg.predict_proba(X_test)[:, 1])
fpr, tpr, thresholds = roc_curve(y_test, lg.predict_proba(X_test)[:, 1])
plt.figure(figsize=(7, 5))
plt.plot(fpr, tpr, label="Logistic Regression (area = %0.2f)" % logit_roc_auc_test)
plt.plot([0, 1], [0, 1], "r--")
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("Receiver operating characteristic")
plt.legend(loc="lower right")
plt.show()

### Model Performance Improvement

Let's see if the recall score can be improved further, by changing the model threshold using AUC-ROC Curve.

In [None]:
# Optimal threshold as per AUC-ROC curve
# The optimal cut off would be where tpr is high and fpr is low
fpr, tpr, thresholds = roc_curve(y_train, lg.predict_proba(X_train)[:, 1])

optimal_idx = np.argmax(tpr - fpr)
optimal_threshold_auc_roc = thresholds[optimal_idx]
print(optimal_threshold_auc_roc)

### Optimal threshold using AUC-ROC curve

#### Model Performance on Training set with Optimal Threshold

In [None]:
# creating confusion matrix
show_confusion_matrix_with_threshold(
    lg, X_train, y_train, threshold=optimal_threshold_auc_roc
)

In [None]:
# checking model performance for this model
log_reg_model_train_perf_threshold_auc_roc = show_model_perf_with_threshold(
    lg, X_train, y_train, threshold=optimal_threshold_auc_roc
)
print("Training performance:")
log_reg_model_train_perf_threshold_auc_roc

#### Model Performance on Test set with Optimal Threshold

In [None]:
# Test set
# creating confusion matrix
show_confusion_matrix_with_threshold(lg, X_test, y_test, threshold=optimal_threshold_auc_roc)

In [None]:
# checking model performance for this model
log_reg_model_test_perf_threshold_auc_roc = show_model_perf_with_threshold(
    lg, X_test, y_test, threshold=optimal_threshold_auc_roc
)
print("Test set performance:")
log_reg_model_test_perf_threshold_auc_roc

> **Observations**:
* Since the optimal threshold ~ 0.51 is close to default threshold 0.5 the scores did not change much.
* Training set **Recall** is 0.9 and test set recall is around 0.85 - not much improvement is shown.
* Precision and F1 scores are also similar with default threshold.

### Let's use Precision-Recall curve and see if we can find a better threshold

In [None]:
y_scores = lg.predict_proba(X_train)[:, 1]
prec, rec, tre = precision_recall_curve(y_train, y_scores,)


def plot_prec_recall_vs_tresh(precisions, recalls, thresholds):
    plt.plot(thresholds, precisions[:-1], "b--", label="precision")
    plt.plot(thresholds, recalls[:-1], "g--", label="recall")
    plt.xlabel("Threshold")
    plt.legend(loc="upper left")
    plt.ylim([0, 1])


plt.figure(figsize=(10, 7))
plot_prec_recall_vs_tresh(prec, rec, tre)
plt.show()

> At threshold around 0.81 we will get equal precision and recall but taking a step back and selecting value around 0.40 will provide a higher recall and a acceptable precision.

In [None]:
# setting the threshold
optimal_threshold_curve = 0.40

#### Model Performance on Training set

In [None]:
# Training
# creating confusion matrix
show_confusion_matrix_with_threshold(lg, X_train, y_train, threshold=optimal_threshold_curve)

In [None]:
log_reg_model_train_perf_threshold_curve = show_model_perf_with_threshold(
    lg, X_train, y_train, threshold=optimal_threshold_curve
)
print("Training performance:")
log_reg_model_train_perf_threshold_curve

#### Model Performance on Test set

In [None]:
# creating confusion matrix
show_confusion_matrix_with_threshold(lg, X_test, y_test, threshold=optimal_threshold_curve)

In [None]:
log_reg_model_test_perf_threshold_curve = show_model_perf_with_threshold(
    lg, X_test, y_test, threshold=optimal_threshold_curve
)
print("Test set performance:")
log_reg_model_test_perf_threshold_curve

> **Observations**:
* With threshold 0.40 we are getting a little higher score for Recall but we have to sacrifice Precision.
* Training set **Recall** is 0.92 and test set recall is around 0.875 - a little bit of improvement.
* Precision and F1 scores have gone down with this new threshold.

### Model Performance Summary

In [None]:
# training performance comparison

models_train_comp_df = pd.concat(
    [
        log_reg_model_train_perf.T,
        log_reg_model_train_perf_threshold_auc_roc.T,
        log_reg_model_train_perf_threshold_curve.T
    ],
    axis=1,
)
models_train_comp_df.columns = [
    "Logistic Regression sklearn",
    "Logistic Regression-0.51 Threshold",
    "Logistic Regression-0.40 Threshold"
]
print("Training performance comparison:")
models_train_comp_df.T.sort_values(by='Recall', ascending=False)

In [None]:
# testing performance comparison

models_test_comp_df = pd.concat(
    [
        log_reg_model_test_perf.T,
        log_reg_model_test_perf_threshold_auc_roc.T,
        log_reg_model_test_perf_threshold_curve.T
        ],
    axis=1,
)
models_test_comp_df.columns = [
    "Logistic Regression sklearn",
    "Logistic Regression-0.51 Threshold",
    "Logistic Regression-0.40 Threshold"
]
print("Test set performance comparison:")
models_test_comp_df.T.sort_values(by='Recall', ascending=False)

### Logistic Regression with Sequential Feature Selector

- Reduces dimensionality.
- Discards deceptive features (Deceptive features appear to aid learning on the training set, but impair generalization).
- Speeds training/testing.

Let's see with SFS what are the most important features and if we can get a better recall score using SFS.

In [None]:
# !pip install mlxtend
# Sequential feature selector is present in mlxtend library
# !pip install mlxtend to install mlxtent library

from mlxtend.feature_selection import SequentialFeatureSelector as SFS

# to plot the performance with addition of each feature
from mlxtend.plotting import plot_sequential_feature_selection as plot_sfs

In [None]:
# from sklearn.linear_model import LogisticRegression

# Fit the model on train
model = LogisticRegression(solver="newton-cg", n_jobs=-1, random_state=1, max_iter=100)

In [None]:
# we will first build model with all varaible
sfs = SFS(
    model,
    k_features=15, # it is same as X_train.shape
    forward=True,
    floating=False,
    scoring="recall",
    verbose=2,
    cv=3,
    n_jobs=-1,
)

sfs = sfs.fit(X_train, y_train)

In [None]:
fig1 = plot_sfs(sfs.get_metric_dict(), kind="std_dev", figsize=(12, 5))
#plt.ylim([0.8, 1])
plt.title("Sequential Forward Selection (w. StdDev)")
plt.xticks(rotation=90)
plt.show()

* We can see that performance increases till the 10th feature and then became constant.
* So we'll use 10 features only to build our mode but the choice of k_features it depends on the business context and use case of the model.

In [None]:
sfs1 = SFS(
    model,
    k_features=10,
    forward=True,
    floating=False,
    scoring="recall",
    verbose=2,
    cv=3,
    n_jobs=-1,
)

sfs1 = sfs1.fit(X_train, y_train)

fig1 = plot_sfs(sfs1.get_metric_dict(), kind="std_dev", figsize=(10, 5))

#plt.ylim([0.8, 1])
plt.title("Sequential Forward Selection (w. StdDev)")
plt.grid()
plt.show()

**Finding which features are important?**

In [None]:
feat_cols = list(sfs1.k_feature_idx_)
print(feat_cols)

**Let's look at best 10 variables**

In [None]:
X_train.columns[feat_cols]

In [None]:
X_train_final = X_train[X_train.columns[feat_cols]]

# Creating new x_test with the same variables that we selected for x_train
X_test_final = X_test[X_train_final.columns]

In [None]:
lg_sfs = LogisticRegression(solver="newton-cg", random_state=1, class_weight={0: 0.1, 1: 0.9})
lg_sfs.fit(X_train_final, y_train)

### Let's Look at model performance with SFS

#### Training set performance

In [None]:
show_confusion_matrix_with_threshold(lg_sfs, X_train_final, y_train)

In [None]:
log_reg_model_train_perf_SFS = show_model_perf_with_threshold(lg_sfs, X_train_final, y_train)
print("Training performance:")
log_reg_model_train_perf_SFS

#### Test set performance

In [None]:
show_confusion_matrix_with_threshold(lg_sfs, X_test_final, y_test)

In [None]:
log_reg_model_test_perf_SFS = show_model_perf_with_threshold(lg_sfs, X_test_final, y_test)
print("Test set performance:")
log_reg_model_test_perf_SFS

> **Observations:**
* Model is giving a generalized performance on training and test set.
* With a fewer number of features, the model performance is comparable to the initial logistic regression model.

### Model Performance Summary with Logistic Regression

In [None]:
# training performance comparison

models_train_comp_df = pd.concat(
    [
        log_reg_model_train_perf.T,
        log_reg_model_train_perf_threshold_auc_roc.T,
        log_reg_model_train_perf_threshold_curve.T,
        log_reg_model_train_perf_SFS.T,
    ],
    axis=1,
)
models_train_comp_df.columns = [
    "Logistic Regression sklearn",
    "Logistic Regression-0.51 Threshold",
    "Logistic Regression-0.40 Threshold",
    "Logistic Regression - SFS",
]
print("Training performance comparison:")
models_train_comp_df.T.sort_values(by='Recall', ascending=False)

In [None]:
# testing performance comparison

models_test_comp_df = pd.concat(
    [
        log_reg_model_test_perf.T,
        log_reg_model_test_perf_threshold_auc_roc.T,
        log_reg_model_test_perf_threshold_curve.T,
        log_reg_model_test_perf_SFS.T,
    ],
    axis=1,
)
models_test_comp_df.columns = [
    "Logistic Regression sklearn",
    "Logistic Regression-0.51 Threshold",
    "Logistic Regression-0.40 Threshold",
    "Logistic Regression - SFS",
]
print("Test set performance comparison:")
models_test_comp_df.T.sort_values(by='Recall', ascending=False)

## Build Decision Tree Model

* We will build a model using the DecisionTreeClassifier function. Using default 'gini' criteria to split. 
* If the frequency of class A (Personal Loan:1) is 9% and the frequency of class B (Personal Loan:0) is 91%, then class B will become the dominant class and the decision tree will become biased toward the dominant classes.

* In this case, we can pass a dictionary {0:0.1,1:0.9} to the model to specify the weight of each class and the decision tree will give more weightage to class 1.

* class_weight is a hyperparameter for the decision tree classifier.

In [None]:
# Libraries to build decision tree classifier
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree

# To tune different models
from sklearn.model_selection import GridSearchCV

# To get diferent metric scores
from sklearn.metrics import (
    make_scorer
)

In [None]:
model = DecisionTreeClassifier(
    criterion="gini", class_weight={0: 0.10, 1: 0.90}, random_state=1
)

In [None]:
model.fit(X_train, y_train)

#### Create functions to calculate different metrics and confusion matrix so that we don't have to use the same code repeatedly for each model.
* The get_recall_score function will be used to check the model performance of decision tree models. 
* The make_confusion_matrix function will be used to plot confusion matrix.

In [None]:
##  Function to calculate recall score
def get_recall_score(model, predictors, target):
    """
    model: classifier
    predictors: independent variables
    target: dependent variable
    
    """
    prediction = model.predict(predictors)
    return recall_score(target, prediction)

In [None]:
def confusion_matrix_sklearn(model, predictors, target):
    """
    To plot the confusion_matrix with percentages

    model: classifier
    predictors: independent variables
    target: dependent variable
    """
    y_pred = model.predict(predictors)
    cm = confusion_matrix(target, y_pred)
    labels = np.asarray(
        [
            ["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
            for item in cm.flatten()
        ]
    ).reshape(2, 2)

    plt.figure(figsize=(6, 4))
    sns.heatmap(cm, annot=labels, fmt="")
    plt.ylabel("True label")
    plt.xlabel("Predicted label")

#### Checking model performance on Training set

In [None]:
confusion_matrix_sklearn(model, X_train, y_train)

In [None]:
decision_tree_perf_train = get_recall_score(model, X_train, y_train)
print("Recall Score:", decision_tree_perf_train)

> **Observations**:
* As predicted with default paramters, Model is able to perfectly classify all the data points on the training set.
* 0 errors on the training set, each sample has been classified correctly.
* As we know a decision tree will continue to grow and classify each data point correctly if no restrictions are applied as the trees will learn all the patterns in the training set.
* This generally leads to overfitting of the model as Decision Tree will perform well on the training set but will fail to replicate the performance on the test set.

#### Checking model performance on Test set

In [None]:
confusion_matrix_sklearn(model, X_test, y_test)

In [None]:
decision_tree_perf_test = get_recall_score(model, X_test, y_test)
print("Recall Score:", decision_tree_perf_test)

> **Observations**:
* Recall score for the test set is not bad but there may be improvement oppertunities

## Visualizing the Decision Tree

In [None]:
## creating a list of column names
feature_names = X_train.columns.to_list()

In [None]:
# Write a function to plot tree with some custom arguments so that this code can be reused later
def plot_tree(model, feature_names, figsize=(15,10), fontsize=12):
    """
    Plots a tree from a Decision Tree model and features. 
    """
    plt.figure(figsize=figsize)
    out = tree.plot_tree(
        model,
        feature_names=feature_names,
        filled=True,
        fontsize=fontsize,
        node_ids=False,
        rounded=True,
        class_names=None,
    )
    # below code will add arrows to the decision tree split if they are missing
    for o in out:
        arrow = o.arrow_patch
        if arrow is not None:
            arrow.set_edgecolor("black")
            arrow.set_linewidth(1)
    plt.show()    

In [None]:
plot_tree(model, feature_names, figsize=(20,30), fontsize=9)

In [None]:
# Text report showing the rules of a decision tree -
print(tree.export_text(model, feature_names=feature_names, show_weights=True))

In [None]:
# importance of features in the tree building ( The importance of a feature is computed as the
# (normalized) total reduction of the criterion brought by that feature. It is also known as the Gini importance )

imp_df1 = pd.DataFrame(
        model.feature_importances_, columns=["Imp"], index=X_train.columns
    ).sort_values(by="Imp", ascending=False)
imp_df1

In [None]:
# Let's see a plot with relative importance where importance is > 0
plt.figure(figsize=(12, 10))
data = imp_df1[imp_df1.Imp > 0]
sns.barplot(x=data.Imp, y=data.index, label="Relative Importance", palette="magma");

> **Observations**:
* Decision Tree model has given very high importance to the Income feature
* Family size, Education and CCAvg also carry high importance.
* Age, Mortgage, CD_Account and Couties with median Income between 60K to 70K has some importance.
* Other features have very low importance.

### Finding the best possible model with Pre Pruning

Althogh we got better recall score on test set than Logistic Regression with default Decision Tree parameters, but we would like to see if the model can be generalized more to get a better recall score with test dataset.

#### Using GridSearch for Hyperparameter tuning of our tree model

* Hyperparameter tuning is also tricky in the sense that there is no direct way to calculate how a change in the
  hyperparameter value will reduce the loss of your model, so we usually resort to experimentation. i.e we'll use Grid search
* Grid search is a tuning technique that attempts to compute the optimum values of hyperparameters. 
* It is an exhaustive search that is performed on a the specific parameter values of a model.
* The parameters of the estimator/model used to apply these methods are optimized by cross-validated grid-search over a parameter grid.

In [None]:
# Choose the type of classifier.
estimator = DecisionTreeClassifier(random_state=1, class_weight={0: 0.10, 1: 0.90})

# Grid of parameters to choose from
parameters = {
    "max_depth": [5, 10, 15, None],
    "criterion": ["entropy", "gini"],
    "splitter": ["best", "random"],
    "min_impurity_decrease": [0.00001, 0.0001, 0.01],
}

# Type of scoring used to compare parameter combinations
scorer = make_scorer(recall_score)

# Run the grid search
grid_obj = GridSearchCV(estimator, parameters, scoring=scorer, cv=5)
grid_obj = grid_obj.fit(X_train, y_train)

# Set the clf to the best combination of parameters
estimator = grid_obj.best_estimator_

# Fit the best algorithm to the data.
estimator.fit(X_train, y_train)

#### Checking performance on Training set

In [None]:
confusion_matrix_sklearn(estimator, X_train, y_train)

In [None]:
decision_tree_tune_perf_train = get_recall_score(estimator, X_train, y_train)
print("Recall Score:", decision_tree_tune_perf_train)

#### Checking performance on Test set

In [None]:
confusion_matrix_sklearn(estimator, X_test, y_test)

In [None]:
decision_tree_tune_perf_test = get_recall_score(estimator, X_test, y_test)
print("Recall Score:", decision_tree_tune_perf_test)

> **Observations**:
* The model is giving a generalized result now.
* Test recall score is around 0.986

## Visualizing the GV Optimized Decision Tree

In [None]:
plot_tree(estimator, feature_names)

In [None]:
# Text report showing the rules of a decision tree -
print(tree.export_text(estimator, feature_names=feature_names, show_weights=True))

In [None]:
# importance of features in the tree building ( The importance of a feature is computed as the
# (normalized) total reduction of the 'criterion' brought by that feature. It is also known as the Gini importance )

imp_df2 =  pd.DataFrame(
        estimator.feature_importances_, columns=["Imp"], index=X_train.columns
    ).sort_values(by="Imp", ascending=False)

imp_df2

In [None]:
# Let's see a plot with relative importance where importance is > 0
plt.figure(figsize=(12, 4))
data = imp_df2[imp_df2.Imp > 0]
sns.barplot(x=data.Imp, y=data.index, label="Relative Importance", palette="magma");

> **Observations:**

* The pre-pruned Decision Tree has only five level including the root
* It only considers four features Income, Family, Education, CCAge in order of importance.
* Other features do not impact the decision making of the pruned tree.

`So it seems customers having higher income, family size, higher education and higher monthly credit card usage will likely to buy a Personal Loan product.`

## Cost Complexity Pruning - Post Pruning

Although we have found a very good recall score with earlier Pre Pruned Tree but we need to see how the recall score changes with Post Pruning techniques.


The `DecisionTreeClassifier` provides parameters such as
``min_samples_leaf`` and ``max_depth`` to prevent a tree from overfiting. Cost
complexity pruning provides another option to control the size of a tree. In
`DecisionTreeClassifier`, this pruning technique is parameterized by the
cost complexity parameter, ``ccp_alpha``. Greater values of ``ccp_alpha``
increase the number of nodes pruned. Here we only show the effect of
``ccp_alpha`` on regularizing the trees and how to choose a ``ccp_alpha``
based on validation scores.

Total impurity of leaves vs effective alphas of pruned tree
---------------------------------------------------------------
Minimal cost complexity pruning recursively finds the node with the "weakest
link". The weakest link is characterized by an effective alpha, where the
nodes with the smallest effective alpha are pruned first. To get an idea of
what values of ``ccp_alpha`` could be appropriate, scikit-learn provides
`DecisionTreeClassifier.cost_complexity_pruning_path` that returns the
effective alphas and the corresponding total leaf impurities at each step of
the pruning process. As alpha increases, more of the tree is pruned, which
increases the total impurity of its leaves.

In [None]:
clf = DecisionTreeClassifier(random_state=1, class_weight={0: 0.10, 1: 0.90})
path = clf.cost_complexity_pruning_path(X_train, y_train)
ccp_alphas, impurities = path.ccp_alphas, path.impurities

In [None]:
pd.DataFrame(path)

In [None]:
fig, ax = plt.subplots(figsize=(10, 5))
ax.plot(ccp_alphas[:-1], impurities[:-1], marker="o", drawstyle="steps-post")
ax.set_xlabel("effective alpha")
ax.set_ylabel("total impurity of leaves")
ax.set_title("Total Impurity vs effective alpha for training set")
plt.show()

Next, we train a decision tree using the effective alphas. The last value
in ``ccp_alphas`` is the alpha value that prunes the whole tree,
leaving the tree, ``clfs[-1]``, with one node.

In [None]:
clfs = []
for ccp_alpha in ccp_alphas:
    clf = DecisionTreeClassifier(
        random_state=1, ccp_alpha=ccp_alpha, class_weight={0: 0.10, 1: 0.90}
    )
    clf.fit(X_train, y_train)
    clfs.append(clf)
print(
    "Number of nodes in the last tree is: {} with ccp_alpha: {}".format(
        clfs[-1].tree_.node_count, ccp_alphas[-1]
    )
)


For the remainder, we remove the last element in
``clfs`` and ``ccp_alphas``, because it is the trivial tree with only one
node. Here we show that the number of nodes and tree depth decreases as alpha
increases.

In [None]:
clfs = clfs[:-1]
ccp_alphas = ccp_alphas[:-1]

node_counts = [clf.tree_.node_count for clf in clfs]
depth = [clf.tree_.max_depth for clf in clfs]
fig, ax = plt.subplots(2, 1, figsize=(10, 7))
ax[0].plot(ccp_alphas, node_counts, marker="o", drawstyle="steps-post")
ax[0].set_xlabel("alpha")
ax[0].set_ylabel("number of nodes")
ax[0].set_title("Number of nodes vs alpha")
ax[1].plot(ccp_alphas, depth, marker="o", drawstyle="steps-post")
ax[1].set_xlabel("alpha")
ax[1].set_ylabel("depth of tree")
ax[1].set_title("Depth vs alpha")
fig.tight_layout()

In [None]:
recall_train = []
for clf in clfs:
    pred_train = clf.predict(X_train)
    values_train = recall_score(y_train, pred_train)
    recall_train.append(values_train)

In [None]:
recall_test = []
for clf in clfs:
    pred_test = clf.predict(X_test)
    values_test = recall_score(y_test, pred_test)
    recall_test.append(values_test)

In [None]:
train_scores = [clf.score(X_train, y_train) for clf in clfs]
test_scores = [clf.score(X_test, y_test) for clf in clfs]

In [None]:
fig, ax = plt.subplots(figsize=(15, 5))
ax.set_xlabel("alpha")
ax.set_ylabel("Recall")
ax.set_title("Recall vs alpha for training and testing sets")
ax.plot(
    ccp_alphas, recall_train, marker="o", label="train", drawstyle="steps-post",
)
ax.plot(ccp_alphas, recall_test, marker="o", label="test", drawstyle="steps-post")
ax.legend()
plt.show()

In [None]:
# creating the model where we get highest train and test recall
index_best_model = np.argmax(recall_test)
best_model = clfs[index_best_model]
print(best_model)

In [None]:
best_model.fit(X_train, y_train)

#### Check performance on Training set

In [None]:
confusion_matrix_sklearn(best_model, X_train, y_train)

In [None]:
print("Recall Score:", get_recall_score(best_model, X_train, y_train))

#### Check performance on Test set

In [None]:
confusion_matrix_sklearn(best_model, X_test, y_test)

In [None]:
print("Recall Score:", get_recall_score(best_model, X_test, y_test))

In [None]:
plot_tree(best_model, feature_names)

In [None]:
# importance of features in the tree building ( The importance of a feature is computed as the
# (normalized) total reduction of the criterion brought by that feature. It is also known as the Gini importance )

imp_df3 = pd.DataFrame(
        best_model.feature_importances_, columns=["Imp"], index=X_train.columns
    ).sort_values(by="Imp", ascending=False)
imp_df3

In [None]:
plt.figure(figsize=(12, 4))
data = imp_df3[imp_df3.Imp > 0]
sns.barplot(x=data.Imp, y=data.index, label="Relative Importance", palette="magma");

> **Observations**:

* Post pruned Tree with optimized CCP alpha is giving a recall score of 0.986
* It also give very high importance to Income followed by Family size, higher Education and higher CCAvg.
* Also the Recall score is similar to Pre Pruned tree with 5 levels.

Although we have achieved a very good recall score and it matches with pre-pruned tree but the training recall score is 1.0 which is a sign of little overfitting. For completeness we woulod check with a higher alpha to generalize the tree a bit more but making sure Recall score does not degrade.

From the **Recall vs alpha for training and testing sets** chart we see that recall score does not drop till about alpha = 0.025, hence we can go for a more generous alpha value and this will generalize the tree even more.

**Creating model with 0.025 ccp_alpha**

In [None]:
best_model2 = DecisionTreeClassifier(
    ccp_alpha=0.025, class_weight={0: 0.10, 1: 0.90}, random_state=1
)
best_model2.fit(X_train, y_train)

#### Check performance on the Training set

In [None]:
confusion_matrix_sklearn(best_model2, X_train, y_train)

In [None]:
decision_tree_postpruned_perf_train = get_recall_score(best_model2, X_train, y_train)
print("Recall Score:", decision_tree_postpruned_perf_train)

#### Check performance on the Test set

In [None]:
confusion_matrix_sklearn(best_model2, X_test, y_test)

In [None]:
decision_tree_postpruned_perf_test = get_recall_score(best_model2, X_test, y_test)
print("Recall Score:", decision_tree_postpruned_perf_test)

In [None]:
plot_tree(best_model2, feature_names)

In [None]:
# Text report showing the rules of a decision tree -
print(tree.export_text(best_model2, feature_names=feature_names, show_weights=True))

In [None]:
# importance of features in the tree building ( The importance of a feature is computed as the
# (normalized) total reduction of the 'criterion' brought by that feature. It is also known as the Gini importance )
imp_df4 = pd.DataFrame(
        best_model2.feature_importances_, columns=["Imp"], index=X_train.columns
    ).sort_values(by="Imp", ascending=False)

imp_df4 

In [None]:
plt.figure(figsize=(12, 4))
data = imp_df4[imp_df4.Imp > 0]
sns.barplot(x=data.Imp, y=data.index, label="Relative Importance", palette="magma");

> **Observations**:

* With ccp alpha 0.025 we are getting the same recall score with ccp alpha 0.004. But the tree is depth is less with alpha 0.025.
* Other observations related to feature importance remain same.
* With ccp alpha 0.025, the Decision Tree is same as the earlier Pre Pruned tree with max depth = 5.

### Comparing all the Decision Tree models

In [None]:
# training performance comparison

models_train_comp_df = pd.DataFrame(
    [
        decision_tree_perf_train,
        decision_tree_tune_perf_train,
        decision_tree_postpruned_perf_train,
    ],
    index=['decision_tree_perf_train', 'decision_tree_prepruned_perf_train', 'decision_tree_postpruned_perf_train'],
    columns=["Recall on training set"]
)

print("Training performance comparison:")
models_train_comp_df

In [None]:
# training performance comparison

models_test_comp_df = pd.DataFrame(
    [
        decision_tree_perf_test,
        decision_tree_tune_perf_test,
        decision_tree_postpruned_perf_test,
    ],
    index=['decision_tree_perf_test', 'decision_tree_prepruned_perf_test', 'decision_tree_postpruned_perf_test'],
    columns=["Recall on test set"]
)

print("Test performance comparison:")
models_test_comp_df

> **Observations**:
* Pre Pruned Decision Tree has comparable Train and Test performance.
* Post Pruned Decision Tree also has comparable Train and Test performance.
* Pre Pruned Decision Tree and Post Pruned Decision Tree have same Recall score and tree depth and nodes.

### Conclusions

- We analyzed the data of AllLife's Bank Personal Loan campaign using both Logistic Regression models and Decision Tree Classifiers to build various predictive models to better understand the features that may have influenced the campaign result. 

- Based on the business case we also decided that the Recall score should the performance criteria of our analysis, since we want to reduce the False Nagatives as oppertunity cost would be higher. In this case we are not very worried about the precision score since the cost of a campaign such as direct mail marketting or emails are not significant. Based on these analysis below conclusions can be drawn :-

> **Logisctic Regression**

- Logisctic Regression does not completely ignore features. Almost all features influnce the outcome but the degree of influence varies greatly.
- Out of all the featues if a customer is having CD_Account, higher Education, bigger Family, higher monthly CCAvg, higher income Income, Age, Mortgage then he/she is more likely to opt for a Personal Loan. 
- On the other hand if a customer is having Credit cards with other banks, Securities accounts or uses Internet banking facilities then the probability of taking a Personal Loan decreses.
- Based on the customer location we can draw a conclusion that customers from counties having median income between 60K to 80K have a higher probablity of buying personal loan than other counties. 

> **Decision Tree**

- Although there is some overlap in terms of conclusions but the Decision Tree gives a different perspective and ignores many features while predicting a Personal Loan customer. 
- It gives the highest importance to Income. It clearly concludes that if a a cusomer has a higher income, he/she is more likely to go for a Personal Loan.
- It also gives weightage to bigger Family size, higher Education and higher monthly average Credit crad usage. Customers matching these criterions also have higher probability of buying personal loan products.
- Interestingly, Decision Tree does not give any importance to features other then Income, Family, Education and CCAvg.

### Recommendations

Based on the combined prediction results of both Logistic Regression and Decision Tree classification models and conclusions we offer below recommendations to AllLife Bank marketing department Personal Loan campaign.

- Internet research shows people take personal loans for covering cost of emergency expense, wedding, vacation, appliance purchase, home remodelling etc. Considering the repayment capability of a customer we should target high income range costomers first as this feature has come up in both types of models and from a bank perspective these will be safer investments.

- Cusotmers having CD accounts with the bank could also be likly canditates for Personal loans. As they would have some surity as well for paying back.

- Customers with bigger family size such 3 or 4 members, higher Education such as graduate or advanced degrees, higher monthly Credit card usage can be targetted next. 

- Also customers having higher morgages, age and professional experience can be targetted for personal loans but the conversion rate may be low here.

- In order to increase the conversion rate further a personal loan social media campaing can be targetted to ZIP codes in counties where median income is between 60K to 80K per year. As analysis has shown people from these counties have higher probability of buying personal loans.

- If customers have credit cards from other banks and lower monthly average credit card usage then they have less probability of buying personal loans as they enjoy high borrowing power already. This type of customers should be targetted less unless other features recommend otherwise.

With above recommendations, derived from both Logistic Regression and Decision Tree models, we believe conversion rate of Personal Loan campaign will increase from the current rate of 9% to higher and AllLife Banks will be better equipped in converting its liability customers to personal loan customers (while retaining them as depositors).

