# Title - Credit Risk Classification: Analyzing and Predicting Customer Creditworthiness

## Introduction

I have chosen the [referenced data](https://www.kaggle.com/datasets/ppb00x/credit-risk-customers) set for my analysis.

### Brief overview of the data

The data set consists of 21 attributes of 1000 anonymous customers in Deutsche Mark bank.

#### Attribute description
1. Existing checking account balance/status, in Deutsche Mark.
2. Duration in months
3. Credit history (credits taken, paid back duly, delays, critical accounts)
4. Purpose of the credit (car, television,...)
5. Credit amount
6. Status of savings account/bonds, in Deutsche Mark.
7. Present employment, in number of years.
8. Installment rate in percentage of disposable income
9. Personal status (married, single,...) and sex
10. Other debtors / guarantors
11. Present residence since X years
12. Property (e.g. real estate)
13. Age in years
14. Other installment plans (banks, stores)
15. Housing (rent, own,...)
16. Number of existing credits at this bank
17. Job
18. Number of people being liable to provide maintenance for
19. Telephone (yes,no)
20. Foreign worker (yes,no)
21. Class (good, bad)


## Goal 

The goal of this analysis is to predict the credit worthiness of the customers by implementing machines learning techniques discussed so far in the class.

* The attribute "class" from the list of 21 attributes is used as the target variable to classify the customers on their credit worthiness.
* The remaining 20 attributes are used as predictors. 

# Section 1: 

**Load and explore the data. Identify any necessary data transformation and cleaning.**

In [1]:
# importing all the necessary libraries
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn.impute import SimpleImputer
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score
from sklearn.tree import DecisionTreeClassifier 
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, AdaBoostClassifier
from xgboost import XGBClassifier

# set random seed to ensure that results are repeatable
np.random.seed(86089106)

In [2]:
df = pd.read_csv("credit_customers.csv")
df.head(5)

Unnamed: 0,checking_status,duration,credit_history,purpose,credit_amount,savings_status,employment,installment_commitment,personal_status,other_parties,...,property_magnitude,age,other_payment_plans,housing,existing_credits,job,num_dependents,own_telephone,foreign_worker,class
0,<0,6.0,critical/other existing credit,radio/tv,1169.0,no known savings,>=7,4.0,male single,none,...,real estate,67.0,none,own,2.0,skilled,1.0,yes,yes,good
1,0<=X<200,48.0,existing paid,radio/tv,5951.0,<100,1<=X<4,2.0,female div/dep/mar,none,...,real estate,22.0,none,own,1.0,skilled,1.0,none,yes,bad
2,no checking,12.0,critical/other existing credit,education,2096.0,<100,4<=X<7,2.0,male single,none,...,real estate,49.0,none,own,1.0,unskilled resident,2.0,none,yes,good
3,<0,42.0,existing paid,furniture/equipment,7882.0,<100,4<=X<7,2.0,male single,guarantor,...,life insurance,45.0,none,for free,1.0,skilled,2.0,none,yes,good
4,<0,24.0,delayed previously,new car,4870.0,<100,1<=X<4,3.0,male single,none,...,no known property,53.0,none,for free,2.0,skilled,2.0,none,yes,bad


In [3]:
# conducting an initial exploration of the data
df.describe()

Unnamed: 0,duration,credit_amount,installment_commitment,residence_since,age,existing_credits,num_dependents
count,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0
mean,20.903,3271.258,2.973,2.845,35.546,1.407,1.155
std,12.058814,2822.736876,1.118715,1.103718,11.375469,0.577654,0.362086
min,4.0,250.0,1.0,1.0,19.0,1.0,1.0
25%,12.0,1365.5,2.0,2.0,27.0,1.0,1.0
50%,18.0,2319.5,3.0,3.0,33.0,1.0,1.0
75%,24.0,3972.25,4.0,4.0,42.0,2.0,1.0
max,72.0,18424.0,4.0,4.0,75.0,4.0,2.0


In [4]:
# Check the missing values by summing the total na's for each variable
df.isna().sum()

checking_status           0
duration                  0
credit_history            0
purpose                   0
credit_amount             0
savings_status            0
employment                0
installment_commitment    0
personal_status           0
other_parties             0
residence_since           0
property_magnitude        0
age                       0
other_payment_plans       0
housing                   0
existing_credits          0
job                       0
num_dependents            0
own_telephone             0
foreign_worker            0
class                     0
dtype: int64

We can see there are no missing values in the data

In [5]:
#clear any white spaces before starting to analyze.
df.columns = [s.strip() for s in df.columns] 
df.columns

Index(['checking_status', 'duration', 'credit_history', 'purpose',
       'credit_amount', 'savings_status', 'employment',
       'installment_commitment', 'personal_status', 'other_parties',
       'residence_since', 'property_magnitude', 'age', 'other_payment_plans',
       'housing', 'existing_credits', 'job', 'num_dependents', 'own_telephone',
       'foreign_worker', 'class'],
      dtype='object')

In [6]:
# Finding the type of data in columns
df.dtypes

checking_status            object
duration                  float64
credit_history             object
purpose                    object
credit_amount             float64
savings_status             object
employment                 object
installment_commitment    float64
personal_status            object
other_parties              object
residence_since           float64
property_magnitude         object
age                       float64
other_payment_plans        object
housing                    object
existing_credits          float64
job                        object
num_dependents            float64
own_telephone              object
foreign_worker             object
class                      object
dtype: object

We can see there are 14 categorical variables and others are numerical variables.

In [7]:
# Finding unique values for all the categorical variable columns to do encoding 
print(df['checking_status'].unique())

print(df['credit_history'].unique())

print(df['purpose'].unique())

print(df['savings_status'].unique())

print(df['employment'].unique())

print(df['personal_status'].unique())

print(df['other_parties'].unique())

print(df['property_magnitude'].unique())

print(df['other_payment_plans'].unique())

print(df['housing'].unique())

print(df['job'].unique())

print(df['own_telephone'].unique())

print(df['foreign_worker'].unique())

print(df['class'].unique())

['<0' '0<=X<200' 'no checking' '>=200']
['critical/other existing credit' 'existing paid' 'delayed previously'
 'no credits/all paid' 'all paid']
['radio/tv' 'education' 'furniture/equipment' 'new car' 'used car'
 'business' 'domestic appliance' 'repairs' 'other' 'retraining']
['no known savings' '<100' '500<=X<1000' '>=1000' '100<=X<500']
['>=7' '1<=X<4' '4<=X<7' 'unemployed' '<1']
['male single' 'female div/dep/mar' 'male div/sep' 'male mar/wid']
['none' 'guarantor' 'co applicant']
['real estate' 'life insurance' 'no known property' 'car']
['none' 'bank' 'stores']
['own' 'for free' 'rent']
['skilled' 'unskilled resident' 'high qualif/self emp/mgmt'
 'unemp/unskilled non res']
['yes' 'none']
['yes' 'no']
['good' 'bad']


# Section 2:

**Apply any necessary data transformations, cleaning, and transformation (such as standardization).** 

NOTE: You must conduct a data split in this section, as some transformations/cleaning require pre and post-split processing.

#### Using one-hot encoding to replace categorical variables

In [8]:

df['checking_status'] = df['checking_status'].replace({1: "no checking", 2: '<0', 3: '0<=X<200', 4: '>=200'})

df['credit_history'] = df['credit_history'].replace({1: "critical/other existing credit", 2: 'existing paid', 3: 'delayed previously', 4: 'no credits/all paid', 5: 'all paid'})

df['purpose'] = df['purpose'].replace({1: 'radio/tv', 2: 'education', 3: 'furniture/equipment', 4: 'new car', 5: 'used car', 6: 'business', 7: 'domestic appliance', 8: 'repairs', 9: 'other', 10: 'retraining'})

df['savings_status'] = df['savings_status'].replace({1: 'no known savings', 2: '<100', 3: '100<=X<500', 4: '500<=X<1000', 5: '>=1000'})

df['employment'] = df['employment'].replace({1: 'unemployed', 2: '<1', 3: '1<=X<4', 4: '4<=X<7', 5: '>=7'})

df['personal_status'] = df['personal_status'].replace({1: 'male single', 2: 'male div/sep', 3: 'male mar/wid', 4: 'female div/dep/mar'})

df['other_parties'] = df['other_parties'].replace({1: 'none', 2: 'guarantor', 3: 'co applicant'})
                                             
df['property_magnitude'] = df['property_magnitude'].replace({1: 'real estate', 2: 'life insurance', 3: 'no known property', 4: 'car'})                        
                                             
df['other_payment_plans'] = df['other_payment_plans'].replace({1: 'none', 2: 'bank', 3: 'stores'})
                                             
df['housing'] = df['housing'].replace({1: 'own', 2: 'for free', 3: 'rent'})
                                             
df['job'] = df['job'].replace({1: 'unemp/unskilled non res', 2: 'unskilled resident', 3: 'skilled', 4: 'high qualif/self emp/mgmt'})  

df['own_telephone'] = df['own_telephone'].replace({1: 'yes', 2: 'none'}) 

df['foreign_worker'] = df['foreign_worker'].replace({1: 'yes', 2: 'no'})  

df['class'].replace({'good':1, 'bad':0}, inplace=True) 

In [9]:
checking_status_dummies = pd.get_dummies(df['checking_status'], prefix='checking_status', drop_first=False)
df = df.join(checking_status_dummies)

credit_history_dummies = pd.get_dummies(df['credit_history'], prefix='Mcredit_history', drop_first=False)
df = df.join(credit_history_dummies)

purpose_dummies = pd.get_dummies(df['purpose'], prefix='purpose', drop_first=False)
df = df.join(purpose_dummies)

savings_status_dummies = pd.get_dummies(df['savings_status'], prefix='savings_status', drop_first=False)
df = df.join(savings_status_dummies)

employment_dummies = pd.get_dummies(df['employment'], prefix='employment', drop_first=False)
df = df.join(employment_dummies)

personal_status_dummies = pd.get_dummies(df['personal_status'], prefix='personal_status', drop_first=False)
df = df.join(personal_status_dummies)

other_parties_dummies = pd.get_dummies(df['other_parties'], prefix='other_parties', drop_first=False)
df = df.join(other_parties_dummies)

property_magnitude_dummies = pd.get_dummies(df['property_magnitude'], prefix='property_magnitude', drop_first=False)
df = df.join(property_magnitude_dummies)

other_payment_plans_dummies = pd.get_dummies(df['other_payment_plans'], prefix='other_payment_plans', drop_first=False)
df = df.join(other_payment_plans_dummies)

housing_dummies = pd.get_dummies(df['housing'], prefix='housing', drop_first=False)
df = df.join(housing_dummies)

job_dummies = pd.get_dummies(df['job'], prefix='job', drop_first=False)
df = df.join(job_dummies)

own_telephone_dummies = pd.get_dummies(df['own_telephone'], prefix='own_telephone', drop_first=False)
df = df.join(own_telephone_dummies)

foreign_worker_dummies = pd.get_dummies(df['foreign_worker'], prefix='foreign_worker', drop_first=False)
df = df.join(foreign_worker_dummies)


In [10]:
#Inspecting the data after the above changes
df.head(5)

Unnamed: 0,checking_status,duration,credit_history,purpose,credit_amount,savings_status,employment,installment_commitment,personal_status,other_parties,...,housing_own,housing_rent,job_high qualif/self emp/mgmt,job_skilled,job_unemp/unskilled non res,job_unskilled resident,own_telephone_none,own_telephone_yes,foreign_worker_no,foreign_worker_yes
0,<0,6.0,critical/other existing credit,radio/tv,1169.0,no known savings,>=7,4.0,male single,none,...,1,0,0,1,0,0,0,1,0,1
1,0<=X<200,48.0,existing paid,radio/tv,5951.0,<100,1<=X<4,2.0,female div/dep/mar,none,...,1,0,0,1,0,0,1,0,0,1
2,no checking,12.0,critical/other existing credit,education,2096.0,<100,4<=X<7,2.0,male single,none,...,1,0,0,0,0,1,1,0,0,1
3,<0,42.0,existing paid,furniture/equipment,7882.0,<100,4<=X<7,2.0,male single,guarantor,...,0,0,0,1,0,0,1,0,0,1
4,<0,24.0,delayed previously,new car,4870.0,<100,1<=X<4,3.0,male single,none,...,0,0,0,1,0,0,1,0,0,1


In [11]:
# Dropping the unnecessary columns 
df = df.drop(['checking_status', 'credit_history', 'purpose', 'savings_status',
       'employment', 'personal_status', 'other_parties', 'property_magnitude', 'other_payment_plans', 'housing', 'job', 'own_telephone','foreign_worker'], axis=1)

In [12]:
# generate a basic summary of the data
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 62 columns):
 #   Column                                          Non-Null Count  Dtype  
---  ------                                          --------------  -----  
 0   duration                                        1000 non-null   float64
 1   credit_amount                                   1000 non-null   float64
 2   installment_commitment                          1000 non-null   float64
 3   residence_since                                 1000 non-null   float64
 4   age                                             1000 non-null   float64
 5   existing_credits                                1000 non-null   float64
 6   num_dependents                                  1000 non-null   float64
 7   class                                           1000 non-null   int64  
 8   checking_status_0<=X<200                        1000 non-null   uint8  
 9   checking_status_<0                        

In [13]:
# properties and observation of the data
df.head(5)

Unnamed: 0,duration,credit_amount,installment_commitment,residence_since,age,existing_credits,num_dependents,class,checking_status_0<=X<200,checking_status_<0,...,housing_own,housing_rent,job_high qualif/self emp/mgmt,job_skilled,job_unemp/unskilled non res,job_unskilled resident,own_telephone_none,own_telephone_yes,foreign_worker_no,foreign_worker_yes
0,6.0,1169.0,4.0,4.0,67.0,2.0,1.0,1,0,1,...,1,0,0,1,0,0,0,1,0,1
1,48.0,5951.0,2.0,2.0,22.0,1.0,1.0,0,1,0,...,1,0,0,1,0,0,1,0,0,1
2,12.0,2096.0,2.0,3.0,49.0,1.0,2.0,1,0,0,...,1,0,0,0,0,1,1,0,0,1
3,42.0,7882.0,2.0,4.0,45.0,1.0,2.0,1,0,1,...,0,0,0,1,0,0,1,0,0,1
4,24.0,4870.0,3.0,4.0,53.0,2.0,2.0,0,0,1,...,0,0,0,1,0,0,1,0,0,1


#### Spliting the data for training and testing (data partitoning 70/30)

In [14]:
train_df, test_df = train_test_split(df, test_size=0.3)

#### creating a standard scaler and fitting it to the training and test set of numerical predictors

In [15]:
scaler = preprocessing.StandardScaler()
cols_to_stdize = ['duration', 'credit_amount', 'installment_commitment', 
                   'residence_since', 'age', 'existing_credits', 'num_dependents']                
               
# Transform the predictors of training and validation sets
train_df[cols_to_stdize] = scaler.fit_transform(train_df[cols_to_stdize])

test_df[cols_to_stdize] = scaler.transform(test_df[cols_to_stdize]) # validation_target is now a series object


#### Seperating the predictors and target variables

In [16]:
target = 'class'
predictors = list(df.columns)
predictors.remove(target)

#### Saving the datasets for testing and training

In [17]:
X_train = train_df[predictors]
y_train = train_df[target]
X_test = test_df[predictors]
y_test = test_df[target]

# Section 3:

**Utilize all the relevant models we have covered. If it's a classifier you are building, then use all the classification models (i.e., Logistic, Decision Tree, RandomForest, etc.), and if regression, use all other regression models (i.e., linear regression, decision tree regressor, random forest regressor, etc.).**

Since this is a binary classification problem that predicts if customer is in the good or bad class of credit, classification models are relevant.

In [18]:
# Creating a dataframe to store the results of the models
performance = pd.DataFrame({"model": [], "Accuracy": [], "Precision": [], "Recall": [], "F1": []})

#### Fit and test with the logistic regression model

In [19]:
from sklearn.linear_model import LogisticRegression

In [20]:
log_reg_model = LogisticRegression(penalty=None, max_iter=3000)
_ = log_reg_model.fit(X_train, np.ravel(y_train))

In [21]:
model_preds = log_reg_model.predict(X_test)
c_matrix = confusion_matrix(y_test, model_preds)
TP = c_matrix[1][1]
TN = c_matrix[0][0]
FP = c_matrix[0][1]
FN = c_matrix[1][0]
performance = pd.concat([performance, pd.DataFrame({'model':"logistic regression default", 
                                                    'Accuracy': [(TP+TN)/(TP+TN+FP+FN)], 
                                                    'Precision': [TP/(TP+FP)], 
                                                    'Recall': [TP/(TP+FN)], 
                                                    'F1': [2*TP/(2*TP+FP+FN)]
                                                     }, index=[0])])
performance

Unnamed: 0,model,Accuracy,Precision,Recall,F1
0,logistic regression default,0.753333,0.808219,0.84689,0.827103


#### Logistic Regression with L2 Regularization

In [22]:
log_reg_L2_model = LogisticRegression(penalty='l2', max_iter=1000)
_ = log_reg_L2_model.fit(X_train, np.ravel(y_train))

In [23]:
model_preds = log_reg_L2_model.predict(X_test)
c_matrix = confusion_matrix(y_test, model_preds)
TP = c_matrix[1][1]
TN = c_matrix[0][0]
FP = c_matrix[0][1]
FN = c_matrix[1][0]
performance = pd.concat([performance, pd.DataFrame({'model':"L2 logistic regression", 
                                                    'Accuracy': [(TP+TN)/(TP+TN+FP+FN)], 
                                                    'Precision': [TP/(TP+FP)], 
                                                    'Recall': [TP/(TP+FN)], 
                                                    'F1': [2*TP/(2*TP+FP+FN)]
                                                     }, index=[0])])
performance

Unnamed: 0,model,Accuracy,Precision,Recall,F1
0,logistic regression default,0.753333,0.808219,0.84689,0.827103
0,L2 logistic regression,0.746667,0.803653,0.842105,0.82243


#### Logistic Regression with L1 Regularization

In [24]:
log_reg_L1_model = LogisticRegression(solver='liblinear', penalty='l1')
_ = log_reg_L1_model.fit(X_train, np.ravel(y_train))

In [25]:
model_preds = log_reg_L1_model.predict(X_test)
c_matrix = confusion_matrix(y_test, model_preds)
TP = c_matrix[1][1]
TN = c_matrix[0][0]
FP = c_matrix[0][1]
FN = c_matrix[1][0]
performance = pd.concat([performance, pd.DataFrame({'model':"L1 logistic Regression", 
                                                    'Accuracy': [(TP+TN)/(TP+TN+FP+FN)], 
                                                    'Precision': [TP/(TP+FP)], 
                                                    'Recall': [TP/(TP+FN)], 
                                                    'F1': [2*TP/(2*TP+FP+FN)]
                                                     }, index=[0])])
performance

Unnamed: 0,model,Accuracy,Precision,Recall,F1
0,logistic regression default,0.753333,0.808219,0.84689,0.827103
0,L2 logistic regression,0.746667,0.803653,0.842105,0.82243
0,L1 logistic Regression,0.756667,0.803571,0.861244,0.831409


#### Fit and test with Decision Tree Classifier(Using default parameters) 

In [26]:
dtree=DecisionTreeClassifier()

In [27]:
_ = dtree.fit(X_train, y_train)

In [28]:
y_pred = dtree.predict(X_test)

In [29]:
performance = pd.concat([performance, pd.DataFrame({'model':"Decision tree Default",       
                                                    'Accuracy': accuracy_score(y_test, y_pred), 
                                                    'Precision': precision_score(y_test, y_pred, average = 'macro'), 
                                                    'Recall': recall_score(y_test, y_pred, average = 'macro'), 
                                                    'F1': f1_score(y_test, y_pred, average = 'macro')
                                                     }, index=[0])])

performance

Unnamed: 0,model,Accuracy,Precision,Recall,F1
0,logistic regression default,0.753333,0.808219,0.84689,0.827103
0,L2 logistic regression,0.746667,0.803653,0.842105,0.82243
0,L1 logistic Regression,0.756667,0.803571,0.861244,0.831409
0,Decision tree Default,0.68,0.615124,0.609023,0.611462


#### Decision Tree Classifier with RandomSearchCV

In [30]:
score_measure = "recall"
kfolds = 5

param_grid = {
     'min_samples_split': [2,10,50,100,200],  
    'min_samples_leaf': [1,5,10,20,50],
    'min_impurity_decrease': [0.0001, 0.0005, 0.0010, 0.0020, 0.0050],
    'max_leaf_nodes': [10,25,50,100,200], 
    'max_depth': [5,10,20,30], 
    'criterion': ['entropy', 'gini'],
}

dtree = DecisionTreeClassifier()
rand_search = RandomizedSearchCV(estimator = dtree, param_distributions=param_grid, cv=kfolds, n_iter=1000,
                           scoring=score_measure, verbose=1, n_jobs=-1,  # n_jobs=-1 will utilize all available CPUs 
                           return_train_score=True)

_ = rand_search.fit(X_train, y_train)

print(f"The best {score_measure} score is {rand_search.best_score_}")
print(f"... with parameters: {rand_search.best_params_}")

bestRecallTree = rand_search.best_estimator_

Fitting 5 folds for each of 1000 candidates, totalling 5000 fits
The best recall score is 0.9082663368377654
... with parameters: {'min_samples_split': 2, 'min_samples_leaf': 1, 'min_impurity_decrease': 0.005, 'max_leaf_nodes': 25, 'max_depth': 30, 'criterion': 'gini'}


In [31]:
c_matrix = confusion_matrix(y_test, rand_search.predict(X_test))
TP = c_matrix[1][1]
TN = c_matrix[0][0]
FP = c_matrix[0][1]
FN = c_matrix[1][0]

print(c_matrix)

print(f"Accuracy={(TP+TN)/(TP+TN+FP+FN):.7f} Precision={TP/(TP+FP):.7f} Recall={TP/(TP+FN):.7f} F1={2*TP/(2*TP+FP+FN):.7f}")

performance = pd.concat([performance, pd.DataFrame({'model':"Decision tree with RandomSearch", 
                                                    'Accuracy': [(TP+TN)/(TP+TN+FP+FN)], 
                                                    'Precision': [TP/(TP+FP)], 
                                                    'Recall': [TP/(TP+FN)], 
                                                    'F1': [2*TP/(2*TP+FP+FN)]
                                                     }, index=[0])])
performance

[[ 29  62]
 [ 20 189]]
Accuracy=0.7266667 Precision=0.7529880 Recall=0.9043062 F1=0.8217391


Unnamed: 0,model,Accuracy,Precision,Recall,F1
0,logistic regression default,0.753333,0.808219,0.84689,0.827103
0,L2 logistic regression,0.746667,0.803653,0.842105,0.82243
0,L1 logistic Regression,0.756667,0.803571,0.861244,0.831409
0,Decision tree Default,0.68,0.615124,0.609023,0.611462
0,Decision tree with RandomSearch,0.726667,0.752988,0.904306,0.821739


#### Decision Tree Classifier with GridSearchCV

In [32]:
score_measure = "recall"
kfolds = 5

param_grid = {
     'min_samples_split': [2,10,50,100,200],  
    'min_samples_leaf': [1,5,10,20,50],
    'min_impurity_decrease': [0.0001, 0.0005, 0.0010, 0.0020, 0.0050],
    'max_leaf_nodes': [10,25,50,100,200], 
    'max_depth': [5,10,20,30], 
    'criterion': ['entropy', 'gini'],
}

#dtree = DecisionTreeClassifier()
grid_search = GridSearchCV(estimator = dtree, param_grid=param_grid, cv=kfolds, 
                           scoring=score_measure, verbose=1, n_jobs=-1,  # n_jobs=-1 will utilize all available CPUs 
                           return_train_score=True)

_ = grid_search.fit(X_train, y_train)

print(f"The best {score_measure} score is {grid_search.best_score_}")
print(f"... with parameters: {grid_search.best_params_}")

bestRecallTree = grid_search.best_estimator_

Fitting 5 folds for each of 5000 candidates, totalling 25000 fits
The best recall score is 0.9123479694908265
... with parameters: {'criterion': 'gini', 'max_depth': 10, 'max_leaf_nodes': 25, 'min_impurity_decrease': 0.005, 'min_samples_leaf': 1, 'min_samples_split': 10}


In [33]:
c_matrix = confusion_matrix(y_test, grid_search.predict(X_test))
TP = c_matrix[1][1]
TN = c_matrix[0][0]
FP = c_matrix[0][1]
FN = c_matrix[1][0]
#print(f"Accuracy={(TP+TN)/(TP+TN+FP+FN):.4f} Precision={TP/(TP+FP):.4f} Recall={TP/(TP+FN):.4f} F1={2*TP/(2*TP+FP+FN):.4f}")
print(c_matrix)

performance = pd.concat([performance, pd.DataFrame({'model':"Decision tree with GridSearch", 
                                                    'Accuracy': [(TP+TN)/(TP+TN+FP+FN)], 
                                                    'Precision': [TP/(TP+FP)], 
                                                    'Recall': [TP/(TP+FN)], 
                                                    'F1': [2*TP/(2*TP+FP+FN)]
                                                     }, index=[0])])
performance

[[ 29  62]
 [ 20 189]]


Unnamed: 0,model,Accuracy,Precision,Recall,F1
0,logistic regression default,0.753333,0.808219,0.84689,0.827103
0,L2 logistic regression,0.746667,0.803653,0.842105,0.82243
0,L1 logistic Regression,0.756667,0.803571,0.861244,0.831409
0,Decision tree Default,0.68,0.615124,0.609023,0.611462
0,Decision tree with RandomSearch,0.726667,0.752988,0.904306,0.821739
0,Decision tree with GridSearch,0.726667,0.752988,0.904306,0.821739


#### Fit and test with Random Forest Classifier(Using default parameters)

In [34]:
rforest = RandomForestClassifier()

In [35]:
_ = rforest.fit(X_train, y_train.values.ravel())

In [36]:
y_pred = rforest.predict(X_test)

In [37]:
performance = pd.concat([performance, pd.DataFrame({'model':"RandomForest Default",       
                                                    'Accuracy': accuracy_score(y_test, y_pred), 
                                                    'Precision': precision_score(y_test, y_pred, average = 'macro'), 
                                                    'Recall': recall_score(y_test, y_pred, average = 'macro'), 
                                                    'F1': f1_score(y_test, y_pred, average = 'macro')
                                                     }, index=[0])])

performance

Unnamed: 0,model,Accuracy,Precision,Recall,F1
0,logistic regression default,0.753333,0.808219,0.84689,0.827103
0,L2 logistic regression,0.746667,0.803653,0.842105,0.82243
0,L1 logistic Regression,0.756667,0.803571,0.861244,0.831409
0,Decision tree Default,0.68,0.615124,0.609023,0.611462
0,Decision tree with RandomSearch,0.726667,0.752988,0.904306,0.821739
0,Decision tree with GridSearch,0.726667,0.752988,0.904306,0.821739
0,RandomForest Default,0.756667,0.719701,0.657842,0.671127


#### Random Forest Classifier with RandomSearchCV

In [38]:
score_measure = "recall"
kfolds = 5

param_grid = {
     'min_samples_split': [2,10,50,100,200],  
    'min_samples_leaf': [1,5,10,20,50],
    'min_impurity_decrease': [0.0001, 0.0005, 0.0010, 0.0020, 0.0050],
    'max_leaf_nodes': [10,25,50,100,200], 
    'max_depth': [5,10,20,30], 
    'criterion': ['entropy', 'gini'],
}

#rforest = RandomForestClassifier()
rand_search = RandomizedSearchCV(estimator = rforest, param_distributions=param_grid, cv=kfolds, n_iter=1000,
                           scoring=score_measure, verbose=1, n_jobs=-1,  # n_jobs=-1 will utilize all available CPUs 
                           return_train_score=True)

_ = rand_search.fit(X_train, y_train.values.ravel())

print(f"The best {score_measure} score is {rand_search.best_score_}")
print(f"... with parameters: {rand_search.best_params_}")

bestRecallTree = rand_search.best_estimator_

Fitting 5 folds for each of 1000 candidates, totalling 5000 fits
The best recall score is 1.0
... with parameters: {'min_samples_split': 10, 'min_samples_leaf': 20, 'min_impurity_decrease': 0.005, 'max_leaf_nodes': 50, 'max_depth': 5, 'criterion': 'gini'}


In [39]:
c_matrix = confusion_matrix(y_test, rand_search.predict(X_test))
TP = c_matrix[1][1]
TN = c_matrix[0][0]
FP = c_matrix[0][1]
FN = c_matrix[1][0]
#print(f"Accuracy={(TP+TN)/(TP+TN+FP+FN):.7f} Precision={TP/(TP+FP):.7f} Recall={TP/(TP+FN):.7f} F1={2*TP/(2*TP+FP+FN):.7f}")

print(c_matrix)

performance = pd.concat([performance, pd.DataFrame({'model':"Random Forest with Random search", 
                                                    'Accuracy': [(TP+TN)/(TP+TN+FP+FN)], 
                                                    'Precision': [TP/(TP+FP)], 
                                                    'Recall': [TP/(TP+FN)], 
                                                    'F1': [2*TP/(2*TP+FP+FN)]
                                                     }, index=[0])])
performance

[[  0  91]
 [  0 209]]


Unnamed: 0,model,Accuracy,Precision,Recall,F1
0,logistic regression default,0.753333,0.808219,0.84689,0.827103
0,L2 logistic regression,0.746667,0.803653,0.842105,0.82243
0,L1 logistic Regression,0.756667,0.803571,0.861244,0.831409
0,Decision tree Default,0.68,0.615124,0.609023,0.611462
0,Decision tree with RandomSearch,0.726667,0.752988,0.904306,0.821739
0,Decision tree with GridSearch,0.726667,0.752988,0.904306,0.821739
0,RandomForest Default,0.756667,0.719701,0.657842,0.671127
0,Random Forest with Random search,0.696667,0.696667,1.0,0.821218


#### Random Forest Classifier with GridSearchCV

In [40]:
score_measure = "recall"
kfolds = 5

param_grid = {
    'min_samples_split': [2,10,50,100,200],  
    'min_samples_leaf': [1,5,10,20,50],
    'min_impurity_decrease': [0.0001, 0.0005, 0.0010, 0.0020, 0.0050],
    'max_leaf_nodes': [10,25,50,100,200], 
    'max_depth': [5,10,20,30], 
    'criterion': ['entropy', 'gini'],
}

#rforest = RandomForestClassifier()
grid_search = GridSearchCV(estimator = rforest, param_grid=param_grid, cv=kfolds, 
                           scoring=score_measure, verbose=1, n_jobs=-1,  # n_jobs=-1 will utilize all available CPUs 
                           return_train_score=True)

_ = grid_search.fit(X_train, y_train.values.ravel())

print(f"The best {score_measure} score is {grid_search.best_score_}")
print(f"... with parameters: {grid_search.best_params_}")

bestRecallTree = grid_search.best_estimator_

Fitting 5 folds for each of 5000 candidates, totalling 25000 fits
The best recall score is 1.0
... with parameters: {'criterion': 'entropy', 'max_depth': 5, 'max_leaf_nodes': 10, 'min_impurity_decrease': 0.0001, 'min_samples_leaf': 1, 'min_samples_split': 200}


In [41]:
c_matrix = confusion_matrix(y_test, grid_search.predict(X_test))
TP = c_matrix[1][1]
TN = c_matrix[0][0]
FP = c_matrix[0][1]
FN = c_matrix[1][0]
#print(f"Accuracy={(TP+TN)/(TP+TN+FP+FN):.7f} Precision={TP/(TP+FP):.7f} Recall={TP/(TP+FN):.7f} F1={2*TP/(2*TP+FP+FN):.7f}")
print(c_matrix)

performance = pd.concat([performance, pd.DataFrame({'model':"Random Forest with Grid search", 
                                                    'Accuracy': [(TP+TN)/(TP+TN+FP+FN)], 
                                                    'Precision': [TP/(TP+FP)], 
                                                    'Recall': [TP/(TP+FN)], 
                                                    'F1': [2*TP/(2*TP+FP+FN)]
                                                     }, index=[0])])
performance

[[  0  91]
 [  0 209]]


Unnamed: 0,model,Accuracy,Precision,Recall,F1
0,logistic regression default,0.753333,0.808219,0.84689,0.827103
0,L2 logistic regression,0.746667,0.803653,0.842105,0.82243
0,L1 logistic Regression,0.756667,0.803571,0.861244,0.831409
0,Decision tree Default,0.68,0.615124,0.609023,0.611462
0,Decision tree with RandomSearch,0.726667,0.752988,0.904306,0.821739
0,Decision tree with GridSearch,0.726667,0.752988,0.904306,0.821739
0,RandomForest Default,0.756667,0.719701,0.657842,0.671127
0,Random Forest with Random search,0.696667,0.696667,1.0,0.821218
0,Random Forest with Grid search,0.696667,0.696667,1.0,0.821218


#### ADABoost Classifier(Using default parameters)

In [42]:
aboost = AdaBoostClassifier()

In [43]:
_ = aboost.fit(X_train, y_train)

In [44]:
y_pred = aboost.predict(X_test)

In [45]:

performance = pd.concat([performance, pd.DataFrame({'model':"AdaBoost Default",       
                                                    'Accuracy': accuracy_score(y_test, y_pred), 
                                                    'Precision': precision_score(y_test, y_pred, average = 'macro'), 
                                                    'Recall': recall_score(y_test, y_pred, average = 'macro'), 
                                                    'F1': f1_score(y_test, y_pred, average = 'macro')
                                                     }, index=[0])])

performance

Unnamed: 0,model,Accuracy,Precision,Recall,F1
0,logistic regression default,0.753333,0.808219,0.84689,0.827103
0,L2 logistic regression,0.746667,0.803653,0.842105,0.82243
0,L1 logistic Regression,0.756667,0.803571,0.861244,0.831409
0,Decision tree Default,0.68,0.615124,0.609023,0.611462
0,Decision tree with RandomSearch,0.726667,0.752988,0.904306,0.821739
0,Decision tree with GridSearch,0.726667,0.752988,0.904306,0.821739
0,RandomForest Default,0.756667,0.719701,0.657842,0.671127
0,Random Forest with Random search,0.696667,0.696667,1.0,0.821218
0,Random Forest with Grid search,0.696667,0.696667,1.0,0.821218
0,AdaBoost Default,0.736667,0.684346,0.662101,0.669801


#### Gradient Boosting Classifier

In [46]:
gboost = GradientBoostingClassifier()

In [47]:
_ = gboost.fit(X_train, y_train)

In [48]:
y_pred = gboost.predict(X_test)

In [49]:
performance = pd.concat([performance, pd.DataFrame({'model':"GradientBoost Default",       
                                                    'Accuracy': accuracy_score(y_test, y_pred), 
                                                    'Precision': precision_score(y_test, y_pred, average = 'macro'), 
                                                    'Recall': recall_score(y_test, y_pred, average = 'macro'), 
                                                    'F1': f1_score(y_test, y_pred, average = 'macro')
                                                     }, index=[0])])

performance

Unnamed: 0,model,Accuracy,Precision,Recall,F1
0,logistic regression default,0.753333,0.808219,0.84689,0.827103
0,L2 logistic regression,0.746667,0.803653,0.842105,0.82243
0,L1 logistic Regression,0.756667,0.803571,0.861244,0.831409
0,Decision tree Default,0.68,0.615124,0.609023,0.611462
0,Decision tree with RandomSearch,0.726667,0.752988,0.904306,0.821739
0,Decision tree with GridSearch,0.726667,0.752988,0.904306,0.821739
0,RandomForest Default,0.756667,0.719701,0.657842,0.671127
0,Random Forest with Random search,0.696667,0.696667,1.0,0.821218
0,Random Forest with Grid search,0.696667,0.696667,1.0,0.821218
0,AdaBoost Default,0.736667,0.684346,0.662101,0.669801


## Section 4: 

**Summary and discussion of results. Provide metrics to support your discussion. Identify the 'best' model from all you've tried, and explain why you've chosen this 'best model.'**

#### Conclusion

Above, I have implemented several predictive models for the data at hand to predict the creditworthiness of customers for the Deutsche Mark bank which can be useful for their customer acquisition strategies.

Here, the target variable - "class" of the customers is being predicted to determine if they fall under good or bad credit category for the bank.

In the above analysis, I have optimized the decision tree models and ensemble models for "Recall" score. 

Recall or true positive rate, measures the proportion of actual positive cases (high-risk customers) that are correctly identified by the model. In the context of credit risk analysis for Deutsche Mark bank, a high recall means that the model captures a larger portion of the true high-risk customers, reducing the number of false negatives. By optimizing recall, I aimed to identify as many high-risk customers as possible, minimizing the chances of overlooking individuals who may default or exhibit a high credit risk. This would help the bank to take appropriate measures to manage risk, such as offering lower credit limits or additional verification steps to those identified as high-risk. 
However, there is a trade-off between recall and other metrics like precision and overall accuracy. Maximizing recall may lead to an increase in false positives, where low-risk customers are incorrectly classified as high-risk. This can result in unnecessary credit restrictions for individuals who pose a lower credit risk. Therefore, finding the right balance between recall, precision, and other relevant metrics is crucial in credit risk analysis.

Based on the above results, here is my brief analysis on the performance of each model.

1. Logistic Regression Default:
   - Accuracy: 0.753333
   - Precision: 0.808219
   - Recall: 0.846890
   - F1 Score: 0.827103

The logistic regression model performs well overall, with a decent accuracy and balanced precision and recall. It seems to be a good baseline model for predicting credit risk.

2. L2 Logistic Regression:
   - Accuracy: 0.746667
   - Precision: 0.803653
   - Recall: 0.842105
   - F1 Score: 0.822430

The L2 logistic regression model has similar performance to the default logistic regression, indicating that the regularization parameter (L2 penalty) did not have a significant impact on the results.

3. L1 Logistic Regression:
   - Accuracy: 0.756667
   - Precision: 0.803571
   - Recall: 0.861244
   - F1 Score: 0.831409

The L1 logistic regression model performs slightly better than the other logistic regression models, with a higher recall and F1 score. It shows a good ability to identify customers at risk of credit default.

4. Decision Tree Default:
   - Accuracy: 0.680000
   - Precision: 0.615124
   - Recall: 0.609023
   - F1 Score: 0.611462

The decision tree model with default settings does not perform as well as the logistic regression models. It has lower accuracy, precision, and recall, suggesting that it might not be the best choice for predicting credit risk.

5. Decision Tree with RandomSearch:
   - Accuracy: 0.726667
   - Precision: 0.752988
   - Recall: 0.904306
   - F1 Score: 0.821739

The decision tree model with random search performs better than the default decision tree, with a higher recall. However, its precision is slightly lower, indicating a higher false positive rate.

6. Decision Tree with GridSearch:
   - Accuracy: 0.726667
   - Precision: 0.752988
   - Recall: 0.904306
   - F1 Score: 0.821739

The decision tree model with grid search yields the same performance as the random search version. It also has a high recall but lower precision.

7. RandomForest Default:
   - Accuracy: 0.756667
   - Precision: 0.719701
   - Recall: 0.657842
   - F1 Score: 0.671127

The random forest model with default settings shows decent accuracy but relatively low recall and precision. It might not be the best model for credit risk prediction based on these results.

8. Random Forest with Random search:
   - Accuracy: 0.696667
   - Precision: 0.696667
   - Recall: 1.000000
   - F1 Score: 0.821218

The random forest model with random search has a perfect recall, indicating that it can identify all the instances of credit risk. However, its precision is relatively low, meaning it might classify too many instances as risky, including false positives.

9. Random Forest with Grid search:
   - Accuracy: 0.696667
   - Precision: 0.696667
   - Recall: 1.000000
   - F1 Score: 0.821218

The random forest model with grid

 search performs the same as the random search version. It also has perfect recall but lower precision.

10. AdaBoost Default:
    - Accuracy: 0.736667
    - Precision: 0.684346
    - Recall: 0.662101
    - F1 Score: 0.669801

The AdaBoost model with default settings shows lower performance compared to the logistic regression and decision tree models. Its accuracy, precision, and recall are relatively lower.

11. GradientBoost Default:
    - Accuracy: 0.753333
    - Precision: 0.706898
    - Recall: 0.680267
    - F1 Score: 0.689528

The GradientBoost model performs similarly to the AdaBoost model, with slightly higher precision and recall but still lower than the logistic regression models.

Based on the results and in the context of predicting credit risk of customers, the best model would be the L1 Logistic Regression according to me. It has a relatively high accuracy of 0.756667 and a high recall of 0.861244, suggesting its ability to correctly classify instances at risk of credit default while minimizing false negatives. It also demonstrates the highest F1 score of 0.831409, indicating a good balance between precision and recall. Therefore, I would go with the L1 Logistic Regression model as the preferred choice for accurately predicting credit risk associated with customers.