# Technical notebook - Exploratory Data Analysis

### About Lending Club

Lending Club is the world leader in p2p lending having issued over ($9 billion) dollars in loans since they began in 2007. They are growing at a rate in excess of (150% a year).

source: https://www.lendacademy.com/lending-club-review/

### How it Works

Peer to peer lending at Lending Club is a very simple process. It begins with the borrower. They apply for a loan and if they meet certain criteria (such as a minimum 660 FICO score) their loan is added to Lending Club’s online platform. Investors can browse the loans on the platform and build a portfolio of loans. The minimum investment an investor can make is just $25 per loan. Each portion of a loan is called a note and smart investors build a portfolio of notes to spread their risk among many borrowers.

Lending Club will perform some level of verification on every borrower. As this verification process is happening investors can be funding portions of the loans. If the borrower passes verification the loan is approved for investors and will be issued to the borrower if fully funded. If the borrower fails verification the loan will not be issued. It will be deleted from the platform and all money that had been invested will be returned to the respective investors.

A loan can stay on the platform for up to 14 days. Most loans are funded much quicker than that and once funded the loan will be deleted from the platform. Approved borrowers will receive their money (less an origination fee) in just a couple of business days once funding is complete and then begin making payments within 30 days. These payments will be for principal plus interest on a standard amortization schedule.

### Explanation of loan grades

Lending Club categorizes borrowers into seven different loan grades: A through G. Within each loan grade there are five sub-grades meaning there are 35 total loan grades for borrowers from A1 down to G5. Where a borrower is graded depends on many factors the most important of which is the data held in the borrower’s credit report. The better credit history a borrower has the better their loan grade with the very best borrowers receiving an A1 grade, which carries the lowest interest rate.

Lending Club will pull the latest credit report for every borrower and take the data held in that report and other factors such as loan amount and loan term to determine the interest rate. Lending Club provides more information on their Interest Rates and How We Set Them page on their site. Learn more about the Lending Club borrower experience in this video where I apply for a Lending Club loan.

### What are the Risks?

Every investor should consider the risks of an investment before committing their money.  Investing with p2p lending has a number of risks:

**Borrower defaults** – the loans are unsecured so an investor has little recourse if the borrower decides not to pay. The annual default rate across all grades at Lending Club is around 6 or 7% with higher risk borrowers having a higher default rate.

**Lending Club bankruptcy** – This is a much smaller risk today than it was several years ago because Lending Club is making money and has had an influx of cash with the recent IPO. But the risk will always be there. In the unlikely event of a bankruptcy, there is a backup loan servicer who will take over servicing the loans but there would likely be some disruption and investors could lose some principal.

**Interest rate risk** – the loan terms are three or five years so during this time interest rates could increase substantially. If an FDIC insured investment is paying 6% it makes investing in a Lending Club loan at 7% not the best investment.

**Poor loan diversification** – many new investors get caught in this trap. They do not take advantage of the \\$25 minimum investment. If you invest in 20 loans at \\$250 you are running a much higher risk than if you invest in 200 loans at \\$25. If you only have 20 loans one default could wipe out most of your gains.

**Liquidity risk** – There is a secondary market on Lending Club where loans can be sold but if you need to liquidate your entire investment you will likely lose some principal in the process.

**Market-wide event or recession** – While p2p lending has been around since the latest recession in 2008, the asset class still remains untested when platforms were originating significant volumes. In a recession, defaults will increase and thus will result in a decrease in investor returns.


**This analysis focuses on borrower defaults and tries to identify a better classification**

### Data source:

https://www.lendingclub.com/info/download-data.action


Data includes yearly consumer loans given out with information was available at the time of the application and payment information. 

They also provide a short description of the columns.

This analysis is based on **LendingClub loan data from 2014.**

### Importing libraries

In [None]:
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings("ignore")


import scipy.stats as stats

from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import RobustScaler

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_validate
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import cross_val_predict

from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.metrics import classification_report
from sklearn.metrics import auc
from sklearn.metrics import precision_recall_curve
from sklearn.metrics import roc_curve


import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor

# from loan_helper import data_cleaning functions
from loan_helper import data_converting

#SMOTE
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
from imblearn.over_sampling import RandomOverSampler



### Importing Dataset

In [None]:
!ls -lath LendingClub

In [None]:
#reading excel
description = pd.read_excel('LendingClub/LCDataDictionary.xlsx')

Reading from csv file:

In [None]:
# reading Lending Club loan data from 2014
# data_lc = pd.read_csv('LendingClub/LoanStats3c_securev1.csv', low_memory=False, header=1)
# data_lc.shape

Reading from zip file:

In [None]:
#read in Lending Club loan data from zip file
from zipfile import ZipFile
zip_file = ZipFile('LendingClub/LoanStats3c_securev1.csv.zip')
data_lc = pd.read_csv(zip_file.open('LoanStats3c_securev1.csv'), low_memory=False, header=1)
data_lc.shape

In [None]:
data_lc.loan_amnt.isna().sum() #there are rows without loan amount

In [None]:
#removed two rows with full NAN values
data_lc = data_lc.loc[data_lc.loan_amnt.notnull()]
data_lc.shape

## I. Feature selection and feature engineering

### 1. Understanding the columns

In order to understand the columns we created a dataframe with column names, two examples, datatype, number of missing values, and the long description. The dataframe was exported to excel to make decision on columns. The result is stored in col_selection.xlsx.

In [None]:
from loan_helper import column_description
desc = column_description(data_lc, description) #column_description() is my own function to create this table

In [None]:
pd.set_option('display.max_colwidth', -1) #this allows us to see the very long description, if exceeds 50 char
desc.head(4)

In [None]:
desc.to_excel('col_desc_2014.xlsx')

### 2. First round feauture selection

Originally the dataset contained **150 columns/features**. Features were reduced due to many reasons listet below:

- Discarded columns which were holding **irrelevant** information:
    - id, member_id, url, policy_code, application_type, annual_inc_joint, dti_joint, verification_status_joint, acc_now_delinque
- Discarded columns which contained **redundant** information:
    - funded_amnt, funded_amnt_inv (the difference is funded by LC itself), int_rate, sub_grade, title, fico_range_high, desc (given as purpose)
- Discarded columns that contained **payment or collection** information (34 columns)
- Discarded columns that contained information that were not available at the time of credit application, **future info**
    - issue_d, pymnt_plan, initial_list_status, last_credit_pull_d, last_fico_range_high, last_fico_range_low
- Discarded features that require **too much data processing**
    - emp_title (due to the free format), zip_code (there is a state instead)
- Discarded columns with **only missing values** (30 columns)

In [None]:
!ls

In [None]:
col_selection.head()

In [None]:
col_selection = pd.read_excel('col_selection_2014.xlsx')
col_selection.Cause.value_counts() #these are the different causes to discard columns

#### Evidences

In [None]:
data_lc.application_type.unique() #there are only individual applications, no joint applications

In [None]:
#the difference between low and high fico scores are a fix number, it is enough to keep only one of them
(data_lc.fico_range_high - data_lc.fico_range_low).value_counts() 

In [None]:
len(data_lc.emp_title.unique()) #number of different values given as employment title

In [None]:
len(data_lc.zip_code.unique()) #number of different values given to the zip_code (categorycal value)

In [None]:
sum(data_lc.isna().sum() == len(data_lc)) #number of columns with missing values, 1 of them is ovelapping

In [None]:
#col_selection.loc[col_selection.nan_counts == 235629, ['col_name', 'nan_counts']] #columns with missing values

#### Remove unwanted columns

In [None]:
selected_col = col_selection.loc[col_selection.Decision == 'keep', 'col_name'].to_list()
len(selected_col)

In [None]:
dataset = data_lc.loc[:, selected_col]

In [None]:
dataset.shape

### 3. Exploratory data analysis

In [None]:
import seaborn as sn
import matplotlib.pyplot as plt
# sns.set_palette(sns.light_palette((360, 90, 50)))
%matplotlib inline

In [None]:
dataset.columns

In [None]:
sns.set_style('white')
fig = plt.figure()
sns.distplot(dataset.loan_amnt, bins=7, kde=False)
plt.xlabel('loan amount');

In [None]:
# sns.barplot(dataset.term)
sns.lmplot(x='term', y='loan_amnt', hue='loan_status', data=dataset)

In [None]:
sns.boxplot(x='dti', y='loan_amnt', hue='loan_status', data=dataset)

### 4. Converting data types

- **emp_length** column was converted to numeric
- **earliest_cr_line** date was converted to numeric (years)
- **revol_util** (revolving utilization) was converted to numeric
- creating **regions from state**, state column was removed
- reduce categories of loan purpose by aggregation of close categories
    - filled nan values with 0 in four columns (mths_since_recent_bc_dlq,mths_since_recent_revol_delinq,
emp_length, mo_sin_old_il_acct)

The above transformations do not impact dataleakage and can be done before train - test split.

In [None]:
dataset = data_converting(dataset)

In [None]:
dataset.info()

### Determing the target feature

In [None]:
df_term = pd.DataFrame(dataset.loan_status.value_counts())
df_term

#### Meaning of the categories
<b>Fully paid:</b> Loan has been fully repaid, either at the expiration of the 3- or 5-year year term or as a result of a prepayment.

<b>Current:</b> Loan is up to date on all outstanding payments. 

<b>In Grace Period:</b> Loan is past due but within the 15-day grace period. 

<b>Late (16-30):</b> Loan has not been current for 16 to 30 days. Learn more about the tools LendingClub has to deal with delinquent borrowers.

<b>Late (31-120):</b> Loan has not been current for 31 to 120 days. Learn more about the tools LendingClub has to deal with delinquent borrowers.

<b>Default:</b> Loan has not been current for an extended period of time. Learn more about the difference between “default” and “charge off”.

<b>Charged Off:</b> Loan for which there is no longer a reasonable expectation of further payments. Upon Charge Off, the remaining principal balance of the Note is deducted from the account balance. Learn more about the difference between “default” and “charge off”.

Sosurce: https://help.lendingclub.com/hc/en-us/articles/215488038-What-do-the-different-Note-statuses-mean-

In [None]:
pd.crosstab(columns=dataset['loan_status'], index=dataset['term'],)

We decided to ignore the 'gray' categories, where there might chance to the recovery of the loan. The 'Current' category contains the 60 months term loans, removing them would panalize the long term loans by increasing the default rate within this category. 

In [None]:
#Select default categories:
dataset.loc[dataset.loan_status == 'Fully Paid', 'default'] = 0
dataset.loc[dataset.loan_status == 'Charged Off', 'default'] = 1
dataset.loc[dataset.loan_status == 'Current', 'default'] = 0
#dropped the rows where the loan status was different
dataset = dataset.loc[dataset.default.notnull()]

In [None]:
#remove loan_status, default replace it
dataset = dataset.drop(columns='loan_status')

In [None]:
dataset = dataset.reset_index()
dataset = dataset.drop(columns='index')

In [None]:
my_palette = ["#8c8c8c", "#e54632"]

fig1 = plt.figure(figsize=(5,5))
sns.countplot(x = 'default', data=dataset, palette=my_palette, )
ax = plt.gca()
plt.title('Target categories\n', fontsize=16)
ax.set_xticklabels(['Non-default', 'Default'])
ax.tick_params(axis='both', labelsize=12)
ax.set_xlabel('')
ax.set_ylabel('')
ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False)
fig1.savefig('Target_var');

In [None]:
# dataset.head()

In [None]:
x_feats = dataset.columns.to_list()
x_feats.remove('default')

### Multicollinearity examination

In [None]:
corr = dataset.corr()
plt.figure(figsize=(16,12))
sns.heatmap(corr);
# corr

In [None]:
corr_df = []
for i in range(corr.shape[1]):
    for j in range(i+1, corr.shape[1]):
        if corr.iat[i,j]>0.8:
            corr_df.append([corr.index[i], corr.columns[j], corr.iat[i,j]])
corr_df = pd.DataFrame(corr_df, columns = ['col_name1','col_name2', 'r_square'])

In [None]:
corr_df

In [None]:
remove_col = ['installment', 'mo_sin_old_rev_tl_op', 'num_sats', 'revol_bal', 'bc_util', 'num_bc_tl',
             'avg_cur_bal', 'total_bc_limit', 'num_actv_bc_tl', 'num_rev_tl_bal_gt_0', 'total_bal_ex_mort']

In [None]:
corr_columns = list(set(corr_df.col_name1.unique()) | set(corr_df.col_name2.unique()))
corr_columns.sort()
corr2 = dataset.loc[:,corr_columns].corr()
fig2 = plt.figure(figsize=(10,8))
ax = plt.gca()
ax.tick_params(axis='both', labelsize=14)
plt.title('Feature correlation heatmap\n', fontsize = 24)
sns.heatmap(corr2)
fig2.savefig('Feature_corr2');

In [None]:
[x_feats.remove(col) for col in remove_col]

In [None]:
len(x_feats)

In [None]:
# x_feats

## Preparing dataset for modeling

In [None]:
X = pd.get_dummies(dataset[x_feats], drop_first=True)

In [None]:
y = dataset.default

In [None]:
y.value_counts()

In [None]:
X.head()

In [None]:
X.shape

In [None]:
# X.columns

### Train-Test Split

In [None]:
#'Stratify=y' provide us the same ratio in the target variable then it was in the original dataset 
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=10, stratify=y) #25%

In [None]:
y_train.mean()

In [None]:
y_test.mean()

### Further datacleaning separately for train and test


- annual_inc: in case of annual income we had to handle extreme values (there are many strategies, we were choosing truncating the extreme values to the value of the 99 quantile)
- imputing median values in place of missing values

In [None]:
imp = SimpleImputer(strategy='median', copy=True, fill_value=None)
imp.fit(X_train)  

X_train_imp = imp.transform(X_train)      
X_test_imp = imp.transform(X_test)

In [None]:
income_trunc = X_train.annual_inc.quantile(q=0.99)
income_trunc

b_train = X_train_imp > income_trunc
b_test = X_test_imp > income_trunc

X_train_imp[b_train] = income_trunc
X_test_imp[b_test] = income_trunc

In [None]:
plt.hist(X_train_imp[:,3])
plt.title('Annual income');

### Scaling

The most standard scaling method

In [None]:
scaler = StandardScaler()
scaler.fit(X_train_imp)
X_train_scaled = scaler.transform(X_train_imp)
X_test_scaled = scaler.transform(X_test_imp)

## Objective: Increasing predictibility of loan defaults from actual default 

### Maximize the F1 score

In [None]:
def get_metric(y_train, y_probability):
    '''
    inputs: y_train values and from the trained model the y probabilities for default
    output: maximized F1 score, cut-off and the corresponding y_hat
    '''
    y = 0
    F1_score = 0
    cut_off = 0
    for cutoff in np.linspace(0,1,101):
        y_hat = (y_probability > cutoff) * 1
        f1 = f1_score(y_train, y_hat)
        if f1> F1_score:
            F1_score = f1
            cut_off = cutoff
            y = y_hat
    
    print('Recall:', recall_score(y_train, y))
    print('Precision:', precision_score(y_train, y))
    print('F1_score:', F1_score)
    print('Cut_off:', cut_off)
    
    conf_matrix = pd.DataFrame(confusion_matrix(y_train, y),
                                    index=['actual 0', 'actual 1'],
                                    columns=['predicted 0', 'predicted 1'])
    return conf_matrix

 ## Model Selection - Logistic regression

## I. Modeling without imbalance strategy

### A. Baseline: vanilla logistic regression (w/o imbalance strategy)

In [None]:
logreg_vanilla = LogisticRegression(C=1e9, solver='liblinear', max_iter=200)

model_vanilla = logreg_vanilla.fit(X_train_scaled, y_train)
y_probability = model_vanilla.predict_proba(X_train_scaled)[:,1]
get_metric(y_train, y_probability)

In [None]:
p, r, t = precision_recall_curve(y_train, model_vanilla.decision_function(X_train_scaled))

In [None]:
from inspect import signature
step_kwargs = ({'step': 'post'} if 'step' in signature(plt.fill_between).parameters else {})
plt.step(r, p, color='b', alpha=0.4, where='post')
plt.fill_between(r, p, color='b', alpha=0.4, **step_kwargs)
plt.xlabel('precision')
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.ylim([0.0, 1.05])
plt.xlim([0.0, 1])
plt.title('Precision-Recall curve');

In [None]:
fpr_rt_lm, tpr_rt_lm, _ = roc_curve(y_train, y_probability)
plt.plot(fpr_rt_lm, tpr_rt_lm)
plt.show()

### B. Lasso regression with different C values (w/o imbalance strategy)

In [None]:
C_values = [0.01, 0.1, 1, 10, 100, 1000, 10000]  # low value means high l1 penalty on coefficients

for C in C_values:
    logreg_l1 = LogisticRegression(C=C, penalty='l1',
                                   solver='liblinear',
                                   max_iter=200)
    print('-'*40,f'\nLasso regression with C = {C}')
    model_l1 = logreg_l1.fit(X_train_scaled, y_train)
    y_probability = model_l1.predict_proba(X_train_scaled)[:,1]
    print(get_metric(y_train, y_probability))


### C. Ridge regression with different C values (w/o imbalance strategy)

In [None]:
C_values = [0.01, 0.1, 1, 10, 100, 1000, 10000]  # low value means high l1 penalty on coefficients

for C in C_values:
    logreg_l2 = LogisticRegression(C=C, penalty='l2',
                                   solver='newton-cg',
                                   max_iter=200)
    
    print('-'*40,f'\nRidge regression with C = {C}')
    model_l2 = logreg_l2.fit(X_train_scaled, y_train)
    y_probability = model_l2.predict_proba(X_train_scaled)[:,1]
    print(get_metric(y_train, y_probability))

### D. Cross-Validation (w/o imbalance strategy)

In [None]:
cv = StratifiedKFold(n_splits= 5, random_state=1000, shuffle=True)

#### Vanilla

In [None]:
lr_vanilla = LogisticRegression(C=1e9,
                                solver='newton-cg',
                                max_iter=200)


cv_vanilla = cross_validate(estimator=lr_vanilla,
                            X=X_train_scaled, y=y_train,
                            cv=cv,
                            n_jobs=-1,
                            return_train_score=True)

In [None]:
y_probability = cross_val_predict(lr_vanilla, X_train_scaled, y_train, cv=cv, method='predict_proba')[:,1]
get_metric(y_train, y_probability)

#### Ridge

In [None]:
l2_reg = LogisticRegression(C=1,
                            solver='newton-cg',
                            penalty='l2',
                            max_iter=200)

cv_l2 = cross_validate(estimator=l2_reg, X=X_train_scaled, y=y_train,
                       cv=cv,
                       n_jobs=-1,
                       return_estimator=True,
                       return_train_score=True)

In [None]:
y_probability = cross_val_predict(l2_reg, X_train_scaled, y_train, cv=cv, method='predict_proba')[:,1]
get_metric(y_train, y_probability)

#### Lasso

In [None]:
l1_reg = LogisticRegression(C=1,
                            solver='saga',
                            penalty='l1',
                            max_iter=200)
cv_l1 = cross_validate(estimator=l1_reg, X=X_train_scaled, y=y_train,
                       cv=cv,
                       n_jobs=-1,
                       return_estimator=True,
                       return_train_score=True)

In [None]:
y_probability = cross_val_predict(l1_reg, X_train_scaled, y_train, cv=cv, method='predict_proba')[:,1]
get_metric(y_train, y_probability)

## II. Imbalance Strategy: Random undersampling

In [None]:
rus = RandomUnderSampler(random_state=42)
X_train_rus, y_train_rus = rus.fit_resample(X_train_scaled, y_train) 

In [None]:
ros = RandomOverSampler(random_state=42)
X_train_ros, y_train_ros = ros.fit_resample(X_train_scaled, y_train) 

In [None]:
pd.Series(y_train_rus).value_counts()

In [None]:
pd.Series(y_train_ros).value_counts()

### A) Baseline Vanilla (imbalance strategy: undersampling)

In [None]:
logreg_vanilla = LogisticRegression(C=1e9, solver='liblinear', max_iter=200)

model_vanilla = logreg_vanilla.fit(X_train_rus, y_train_rus)
y_probability = model_vanilla.predict_proba(X_train_scaled)[:,1]
get_metric(y_train, y_probability)

### D. Cross-validation (imbalance strategy: undersampling)

In [None]:
#### Vanilla

In [None]:
cv_vanilla = cross_validate(estimator=lr_vanilla,
                            X=X_train_rus, y=y_train_rus,
                            cv=cv,
                            n_jobs=-1,
                            return_train_score=True)
y_probability = cross_val_predict(lr_vanilla, X_train_rus, y_train_rus, cv=cv, method='predict_proba')[:,1]
get_metric(y_train_rus, y_probability)

#### Lasso

In [None]:
cv_l2 = cross_validate(estimator=l2_reg, X=X_train_rus, y=y_train_rus,
                       cv=cv,
                       n_jobs=-1,
                       return_estimator=True,
                       return_train_score=True)
y_probability = cross_val_predict(l2_reg, X_train_rus, y_train_rus, cv=cv, method='predict_proba')[:,1]
get_metric(y_train_rus, y_probability)

#### Ridge

In [None]:
cv_l2 = cross_validate(estimator=l2_reg, X=X_train_scaled, y=y_train,
                       cv=cv,
                       n_jobs=-1,
                       return_estimator=True,
                       return_train_score=True)
y_probability = cross_val_predict(l2_reg, X_train_rus, y_train_rus, cv=cv, method='predict_proba')[:,1]
get_metric(y_train_rus, y_probability)

In [None]:
fpr_rt_lm, tpr_rt_lm, _ = roc_curve(y_train_rus, y_probability)

In [None]:
fpr_rt_lm, tpr_rt_lm, _ = roc_curve(y_train_rus, y_probability)
plt.plot(fpr_rt_lm, tpr_rt_lm)
plt.show()

## III. IMBALANCE STRATEGY: SMOTE

In [None]:
print(y_train.value_counts()) #Previous original class distribution
smote = SMOTE()
X_train_smote, y_train_smote = smote.fit_sample(X_train_scaled, y_train) 
print(pd.Series(y_train_smote).value_counts()) #Preview synthetic sample class distributi

### A) Baseline Vanilla (SMOTE)

In [None]:
# Vanilla regression
logreg_vanilla = LogisticRegression(C=1e9, solver='liblinear', max_iter=200)

model_vanilla = logreg_vanilla.fit(X_train_smote, y_train_smote)

In [None]:
y_probability = model_vanilla.predict_proba(X_train_scaled)[:,1]

get_metric(y_train, y_probability)

#### This is our selected model, now we test on test set

In [None]:
y_probability = model_vanilla_balance.predict_proba(X_test_scaled)[:,1]

get_metric(y_test, y_probability)

- Visualization

In [None]:
p_balance, r_balance, t_balance = precision_recall_curve(y_train, model_vanilla_balance.decision_function(X_train_scaled))

In [None]:
fig, axs = plt.subplots(1,2, figsize=(13,6))

step_kwargs = ({'step': 'post'} if 'step' in signature(plt.fill_between).parameters else {})


axs[0].fill_between(r, p, color='#8c8c8c', alpha=0.4, **step_kwargs)
axs[0].set(title='Imbalance Precision-Recall Curve', xlabel='Recall', ylabel='Precision', xlim=(0.0, 1), ylim=(0.0, 1.05))

axs[1].fill_between(r_balance, p_balance, color='r', alpha=0.4, **step_kwargs)
axs[1].set(title='Balanced Precision-Recall Curve', xlabel='Recall', ylabel='Precision', xlim=(0.0, 1), ylim=(0.0, 1.05))
# fig.savefig('Precision-recall curve')

plt.show()

####  B) Lasso regression with different C values

In [None]:
C_values = [0.01]  # low value means high l1 penalty on coefficients

for C in C_values:
    logreg_l1 = LogisticRegression(C=C, penalty='l1',
                                   solver='liblinear',
                                   max_iter=200)
    print('-'*40,f'\nLasso regression with C = {C}')
    model_l1 = logreg_l1.fit(X_train_smote, y_train_smote)
    y_probability = model_l1.predict_proba(X_train_scaled)[:,1]
    get_metric(y_train, y_probability)


#### C) Ridge regression with different C values 

In [None]:
C_values = [0.01]  # low value means high l1 penalty on coefficients

for C in C_values:
    logreg_l2 = LogisticRegression(C=C, penalty='l2',
                                   solver='newton-cg',
                                   max_iter=200)
    
    print('-'*40,f'\nRidge regression with C = {C}')
    model_l2 = logreg_l2.fit(X_train_smote, y_train_smote)
    y_probability = model_l2.predict_proba(X_train_scaled)[:,1]
    get_metric(y_train, y_probability)

#### D) Cross-Validation

In [None]:
cv = StratifiedKFold(n_splits= 5, random_state=1000, shuffle=True)


#### Vanilla

In [None]:
lr_vanilla = LogisticRegression(C=1e9,
                                solver='newton-cg',
                                max_iter=200)


cv_vanilla = cross_validate(estimator=lr_vanilla,
                            X=X_train_smote, y=y_train_smote,
                            cv=cv,
                            n_jobs=-1,
                            return_train_score=True)

In [None]:
y_probability = cross_val_predict(lr_vanilla, X_train_scaled, y_train, cv=cv, method='predict_proba')[:,1]
get_metric(y_train, y_probability)

#### Ridge

In [None]:
l2_reg = LogisticRegression(C=1,
                            solver='newton-cg',
                            penalty='l2',
                            max_iter=200)

cv_l2 = cross_validate(estimator=l2_reg, X=X_train_smote, y=y_train_smote,
                       cv=cv,
                       n_jobs=-1,
                       return_estimator=True,
                       return_train_score=True)

In [None]:
y_probability = cross_val_predict(l2_reg, X_train_scaled, y_train, cv=cv, method='predict_proba')[:,1]
get_metric(y_train, y_probability)

#### Lasso

In [None]:
l1_reg = LogisticRegression(C=1,
                            solver='saga',
                            penalty='l1',
                            max_iter=200)
cv_l1 = cross_validate(estimator=l1_reg, X=X_train_smote, y=y_train_smote,
                       cv=cv,
                       n_jobs=-1,
                       return_estimator=True,
                       return_train_score=True)

In [None]:
y_probability = cross_val_predict(l1_reg, X_train_scaled, y_train, cv=cv, method='predict_proba')[:,1]
get_metric(y_train, y_probability)