As before, we upload our data, and perform necessary data manipulations

In [1]:
import pandas as pd
credit_df = pd.read_csv("MyCreditData.csv")
credit_df.head()

Unnamed: 0,checking_account,duration,credit_history,purpose,amount,savings_account,employment_duration,installment_rate,other_debtors,present_residence,...,age,other_installment_plans,housing,number_credits,job,people_liable,telephone,foreign_worker,gender,profit
0,3,18,0,2,1049,4,2,2,2,3,...,21,1,0,0,1,0,0,0,female,242
1,3,9,0,5,2799,4,0,1,2,0,...,36,1,0,1,1,1,0,0,male,596
2,0,12,4,8,841,0,1,1,2,3,...,23,1,0,0,3,0,0,0,female,25
3,3,12,0,5,2122,4,0,0,2,0,...,39,1,0,1,3,1,0,1,male,568
4,3,12,0,5,2171,4,0,2,2,3,...,38,0,2,1,3,0,0,1,male,782


In [12]:
# ensure Python reads the categorical variables as categorical
non_categorical_columns = ['duration', 'amount', 'age', 'profit']
for column in credit_df.columns:
    if column not in non_categorical_columns:
        credit_df[column] = pd.Categorical(credit_df[column])

We now create a binary dependendent variable, *is_profitable*, indicating if *profit* is positive:

In [13]:
import numpy as np
credit_df["is_profitable"] = np.where(credit_df['profit'] > 0, 1, 0)

As previously, we split our data for modeling, create dummies and normalize

In [14]:
y = credit_df['is_profitable']
X = credit_df.iloc[:, :-2] # All columns but the last two, profit and is_profitable

# Use dummy variables for categorical variables
X = pd.get_dummies(X, drop_first=True)

# Standardize our non-dummy variables
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X[['duration', 'amount', 'age']]= scaler.fit_transform(X[['duration', 'amount', 'age']])

# split into 70% training 30% validation
from sklearn.model_selection import train_test_split
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.3, random_state = 1)

Our goal is to predict *is_profitable* using all available information. We run a full logistic regression with two libraries, first with statsmodels (sm), as their logistic regression *summary* function provides p-values and other metrics:

In [19]:
import statsmodels.api as sm
log_reg_sm = sm.Logit(y_train, sm.add_constant(X_train.astype(float))).fit()
log_reg_sm.summary()

Optimization terminated successfully.
         Current function value: 0.431731
         Iterations 7


0,1,2,3
Dep. Variable:,is_profitable,No. Observations:,700.0
Model:,Logit,Df Residuals:,646.0
Method:,MLE,Df Model:,53.0
Date:,"Wed, 27 Dec 2023",Pseudo R-squ.:,0.2845
Time:,20:15:42,Log-Likelihood:,-302.21
converged:,True,LL-Null:,-422.4
Covariance Type:,nonrobust,LLR p-value:,1.113e-25

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
const,0.7762,1.165,0.667,0.505,-1.506,3.059
duration,-0.3038,0.143,-2.129,0.033,-0.584,-0.024
amount,-0.3546,0.155,-2.287,0.022,-0.659,-0.051
age,0.1443,0.131,1.104,0.270,-0.112,0.401
checking_account_1,1.3976,0.281,4.969,0.000,0.846,1.949
checking_account_2,1.1047,0.528,2.092,0.036,0.069,2.140
checking_account_3,-0.3648,0.270,-1.350,0.177,-0.894,0.165
credit_history_1,-1.3090,0.549,-2.384,0.017,-2.385,-0.233
credit_history_2,-1.3582,0.520,-2.611,0.009,-2.378,-0.339


However, we must find the accuracy score manually:

In [21]:
y_pred = np.round(log_reg_sm.predict(sm.add_constant(X_val.astype('float'))))
score = sum(y_pred == y_val) / len(y_val)
score

0.7666666666666667

On the other hand, scikit-learn (sklearn) does not provide p-values, but does make metrics of predictive quality more easily available, such as accuracy score:

In [22]:
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
log_reg_sklearn = LogisticRegression(penalty='none', solver='lbfgs', 
                                     max_iter=200)
log_reg_sklearn.fit(X_train, y_train)
log_reg_sklearn.score(X_val, y_val)



0.7666666666666667

We now run Lasso regression (setting the penalty to $l1$), and using LogisticRegressionCV to find the best regularization constant:

In [23]:
C_list = np.arange(0.01,2,0.01)
lasso_reg_cv = LogisticRegressionCV(Cs=C_list, penalty='l1', solver='liblinear', 
                                    random_state=0)
lasso_reg_cv.fit(X_train, y_train)
lasso_reg_cv.C_

array([1.11])

In [2]:
lasso_reg = LogisticRegression(penalty='l1', solver='liblinear', C=lasso_reg_cv.C_[0],
                               random_state=0).fit(X_train, y_train)
lasso_reg.score(X_val, y_val)

NameError: name 'LogisticRegression' is not defined

And do the same for Ridge regression (setting the penalty to *l2*)

In [3]:
C_list = np.arange(0.01,2,0.01)
ridge_reg_cv = LogisticRegressionCV(Cs=C_list, penalty='l2', solver='lbfgs', max_iter=100,
                                    random_state=0).fit(X_train, y_train)
ridge_reg_cv.C_

NameError: name 'np' is not defined

In [4]:
ridge_reg = LogisticRegression(penalty='l2', solver='lbfgs', C=ridge_reg_cv.C_[0],
                               max_iter=1000, random_state=0).fit(X_train, y_train)
ridge_reg.score(X_val, y_val)

NameError: name 'LogisticRegression' is not defined

We now compare the coefficients between these three models:

In [27]:
full_coef = np.transpose(list(log_reg_sm.params))
lasso_coef = np.transpose(np.append(lasso_reg.intercept_, lasso_reg.coef_))
ridge_coef = np.transpose(np.append(ridge_reg.intercept_, ridge_reg.coef_))
coef_names = ['const'] + list(X.columns)

coef_table = pd.concat([pd.DataFrame(coef_names), pd.DataFrame(full_coef), pd.DataFrame(lasso_coef), 
                        pd.DataFrame(ridge_coef)], axis = 1)
coef_table.columns = ['', 'Full', 'Lasso', 'Ridge']
coef_table.set_index('', inplace=True)
coef_table

Unnamed: 0,Full,Lasso,Ridge
,,,
const,0.776244,0.0,0.296984
duration,-0.30384,-0.304285,-0.300139
amount,-0.354637,-0.282732,-0.224082
age,0.144298,0.128115,0.149899
checking_account_1,1.397572,1.307906,0.889809
checking_account_2,1.104656,0.874662,0.333973
checking_account_3,-0.364776,-0.348048,-0.427901
credit_history_1,-1.308991,-0.864284,-0.351019
credit_history_2,-1.358166,-0.930499,-0.405847


We see that while for some features, coefficients are similar across models, for many other features, Lasso and Ridge decrease the absolute value of the coefficient.