# Give Me Some Credit

<p style="font-size:15.5px;">Imagine you need a loan of $100,000. How do we decide whether to approve your loan? Machine learning provides a powerful solution. By analyzing historical data on borrowers, including their credit history, income, employment status, and repayment behavior, machine learning models can predict the likelihood of a borrower repaying the loan. These models assess patterns and correlations that are often too complex for traditional methods, offering a more accurate risk assessment. By leveraging this technology, lenders can make informed decisions, reducing the risk of default and ensuring financial stability. Thus, machine learning becomes a key tool in evaluating loyalty to debt and making prudent lending decisions.</p>

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/GiveMeSomeCredit/Data Dictionary.xls
/kaggle/input/GiveMeSomeCredit/cs-training.csv
/kaggle/input/GiveMeSomeCredit/sampleEntry.csv
/kaggle/input/GiveMeSomeCredit/cs-test.csv


In [2]:
train=pd.read_csv("/kaggle/input/GiveMeSomeCredit/cs-training.csv")
se=pd.read_csv("/kaggle/input/GiveMeSomeCredit/sampleEntry.csv")
test=pd.read_csv("/kaggle/input/GiveMeSomeCredit/cs-test.csv")

In [3]:
se.head()

Unnamed: 0,Id,Probability
0,1,0.080807
1,2,0.040719
2,3,0.011968
3,4,0.06764
4,5,0.108264


In [4]:
train.head()

Unnamed: 0.1,Unnamed: 0,SeriousDlqin2yrs,RevolvingUtilizationOfUnsecuredLines,age,NumberOfTime30-59DaysPastDueNotWorse,DebtRatio,MonthlyIncome,NumberOfOpenCreditLinesAndLoans,NumberOfTimes90DaysLate,NumberRealEstateLoansOrLines,NumberOfTime60-89DaysPastDueNotWorse,NumberOfDependents
0,1,1,0.766127,45,2,0.802982,9120.0,13,0,6,0,2.0
1,2,0,0.957151,40,0,0.121876,2600.0,4,0,0,0,1.0
2,3,0,0.65818,38,1,0.085113,3042.0,2,1,0,0,0.0
3,4,0,0.23381,30,0,0.03605,3300.0,5,0,0,0,0.0
4,5,0,0.907239,49,1,0.024926,63588.0,7,0,1,0,0.0


In [5]:
pip install xlrd

Collecting xlrd
  Downloading xlrd-2.0.1-py2.py3-none-any.whl.metadata (3.4 kB)
Downloading xlrd-2.0.1-py2.py3-none-any.whl (96 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m96.5/96.5 kB[0m [31m2.4 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hInstalling collected packages: xlrd
Successfully installed xlrd-2.0.1
Note: you may need to restart the kernel to use updated packages.


In [6]:
explonations=pd.read_excel("//kaggle/input/GiveMeSomeCredit/Data Dictionary.xls")

In [7]:
explonations

Unnamed: 0.1,Unnamed: 0,Unnamed: 1,Unnamed: 2
0,Variable Name,Description,Type
1,SeriousDlqin2yrs,Person experienced 90 days past due delinquenc...,Y/N
2,RevolvingUtilizationOfUnsecuredLines,Total balance on credit cards and personal lin...,percentage
3,age,Age of borrower in years,integer
4,NumberOfTime30-59DaysPastDueNotWorse,Number of times borrower has been 30-59 days p...,integer
5,DebtRatio,"Monthly debt payments, alimony,living costs di...",percentage
6,MonthlyIncome,Monthly income,real
7,NumberOfOpenCreditLinesAndLoans,Number of Open loans (installment like car loa...,integer
8,NumberOfTimes90DaysLate,Number of times borrower has been 90 days or m...,integer
9,NumberRealEstateLoansOrLines,Number of mortgage and real estate loans inclu...,integer


In [8]:
test.head()

Unnamed: 0.1,Unnamed: 0,SeriousDlqin2yrs,RevolvingUtilizationOfUnsecuredLines,age,NumberOfTime30-59DaysPastDueNotWorse,DebtRatio,MonthlyIncome,NumberOfOpenCreditLinesAndLoans,NumberOfTimes90DaysLate,NumberRealEstateLoansOrLines,NumberOfTime60-89DaysPastDueNotWorse,NumberOfDependents
0,1,,0.885519,43,0,0.177513,5700.0,4,0,0,0,0.0
1,2,,0.463295,57,0,0.527237,9141.0,15,0,4,0,2.0
2,3,,0.043275,59,0,0.687648,5083.0,12,0,1,0,2.0
3,4,,0.280308,38,1,0.925961,3200.0,7,0,2,0,0.0
4,5,,1.0,27,0,0.019917,3865.0,4,0,0,0,1.0


In [9]:
train["RevolvingUtilizationOfUnsecuredLines"]=train["RevolvingUtilizationOfUnsecuredLines"]*100
test["RevolvingUtilizationOfUnsecuredLines"]=test["RevolvingUtilizationOfUnsecuredLines"]*100
train["DebtRatio"]=train["DebtRatio"]*100
test["DebtRatio"]=test["DebtRatio"]*100

In [10]:
train.drop("Unnamed: 0",axis=1,inplace=True)
test.drop("Unnamed: 0",axis=1,inplace=True)

In [11]:
train.isnull().sum()

SeriousDlqin2yrs                            0
RevolvingUtilizationOfUnsecuredLines        0
age                                         0
NumberOfTime30-59DaysPastDueNotWorse        0
DebtRatio                                   0
MonthlyIncome                           29731
NumberOfOpenCreditLinesAndLoans             0
NumberOfTimes90DaysLate                     0
NumberRealEstateLoansOrLines                0
NumberOfTime60-89DaysPastDueNotWorse        0
NumberOfDependents                       3924
dtype: int64

In [12]:
train.shape

(150000, 11)

In [13]:
test.isnull().sum()

SeriousDlqin2yrs                        101503
RevolvingUtilizationOfUnsecuredLines         0
age                                          0
NumberOfTime30-59DaysPastDueNotWorse         0
DebtRatio                                    0
MonthlyIncome                            20103
NumberOfOpenCreditLinesAndLoans              0
NumberOfTimes90DaysLate                      0
NumberRealEstateLoansOrLines                 0
NumberOfTime60-89DaysPastDueNotWorse         0
NumberOfDependents                        2626
dtype: int64

In [14]:
import warnings
warnings.filterwarnings('ignore')
train["MonthlyIncome"].fillna(train["MonthlyIncome"].mean(),inplace=True)
test["MonthlyIncome"].fillna(test["MonthlyIncome"].mean(),inplace=True)
train["NumberOfDependents"].fillna(train["NumberOfDependents"].mean(),inplace=True)
test["NumberOfDependents"].fillna(test["NumberOfDependents"].mean(),inplace=True)

In [15]:
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
x=train.drop("SeriousDlqin2yrs",axis=1)
y=train["SeriousDlqin2yrs"]
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2,random_state=42)
dtc=DecisionTreeClassifier()
lr=LogisticRegression()
rf=RandomForestClassifier()
dtc.fit(x_train,y_train)
lr.fit(x_train,y_train)
rf.fit(x_train,y_train)
y_pred_dtc=dtc.predict(x_test)
y_pred_lr=lr.predict(x_test)
y_pred_rf=rf.predict(x_test)
print(accuracy_score(y_test,y_pred_dtc))
print(accuracy_score(y_test,y_pred_lr))
print(accuracy_score(y_test,y_pred_rf))

0.8972
0.9348
0.9373333333333334


In [16]:
feature_importance = rf.feature_importances_
columns=x.columns
feature_importance_sorting = sorted(zip(columns, feature_importance), key=lambda x: x[1], reverse=True)
for f, i in feature_importance_sorting:
    print(f"{f}: {i}")

RevolvingUtilizationOfUnsecuredLines: 0.1923711001152444
DebtRatio: 0.17776574350081925
MonthlyIncome: 0.1458017611485042
age: 0.12755176584998376
NumberOfTimes90DaysLate: 0.09308745716471614
NumberOfOpenCreditLinesAndLoans: 0.088854238913678
NumberOfTime30-59DaysPastDueNotWorse: 0.050891894380006575
NumberOfTime60-89DaysPastDueNotWorse: 0.045272435949719025
NumberOfDependents: 0.0438407387038383
NumberRealEstateLoansOrLines: 0.034562864273490465


In [17]:
for f, i in feature_importance_sorting:
    if i < 0.12:
        train.drop(f, axis=1, inplace=True)
        test.drop(f, axis=1, inplace=True)
    else:
        continue


In [18]:
rf.fit(x_train,y_train)
y_pred_rf=rf.predict(x_test)
print(accuracy_score(y_test,y_pred_rf))

0.9369


In [19]:
from sklearn.preprocessing import scale
x_train=scale(x_train)
x_test=scale(x_test)

In [20]:
from keras.models import Sequential
from keras.layers import Dense

2024-06-12 09:03:33.696971: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-06-12 09:03:33.697125: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-06-12 09:03:33.848294: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


In [21]:
x=train.drop("SeriousDlqin2yrs",axis=1)
y=train["SeriousDlqin2yrs"]
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2,random_state=42)

In [22]:
model=Sequential()
model.add(Dense(128,activation="relu"))
model.add(Dense(64,activation="relu"))
model.add(Dense(32,activation="relu"))
model.add(Dense(1,activation="sigmoid"))
model.compile(optimizer="adam",loss="binary_crossentropy",metrics=["accuracy"])
model.fit(x_train,y_train,epochs=50,validation_data=(x_test,y_test))

Epoch 1/50
[1m3750/3750[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m8s[0m 2ms/step - accuracy: 0.8818 - loss: 69.3163 - val_accuracy: 0.9347 - val_loss: 0.5773
Epoch 2/50
[1m3750/3750[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 2ms/step - accuracy: 0.9252 - loss: 0.6795 - val_accuracy: 0.9321 - val_loss: 0.3326
Epoch 3/50
[1m3750/3750[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 2ms/step - accuracy: 0.9308 - loss: 0.2883 - val_accuracy: 0.9348 - val_loss: 0.2410
Epoch 4/50
[1m3750/3750[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 2ms/step - accuracy: 0.9343 - loss: 0.2426 - val_accuracy: 0.9348 - val_loss: 0.2412
Epoch 5/50
[1m3750/3750[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 2ms/step - accuracy: 0.9334 - loss: 0.2448 - val_accuracy: 0.9348 - val_loss: 0.2410
Epoch 6/50
[1m3750/3750[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 2ms/step - accuracy: 0.9327 - loss: 0.2490 - val_accuracy: 0.9348 - val_loss: 0.2410
Epoch 7/50
[1m

<keras.src.callbacks.history.History at 0x78935b806ad0>

In [23]:
test.drop("SeriousDlqin2yrs",axis=1,inplace=True)
test=scale(test)
pred=model.predict(test)

[1m3172/3172[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 1ms/step


In [26]:
se["Probability"]=pred
se.head()

Unnamed: 0,Id,Probability
0,1,0.032505
1,2,0.02707
2,3,0.026126
3,4,0.040926
4,5,0.059988


In [27]:
se.to_csv("submission.csv")
model.save("model.h5")

As a result, my metrics are not very good. But metrics alone are not enough to understand the performance of models. So I will develop a Streamlit application.