<h3>Problem Statement</h3>
<p>Kowope Mart is a Nigerian-based retail company with a vision to provide quality goods, education and automobile services to its customers at affordable price and reduce if not eradicate charges on card payments and increase customer satisfaction with credit rewards that can be used within the Mall. To achieve this, the company has partnered with DSBank on co-branded credit card with additional functionality such that customers can request for loan, pay for goods even with zero-balance and then pay back within an agreed period of time. This innovative strategy has increased sales for the company. However, there has been recent cases of credit defaults and Kowope Mart will like to have a system that profiles customers who are worthy of the card with minimum if not zero risk of defaulting.</p>

<h4>Objective</h4> To predict customers who are likely to default or not.
<h4>Evauation Metric</h4> Area under the ROC Curve(AUC) 

<h3>Variable Definitions</h3>
<ul>
    <li> Application_ID: Unique Customer Application Identification number </li>

<strong>Risk Scoring features:</strong>
    <li>form_field1: Customer Creditworthiness score based on historical data</li>
    <li>form_field2: A score that measures the number and riskiness of credit enquiries made by a borrower.</li>
    
<strong>Severity features:</strong> Severity is a function of amount, time since default, and number of defaults
    <li>form_field3: Severity of default by the borrower on any loan(s).</li>
    <li>form_field4: Severity of default by the borrower on auto loan(s).</li>
    <li>form_field5: Severity of default by the borrower on education loan(s).</li>

<strong>Credit features</strong>

   <li>form_field6: Minimum of credit available on all credit cards that is automatically renewed as debts are paid off on the
    customer’s cards (in NGN)</li>
    <li>form_field7: Maximum of credit available on customer’s active credit lines (in NGN)</li>
    <li>form_field8: Maximum of credit available on all active credit cards that is automatically renewed as debts are paid off
    on the customer’s cards (in NGN)</li>
    <li>form_field9: Sum of available credit on credit cards that the borrower has missed 1 payment (in NGN)</li>
    <li>form_field10: Total amount of credit available on accepted credit lines (in NGN)
    <li>form_field11: The amount of dues collected post-default where the due amount was more than 500 (in NGN)</li>
    <li>form_field12: Sum of the amount due on active credit cards (in NGN)</li>
    <li>form_field13: Annual amount paid towards all credit cards during the previous year (in NGN)</li>
    <li>form_field14: Annual income (in NGN)</li>
    <li>form_field15: The estimated market value of a property owned/used by the borrower (in NGN)</li>

<strong>Credit card features:</strong>
    <li>form_field16: Number of active credit card that is automatically renewed as debts are paid off on which full credit
    limit is utilized by the borrower</li>
    <li>form_field17: Number of active credit cards on which full credit limit is utilized by the borrower</li>
    <li>form_field18: Number of active credit lines on which full credit limit is utilized by the borrower</li>
    <li>form_field19: Number of active credit cards on which at least 75% credit limit is utilized by the borrower</li>
    <li>form_field20: Number of active credit lines on which at least 75% credit limit is utilized by the borrower</li>
    <li>form_field21: Average utilization of active revolving credit card loans (%)</li>
    <li>form_field22: Average utilization of line on all active credit lines activated in last 2 years (%)</li>
    <li>form_field23: Average utilization of line on all active credit cards activated in last 1 year (%)</li>
    <li>form_field24: Average utilization of line on credit cards on which the borrower has missed 1 payment during the last 6 months (%)</li>
    <li>form_field25: Average tenure of active revolving credit cards (in days)</li>
    <li>form_field26: Tenure of oldest credit card among all active credit cards (in days)</li>
    <li>form_field27: Tenure of oldest revolving credit card among all active revolving credit cards (in days)</li>

<strong>Time-related features</strong>
    <li>form_field28: Number of days since last missed payment on any credit line</li>
    <li>form_field29: Tenure of the oldest credit line (in days)</li>
    form_field30: Maximum tenure on all auto loans (in days)</li>
    <li>form_field31: Maximum tenure on all education loans (in days)</li>
    <li>form_field32: Sum of tenures (in months) of active credit cards</li>
    <li>form_field33: Sum of tenures (in months) of active credit cards</li>
    <li>form_field34: Number of active credit lines over the last 6 months on which the borrower has missed 1 payment</li>
    <li>form_field35: Number of revolving credit cards over the last 2 years on which the borrower has missed 1 payment</li>
    <li>form_field36: Number of active credit lines</li>
    <li>form_field37: Number of credit cards with an active tenure of at least 2 years</li>
    <li>form_field38: Number of credit lines activated in the last 2 years</li>
    <li>form_field39: Number of credit lines on which the borrower has current delinquency</li>

<strong>Percentage related features</strong>
    <li>form_field40: Utilization of line on active education loans (%)</li>
    <li>form_field41: Utilization of line on active auto loans (%)</li>

<strong>Other features</strong>
    <li>form_field42: Financial stress index of the borrower. This index is a function of collection trades, bankruptcies files, tax liens invoked, etc.</li>
    <li>form_field43: Number of credit lines on which the borrower has never missed a payment in the last 2 years, yet considered as high-risk loans based on the market prediction of the economic scenario</li>
    <li>form_field44: Ratio of the maximum amount due on all active credit lines and the sum of amounts due on all active credit lines</li>
    <li>form_field45: Number of mortgage loans on which the borrower has missed 2 payments</li>
    <li>form_field46: Number of auto loans on which the borrower has missed 2 payments</li>
    <li>form_field47: Type of product that the applicant applied for. (C = Charge; L = Lending)</li>
    <li>form_field48: Undefined Variable</li>
    <li>form_field49: Undefined Variable</li>
    <li>form_field50: Ratio of the minimum amount due on all active credit lines and the sum of amounts due on all active credit lines</li>

<strong>Target features</strong>
    <li>default_status: defaulted or not. (yes:1, no: 0)</li>
</ul>

In [176]:
import os, warnings, gc
warnings.filterwarnings('ignore')

In [177]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import scipy.stats as stats
import missingno as msno

In [178]:
%matplotlib inline
pd.set_option('display.max_columns', 100)
plt.style.use('fivethirtyeight')
plt.rcParams['font.size'] = 10

In [179]:
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.inspection import permutation_importance
from sklearn.metrics import plot_confusion_matrix, plot_roc_curve, roc_auc_score, auc, classification_report

In [206]:
import random
seed = 4092
# seed = 2046
random.seed(seed)
np.random.seed(seed)

In [181]:
# os.chdir("/home/is_ifeanyichukwu/Documents/DSN Competition/DSN_Bootcamp_QualificationChallenge")

In [182]:
path = "~/Documents/DSN Competition/DSN_Bootcamp_QualificationChallenge"

In [183]:
#importing files
train  = pd.read_csv(path+'/Train.csv')
test = pd.read_csv(path+'/Test.csv')
subfile = pd.read_csv(path+'/SampleSubmission.csv')

In [184]:
# train.head()

In [185]:
# test.head()

In [186]:
#shape of data
train.shape, test.shape

((56000, 52), (24000, 51))

In [187]:
# train.describe()

In [188]:
# test.describe()

#### Checking nunique values

In [189]:
# train.nunique().sort_values()

#### Checking for missing values

In [190]:
# ax = train.isna().sum().transform(lambda x:x/train.shape[0]).sort_values().plot.barh(
#     figsize=(11,11), title='Percentage missing values (Train)', color='indigo', alpha=0.5)
# for i in ax.patches:
#     ax.text(i.get_width()+0.005, i.get_y(), str(int(round(i.get_width(),2)*100))+'%', fontsize=10, color='b', alpha=0.7);

From the Bar chart, there's a lot of missing values in the train data

Lets take a look at the test data

In [191]:
# ax1 = test.isna().sum().transform(lambda x:x/test.shape[0]).sort_values().plot.barh(
#     figsize=(11,11), title='Percentage missing values (Test)', color='coral', alpha=0.6)
# for i in ax1.patches:
#     ax1.text(i.get_width()+0.005, i.get_y(), 
#              str(int(round(i.get_width(),2)*100))+'%', fontsize=10, color='blue');

Both the train and test data have a lot of missing values. We will see how to handle them.

 form_field40 has the most, with almost 80% missing, followed by form_field31 and then form_field41.

#### Target Distribution

In [192]:
# ax = sns.countplot('default_status', data=train,color='coral')
# plt.xticks(range(2), ['No', 'Yes'], fontsize=13)
# plt.title('Target Distribution')

# for i in ax.patches:
#     ax.text(i.get_x()+ 0.28, i.get_height()+ 3, 
#             str(round(i.get_height()/train.shape[0] * 100,2))+'%', 
#             fontsize=13);

As we can see, the data is imbalanced. Hence, we may try to solve the issue of imbalanced datasets. About 74% of the data is for churn No, while 24% for Churn Yes

In [193]:
#Label Encoding target
train.default_status.replace({'no':0, 'yes':1}, inplace=True)

In [194]:
def model_auc(model):
    train_auc = roc_auc_score(y_train, model.predict_proba(x_train)[:, 1])
    val_auc = roc_auc_score(y_val, model.predict_proba(x_val)[:, 1])
    print(f'Train AUC: {train_auc}, Val Auc: {val_auc}')

In [195]:
# def plot_figure(model):
#     fig, ax = plt.subplots(1,2, figsize=(10,5))
#     plot_confusion_matrix(model, x_val, y_val, ax=ax[0], display_labels=['No', 'Yes'])
#     plot_roc_curve(model, x_val, y_val, ax=ax[1])
#     ax[0].set(title='Confusion Matrix')
#     ax[1].set(title='ROC Curve')
#     ax[0].grid(False)
#     ax[1].grid(False)
#     ax[1].legend(loc='center right')
#     plt.tight_layout(pad=1.0)

#### Filling Missing Values with zeros

In [196]:
train.fillna(0, inplace=True)
test.fillna(0, inplace=True)

In [148]:
corr= train.corr()
mask = np.zeros_like(corr)
mask[np.triu_indices_from(mask)] = True

In [197]:
# plt.figure(figsize=(15,13))
# sns.heatmap(corr, mask=mask)
# plt.title('Correlation Matrix');

#### Baseline Model

In [198]:
test_id = test.Applicant_ID

In [199]:
features = train.columns.drop(['Applicant_ID', 'default_status'])
X = pd.get_dummies(train[features])
y = train.default_status
test = pd.get_dummies(test[features])

In [217]:
x_train, x_val, y_train, y_val = train_test_split(X, y, random_state=seed, test_size=0.2, stratify=y)

In [218]:
x_train.shape, x_val.shape

((50400, 51), (5600, 51))

In [227]:
scaler = StandardScaler()

In [229]:
logreg = LogisticRegression(random_state=seed)
logreg_pl = Pipeline([('scaler', scaler), ('logreg', logreg)])

In [230]:
logreg_pl.fit(x_train, y_train)

Pipeline(steps=[('scaler', StandardScaler()),
                ('logreg', LogisticRegression(random_state=4092))])

In [231]:
model_auc(logreg_pl)

Train AUC: 0.8278967297025593, Val Auc: 0.8265753517746504


In [133]:
# plot_figure(logreg_pl)

In [132]:
# print(classification_report(y_val, logreg_pl.predict(x_val)))

#### Submission

In [128]:
test_predictions = logreg_pl.predict_proba(test)[:, 1]

In [129]:
submission = pd.DataFrame({'Applicant_ID' : test_id, 'default_status' : test_predictions})

In [130]:
submission.to_csv(path+'/Logreg_baseline.csv', index=False)