# Credit Score Prediction

## Introduction

A credit score is a numerical expression based on a level analysis of a person's credit files, to represent the creditworthiness of an individual. A credit score is primarily based on a credit report information typically sourced from credit bureaus.

Lenders, such as banks and credit card companies, use credit scores to evaluate the potential risk posed by lending money to consumers and to mitigate losses due to bad debt. Lenders use credit scores to determine who qualifies for a loan, at what interest rate, and what credit limits. Lenders also use credit scores to determine which customers are likely to bring in the most revenue. The use of credit or identity scoring prior to authorizing access or granting credit is an implementation of a trusted system.

Credit scoring is not limited to banks. Other organizations, such as mobile phone companies, insurance companies, landlords, and government departments employ the same techniques. Digital finance companies such as online lenders also use alternative data sources to calculate the creditworthiness of borrowers. Credit scoring also has much overlap with data mining, which uses many similar techniques. These techniques combine thousands of factors but are similar or identical.

## Credit Score prediction using Machine Learning
We are using some standard machine learning approaches and algorithms in order to predict the credit score of a customer. 

Let us load all the libraries and other custom written functionalities to our notebook.

In [604]:
import pandas as pd
import datetime
from collections import Counter, OrderedDict
import copy
from scipy import stats
import numpy as np
from imblearn.over_sampling import ADASYN
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import confusion_matrix, precision_recall_curve, auc, roc_auc_score, roc_curve, recall_score, classification_report
from sklearn.utils import shuffle
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier

from iv import WOE

Let us have all the given datasets on board!!

In [None]:
train_raw_df = pd.read_csv('references/test_data/raw_data_70_new.csv', low_memory=False)
train_account_df = pd.read_csv('references/test_data/raw_account_70_new.csv', low_memory=False)
train_enquiry_df = pd.read_csv('references/test_data/raw_enquiry_70_new.csv', low_memory=False)
test_raw_df = pd.read_csv('references/test_data/raw_data_30_new.csv', low_memory=False)
test_account_df = pd.read_csv('references/test_data/raw_account_30_new.csv', low_memory=False)
test_enquiry_df = pd.read_csv('references/test_data/raw_enquiry_30_new.csv', low_memory=False)

# Datasets custom prepared
dss_train_df = pd.read_csv('references/test_data/dss_data_70_new.csv', low_memory=False)
dss_test_df = pd.read_csv('references/test_data/dss_data_30_new.csv', low_memory=False)

Below is a custom function created in order to understand the dataset faster and better

In [None]:
def describe(df, shape=True, columns=False, missing_vals=False):
    # dimensions of the dataset
    print('Data dimensions: ', df.shape, '\n') if shape else None

    # column names
    print('Column names: ', df.columns, '\n') if columns else None

    # Number of missing values in each column
    print('Number of missing values in each column \n', df.isnull().sum(axis=0)) if missing_vals else None

Below function is used to check the dataset's NaN values

In [None]:
get_nans = lambda df: df[df.isnull().any(axis=1)]

Below function returns the attribute columns in a dataframe by removing the customer ID and the target label

In [None]:
def get_attr_cols(cols):
    """
    Return a list of attribute columns
    """
    non_attr_cols = ['customer_no','Bad_label']
    return [col for col in cols if col not in non_attr_cols]

Lets have a validity function which can validate the value for None and NaN

In [None]:
def validity(field):
    """
    Returns True for a valid field else False
    """
    validity = True if field is not None and field == field else False
    return validity

### 1. Data Description

Now, let's use the functionality we created to understand the datasets we have

In [None]:
# Describe the Raw data
describe(train_raw_df, columns=True)

# Event rate
print('Event rate: ',sum(train_raw_df.loc[:, 'Bad_label'])/train_raw_df.shape[0], '\n')

# Describe the accounts data
describe(train_account_df)

# Describe the enquiry data
describe(train_enquiry_df)

# Describe the dataset loaded by selecting the best features from raw_df
describe(dss_train_df, columns=True)

Lets have a function to plot the event rate in the dataframe!!

In [None]:
def plot_event_rate(df):
    """
    Plots the event rate in a bar plot
    """
    pd.value_counts(df['Bad_label']).plot.bar()
    plt.title('Distribution of customer Bad_label in the balanced dataset')
    plt.xlabel('Bad_label')
    plt.ylabel('Frequency')
    plt.show()

In [None]:
plot_event_rate(train_raw_df)

### 2. Data Processing

Now is the time we have to analyze the datasets and understand.

But lets do things bit hacky as there is shortage of time:P As we already got some clues of analysis we know the important variables which can avoid a lot of frustrations and disappointments during our course of building the model.

####  2.1 Feature Generation
Lets first create a set of functionalities required for our processing of the derived features. All the methods below have a comment string stating their purpose.

Here the intention is not performance/efficiency but the readability. Hence there are many places where we can see some repeated operations which are trivial to refactor.

In [None]:
# The total duration between last payment date and account opened date of all accounts
def diff_last_pmt_dt_opnd_dt(last_pmt_dt, opened_dt):
    """
    The difference between last payment date and account opened date in days
    """
    num_days = 0
    if opened_dt is not None and opened_dt == opened_dt:
        opened_date = datetime.datetime.strptime(opened_dt, "%d-%b-%y")
        if last_pmt_dt is not None and last_pmt_dt == last_pmt_dt:
            last_payment_date = datetime.datetime.strptime(last_pmt_dt, "%d-%b-%y")
            diff = last_payment_date - opened_date
            num_days = diff.days
    return num_days

In [None]:
# The mean count of accounts that is in 0-29 dpd bucket throughout the payment history
def count_30_dpd(paymenthistory1, paymenthistory2):
    """
    The mean count of accounts that is in 0-29 dpd bucket throughout the payment history
    """
    count = 0
    paymenthistory1 = paymenthistory1.strip("\"") if paymenthistory1 == paymenthistory1 else ""
    paymenthistory2 = paymenthistory2.strip("\"") if paymenthistory2 == paymenthistory2 else ""
    paymenthistory = paymenthistory1 + paymenthistory2
    for i in range(0, len(paymenthistory), 3):
        x = paymenthistory[i:i+3]
        if x.isdigit():
            if int(x) < 30:
                count += 1
    return count

In [None]:
# The smallest number of months passed before first 30+ dpd appeared for each account.
def min_months_30_dpd(paymenthistory1, paymenthistory2):
    """
    The minimum number of months happened before first 30plus dpd has happened
    """
    months = 0
    paymenthistory1 = paymenthistory1.strip("\"") if paymenthistory1 == paymenthistory1 else ""
    paymenthistory2 = paymenthistory2.strip("\"") if paymenthistory2 == paymenthistory2 else ""
    paymenthistory = paymenthistory1 + paymenthistory2
    for i in range(0, len(paymenthistory), 3):
        x = paymenthistory[i:i+3]
        if x.isdigit():
            if int(x) < 30:
                months += 1
            else:
                break
        else:
            months += 1
    return months

In [None]:
# Number of enquiries made in past 365 days
def count_past_365_days(opened_dt_series, enquiry_dt_series):
    """
    Return 1 if the date difference is under 365 days else 0
    """
    num_of_enq = 0
    for opened_dt, enquiry_dt in zip(opened_dt_series, enquiry_dt_series):
        if opened_dt is not None and opened_dt == opened_dt and enquiry_dt is not None and enquiry_dt == enquiry_dt:
            opened_date = datetime.datetime.strptime(opened_dt, "%d-%b-%y")
            enquiry_date = datetime.datetime.strptime(enquiry_dt, "%d-%b-%y")
            diff = enquiry_date - opened_date
            if diff.days <= 365:
                num_of_enq += 1
    return num_of_enq

In [None]:
# Ratio of total current balance amount to total credit limit
def ratio_cur_bal_credit(cur_bal_series, credit_limit_series):
    """
    Calculating total current_balance_amout / total credit_limit
    """
    ratio = cur_bal_series.sum() / credit_limit_series.sum()
    if ratio != ratio:
        return 0
    return ratio

In [None]:
# The mean duration between last payment date and account opened date of all accounts
def mean_last_pmt_acc_opnd(last_pmt_dt_series, acc_opnd_dt_series):
    """
    Calculating the mean duration between last payment and account opened date for all accounts of a customer
    """
    mean_vals = []
    for last_pmt_dt, acc_opnd_dt in zip(last_pmt_dt_series, acc_opnd_dt_series):
        if last_pmt_dt is not None and last_pmt_dt == last_pmt_dt and acc_opnd_dt is not None and acc_opnd_dt == acc_opnd_dt:
            last_pmt_date = datetime.datetime.strptime(last_pmt_dt, "%d-%b-%y")
            acc_opnd_date = datetime.datetime.strptime(acc_opnd_dt, "%d-%b-%y")
            difference = last_pmt_date - acc_opnd_date
            mean_vals.append(difference.days)
    return int(sum(mean_vals) / len(mean_vals)) if len(mean_vals) > 0 else 0

In [None]:
# The average difference between enquiry dt_opened date and enquiry date
def diff_enq_dt_opened_enq_dt(dt_opened, enq_date):
    """
    The days difference between date_opened in enquiry and the actual enquiry_date
    """
    num_days = 0
    if dt_opened is not None and dt_opened == dt_opened:
        opened_date = datetime.datetime.strptime(dt_opened, "%d-%b-%y")
        if enq_date is not None and enq_date == enq_date:
            enquiry_date = datetime.datetime.strptime(enq_date, "%d-%b-%y")
            diff = opened_date - enquiry_date
            num_days = diff.days
    return num_days

In [None]:
# Average length of payment history variable
def avg_length_payment_history(paymenthistory1, paymenthistory2):
    """
    Average length of payment history variable
    """
    length = 0
    paymenthistory1 = paymenthistory1.strip("\"") if paymenthistory1 == paymenthistory1 else ""
    paymenthistory2 = paymenthistory2.strip("\"") if paymenthistory2 == paymenthistory2 else ""
    paymenthistory = paymenthistory1 + paymenthistory2
    for i in range(0, len(paymenthistory), 3):
        length += 1
    return length

In [None]:
# Most frequent enquiry purpose
def mode_enq_purpose(enq_purpose):
    """
    Return the mode of the enquiry purpose for a customer
    """
    count = Counter(enq_purpose)
    return max(count, key=count.get)

In [None]:
# Number of enquiry made in past 90 days
def count_past_90_days(opened_dt_series, enquiry_dt_series):
    """
    Return 1 if the date difference is under 90 days else 0
    """
    num_of_enq = 0
    for opened_dt, enquiry_dt in zip(opened_dt_series, enquiry_dt_series):
        if opened_dt is not None and opened_dt == opened_dt and enquiry_dt is not None and enquiry_dt == enquiry_dt:
            opened_date = datetime.datetime.strptime(opened_dt, "%d-%b-%y")
            enquiry_date = datetime.datetime.strptime(enquiry_dt, "%d-%b-%y")
            if (int(enquiry_date.timestamp()) - int(opened_date.timestamp())) <= 90:
                num_of_enq += 1
    return num_of_enq

In [None]:
# Calculate the utilization trend
def utilization_trend(cur_balance_amt_series, credit_limit_series, cash_limit_series):
    """
    Calculat the utilization trend of a customer
    """
    total_cur_balance_amt = cur_balance_amt_series.sum()
    total_credit_limit = credit_limit_series.sum()
    mean_cur_balance_amt = cur_balance_amt_series.mean()
    mean_credit_limit = credit_limit_series.mean()
    mean_cash_limit = cash_limit_series.mean()
    return (total_cur_balance_amt / total_credit_limit) / (mean_cur_balance_amt / (mean_credit_limit + mean_cash_limit))

Let's create a functionality which can manage all the feature generation and creates and final dataset needed.

Just to make our life easier on the test dataset:P

In [None]:
def preprocess(raw_df: pd.DataFrame, account_df: pd.DataFrame, enquiry_df: pd.DataFrame):
    """
    This is the main method for the pre-processing of the data.
    This method returns a single Pandas DataFrame after the pre processing
    """
    
    # Add only the customer_no and Bad_label to the new_df
    new_df = pd.DataFrame(data=raw_df[['customer_no','Bad_label']], columns=['customer_no','Bad_label'])
    
    account_df['diff_last_pmt_dt_opnd_dt'] = account_df.apply(lambda x: diff_last_pmt_dt_opnd_dt(x['last_paymt_dt'], x['opened_dt']), axis=1)
    total_diff_last_pmt_dt_opnd_dt_df = account_df[['diff_last_pmt_dt_opnd_dt','customer_no']].groupby('customer_no').sum().reset_index()
    total_diff_last_pmt_dt_opnd_dt_df.columns = ['customer_no','diff_last_pmt_dt_opnd_dt']
    new_df = new_df.merge(right=total_diff_last_pmt_dt_opnd_dt_df, on='customer_no',how='inner')
    
    account_df['months_before_first_30dpd'] = account_df.apply(lambda x: count_30_dpd(x['paymenthistory1'], x['paymenthistory2']), axis=1)
    min_months_30_dpd_df = account_df[['months_before_first_30dpd','customer_no']].groupby('customer_no').min().reset_index()
    min_months_30_dpd_df.columns = ['customer_no','min_months_before_first_30dpd']
    new_df = new_df.merge(right=min_months_30_dpd_df, on='customer_no',how='inner')
    
    enquiry_recency_365_df = enquiry_df.groupby('customer_no').apply(lambda x: count_past_365_days(x['dt_opened'],x['enquiry_dt'])).to_frame().reset_index()
    enquiry_recency_365_df.columns = ['customer_no','enquiry_recency_365']
    new_df = new_df.merge(right=enquiry_recency_365_df, on='customer_no',how='inner')
    
    ratio_cur_bal_credit_df = account_df.groupby('customer_no').apply(lambda x: ratio_cur_bal_credit(x['cur_balance_amt'], x['creditlimit'])).to_frame().reset_index()
    ratio_cur_bal_credit_df.columns = ['customer_no','ratio_cur_bal_credit']
    new_df = new_df.merge(right=ratio_cur_bal_credit_df, on='customer_no',how='inner')
    
    mean_last_pmt_acc_opnd_df = account_df.groupby('customer_no').apply(lambda x: mean_last_pmt_acc_opnd(x['last_paymt_dt'], x['opened_dt'])).to_frame().reset_index()
    mean_last_pmt_acc_opnd_df.columns = ['customer_no','mean_last_pmt_acc_opnd']
    new_df = new_df.merge(right=mean_last_pmt_acc_opnd_df, on='customer_no',how='inner')

    enquiry_df['diff_opened_dt_enq_dt'] = enquiry_df.apply(lambda x: diff_enq_dt_opened_enq_dt(x['dt_opened'], x['enquiry_dt']), axis=1)
    avg_diff_enq_dt_opened_enq_dt_df = enquiry_df[['customer_no','diff_opened_dt_enq_dt']].groupby('customer_no').mean().reset_index()
    avg_diff_enq_dt_opened_enq_dt_df.columns = ['customer_no','mean_diff_opened_dt_enq_dt']
    new_df = new_df.merge(right=avg_diff_enq_dt_opened_enq_dt_df, on='customer_no',how='inner')

    account_df['pmt_history_len'] = account_df.apply(lambda x: avg_length_payment_history(x['paymenthistory1'], x['paymenthistory2']), axis=1)
    pmt_history_len_df = account_df[['customer_no','pmt_history_len']].groupby('customer_no').mean().reset_index()
    pmt_history_len_df.columns = ['customer_no','mean_pmt_history_len']
    new_df = new_df.merge(right=pmt_history_len_df, on='customer_no',how='inner')
    
    mode_enq_purpose_df = enquiry_df.groupby('customer_no').apply(lambda x: mode_enq_purpose(x['enq_purpose'])).to_frame().reset_index()
    mode_enq_purpose_df.columns = ['customer_no','mode_enq_purpose']
    new_df = new_df.merge(right=mode_enq_purpose_df, on='customer_no',how='inner')
    
    enquiry_recency_90_df = enquiry_df.groupby('customer_no').apply(lambda x: count_past_365_days(x['dt_opened'],x['enquiry_dt'])).to_frame().reset_index()
    enquiry_recency_90_df.columns = ['customer_no','enquiry_recency_90']
    new_df = new_df.merge(right=enquiry_recency_90_df, on='customer_no',how='inner')
    
    utilization_trend_df = account_df.groupby('customer_no').apply(lambda x: utilization_trend(x['cur_balance_amt'],x['creditlimit'],x['cashlimit'])).to_frame().reset_index()
    utilization_trend_df.columns = ['customer_no','utilization_trend']
    new_df = new_df.merge(right=utilization_trend_df, on='customer_no', how='inner')
    
    return new_df

Now let's run the processing step and get the new features!! Please remain calm as I already stated that this is not performace specific

In [None]:
train_df_1 = preprocess(raw_df=train_raw_df, account_df=train_account_df, enquiry_df=train_enquiry_df)
train_df_1.to_csv('train_df_1.csv', index=False)
train_df.head()

Now we have the generated features in the dataset.

Let's create a single train dataset by combining the features generated dataset and the analyzed and selected dataset. 

In [None]:
# Concatenate both train_dfs to one train_df
# Also removing Bad_label as we already have in the train_df_1
train_df = train_df_1.merge(dss_train_df.drop(['Bad_label'], axis=1), on='customer_no', how='inner')
train_df.head()

In [None]:
# Save data as csv
train_df.to_csv('train_df.csv', index=False)

In [None]:
train_df = pd.read_csv('train_df.csv')

#### 2.2 Data Encoding

Let's create a functionality to encode the categorical columns in the dataset

In [None]:
def encode_columns(df, cols_to_encode):
    """
    Encodes all the columns and returns the encoded df
    """
    all_columns = df.columns.tolist()
    for col in cols_to_encode:
        if col not in all_columns:
            print("Given column {0} to encode is not in the dataframe's columns {1}".format(col, all_columns))
            continue
        df = pd.concat([df, pd.get_dummies(df.loc[:,col], prefix=col)], axis=1)
        df.drop([col], axis=1, inplace=True)
    return df

In [None]:
train_df = encode_columns(df=train_df, cols_to_encode=['feature_1','feature_27','feature_32','feature_58','feature_60'])
train_df.to_csv('train_df_encoded.csv', index=False)
train_df.head()

#### 2.3 NaN and Outlier treatment

In [None]:
describe(train_df, missing_vals=True)
get_nans(train_df)

In [None]:
# Impute the mode value of the column to the missing values. Here the code 10.0 is of the high frequency of 19623. 
# So imputing this value will not make much of a difference to the data

train_df['mode_enq_purpose'].fillna(train_df.mode(axis=0).iloc[0].loc['mode_enq_purpose'], inplace=True)

# Impute mean value for utilization_trend column
train_df['utilization_trend'].fillna(train_df.loc[:,'utilization_trend'].mean(), inplace=True)

# Verifying the missing values
describe(train_df, columns=True, missing_vals=True)

In [None]:
# Drop records with outliers 
train_df_otlr_treat = train_df[(np.abs(stats.zscore(train_df)) < 3).all(axis=1)]

# event rate
print('Event rate: ',sum(train_df_otlr_treat.loc[:, 'Bad_label'])/train_df_otlr_treat.shape[0], '\n')

We will neglect the outlier removal as removing outliers is resulting in the 0.0 event rate.

#### 2.4 Data Normalization

Let's have a functionality to normalize the dataset's attributes

In [None]:
# Normalize the data
def normalize(df, cols):
    """
    This function normalizes the given cols of the data in the dataset.
    """
    for col in cols:
        if col not in df.columns.tolist():
            raise Exception("The column {0} is not in the DataFrame to normalize".format(col))
        df['norm_' + col] = StandardScaler().fit_transform(df[col].values.reshape(-1, 1))
        df.drop([col], axis=1, inplace=True)

In [None]:
cols_to_norm = ['diff_last_pmt_dt_opnd_dt',
       'min_months_before_first_30dpd', 'enquiry_recency_365',
       'ratio_cur_bal_credit', 'mean_last_pmt_acc_opnd',
       'mean_diff_opened_dt_enq_dt', 'mean_pmt_history_len',
       'mode_enq_purpose', 'enquiry_recency_90', 'utilization_trend',
       'feature_3', 'feature_7', 'feature_29', 'feature_34', 'feature_44',
       'feature_64', 'feature_65']
normalize(train_df, cols=cols_to_norm)
train_df.head()

#### 2.5 Calculate Information Value score

Let's have a functionality to get the features in the order of their IV scores

In [None]:
def get_features_ordered(scores, features):
    """
    This method returns the top n features based on the IV scores.
    Returns the list of features in the order of their IV scores. [a, ..., z] a is highest IV and z is lowest IV.
    """
    if not len(features) == len(scores):
        print("Given number of scores doesn't match with number of features")
        return None
    attr_to_score = dict(zip(features, scores))
#     return sorted(attr_to_score, key=attr_to_score.get, reverse=True)
    features_ordered =  sorted(attr_to_score, key=attr_to_score.get, reverse=True)
    features_rank = OrderedDict()
    for feature in features_ordered:
        features_rank[feature] = attr_to_score[feature]
    return features_rank

In [None]:
iv = WOE()
iv_scores = iv.woe(X=train_df.iloc[:,2:].values, y=train_df.iloc[:,1].values)
get_top_n_features(scores=iv_scores, features=train_df.columns.tolist()[2:])

#### 2.6 Data Balancing

In [None]:
print('Event rate: ',(sum(train_df.loc[:, 'Bad_label'])/train_df.shape[0]) * 100, '\n')
plot_event_rate(train_df)

In [None]:
# Using ADASYN for the data balancing
X_resampled, y_resampled = ADASYN().fit_sample(train_df.iloc[:,2:], train_df.loc[:,'Bad_label'])
print('Event rate after data balancing: ', round(100*sum(y_resampled)/y_resampled.shape[0], 1))
col_names = train_df.columns
train_df_blncd = pd.DataFrame(X_resampled)
train_df_blncd.columns = col_names[2:]
train_df_blncd.loc[:,'Bad_label'] = y_resampled
train_df_blncd.loc[:,'customer_no'] = [i+1 for i in train_df_blncd.index]
plot_event_rate(train_df_blncd)

Let's check the IV scores on the balanced dataset

In [None]:
iv_scores = iv.woe(X=train_df_blncd.iloc[:,:-2].values, y=train_df_blncd.iloc[:,-2].values)
get_top_n_features(scores=iv_scores, features=train_df_blncd.columns.tolist()[:-2])

In [None]:
# Save balanced data as csv
train_df_blncd.to_csv('train_df_blncd.csv', index=False)

In [573]:
train_df_blncd = pd.read_csv('train_df_blncd.csv')

### 3. Model Building

Let's do some model fitting to the data we prepared till now and see how can we use machine learning for out credit score prediction!!

#### 3.1 Logistic Regression

As we all know one of the trivial solution for the binary classiffication is the Logistic Regression. Let's see how does it perform on our data.

In [None]:
# Instantiate the logreg model
logreg_classiffier = LogisticRegression(penalty='l2', dual=False, tol=0.01, C=10, fit_intercept=True, 
                        intercept_scaling=1, class_weight=None, random_state=None, 
                        solver='liblinear', max_iter=100, multi_class='ovr', 
                        verbose=0, warm_start=False, n_jobs=1)
# Fit the model
logreg_classiffier.fit(train_df.iloc[:,2:], train_df.iloc[:,1])

Now that we have fit a logreg model for our data now is the time that we actually check how good has been our model fit. 

Lets cleanup the test data and follow the data processing steps followed for training data.

In [None]:
# Pre process the test data
test_df_1 = preprocess(raw_df = test_raw_df, account_df = test_account_df, enquiry_df = test_enquiry_df)
test_df_2 = dss_test_df.drop(['Bad_label'], axis=1)
test_df = test_df_1.merge(test_df_2, on='customer_no', how='inner')
test_df.shape

In [None]:
# Save the test_df as csv
test_df.to_csv('test_df.csv', index=False)

In [None]:
test_df = pd.read_csv('test_df.csv')

In [None]:
# Encode the categorical columns in test data
test_df = encode_columns(df=test_df, cols_to_encode=['feature_1','feature_27','feature_32','feature_58','feature_60'])

# Impute the mode value of the column to the missing values.
test_df['mode_enq_purpose'].fillna(test_df.mode(axis=0).iloc[0].loc['mode_enq_purpose'], inplace=True)
#Impute the mean value for utilization_trend
test_df['utilization_trend'].fillna(test_df.loc[:,'utilization_trend'].mean(), inplace=True)

# Normalize the test data
normalize(test_df, cols_to_norm)

test_df.head()

In [None]:
tr_cols = train_df.columns.tolist()
te_cols = test_df.columns.tolist()
col_not_in_tr = [col for col in te_cols if col not in tr_cols]
col_not_in_te = [col for col in tr_cols if col not in te_cols]
print("Col not in train: {0} and Col not in test: {1}".format(col_not_in_tr, col_not_in_te))

As we can see that the feature_27_Architect is there in the train data but not in the test data and simillarly the feature_27_Lawyer is vice-versa. 

As a simple step let's remove the feature_27_Architect from out training data and train the model again.

In [None]:
train_df.drop('feature_27_Architect', axis=1, inplace=True)
test_df.drop('feature_27_Lawyer', axis=1, inplace=True)

Now lets fit our Logreg model again!!

In [570]:
# Instantiate the logreg model
logreg_classiffier_1 = LogisticRegression(penalty='l2', dual=False, tol=0.01, C=10, fit_intercept=True, 
                        intercept_scaling=1, class_weight=None, random_state=None, 
                        solver='liblinear', max_iter=100, multi_class='ovr', 
                        verbose=0, warm_start=False, n_jobs=1)
# Fit the model
logreg_classiffier_1.fit(train_df.iloc[:,2:], train_df.iloc[:,1])

LogisticRegression(C=10, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.01,
          verbose=0, warm_start=False)

Let's test the model's fit on the test data and check the quality of our fit

In [571]:
# Predictions
predictions_lr_1 = logreg_classiffier_1.predict(test_df.iloc[:,2:])

In [572]:
# Get classification report
target_names=['Good Customer', 'Bad Customer']
print(classification_report(test_df.loc[:,'Bad_label'], predictions_lr_1, target_names=target_names))

               precision    recall  f1-score   support

Good Customer       0.95      1.00      0.98      9778
 Bad Customer       0.00      0.00      0.00       462

  avg / total       0.91      0.95      0.93     10240



  'precision', 'predicted', average, warn_for)


Let's create a functionality for calculating the AUC score and Gini coefficient

In [None]:
def get_auc_gini(actual, prediction):
    """
    Prints the AUC and Gini values for the classiffication
    """
    auc_score = roc_auc_score(actual, prediction)
    gini = 2 * auc_score - 1
    print("AUC score: {0}, Gini: {1}".format(auc_score, gini))

In [None]:
get_auc_gini(actual=test_df.loc[:,'Bad_label'].tolist(), prediction=list(predictions_lr))

Oops!! Our firsl model went really bad :( :( :(

Thats okay its highly unreasonable to get the best fit in the first model.

Let's try to understand the problem better.

So first let's see how does our model behave on the balanced data set.

In [574]:
# Remove the feature_27_Architect from the balanced dataset
train_df_blncd.drop('feature_27_Architect', axis=1, inplace=True)
train_df_blncd = shuffle(train_df_blncd)

# Instantiate the logreg model
logreg_classiffier_2 = LogisticRegression(penalty='l2', dual=False, tol=0.01, C=10, fit_intercept=True, 
                        intercept_scaling=1, class_weight=None, random_state=None, 
                        solver='liblinear', max_iter=100, multi_class='ovr', 
                        verbose=0, warm_start=False, n_jobs=1)
# Fit the model
logreg_classiffier_2.fit(train_df_blncd.iloc[:,:-2], train_df_blncd.iloc[:,-2])

LogisticRegression(C=10, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.01,
          verbose=0, warm_start=False)

In [575]:
# Predictions
predictions_lr_2 = logreg_classiffier_2.predict(test_df.iloc[:,2:])

In [576]:
# Get classification report
target_names=['Good Customer', 'Bad Customer']
print(classification_report(test_df.loc[:,'Bad_label'], predictions_lr_2, target_names=target_names))

               precision    recall  f1-score   support

Good Customer       0.96      0.74      0.83      9778
 Bad Customer       0.05      0.28      0.08       462

  avg / total       0.91      0.72      0.80     10240



In [577]:
get_auc_gini(actual=test_df.loc[:,'Bad_label'].tolist(), prediction=list(predictions_lr))

AUC score: 0.5090110850491296, Gini: 0.018022170098259238


Cool!! We could find some benefit using the oversampling technique. But thats not quite enough

Let's see if some other classiffies can improve the performace

In [579]:
# Instantiate a Random Forest classiffier
rf_clf = RandomForestClassifier(n_estimators=100, criterion='gini', max_depth=None, 
                                min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, 
                                max_features='auto', max_leaf_nodes=None, min_impurity_decrease=0.0, 
                                min_impurity_split=None, bootstrap=True, oob_score=False, n_jobs=-1, 
                                random_state=42, verbose=0, warm_start=False, class_weight=None)

# Fit the model
rf_clf.fit(train_df_blncd.iloc[:,:-2], train_df_blncd.iloc[:,-2])

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=-1,
            oob_score=False, random_state=42, verbose=0, warm_start=False)

In [580]:
# predictions
predictions_rf = rf_clf.predict(test_df.iloc[:,:-2])

In [581]:
# Get classification report
target_names=['Good Customer', 'Bad Customer']
print(classification_report(test_df.loc[:,'Bad_label'], predictions_rf, target_names=target_names))

               precision    recall  f1-score   support

Good Customer       0.96      0.52      0.68      9778
 Bad Customer       0.06      0.59      0.10       462

  avg / total       0.92      0.53      0.65     10240



In [583]:
get_auc_gini(actual=test_df.loc[:,'Bad_label'].tolist(), prediction=list(predictions_rf))

AUC score: 0.5584001632784615, Gini: 0.11680032655692307


Thats pretty much a good news to what we were seeing earlier.

So far we have used a logistic model and a tree based model. Logistic regression is a parametric model while random forest is a non-parametric model. Next we will train a boosted model.

In [633]:
# Set the parameters
n_estimators = 150
learning_rate = 1.

# train the adaboost algorithm
ada_real = AdaBoostClassifier(
    learning_rate=learning_rate,
    n_estimators=n_estimators,
    algorithm="SAMME.R")

ada_real.fit(train_df_blncd.iloc[:,:-2], train_df_blncd.iloc[:,-2])

AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None,
          learning_rate=1.0, n_estimators=150, random_state=None)

In [634]:
# Predictions
predictions_ab = ada_real.predict(test_df.iloc[:,:-2])

In [635]:
# Get classification report
target_names=['Good Customer', 'Bad Customer']
print(classification_report(test_df.loc[:,'Bad_label'], predictions_ab, target_names=target_names))

               precision    recall  f1-score   support

Good Customer       0.97      0.48      0.64      9778
 Bad Customer       0.06      0.71      0.11       462

  avg / total       0.93      0.49      0.62     10240



In [636]:
get_auc_gini(actual=test_df.loc[:,'Bad_label'].tolist(), prediction=list(predictions_ab))

AUC score: 0.5969158168483184, Gini: 0.1938316336966368


Thats pretty cool!!

Let's wrap up everything with a last good fit model

Now let's try with an ensemble model and see if it helps in anyway to improve.

In [637]:
# Instantiate an ensemble classiffier
e_classiffier = VotingClassifier(estimators=[('lr', logreg_classiffier_2), ('rf', rf_clf), ('ab', ada_real)])

# Fit the train data
e_classiffier.fit(train_df_blncd.iloc[:,:-2], train_df_blncd.iloc[:,-2])

VotingClassifier(estimators=[('lr', LogisticRegression(C=10, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.01,
          verbose=0, warm_start=False)), ('rf', RandomFore...='SAMME.R', base_estimator=None,
          learning_rate=1.0, n_estimators=150, random_state=None))],
         flatten_transform=None, n_jobs=1, voting='hard', weights=None)

In [638]:
# Predictions
predictions_ec = e_classiffier.predict(test_df.iloc[:,:-2])

In [639]:
# Get classification report
target_names=['Good Customer', 'Bad Customer']
print(classification_report(test_df.loc[:,'Bad_label'], predictions_ec, target_names=target_names))

               precision    recall  f1-score   support

Good Customer       0.97      0.34      0.50      9778
 Bad Customer       0.05      0.77      0.10       462

  avg / total       0.93      0.36      0.48     10240



In [640]:
get_auc_gini(actual=test_df.loc[:,'Bad_label'].tolist(), prediction=list(predictions_ec))

AUC score: 0.5562859108573979, Gini: 0.11257182171479574


:( :( This ensembling didn't help though

Anyway the Adaboost is so far giving a real good result. Let's try a small variation on the Adaboost w.r.t data.

Instead of taking all the attributes let's only take the top 10 attributes from IV and see if that helps!

In [642]:
iv_scores = iv.woe(X=train_df_blncd.iloc[:,:-2].values, y=train_df_blncd.iloc[:,-2].values)
ordered_dict = get_top_n_features(scores=iv_scores, features=train_df_blncd.columns.tolist()[:-2])
cols_to_select = []
for k, v in ordered_dict.items():
    if v > 0.009:
        cols_to_select.append(k)
cols_to_select

['feature_32_Self',
 'feature_1_Platinum Maxima',
 'feature_1_Titanium Deligh',
 'feature_27_Graduate',
 'feature_1_Platinum Deligh',
 'norm_mode_enq_purpose',
 'norm_feature_65',
 'norm_feature_29',
 'norm_feature_64',
 'norm_feature_44',
 'norm_utilization_trend',
 'feature_32_Rente',
 'feature_32_Paren',
 'norm_mean_pmt_history_len',
 'norm_ratio_cur_bal_credit',
 'norm_mean_diff_opened_dt_enq_dt',
 'norm_diff_last_pmt_dt_opnd_dt',
 'norm_feature_3',
 'norm_mean_last_pmt_acc_opnd',
 'norm_enquiry_recency_365',
 'norm_enquiry_recency_90',
 'norm_feature_34',
 'feature_58_N',
 'norm_feature_7',
 'norm_min_months_before_first_30dpd']

Now, lets select only these columns from our training set and train the model

In [643]:
cols_to_select += ['Bad_label', 'customer_no']
train_df_blncd_new = train_df_blncd[cols_to_select]
train_df_blncd_new.head()

Unnamed: 0,feature_32_Self,feature_1_Platinum Maxima,feature_1_Titanium Deligh,feature_27_Graduate,feature_1_Platinum Deligh,norm_mode_enq_purpose,norm_feature_65,norm_feature_29,norm_feature_64,norm_feature_44,...,norm_feature_3,norm_mean_last_pmt_acc_opnd,norm_enquiry_recency_365,norm_enquiry_recency_90,norm_feature_34,feature_58_N,norm_feature_7,norm_min_months_before_first_30dpd,Bad_label,customer_no
28346,0.0,0.0,1.0,1.0,0.0,-2.044563,-0.00544,0.108638,0.014463,0.135921,...,1.212834,2.539525,-0.598901,-0.598901,-0.555239,1.0,-0.530242,2.184268,1,28347
24850,0.696036,0.0,1.0,0.696036,0.0,-0.761931,-0.248382,-0.652147,-0.385947,-0.692244,...,1.538943,-0.029341,0.92218,0.92218,1.063801,1.0,-0.10604,-0.363198,1,24851
2301,0.0,0.0,1.0,0.0,0.0,-2.3808,-0.746978,1.146315,0.674171,1.133739,...,0.082505,-0.202718,5.226114,5.226114,1.770848,1.0,-1.149356,-0.577087,0,2302
42220,0.692261,0.0,0.307739,0.307739,0.692261,0.258746,-0.162771,-0.675146,0.326701,-0.662817,...,-0.22924,-0.605799,-0.351662,-0.351662,-0.555239,1.0,-0.055496,-0.52295,1,42221
15621,0.0,0.0,0.0,1.0,1.0,0.258746,-0.275564,-0.675038,1.935202,-0.639324,...,-0.859407,-0.692453,-0.739956,-0.739956,-0.555239,1.0,-0.775151,0.47841,0,15622


In [645]:
# Set the parameters
n_estimators = 150
learning_rate = 1.

# train the adaboost algorithm
ada_real_2 = AdaBoostClassifier(
    learning_rate=learning_rate,
    n_estimators=n_estimators,
    algorithm="SAMME.R")

ada_real_2.fit(train_df_blncd_new.iloc[:,:-2], train_df_blncd_new.iloc[:,-2])

AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None,
          learning_rate=1.0, n_estimators=150, random_state=None)

In [646]:
test_df_new = test_df[cols_to_select]

# Predictions
predictions_ab_2 = ada_real_2.predict(test_df_new.iloc[:,:-2])

In [647]:
# Get classification report
target_names=['Good Customer', 'Bad Customer']
print(classification_report(test_df_new.loc[:,'Bad_label'], predictions_ab_2, target_names=target_names))

               precision    recall  f1-score   support

Good Customer       0.96      0.95      0.95      9778
 Bad Customer       0.08      0.10      0.09       462

  avg / total       0.92      0.91      0.91     10240



In [648]:
get_auc_gini(actual=test_df_new.loc[:,'Bad_label'].tolist(), prediction=list(predictions_ab_2))

AUC score: 0.5224773521971313, Gini: 0.044954704394262635


Mmmmmm sad that it didn't go with my intuition:( :(

Anyway that was a real fun I had doing this project.

A Gini of close to 20 is not bad though with the time that I've taken to finish this.

## Thank you!! Had a great recollecting time doing this project:) :)