# Bank credit scoring

You have been provided with information from the borrowers' personal data and the fact that there is a default.

## Field descriptions:

- `client_id` - client identifier
- `education` - level of education
- `sex` - borrower's gender
- `age` - borrower's age
- `car` - flag of the presence of a car
- `car_type` - flag of a foreign car
- `decline_app_cnt` - number of declined past bids
- `good_work` - flag of having "good" work
- `bki_request_cnt` - number of requests to the bki
- `home_address` - home address categorizer
- `work_address` - work address categorizer
- `income` - borrower's income
- `foreign_passport` - availability of a foreign passport
- `sna` - relationship between the borrower and the bank's clients
- `first_time` - how old the information about the borrower was
- `score_bki` - scoring according to data from the BKI
- `region_rating` - region rating
- `app_date` - date of application submission
- `default` - credit default flag

In [1]:
import os


def import_extra_package(package):
    try:
        __import__(package)
    except ImportError:
        os.system("pip install " + package)
        __import__(package)

In [2]:
import pandas as pd
import_extra_package('pandas_profiling')
from pandas_profiling import ProfileReport
import numpy as np

import datetime

import matplotlib.pyplot as plt
import seaborn as sns 
%matplotlib inline

from sklearn.feature_selection import f_classif, mutual_info_classif
from sklearn.preprocessing import LabelEncoder, OneHotEncoder, StandardScaler
from sklearn.metrics import plot_confusion_matrix
from sklearn.preprocessing import PolynomialFeatures
from sklearn.model_selection import train_test_split,GridSearchCV
from sklearn.metrics import auc, roc_auc_score, roc_curve
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.model_selection import cross_validate
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
from scipy.stats import randint
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import RandomizedSearchCV

import_extra_package('mlxtend')
from mlxtend.feature_selection import ExhaustiveFeatureSelector as EFS
from math import log as log
import_extra_package('nameof')

## Helpers

Functions to combine all operations.

In [3]:
def profile_report(df):
    if df is None:
        raise TypeError(nameof(df))

    profile = df.profile_report(
        title='Bank credit scoring',
        dark_mode=True,
        progress_bar=False,
        correlations={
            'pearson': {'calculate': True},
            'spearman': {'calculate': False},
            'kendall': {'calculate': False},
            'phi_k': {'calculate': False},
            'cramers': {'calculate': False},
        },
        interactions={
            'continuous': False,
            'targets': []
        },
        missing_diagrams={
            'heatmap': True,
            'dendrogram': False,
            'matrix': False
        },
        vars={
            'cat': {
                'characters': False,
                'words': False,
                'n_obs': 10
            }
        }
    )

    return profile


class ProcessingService():
    def __init__(self):
        self.label_encoder = LabelEncoder()
        self.standard_scaler = StandardScaler()

    def binary_categories_to_numbers(self, df, column):
        df[column] = self.label_encoder.fit_transform(df[column])

    def get_dummies(self, df, column):
        return pd.get_dummies(df, columns=[column])
    
    def standardizing_numeric_variables(self, df, column):
        df[column] = self.standard_scaler.fit_transform(df[column].values)

## Load the train and test datasets:

In [4]:
path = ''

In [5]:
for dirname, _, filenames in os.walk('kaggle'):
    for filename in filenames:
        print(os.path.join(dirname, filename))
        path = dirname

kaggle\input\sf-dst-scoring\sample_submission.csv
kaggle\input\sf-dst-scoring\test.csv.zip
kaggle\input\sf-dst-scoring\train.csv.zip


In [6]:
if path == '':
    raise Exception('The input path is empty!')
else:
    print(path)

kaggle\input\sf-dst-scoring


In [7]:
train_df = pd.read_csv(os.path.join(path, 'train.csv.zip'))
test_df = pd.read_csv(os.path.join(path, 'test.csv.zip'))

## General data inspection:

In [8]:
train_df_profile = profile_report(train_df)
test_df_profile = profile_report(test_df)

In [9]:
# profile.to_widgets()
train_df_profile.to_file('train_df_output.html')
test_df_profile.to_file('test_df_output.html')

Here are the links on: **[the training dataset profile report](./train_df_output.html)** and **[the test dataset profile report](./test_df_output.html)**.

## Cleaning and Preparing the Data

Let's create the processing service object:

In [10]:
processing_service = ProcessingService()

Prepare lists for binary, categorical and numeric variables:

In [11]:
# binary columns
bin_cols = []

# categorical columns
cat_cols = []

# numeric columns
num_cols = []

### Client identifier: [client_id]

The `client_id` feature is a unique number of a user. The similar features are garbage. Let's remove it from the training dataset.

In [12]:
train_df.drop(columns=['client_id'], inplace=True)

### Level of education: [education]

The `education` feature is a categorical feature. It stores the level of education of a client. This feature has some missing values. The training dataset has approximately 0.4% of missing values. The test dataset has about 0.5% of missing values. The possible values are: SCH, GRD, UGR, PGR, ACD. The most popular is the SCH. There is more than 50% of SCH values. Let's fill the missing values with SCH value.

In [13]:
train_df['education'].fillna('SCH', inplace=True)
test_df['education'].fillna('SCH', inplace=True)

The feature can be modified as a dummy variable. Let's do it later with all categorical features.

In [14]:
# add column to categorical list
cat_cols.append('education')

## Borrower's gender: [sex]

The `sex` feature can have only two values M and F. It can be modified as a binary variable. Male = 1, Female = 0. Let's do it later with all binary features.

In [15]:
# add column to binary list
bin_cols.append('sex')

## Borrower's age: [age]

The `age` feature is a numeric variable. The histogram shows us the Distribution has a big right tail. We can try to work with the logarithm of this variable. Let's do it later with all numeric features.

In [16]:
# add column to numeric list
num_cols.append('age')

## Flag of the presence of a car: [car]

The `car` feature has only two values True and False. It can be modified as a binary variable. True = 1, False = 0. Let's do it later with all binary features.

In [17]:
# add column to binary list
bin_cols.append('car')

## Flag of a foreign car: [car_type]

The `car_type` feature has only two values True and False. It can be modified as a binary variable. True = 1, False = 0. Let's do it later with all binary features.

In [18]:
# add column to binary list
bin_cols.append('car_type')

## Number of declined past bids: [decline_app_cnt]

The `decline_app_cnt` feature is a numeric variable. The histogram shows us the Distribution has a big right tail. We can try to work with the logarithm of this variable. Let's do it later with all numeric features.

In [19]:
# add column to numeric list
num_cols.append('decline_app_cnt')

## Flag of having "good" work: [good_work]

The `good_work` feature has only two values 1 and 0. Let's leave it as it is.

In [20]:
# add column to binary list
bin_cols.append('good_work')

## Number of requests to the bki: [bki_request_cnt]

The `bki_request_cnt` feature is a numeric variable. The histogram shows us the Distribution has a big right tail. We can try to work with the logarithm of this variable. Let's do it later with all numeric features.

In [21]:
# add column to numeric list
num_cols.append('bki_request_cnt')

## Home address categorizer: [home_address]

The `home_address` is a categorical feature. The feature has three possible values: 1, 2, 3. The feature can be modified as a dummy variable. Let's do it later with all categorical features.

In [22]:
# add column to categorical list
cat_cols.append('home_address')

## Work address categorizer: [work_address]

The `work_address` is a categorical feature. The feature has three possible values: 1, 2, 3. The feature can be modified as a dummy variable. Let's do it later with all categorical features.

In [23]:
# add column to categorical list
cat_cols.append('work_address')

## Borrower's income: [income]

The `income` feature is a numeric variable. The histogram shows us the Distribution has a big right tail. We can try to work with the logarithm of this variable. Let's do it later with all numeric features.

In [None]:
# add column to numeric list
num_cols.append('income')

## Availability of a foreign passport: [foreign_passport]

The `foreign_passport` feature has only two values True and False. It can be modified as a binary variable. True = 1, False = 0. Let's do it later with all binary features.

In [24]:
# add column to binary list
bin_cols.append('foreign_passport')

## Relationship between the borrower and the bank's clients: [sna]

The `sna` is a numerical feature. We can try to work with the logarithm of this variable. Let's do it later with all numeric features.

In [25]:
# add column to categorical list
num_cols.append('sna')

## How old the information about the borrower was: [first_time]

The `first_time` is a numerical feature. We can try to work with the logarithm of this variable. Let's do it later with all numeric features.

In [27]:
# add column to categorical list
num_cols.append('first_time')

## Scoring according to data from the BKI: [score_bki]

The `score_bki` is a numeric variable. 

region_rating - region rating
app_date - date of application submission
default - credit default flag

In [26]:
train_df.head()

Unnamed: 0,app_date,education,sex,age,car,car_type,decline_app_cnt,good_work,score_bki,bki_request_cnt,region_rating,home_address,work_address,income,sna,first_time,foreign_passport,default
0,01FEB2014,SCH,M,62,Y,Y,0,0,-2.008753,1,50,1,2,18000,4,1,N,0
1,12MAR2014,SCH,F,59,N,N,0,0,-1.532276,3,50,2,3,19000,4,1,N,0
2,01FEB2014,SCH,M,25,Y,N,2,0,-1.408142,1,80,1,2,30000,1,4,Y,0
3,23JAN2014,SCH,F,53,N,N,0,0,-2.057471,2,50,2,3,10000,1,3,N,0
4,18APR2014,GRD,M,48,N,N,0,1,-1.244723,1,60,2,3,30000,1,4,Y,0


### Date of application submission: [app_date]

Првести к датам, поэксперементировать с возможному преведению данных к категориальным, менее года назад, менее двух лет назад, менее трех, 4х, 5ти.

The `app_date` feature is the date of an opperation. It stores values as a string in format `01FEB2014`. Let's 