# Bank credit scoring

You have been provided with information from the borrowers' personal data and the fact that there is a default.

## Field descriptions:

- `client_id` - client identifier
- `education` - level of education
- `sex` - borrower's gender
- `age` - borrower's age
- `car` - flag of the presence of a car
- `car_type` - flag of a foreign car
- `decline_app_cnt` - number of declined past bids
- `good_work` - flag of having "good" work
- `bki_request_cnt` - number of requests to the bki
- `home_address` - home address categorizer
- `work_address` - work address categorizer
- `income` - borrower's income
- `foreign_passport` - availability of a foreign passport
- `sna` - communication between the borrower and the bank's clients
- `first_time` - how old the information about the borrower was
- `score_bki` - scoring score according to data from the BCI
- `region_rating` - region rating
- `app_date` - date of application submission
- `default` - credit default flag

In [1]:
def import_extra_package(package):
    try:
        return __import__(package)
    except ImportError:
        !pip install package
        return __import__(package)


def dataset_info(df):
    df.info()
    display('Nan objects:', df.isna().sum())

In [2]:
import pandas as pd
import_extra_package('pandas_profiling')
import numpy as np

import datetime

import matplotlib.pyplot as plt
import seaborn as sns 
%matplotlib inline

from sklearn.feature_selection import f_classif, mutual_info_classif
from sklearn.preprocessing import LabelEncoder, OneHotEncoder, StandardScaler
from sklearn.metrics import plot_confusion_matrix
from sklearn.preprocessing import PolynomialFeatures
from sklearn.model_selection import train_test_split,GridSearchCV
from sklearn.metrics import auc, roc_auc_score, roc_curve
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.model_selection import cross_validate
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
from scipy.stats import randint
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import RandomizedSearchCV

import_extra_package('mlxtend')
from mlxtend.feature_selection import ExhaustiveFeatureSelector as EFS
from math import log as log
import os

## Load the train and test datasets:

In [3]:
path = ''

In [4]:
for dirname, _, filenames in os.walk('kaggle'):
    for filename in filenames:
        print(os.path.join(dirname, filename))
        path = dirname

kaggle\input\sf-dst-scoring\sample_submission.csv
kaggle\input\sf-dst-scoring\test.csv.zip
kaggle\input\sf-dst-scoring\train.csv.zip


In [5]:
if path == '':
    raise Exception('The input path is empty!')
else:
    print(path)

kaggle\input\sf-dst-scoring


In [6]:
train_df = pd.read_csv(os.path.join(path, 'train.csv.zip'))
test_df = pd.read_csv(os.path.join(path, 'test.csv.zip'))

## General data inspection: 

In [7]:
dataset_info(train_df)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 73799 entries, 0 to 73798
Data columns (total 19 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   client_id         73799 non-null  int64  
 1   app_date          73799 non-null  object 
 2   education         73492 non-null  object 
 3   sex               73799 non-null  object 
 4   age               73799 non-null  int64  
 5   car               73799 non-null  object 
 6   car_type          73799 non-null  object 
 7   decline_app_cnt   73799 non-null  int64  
 8   good_work         73799 non-null  int64  
 9   score_bki         73799 non-null  float64
 10  bki_request_cnt   73799 non-null  int64  
 11  region_rating     73799 non-null  int64  
 12  home_address      73799 non-null  int64  
 13  work_address      73799 non-null  int64  
 14  income            73799 non-null  int64  
 15  sna               73799 non-null  int64  
 16  first_time        73799 non-null  int64 

'Nan objects:'

client_id             0
app_date              0
education           307
sex                   0
age                   0
car                   0
car_type              0
decline_app_cnt       0
good_work             0
score_bki             0
bki_request_cnt       0
region_rating         0
home_address          0
work_address          0
income                0
sna                   0
first_time            0
foreign_passport      0
default               0
dtype: int64

In [8]:
dataset_info(test_df)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 36349 entries, 0 to 36348
Data columns (total 18 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   client_id         36349 non-null  int64  
 1   app_date          36349 non-null  object 
 2   education         36178 non-null  object 
 3   sex               36349 non-null  object 
 4   age               36349 non-null  int64  
 5   car               36349 non-null  object 
 6   car_type          36349 non-null  object 
 7   decline_app_cnt   36349 non-null  int64  
 8   good_work         36349 non-null  int64  
 9   score_bki         36349 non-null  float64
 10  bki_request_cnt   36349 non-null  int64  
 11  region_rating     36349 non-null  int64  
 12  home_address      36349 non-null  int64  
 13  work_address      36349 non-null  int64  
 14  income            36349 non-null  int64  
 15  sna               36349 non-null  int64  
 16  first_time        36349 non-null  int64 

'Nan objects:'

client_id             0
app_date              0
education           171
sex                   0
age                   0
car                   0
car_type              0
decline_app_cnt       0
good_work             0
score_bki             0
bki_request_cnt       0
region_rating         0
home_address          0
work_address          0
income                0
sna                   0
first_time            0
foreign_passport      0
dtype: int64