# Credit scoring

There is a [tutorial competition on Kaggle](https://www.kaggle.com/competitions/bank-issues-042022/leaderboard) for this task.

## Downloading data

Download the data and import the required libraries

In [1]:
! pip install wldhx.yadisk-direct
! curl -L $(yadisk-direct https://disk.yandex.com/d/sknuSa3xoNBsDw) -o bank-issues-data.zip

Collecting wldhx.yadisk-direct
  Downloading wldhx.yadisk_direct-0.0.6-py3-none-any.whl (4.5 kB)
Installing collected packages: wldhx.yadisk-direct
Successfully installed wldhx.yadisk-direct-0.0.6
[0m  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100 3473k  100 3473k    0     0   547k      0  0:00:06  0:00:06 --:--:--  732k


In [2]:
# ! unzip -qq bank-issues-data.zip

In [3]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

## Training data processing

In [4]:
train_data = pd.read_csv('./bank-issues-data/train.csv')
test_data = pd.read_csv('./bank-issues-data/test.csv')

In [5]:
train_data.head()

Unnamed: 0,client_id,gender,age,marital_status,job_position,credit_sum,credit_month,tariff_id,score_shk,education,living_region,monthly_income,credit_count,overdue_credit_count,open_account_flg
0,1,M,25,UNM,SPC,26389.0,10,1.32,0.584105,SCH,ОБЛ КУРСКАЯ,35000.0,2.0,0.0,1
1,2,F,37,MAR,SPC,19588.0,12,1.43,0.718935,SCH,РЕСПУБЛИКА ТАТАРСТАН,15000.0,0.0,0.0,1
2,3,F,28,UNM,SPC,53669.0,18,1.1,0.586015,GRD,МОСКВА Г,70000.0,4.0,0.0,1
3,4,M,34,MAR,SPC,26349.0,10,1.43,0.655703,SCH,СВЕРДЛОВСКАЯ ОБЛАСТЬ,42500.0,4.0,0.0,0
4,5,F,43,MAR,UMN,11589.0,10,1.1,0.271893,GRD,РЯЗАНСКАЯ ОБЛАСТЬ,20000.0,3.0,0.0,0


Data Fields:
- **client_id** - Unique identifier of the client
- **gender** - Gender
- **age** - Age (in years)
- **marital_status** - Marital status.
    Possible values:
    - UNM : Single / single
    - DIV : Married
    - MAR : Married
    - WID : Widower, widow
    - CIV : Civil Marriage
- **job_position** - Job.
    Possible values:
    - SPC : Non-managerial employee - professional
    - DIR : Manager of an organization
    - HSK : Housewife
    - WOI : Working for a sole proprietor
    - WRK : Non-managerial employee - worker
    - ATP : Non-managerial employee - service personnel
    - WRP : Retired worker
    - UMN : Unit Manager
    - NOR : Not working
    - NS : Retired
    - BIS : Own business
    - INP : Individual entrepreneur
- **credit_sum** - Loan amount
- **credit_month** - Credit period in months
- **tariff_id** - Number of tariff offered
- **education** - Type of education.
    Possible knowledge:
    - SCH : Primary, Secondary
    - PGR : Secondary
    - GRD : Graduate
    - UGR : Undergraduate
    - ACD : Advanced degree
- **living_region** - Region of residence
- **monthly_income** - Salary per month
- **credit_count** - Number of loans a client has.
- **overdue_credit_count** - Number of overdue loans of the client
- **open_account_flag** - Target variable -- whether the client will choose our bank or not.

Отделим целевую переменную от признаков:

In [6]:
y_train = train_data['open_account_flg']
train_data = train_data.drop(columns=['open_account_flg'])

In [7]:
train_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 119518 entries, 0 to 119517
Data columns (total 14 columns):
 #   Column                Non-Null Count   Dtype  
---  ------                --------------   -----  
 0   client_id             119518 non-null  int64  
 1   gender                119518 non-null  object 
 2   age                   119518 non-null  int64  
 3   marital_status        119518 non-null  object 
 4   job_position          119518 non-null  object 
 5   credit_sum            119518 non-null  float64
 6   credit_month          119518 non-null  int64  
 7   tariff_id             119518 non-null  float64
 8   score_shk             119518 non-null  float64
 9   education             119518 non-null  object 
 10  living_region         119385 non-null  object 
 11  monthly_income        119518 non-null  float64
 12  credit_count          113032 non-null  float64
 13  overdue_credit_count  113032 non-null  float64
dtypes: float64(6), int64(3), object(5)
memory usage: 12.

### Filling out passes

Once again, let's derive what the training data looks like for easy reference

In [8]:
train_data.head()

Unnamed: 0,client_id,gender,age,marital_status,job_position,credit_sum,credit_month,tariff_id,score_shk,education,living_region,monthly_income,credit_count,overdue_credit_count
0,1,M,25,UNM,SPC,26389.0,10,1.32,0.584105,SCH,ОБЛ КУРСКАЯ,35000.0,2.0,0.0
1,2,F,37,MAR,SPC,19588.0,12,1.43,0.718935,SCH,РЕСПУБЛИКА ТАТАРСТАН,15000.0,0.0,0.0
2,3,F,28,UNM,SPC,53669.0,18,1.1,0.586015,GRD,МОСКВА Г,70000.0,4.0,0.0
3,4,M,34,MAR,SPC,26349.0,10,1.43,0.655703,SCH,СВЕРДЛОВСКАЯ ОБЛАСТЬ,42500.0,4.0,0.0
4,5,F,43,MAR,UMN,11589.0,10,1.1,0.271893,GRD,РЯЗАНСКАЯ ОБЛАСТЬ,20000.0,3.0,0.0


Let's see which columns have omissions in the training data

In [9]:
train_data.isna().any()

client_id               False
gender                  False
age                     False
marital_status          False
job_position            False
credit_sum              False
credit_month            False
tariff_id               False
score_shk               False
education               False
living_region            True
monthly_income          False
credit_count             True
overdue_credit_count     True
dtype: bool

Let's also see what omissions there are in the test data:

In [10]:
test_data.isna().any()

client_id               False
gender                  False
age                     False
marital_status          False
job_position            False
credit_sum              False
credit_month            False
tariff_id               False
score_shk               False
education               False
living_region            True
monthly_income           True
credit_count             True
overdue_credit_count     True
dtype: bool

We see that in the test part there are gaps in the monthly_income column, but in the training part there are no gaps in this column. We will need to understand how we will fill in the gaps in this column.

**WARNING**: it is important to remember that sometimes there are gaps in the data that pandas does not catch. For example, when there is an omission in the categorical column "job_position", and it is expressed by an empty string '' or a space '' instead of a value. Pandas will not catch this.

**How can you fill in the gaps in the columns?**



#### Options for filling in gaps

- Fill in the blanks with some statistics on the data (mean/median/0/...). [SimpleImputer](https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html#sklearn.impute.SimpleImputer) from Sklearn can help;

    **Question**: why is the median preferred over the mean?

- Use more sophisticated strategies. For example, [IterativeImputer](https://scikit-learn.org/stable/modules/generated/sklearn.impute.IterativeImputer.html#sklearn.impute.IterativeImputer) or [KNNImputer](https://scikit-learn.org/stable/modules/generated/sklearn.impute.KNNImputer.html);

- Add a "skip" value in place of skips. The [MissingIndicator](https://scikit-learn.org/stable/modules/generated/sklearn.impute.MissingIndicator.html#sklearn.impute.MissingIndicator) from Sklearn will help;
- And sometimes it happens that it is better to remove rows with many skips from the training data. The logic is as follows: if little is known about an object, this object does not provide useful information for the model. Please note that you cannot delete any rows from the test data!
- In general, you should delete a feature with skips. It may be logical if there are too many skips;

#### Filling in badges in our data

Let's see how many missing values there are in the credit_count and overdue_credit_count columns:

In [11]:
train_data['credit_count'].info()

<class 'pandas.core.series.Series'>
RangeIndex: 119518 entries, 0 to 119517
Series name: credit_count
Non-Null Count   Dtype  
--------------   -----  
113032 non-null  float64
dtypes: float64(1)
memory usage: 933.9 KB


In [12]:
np.unique(train_data['credit_count'], return_counts=True)

(array([ 0.,  1.,  2.,  3.,  4.,  5.,  6.,  7.,  8.,  9., 10., 11., 12.,
        13., 14., 15., 16., 17., 18., 19., nan]),
 array([18219, 31714, 25799, 16707,  9826,  5285,  2760,  1345,   707,
          326,   141,    90,    56,    25,    12,    10,     2,     4,
            1,     3,  6486]))

Let's fill in the gap in credit_count with the median:

In [13]:
# manual filling of skips with statistics

# median_value_credit_count = np.median(train_data['credit_count'].dropna())
# train_data['credit_count'].fillna(median_value_credit_count)

In [14]:
from sklearn.impute import SimpleImputer

simple_imputer = SimpleImputer(missing_values=np.nan, strategy='median')
simple_imputer.fit(train_data[['credit_count']])
train_data['credit_count'] = simple_imputer.transform(train_data[['credit_count']])

In [15]:
train_data['credit_count'].isna().any()

False

In [16]:
train_data['job_position'].dtype

dtype('O')

In [17]:
train_data['age'].dtype

dtype('int64')

In [18]:
numeric_columns = [column for column in train_data.columns if train_data[column].dtype != 'O']
numeric_columns

['client_id',
 'age',
 'credit_sum',
 'credit_month',
 'tariff_id',
 'score_shk',
 'monthly_income',
 'credit_count',
 'overdue_credit_count']

Let's fill in the pass in overdue_credit_count using KNNImputer:

In [19]:
from sklearn.impute import KNNImputer

knn_imputer = KNNImputer(n_neighbors=2)
knn_imputer.fit(train_data[numeric_columns])
transformed_overdue_credit_count = knn_imputer.transform(train_data[numeric_columns])[:, 8]
train_data['overdue_credit_count'] = transformed_overdue_credit_count

In [20]:
train_data['overdue_credit_count'].isna().any()

False

### Processing of categorical attributes

In [21]:
train_data.head()

Unnamed: 0,client_id,gender,age,marital_status,job_position,credit_sum,credit_month,tariff_id,score_shk,education,living_region,monthly_income,credit_count,overdue_credit_count
0,1,M,25,UNM,SPC,26389.0,10,1.32,0.584105,SCH,ОБЛ КУРСКАЯ,35000.0,2.0,0.0
1,2,F,37,MAR,SPC,19588.0,12,1.43,0.718935,SCH,РЕСПУБЛИКА ТАТАРСТАН,15000.0,0.0,0.0
2,3,F,28,UNM,SPC,53669.0,18,1.1,0.586015,GRD,МОСКВА Г,70000.0,4.0,0.0
3,4,M,34,MAR,SPC,26349.0,10,1.43,0.655703,SCH,СВЕРДЛОВСКАЯ ОБЛАСТЬ,42500.0,4.0,0.0
4,5,F,43,MAR,UMN,11589.0,10,1.1,0.271893,GRD,РЯЗАНСКАЯ ОБЛАСТЬ,20000.0,3.0,0.0


In [22]:
categorical_columns = [column for column in train_data.columns if train_data[column].dtype == 'O']
categorical_columns

['gender', 'marital_status', 'job_position', 'education', 'living_region']

In [23]:
categorical_columns = [x for x in categorical_columns if x != 'living_region']

#### One-hot coding

In [24]:
# dummy_features = pd.get_dummies(train_data[categorical_columns])
# dummy_features.head()

One-hot encoding with OneHotEncoder from Sklearn

In [25]:
from sklearn.preprocessing import OneHotEncoder

ohe = OneHotEncoder(drop='if_binary', sparse=False)
ohe.fit(train_data[categorical_columns])
new_category_columns = ohe.transform(train_data[categorical_columns])

In [26]:
new_category_columns

array([[1., 0., 0., ..., 0., 1., 0.],
       [0., 0., 0., ..., 0., 1., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [1., 0., 0., ..., 0., 0., 0.]])

In [27]:
ohe.get_feature_names_out()

array(['gender_M', 'marital_status_CIV', 'marital_status_DIV',
       'marital_status_MAR', 'marital_status_UNM', 'marital_status_WID',
       'job_position_ATP', 'job_position_BIS', 'job_position_BIU',
       'job_position_DIR', 'job_position_HSK', 'job_position_INP',
       'job_position_NOR', 'job_position_PNA', 'job_position_PNI',
       'job_position_PNS', 'job_position_PNV', 'job_position_SPC',
       'job_position_UMN', 'job_position_WOI', 'job_position_WRK',
       'job_position_WRP', 'education_ACD', 'education_GRD',
       'education_PGR', 'education_SCH', 'education_UGR'], dtype=object)

In [28]:
new_train_columns = pd.DataFrame(new_category_columns, columns=ohe.get_feature_names_out())
train_data = train_data.drop(columns=categorical_columns)
train_data = pd.concat([train_data, new_train_columns], axis=1)
train_data.head()

Unnamed: 0,client_id,age,credit_sum,credit_month,tariff_id,score_shk,living_region,monthly_income,credit_count,overdue_credit_count,...,job_position_SPC,job_position_UMN,job_position_WOI,job_position_WRK,job_position_WRP,education_ACD,education_GRD,education_PGR,education_SCH,education_UGR
0,1,25,26389.0,10,1.32,0.584105,ОБЛ КУРСКАЯ,35000.0,2.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
1,2,37,19588.0,12,1.43,0.718935,РЕСПУБЛИКА ТАТАРСТАН,15000.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
2,3,28,53669.0,18,1.1,0.586015,МОСКВА Г,70000.0,4.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
3,4,34,26349.0,10,1.43,0.655703,СВЕРДЛОВСКАЯ ОБЛАСТЬ,42500.0,4.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
4,5,43,11589.0,10,1.1,0.271893,РЯЗАНСКАЯ ОБЛАСТЬ,20000.0,3.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0


#### Processing living_region

In [29]:
np.unique(train_data['living_region'].dropna(), return_counts=True)

(array(['74', '98', 'АДЫГЕЯ РЕСП', 'АЛТАЙСКИЙ', 'АЛТАЙСКИЙ КРАЙ',
        'АМУРСКАЯ ОБЛ', 'АМУРСКАЯ ОБЛАСТЬ', 'АО НЕНЕЦКИЙ',
        'АО ХАНТЫ-МАНСИЙСКИЙ АВТОНОМНЫЙ ОКРУГ - Ю', 'АО ЯМАЛО-НЕНЕЦКИЙ',
        'АОБЛ ЕВРЕЙСКАЯ', 'АРХАНГЕЛЬСКАЯ', 'АРХАНГЕЛЬСКАЯ ОБЛ',
        'АРХАНГЕЛЬСКАЯ ОБЛАСТЬ', 'АСТРАХАНСКАЯ', 'АСТРАХАНСКАЯ ОБЛ',
        'АСТРАХАНСКАЯ ОБЛАСТЬ', 'БАШКОРТОСТАН', 'БАШКОРТОСТАН РЕСП',
        'БЕЛГОРОДСКАЯ ОБЛ', 'БЕЛГОРОДСКАЯ ОБЛАСТЬ', 'БРЯНСКАЯ ОБЛ',
        'БРЯНСКАЯ ОБЛАСТЬ', 'БРЯНСКИЙ', 'БУРЯТИЯ', 'БУРЯТИЯ РЕСП',
        'ВЛАДИМИРСКАЯ ОБЛ', 'ВЛАДИМИРСКАЯ ОБЛАСТЬ', 'ВОЛГОГРАДСКАЯ ОБЛ',
        'ВОЛГОГРАДСКАЯ ОБЛАСТЬ', 'ВОЛОГОДСКАЯ', 'ВОЛОГОДСКАЯ ОБЛ',
        'ВОЛОГОДСКАЯ ОБЛ.', 'ВОЛОГОДСКАЯ ОБЛАСТЬ', 'ВОРОНЕЖСКАЯ ОБЛ',
        'ВОРОНЕЖСКАЯ ОБЛАСТЬ', 'Г МОСКВА', 'Г. МОСКВА',
        'Г. САНКТ-ПЕТЕРБУРГ', 'Г.МОСКВА', 'ГОРЬКОВСКАЯ ОБЛ',
        'ГУСЬ-ХРУСТАЛЬНЫЙ Р-Н', 'ДАГЕСТАН РЕСП', 'ЕВРЕЙСКАЯ АВТОНОМНАЯ',
        'ЕВРЕЙСКАЯ АОБЛ', 'ЗАБАЙКАЛЬСКИЙ КРАЙ', 'ИВАНОВСКАЯ ОБЛ',

What are or could be the problems here:

- The same region is written differently;
- There are regions that are very rare;
- Two types of situations are possible:
    - Some region is in the training data, but it is not in the test data;
    - A region is in the test data but not in the training data.

### Creating new features

#### Manual creation of new features

In [30]:
train_data.head()

Unnamed: 0,client_id,age,credit_sum,credit_month,tariff_id,score_shk,living_region,monthly_income,credit_count,overdue_credit_count,...,job_position_SPC,job_position_UMN,job_position_WOI,job_position_WRK,job_position_WRP,education_ACD,education_GRD,education_PGR,education_SCH,education_UGR
0,1,25,26389.0,10,1.32,0.584105,ОБЛ КУРСКАЯ,35000.0,2.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
1,2,37,19588.0,12,1.43,0.718935,РЕСПУБЛИКА ТАТАРСТАН,15000.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
2,3,28,53669.0,18,1.1,0.586015,МОСКВА Г,70000.0,4.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
3,4,34,26349.0,10,1.43,0.655703,СВЕРДЛОВСКАЯ ОБЛАСТЬ,42500.0,4.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
4,5,43,11589.0,10,1.1,0.271893,РЯЗАНСКАЯ ОБЛАСТЬ,20000.0,3.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0


#### Automatic creation of new features

The [PolynomialFeatures](sklearn.preprocessing.PolynomialFeatures) from sklearn will help

### Feature filtering

The [SequentialFeatureSelector](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SequentialFeatureSelector.html#sklearn.feature_selection.SequentialFeatureSelector), [SelectFromModel](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectFromModel.html#sklearn.feature_selection.SelectFromModel) and other classes from the [feature_selection](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.feature_selection) sklearn module will help.

## Test data processing

In [31]:
test_data.head()

Unnamed: 0,client_id,gender,age,marital_status,job_position,credit_sum,credit_month,tariff_id,score_shk,education,living_region,monthly_income,credit_count,overdue_credit_count
0,119519,F,24,MAR,SPC,28429.0,18,1.1,0.593836,GRD,ОБЛ ИРКУТСКАЯ,33000.0,3.0,0.0
1,119520,M,25,UNM,SPC,15997.0,10,1.6,0.615015,SCH,УЛЬЯНОВСКАЯ ОБЛ,35000.0,2.0,0.0
2,119521,M,25,UNM,SPC,11043.0,10,1.16,0.666758,SCH,РЕСП БАШКОРТОСТАН,25000.0,3.0,0.0
3,119522,F,34,MAR,SPC,14617.0,10,1.4,0.447745,GRD,ПЕНЗЕНСКАЯ ОБЛ,15000.0,2.0,0.0
4,119523,M,33,MAR,SPC,38147.0,12,1.6,0.706974,UGR,ОБЛ МОСКОВСКАЯ,55000.0,1.0,0.0


Filling in the gaps in the data:

In [32]:
test_data['credit_count'] = simple_imputer.transform(test_data[['credit_count']])
transformed_test_data = knn_imputer.transform(test_data[numeric_columns])[:, 8]
test_data['overdue_credit_count'] = transformed_test_data

Recall now that the test data also has a skip in the monthly_income column.

**Task**: fill in the omission in monthly_income in the test data

Processing of categorical attributes:

In [33]:
new_category_columns_test = ohe.transform(test_data[categorical_columns])
new_test_columns = pd.DataFrame(new_category_columns_test, columns=ohe.get_feature_names_out())
test_data = test_data.drop(columns=categorical_columns)
test_data = pd.concat([test_data, new_test_columns], axis=1)
test_data.head()

Unnamed: 0,client_id,age,credit_sum,credit_month,tariff_id,score_shk,living_region,monthly_income,credit_count,overdue_credit_count,...,job_position_SPC,job_position_UMN,job_position_WOI,job_position_WRK,job_position_WRP,education_ACD,education_GRD,education_PGR,education_SCH,education_UGR
0,119519,24,28429.0,18,1.1,0.593836,ОБЛ ИРКУТСКАЯ,33000.0,3.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
1,119520,25,15997.0,10,1.6,0.615015,УЛЬЯНОВСКАЯ ОБЛ,35000.0,2.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
2,119521,25,11043.0,10,1.16,0.666758,РЕСП БАШКОРТОСТАН,25000.0,3.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
3,119522,34,14617.0,10,1.4,0.447745,ПЕНЗЕНСКАЯ ОБЛ,15000.0,2.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
4,119523,33,38147.0,12,1.6,0.706974,ОБЛ МОСКОВСКАЯ,55000.0,1.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0


And there is still a separate living_region feature.

**Task**: preprocess the living_region feature

## Training and model parameterization

**Task**: train a machine learning model on training data and select hyperparameters.

Question: when should I split the data into train/val? Before or after preprocessing the features of the training dataset?

Note that the quality metric in [cont. on Kaggle](https://www.kaggle.com/competitions/bank-issues-042022/leaderboard) for this task is ROC AUC

## Obtaining predictions on test data

**Task**: get model predictions on test_data. Submit your predictions to [Kaggle contest](https://www.kaggle.com/competitions/bank-issues-042022/leaderboard).