# Несбалансированные выборки. Практическая работа

## Цель практической работы

Научиться обрабатывать несбалансированные данные и обучать модели машинного обучения на таких данных.

## Что входит в работу


1. Загрузить данные и провести разведочный анализ.
2. Разделить данные на обучающую и тестовую выборки.
3. Подготовить данные для моделирования.
4. Сбалансировать данные методом SMOTE и обучить модель машинного обучения.
5. Обучить модель машинного обучения с использованием весов классов и кросс-валидации.
6. Сравнить метрики качества четырёх моделей.




## Что оценивается 

- Выполнены все этапы работы.
- Не допущена утечка данных при разделении выборок и подготовке данных.
- Данные корректным образом сбалансированы.
- Модели не переобучены.


## Как отправить работу на проверку

Скачайте файл с заданиями в материалах, откройте его через Jupyter Notebook и выполните задания. Сохраните изменения при помощи опции Save and Checkpoint из вкладки меню File или кнопки Save and Checkpoint на панели инструментов. Отправьте через форму ниже итоговый файл Jupyter Notebook (в формате .ipynb) или ссылку на него.


# Задача

Пусть у нас имеется некоторый набор данных `german_credit_data.csv` о заёмщиках банка:

* Age — возраст заёмщика.
* Sex — пол заёмщика.
* Job — тип работы заёмщика.
* Housing — тип жилья заёмщика.
* Saving accounts — объём средств на сберегательных счетах заёмщика.
* Checking account — объём средств на основном счёте заёмщика.
* Credit amount — размер кредита. 
* Duration — срок кредита (в месяцах).
* Purpose — цель кредита.
* Risk — таргет, допустил ли заёмщик просрочку платежей по кредиту.

Решите задачу классификации заёмщиков, чтобы банк умел заранее предсказывать просрочку платежей по кредиту. 



# Задание 1

Загрузите датасет и выполните разведочный анализ данных. Сделайте выводы о найденных закономерностях, особенностях и других свойствах данных, которые вы обнаружите.

In [199]:
import pandas as pd
import numpy as np

from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import roc_auc_score
from imblearn.over_sampling import SMOTE

In [160]:
df = pd.read_csv('german_credit_data.csv')
df.head()

Unnamed: 0,Age,Sex,Job,Housing,Saving accounts,Checking account,Credit amount,Duration,Purpose,Risk
0,67,male,2,own,,little,1169,6,radio/TV,good
1,22,female,2,own,little,moderate,5951,48,radio/TV,bad
2,49,male,1,own,little,,2096,12,education,good
3,45,male,2,free,little,little,7882,42,furniture/equipment,good
4,53,male,2,free,little,little,4870,24,car,bad


In [161]:
df['Checking account'].value_counts()

Checking account
little      274
moderate    269
rich         63
Name: count, dtype: int64

In [162]:
df.Risk.value_counts()

Risk
good    700
bad     300
Name: count, dtype: int64

In [163]:
df.Sex.value_counts()

Sex
male      690
female    310
Name: count, dtype: int64

In [164]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 10 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   Age               1000 non-null   int64 
 1   Sex               1000 non-null   object
 2   Job               1000 non-null   int64 
 3   Housing           1000 non-null   object
 4   Saving accounts   817 non-null    object
 5   Checking account  606 non-null    object
 6   Credit amount     1000 non-null   int64 
 7   Duration          1000 non-null   int64 
 8   Purpose           1000 non-null   object
 9   Risk              1000 non-null   object
dtypes: int64(4), object(6)
memory usage: 78.3+ KB


In [165]:
df.describe()

Unnamed: 0,Age,Job,Credit amount,Duration
count,1000.0,1000.0,1000.0,1000.0
mean,35.546,1.904,3271.258,20.903
std,11.375469,0.653614,2822.736876,12.058814
min,19.0,0.0,250.0,4.0
25%,27.0,2.0,1365.5,12.0
50%,33.0,2.0,2319.5,18.0
75%,42.0,2.0,3972.25,24.0
max,75.0,3.0,18424.0,72.0


In [166]:
df['Checking account'].isna()

0      False
1      False
2       True
3      False
4      False
       ...  
995     True
996    False
997     True
998    False
999    False
Name: Checking account, Length: 1000, dtype: bool

In [167]:
df['Checking account'].value_counts()

Checking account
little      274
moderate    269
rich         63
Name: count, dtype: int64

# Задание 2

Разделите датасет на обучающую и тестовую выборки в пропорции 80:20. Разделение должно быть стратифицированным по таргету Risk.

В этом и следующих заданиях используйте random_state = 1.

In [168]:
df_train, df_test = train_test_split(df, test_size=0.2, stratify=df.Risk, random_state=1)

# Задание 3

Проведите этап очистки и подготовки данных (data preparation) и подготовьте данные к моделированию.

In [169]:
df_train['Saving accounts'] = df_train['Saving accounts'].fillna(value=df_train['Saving accounts'].mode()[0])
df_test['Saving accounts'] = df_test['Saving accounts'].fillna(value=df_test['Saving accounts'].mode()[0])

df_train['Checking account'] = df_train['Checking account'].fillna(value=df_train['Checking account'].mode()[0])
df_test['Checking account'] = df_test['Checking account'].fillna(value=df_test['Checking account'].mode()[0])

In [170]:
df_test['Checking account'].isna().value_counts()

Checking account
False    200
Name: count, dtype: int64

In [171]:
le = LabelEncoder()

In [172]:
df_train.Risk, df_train.Sex, df_train.Housing, df_train['Checking account'] = le.fit_transform(df_train.Risk), le.fit_transform(df_train.Sex), le.fit_transform(df_train.Housing), le.fit_transform(df_train['Checking account'])
df_test.Risk, df_test.Sex, df_test.Housing, df_test['Checking account'] = le.fit_transform(df_test.Risk), le.fit_transform(df_test.Sex), le.fit_transform(df_test.Housing), le.fit_transform(df_test['Checking account'])

In [173]:
sds = StandardScaler()

df_train[['Age', 'Credit amount', 'Duration']] = sds.fit_transform(df_train[['Age', 'Credit amount', 'Duration']])

In [174]:
ohe = OneHotEncoder(sparse_output=False)

df_train[ohe.get_feature_names_out()] = ohe.fit_transform(df_train[['Saving accounts']])
df_train[ohe.get_feature_names_out()] = ohe.fit_transform(df_train[['Purpose']])
df_train = df_train.drop(['Saving accounts', 'Purpose'], axis = 1)

df_test[ohe.get_feature_names_out()] = ohe.fit_transform(df_test[['Saving accounts']])
df_test[ohe.get_feature_names_out()] = ohe.fit_transform(df_test[['Purpose']])
df_test = df_test.drop(['Saving accounts', 'Purpose'], axis = 1)

In [175]:
df_train.head()

Unnamed: 0,Age,Sex,Job,Housing,Checking account,Credit amount,Duration,Risk,Saving accounts_little,Saving accounts_moderate,Saving accounts_quite rich,Saving accounts_rich,Purpose_business,Purpose_car,Purpose_domestic appliances,Purpose_education,Purpose_furniture/equipment,Purpose_radio/TV,Purpose_repairs,Purpose_vacation/others
561,-1.034287,1,1,2,0,-0.607482,0.236012,0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
613,-1.207679,0,2,2,0,0.128437,0.236012,1,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
352,0.179461,1,3,1,0,-0.013738,-0.255253,1,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
568,0.43955,1,2,1,1,0.250854,2.201073,1,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
295,-0.860894,0,2,1,1,2.360888,2.201073,0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0


# Задание 4

Сбалансируйте обучающую выборку методом SMOTE и обучите модель RandomForestClassifier. Подберите оптимальные гиперпараметры с помощью GridSearch. Посчитайте метрику ROC-AUC на тестовой выборке.

In [191]:
os = SMOTE(random_state=1, k_neighbors=2)

features = list(df_train.columns.drop('Risk'))
target = 'Risk'

X_train, y_train = os.fit_resample(df_train[features], df_train[target])

In [200]:
param_grid = {
    'n_estimators': list(range(100,500,100)),
    'criterion' : ['gini', 'entropy', 'log_loss'],
    'max_depth': list(range(10,100,10)),
    'max_features':['sqrt', 'log2'],
    'min_samples_split': [2],
    'random_state': [1]
    }
grid_search_rf = GridSearchCV(
    estimator=RandomForestClassifier(),
    param_grid=param_grid,
    scoring='f1',
    verbose=1,
    n_jobs=-1)

grid_search_rf.fit(X_train, y_train)

best_params = grid_search_rf.best_params_
best_params

Fitting 5 folds for each of 216 candidates, totalling 1080 fits


{'criterion': 'entropy',
 'max_depth': 40,
 'max_features': 'sqrt',
 'min_samples_split': 2,
 'n_estimators': 300,
 'random_state': 1}

In [203]:
rf = RandomForestClassifier(criterion='entropy', max_depth=40, max_features='sqrt', min_samples_split=2, n_estimators= 300, random_state=1)
rf.fit(X_train, y_train)

In [206]:
X_test, y_test = df_test[features], df_test[target]
score = roc_auc_score(y_test, rf.predict_proba(X_test)[:, 1])
print(score)

0.5494047619047618


# Задание 5

Обучите модель RandomForestClassifier с использованием весов классов. Подберите оптимальные веса и гиперпараметры с помощью GridSearch. Посчитайте метрику ROC-AUC на тестовой выборке.

In [232]:
X_train, y_train = df_train.drop('Risk', axis=1), df_train.Risk
X_test, y_test = df_test.drop('Risk', axis = 1), df_test.Risk

class_weights = {
    0:1,
    1:(df_train[df_train['Risk'] == 0].shape[0] / df_train[df_train['Risk'] == 1].shape[0])**3
}

In [236]:
param_grid = {
    'n_estimators': list(range(100,500,100)),
    'criterion' : ['gini', 'entropy', 'log_loss'],
    'max_depth': list(range(10,100,10)),
    'max_features':['sqrt', 'log2'],
    'min_samples_split': [2],
    'random_state': [1],
    'class_weight': [class_weights]
    }
grid_search_rf = GridSearchCV(
    estimator=RandomForestClassifier(),
    param_grid=param_grid,
    scoring='f1_weighted', # Учитывает дисбаланс классов
    verbose=1,
    n_jobs=-1)

grid_search_rf.fit(X_train, y_train)

best_params = grid_search_rf.best_params_
best_params

Fitting 5 folds for each of 216 candidates, totalling 1080 fits


  _data = np.array(data, dtype=dtype, copy=copy,


{'class_weight': {0: 1, 1: 0.07871720116618075},
 'criterion': 'entropy',
 'max_depth': 30,
 'max_features': 'sqrt',
 'min_samples_split': 2,
 'n_estimators': 100,
 'random_state': 1}

In [237]:
rf = RandomForestClassifier(criterion='entropy', max_depth=30, max_features='sqrt', min_samples_split=2, n_estimators= 100, random_state=1, class_weight=class_weights)
rf.fit(X_train, y_train)

In [238]:
X_test, y_test = df_test[features], df_test[target]
score = roc_auc_score(y_test, rf.predict_proba(X_test)[:, 1])
print(score)

0.5545238095238094


In [245]:
from sklearn.model_selection import StratifiedKFold

kf = StratifiedKFold(n_splits=4, shuffle=True, random_state=1)

metrics = []
models = []

X, y = df_train[features], df_train.Risk

for train_index, test_index in kf.split(X, y):
  X_train, y_train = X.values[train_index], y.values[train_index]
  X_test, y_test = X.values[test_index], y.values[test_index]

  model = RandomForestClassifier(n_estimators=10, max_depth=3, random_state=1, class_weight='balanced')

  model.fit(X_train, y_train)
  score = roc_auc_score(y_test, model.predict_proba(X_test)[:, 1])
  print(score)

  metrics.append(score)
  models.append(model)  

0.6451785714285714
0.6402380952380953
0.6792857142857144
0.6173809523809524


In [246]:
sum(metrics) / len(metrics)

0.6455208333333333

In [249]:
param_grid = {
    'n_estimators': list(range(100,500,100)),
    'criterion' : ['gini', 'entropy', 'log_loss'],
    'max_depth': list(range(10,100,10)),
    'max_features':['sqrt', 'log2'],
    'min_samples_split': [2],
    'random_state': [1],
    }
grid_search_rf = GridSearchCV(
    estimator=RandomForestClassifier(),
    param_grid=param_grid,
    scoring='roc_auc', # roc_auc_score
    verbose=1,
    n_jobs=-1)

grid_search_rf.fit(X_train, y_train)

best_params = grid_search_rf.best_params_
best_params

Fitting 5 folds for each of 216 candidates, totalling 1080 fits


{'criterion': 'gini',
 'max_depth': 10,
 'max_features': 'sqrt',
 'min_samples_split': 2,
 'n_estimators': 100,
 'random_state': 1}

In [250]:
rf = RandomForestClassifier(criterion='gini', max_depth=10, max_features='sqrt', min_samples_split=2, n_estimators= 100, random_state=1)
rf.fit(X_train, y_train)

In [251]:
X_test, y_test = df_test[features], df_test[target]
score = roc_auc_score(y_test, rf.predict_proba(X_test)[:, 1])
print(score)

0.5654761904761905


