<h1 align='center'><b>Машинное обучение в бизнесе<b></h1>

<h1 align='left'>Урок 6. Задача lookalike (Positive Unlabeled Learning)</h1>

<h2 align='center'>Домашняя работа</h2>

1. взять любой набор данных для бинарной классификации (можно скачать один из модельных с https://archive.ics.uci.edu/ml/datasets.php)
2. сделать feature engineering
3. обучить любой классификатор (какой вам нравится)
4. далее разделить ваш набор данных на два множества: P (positives) и U (unlabeled). Причем брать нужно не все положительные (класс 1) примеры, а только лишь часть
5. применить random negative sampling для построения классификатора в новых условиях
6. сравнить качество с решением из пункта 4 (построить отчет - таблицу метрик)
7. поэкспериментировать с долей P на шаге 5 (как будет меняться качество модели при уменьшении/увеличении размера P)

<b>Бонусный вопрос:</b>

Как вы думаете, какой из методов на практике является более предпочтительным: random negative sampling или 2-step approach?

In [1]:
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import recall_score, precision_score, roc_auc_score, accuracy_score, f1_score, precision_recall_curve


In [2]:
import warnings
warnings.filterwarnings('ignore')

### 1. взять любой набор данных для бинарной классификации (можно скачать один из модельных с https://archive.ics.uci.edu/ml/datasets.php)

Датасет для работы возьмем отсюда: https://www.kaggle.com/davinwijaya/customer-retention

In [3]:
df = pd.read_csv('./data/churn_data.csv')
df.head(3)

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,1,15634602,Hargrave,619,France,Female,42,2,0.0,1,1,1,101348.88,1
1,2,15647311,Hill,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0
2,3,15619304,Onio,502,France,Female,42,8,159660.8,3,1,0,113931.57,1


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 14 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   RowNumber        10000 non-null  int64  
 1   CustomerId       10000 non-null  int64  
 2   Surname          10000 non-null  object 
 3   CreditScore      10000 non-null  int64  
 4   Geography        10000 non-null  object 
 5   Gender           10000 non-null  object 
 6   Age              10000 non-null  int64  
 7   Tenure           10000 non-null  int64  
 8   Balance          10000 non-null  float64
 9   NumOfProducts    10000 non-null  int64  
 10  HasCrCard        10000 non-null  int64  
 11  IsActiveMember   10000 non-null  int64  
 12  EstimatedSalary  10000 non-null  float64
 13  Exited           10000 non-null  int64  
dtypes: float64(2), int64(9), object(3)
memory usage: 1.1+ MB


### 2. сделать feature engineering

Есть как категориальные, так и вещественные признаки. Поле CustomerId нужно будет удалить.  
Категориальные признаки закодируем с помощью OneHotEncoding.  
Вещественные обработаем c помощью StandardScaler.  

In [5]:
df = df.drop(['CustomerId'], axis=1)

Зададим списки признаков

In [6]:
categorical_columns = ['Geography', 'Gender', 'Tenure', 'HasCrCard', 'IsActiveMember', 'Surname']
continuous_columns = ['CreditScore', 'Age', 'Balance', 'NumOfProducts', 'EstimatedSalary']

In [7]:
one_hot = pd.get_dummies(df[categorical_columns])
df = df.drop(categorical_columns, axis=1)
df = df.join(one_hot)

In [8]:
scaler = StandardScaler()

df_norm = df.copy()
df[continuous_columns] = scaler.fit_transform(df_norm[continuous_columns])

In [9]:
df.head()

Unnamed: 0,RowNumber,CreditScore,Age,Balance,NumOfProducts,EstimatedSalary,Exited,Tenure,HasCrCard,IsActiveMember,...,Surname_Zinachukwudi,Surname_Zito,Surname_Zotov,Surname_Zotova,Surname_Zox,Surname_Zubarev,Surname_Zubareva,Surname_Zuev,Surname_Zuyev,Surname_Zuyeva
0,1,-0.326221,0.293517,-1.225848,-0.911583,0.021886,1,2,1,1,...,0,0,0,0,0,0,0,0,0,0
1,2,-0.440036,0.198164,0.11735,-0.911583,0.216534,0,1,0,1,...,0,0,0,0,0,0,0,0,0,0
2,3,-1.536794,0.293517,1.333053,2.527057,0.240687,1,8,1,0,...,0,0,0,0,0,0,0,0,0,0
3,4,0.501521,0.007457,-1.225848,0.807737,-0.108918,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
4,5,2.063884,0.388871,0.785728,-0.911583,-0.365276,0,2,1,1,...,0,0,0,0,0,0,0,0,0,0


### 3. обучить любой классификатор (какой вам нравится)

In [10]:
x_data = df.drop(['Exited'], axis=1)
y_data = df['Exited']

In [11]:
#разделим данные на train/test
X_train, X_test, y_train, y_test = train_test_split(x_data, y_data, random_state=0)

В качестве классификатора воспользуемся __RandomForestClassifier__

In [12]:
model_rf = RandomForestClassifier(random_state=42)

model_rf.fit(X_train, y_train)

RandomForestClassifier(random_state=42)

Проверяем качество

In [13]:
#наши прогнозы для тестовой выборки
preds_rf = model_rf.predict_proba(X_test)[:, 1]
preds_rf[:10]

array([0.24, 0.23, 0.13, 0.01, 0.08, 0.49, 0.03, 0.06, 0.2 , 0.53])

In [14]:
b=1
precision, recall, thresholds = precision_recall_curve(y_test, preds_rf)
roc_auc = roc_auc_score(y_true=y_test, y_score=preds_rf)
fscore = (1+b**2)*(precision * recall) / (b**2*precision + recall)
# locate the index of the largest f score
ix = np.argmax(fscore)
print('Best Threshold=%f, Precision=%.3f, Recall=%.3f, Roc_Auc=%.3f, F-Score=%.3f' % (thresholds[ix],
                                                                                      precision[ix],
                                                                                      recall[ix],
                                                                                      roc_auc,
                                                                                      fscore[ix]))
threshold_rf = thresholds[ix],
precision_rf = precision[ix],
recall_rf = recall[ix],
roc_auc_rf = roc_auc,
fscore_rf = fscore[ix]

Best Threshold=0.310000, Precision=0.664, Recall=0.609, Roc_Auc=0.852, F-Score=0.635


### 4. далее разделить ваш набор данных на два множества: P (positives) и U (unlabeled). Причем брать нужно не все положительные (класс 1) примеры, а только лишь часть.  Поэкспериментируем с долей P, взяв 0.1, 0.5 и 0.9

In [15]:
df.head()

Unnamed: 0,RowNumber,CreditScore,Age,Balance,NumOfProducts,EstimatedSalary,Exited,Tenure,HasCrCard,IsActiveMember,...,Surname_Zinachukwudi,Surname_Zito,Surname_Zotov,Surname_Zotova,Surname_Zox,Surname_Zubarev,Surname_Zubareva,Surname_Zuev,Surname_Zuyev,Surname_Zuyeva
0,1,-0.326221,0.293517,-1.225848,-0.911583,0.021886,1,2,1,1,...,0,0,0,0,0,0,0,0,0,0
1,2,-0.440036,0.198164,0.11735,-0.911583,0.216534,0,1,0,1,...,0,0,0,0,0,0,0,0,0,0
2,3,-1.536794,0.293517,1.333053,2.527057,0.240687,1,8,1,0,...,0,0,0,0,0,0,0,0,0,0
3,4,0.501521,0.007457,-1.225848,0.807737,-0.108918,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
4,5,2.063884,0.388871,0.785728,-0.911583,-0.365276,0,2,1,1,...,0,0,0,0,0,0,0,0,0,0


### perc = 0.1

In [16]:
mod_data = X_train.copy()
mod_data['lable'] = y_train
mod_data = mod_data.reset_index(drop=True)

#индексы положительных образцов
pos_ind = np.where(mod_data['lable'].values == 1)[0]
#shuffle them
np.random.shuffle(pos_ind)
# оставляем отмеченными только % положительных результатов
perc = 0.1
pos_sample_len = int(np.ceil(perc * len(pos_ind)))

print(f'Используем {pos_sample_len}/{len(pos_ind)} как позитивные, а остальные оставляем неразмеченными')
pos_sample = pos_ind[:pos_sample_len]

Используем 153/1528 как позитивные, а остальные оставляем неразмеченными


Создаем столбец для новой целевой переменной, где у нас два класса - P (1) и U (-1)

In [17]:
mod_data['class_test'] = -1
mod_data.loc[pos_sample, 'class_test'] = 1
print('target variable:\n', mod_data.iloc[:,-1].value_counts())

target variable:
 -1    7347
 1     153
Name: class_test, dtype: int64


In [18]:
mod_data.head(10)

Unnamed: 0,RowNumber,CreditScore,Age,Balance,NumOfProducts,EstimatedSalary,Tenure,HasCrCard,IsActiveMember,Geography_France,...,Surname_Zotov,Surname_Zotova,Surname_Zox,Surname_Zubarev,Surname_Zubareva,Surname_Zuev,Surname_Zuyev,Surname_Zuyeva,lable,class_test
0,2968,-0.740092,0.007457,0.662679,2.527057,-1.639074,5,0,0,0,...,0,0,0,0,0,0,0,0,1,-1
1,701,1.029206,-0.660018,-1.225848,0.807737,-0.077881,5,1,0,1,...,0,0,0,0,0,0,0,0,0,-1
2,3482,0.811924,-0.469311,-0.371603,0.807737,-0.995247,9,1,1,0,...,0,0,0,0,0,0,0,0,0,-1
3,1622,0.398053,-0.087897,-0.02261,-0.911583,-1.590021,5,1,1,0,...,0,0,0,0,0,0,0,0,1,-1
4,801,-0.471076,1.247053,-1.225848,0.807737,1.284391,7,1,1,1,...,0,0,0,0,0,0,0,0,0,-1
5,8943,0.170424,-0.183251,-0.075311,0.807737,-0.562629,9,1,1,1,...,0,0,0,0,0,0,0,0,0,-1
6,3981,0.232504,2.48665,-1.225848,-0.911583,-0.249652,0,1,1,1,...,0,0,0,0,0,0,0,0,0,-1
7,3664,0.76019,-0.755372,-1.225848,-0.911583,0.605132,5,1,0,1,...,0,0,0,0,0,0,0,0,1,-1
8,7879,0.832617,-0.087897,0.756894,0.807737,1.238974,10,1,0,1,...,0,0,0,0,0,0,0,0,1,-1
9,2331,-1.723036,0.007457,-1.225848,0.807737,1.306503,5,1,1,0,...,0,0,0,0,0,0,0,0,0,-1


### 5. применить random negative sampling для построения классификатора в новых условиях

In [19]:
mod_data = mod_data.sample(frac=1)

data_N = mod_data[mod_data['class_test'] == -1]
data_P = mod_data[mod_data['class_test'] == 1]

neg_sample = data_N[:data_P.shape[0]]
sample_test = data_N[data_P.shape[0]:]
pos_sample = data_P.copy()

print(neg_sample.shape, pos_sample.shape)
sample_train = pd.concat([neg_sample, pos_sample]).sample(frac=1)

(153, 2948) (153, 2948)


In [20]:
sample_train

Unnamed: 0,RowNumber,CreditScore,Age,Balance,NumOfProducts,EstimatedSalary,Tenure,HasCrCard,IsActiveMember,Geography_France,...,Surname_Zotov,Surname_Zotova,Surname_Zox,Surname_Zubarev,Surname_Zubareva,Surname_Zuev,Surname_Zuyev,Surname_Zuyeva,lable,class_test
585,6897,-0.564197,0.293517,-1.225848,0.807737,0.373483,7,1,1,1,...,0,0,0,0,0,0,0,0,0,-1
6782,4998,0.739496,0.388871,-1.225848,-0.911583,-0.965612,1,1,0,0,...,0,0,0,0,0,0,0,0,1,1
77,9769,0.656722,1.437761,0.985678,0.807737,-0.848265,4,0,1,0,...,0,0,0,0,0,0,0,0,1,1
1884,5482,-0.760786,0.484225,1.211654,-0.911583,-0.962842,1,0,1,0,...,0,0,0,0,0,0,0,0,1,1
2371,2189,0.594642,-0.660018,1.779037,-0.911583,0.348219,8,1,1,0,...,0,0,0,0,0,0,0,0,0,-1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6027,7660,0.304932,-0.755372,0.589898,-0.911583,0.392113,10,1,1,1,...,0,0,0,0,0,0,0,0,0,-1
4179,3053,-0.450383,-0.373958,-1.225848,0.807737,0.245186,2,1,1,1,...,0,0,0,0,0,0,0,0,0,-1
5905,7163,-0.450383,-0.469311,0.896782,-0.911583,1.350386,9,1,0,0,...,0,0,0,0,0,0,0,0,0,-1
1363,5915,1.070593,-1.136786,0.658593,0.807737,-0.211837,7,0,1,0,...,0,0,0,0,0,0,0,0,0,-1


In [21]:
model_rf_mod = RandomForestClassifier(random_state=42)
sample_train.loc[sample_train['class_test'] == -1, 'class_test'] = 0

model_rf_mod.fit(sample_train.drop(columns=['class_test', 'lable']),
                sample_train['class_test'])

preds_rf_mod = model_rf_mod.predict_proba(X_test)[:, 1]

b=1
precision, recall, thresholds = precision_recall_curve(y_test, preds_rf_mod)
roc_auc = roc_auc_score(y_true=y_test, y_score=preds_rf_mod)
fscore = (1+b**2)*(precision * recall) / (b**2*precision + recall)
# locate the index of the largest f score
ix = np.argmax(fscore)
print('Best Threshold=%f, Precision=%.3f, Recall=%.3f, Roc_Auc=%.3f, F-Score=%.3f' % (thresholds[ix],
                                                                                      precision[ix],
                                                                                      recall[ix],
                                                                                      roc_auc,
                                                                                      fscore[ix]))

threshold_mod_1 = thresholds[ix],
precision_mod_1 = precision[ix],
recall_mod_1 = recall[ix],
roc_auc_mod_1 = roc_auc,
fscore_mod_1 = fscore[ix]

Best Threshold=0.550000, Precision=0.454, Recall=0.556, Roc_Auc=0.781, F-Score=0.500


### perc = 0.5

In [22]:
mod_data = X_train.copy()
mod_data['lable'] = y_train
mod_data = mod_data.reset_index(drop=True)

#индексы положительных образцов
pos_ind = np.where(mod_data['lable'].values == 1)[0]
#shuffle them
np.random.shuffle(pos_ind)
# оставляем отмеченными только % положительных результатов
perc = 0.5
pos_sample_len = int(np.ceil(perc * len(pos_ind)))

print(f'Используем {pos_sample_len}/{len(pos_ind)} как позитивные, а остальные оставляем неразмеченными')
pos_sample = pos_ind[:pos_sample_len]

Используем 764/1528 как позитивные, а остальные оставляем неразмеченными


In [23]:
mod_data['class_test'] = -1
mod_data.loc[pos_sample, 'class_test'] = 1
print('target variable:\n', mod_data.iloc[:,-1].value_counts())

target variable:
 -1    6736
 1     764
Name: class_test, dtype: int64


In [24]:
mod_data = mod_data.sample(frac=1)

data_N = mod_data[mod_data['class_test'] == -1]
data_P = mod_data[mod_data['class_test'] == 1]

neg_sample = data_N[:data_P.shape[0]]
sample_test = data_N[data_P.shape[0]:]
pos_sample = data_P.copy()

print(neg_sample.shape, pos_sample.shape)
sample_train = pd.concat([neg_sample, pos_sample]).sample(frac=1)

(764, 2948) (764, 2948)


In [25]:
model_rf_mod = RandomForestClassifier(random_state=42)
sample_train.loc[sample_train['class_test'] == -1, 'class_test'] = 0

model_rf_mod.fit(sample_train.drop(columns=['class_test', 'lable']),
                sample_train['class_test'])

preds_rf_mod = model_rf_mod.predict_proba(X_test)[:, 1]

b=1
precision, recall, thresholds = precision_recall_curve(y_test, preds_rf_mod)
roc_auc = roc_auc_score(y_true=y_test, y_score=preds_rf_mod)
fscore = (1+b**2)*(precision * recall) / (b**2*precision + recall)
# locate the index of the largest f score
ix = np.argmax(fscore)
print('Best Threshold=%f, Precision=%.3f, Recall=%.3f, Roc_Auc=%.3f, F-Score=%.3f' % (thresholds[ix],
                                                                                      precision[ix],
                                                                                      recall[ix],
                                                                                      roc_auc,
                                                                                      fscore[ix]))

threshold_mod_2 = thresholds[ix],
precision_mod_2 = precision[ix],
recall_mod_2 = recall[ix],
roc_auc_mod_2 = roc_auc,
fscore_mod_2 = fscore[ix]

Best Threshold=0.590000, Precision=0.543, Recall=0.644, Roc_Auc=0.831, F-Score=0.589


### perc = 0.9

In [26]:
mod_data = X_train.copy()
mod_data['lable'] = y_train
mod_data = mod_data.reset_index(drop=True)

#индексы положительных образцов
pos_ind = np.where(mod_data['lable'].values == 1)[0]
#shuffle them
np.random.shuffle(pos_ind)
# оставляем отмеченными только % положительных результатов
perc = 0.9
pos_sample_len = int(np.ceil(perc * len(pos_ind)))

print(f'Используем {pos_sample_len}/{len(pos_ind)} как позитивные, а остальные оставляем неразмеченными')
pos_sample = pos_ind[:pos_sample_len]

Используем 1376/1528 как позитивные, а остальные оставляем неразмеченными


In [27]:
mod_data['class_test'] = -1
mod_data.loc[pos_sample, 'class_test'] = 1
print('target variable:\n', mod_data.iloc[:,-1].value_counts())

target variable:
 -1    6124
 1    1376
Name: class_test, dtype: int64


In [28]:
mod_data = mod_data.sample(frac=1)

data_N = mod_data[mod_data['class_test'] == -1]
data_P = mod_data[mod_data['class_test'] == 1]

neg_sample = data_N[:data_P.shape[0]]
sample_test = data_N[data_P.shape[0]:]
pos_sample = data_P.copy()

print(neg_sample.shape, pos_sample.shape)
sample_train = pd.concat([neg_sample, pos_sample]).sample(frac=1)

(1376, 2948) (1376, 2948)


In [29]:
model_rf_mod = RandomForestClassifier(random_state=42)
sample_train.loc[sample_train['class_test'] == -1, 'class_test'] = 0

model_rf_mod.fit(sample_train.drop(columns=['class_test', 'lable']),
                sample_train['class_test'])

preds_rf_mod = model_rf_mod.predict_proba(X_test)[:, 1]

b=1
precision, recall, thresholds = precision_recall_curve(y_test, preds_rf_mod)
roc_auc = roc_auc_score(y_true=y_test, y_score=preds_rf_mod)
fscore = (1+b**2)*(precision * recall) / (b**2*precision + recall)
# locate the index of the largest f score
ix = np.argmax(fscore)
print('Best Threshold=%f, Precision=%.3f, Recall=%.3f, Roc_Auc=%.3f, F-Score=%.3f' % (thresholds[ix],
                                                                                      precision[ix],
                                                                                      recall[ix],
                                                                                      roc_auc,
                                                                                      fscore[ix]))

threshold_mod_3 = thresholds[ix],
precision_mod_3 = precision[ix],
recall_mod_3 = recall[ix],
roc_auc_mod_3 = roc_auc,
fscore_mod_3 = fscore[ix]

Best Threshold=0.670000, Precision=0.684, Recall=0.544, Roc_Auc=0.850, F-Score=0.606


### 6. сравнить качество с решением из пункта 4 (построить отчет - таблицу метрик)

In [30]:
summary_table = pd.DataFrame({
    'Version': ['RandomForestClassifier', 'RandomForestClassifier + PU(0.1)', 'RandomForestClassifier + PU(0.5)', 'RandomForestClassifier + PU(0.9)'],
    'Threshold': [threshold_rf, threshold_mod_1, threshold_mod_2, threshold_mod_3],
    'Precision': [precision_rf, precision_mod_1, precision_mod_2, precision_mod_3],
    'Recall': [recall_rf, recall_mod_1, recall_mod_2, recall_mod_3],
    'ROC-AUC': [roc_auc_rf, roc_auc_mod_1, roc_auc_mod_2, roc_auc_mod_3],
    'F-Score': [fscore_rf, fscore_mod_1, fscore_mod_2, fscore_mod_3],
})
summary_table

Unnamed: 0,Version,Threshold,Precision,Recall,ROC-AUC,F-Score
0,RandomForestClassifier,"(0.31,)","(0.6638115631691649,)","(0.6090373280943026,)","(0.8517567758252016,)",0.635246
1,RandomForestClassifier + PU(0.1),"(0.55,)","(0.453525641025641,)","(0.555992141453831,)","(0.7809163830557746,)",0.499559
2,RandomForestClassifier + PU(0.5),"(0.59,)","(0.543046357615894,)","(0.6444007858546169,)","(0.8307521370726224,)",0.589398
3,RandomForestClassifier + PU(0.9),"(0.67,)","(0.6839506172839506,)","(0.5442043222003929,)","(0.8501725347561078,)",0.606127


__ВЫВОД:__  
Можно заметить, что при увеличении % отмеченных положительных результатов, метрики __ROC-AUC__ и __F-Score__ приближаются к обученной изначально модели.  __Precision__ и __Recall__ ведут себя произвольно.	