#### Домашнее задание
1. взять любой набор данных для бинарной классификации (можно скачать один из модельных с https://archive.ics.uci.edu/ml/datasets.php)  
2. сделать feature engineering  
3. обучить любой классификатор (какой вам нравится)  
4. далее разделить ваш набор данных на два множества:  
P (positives) и U (unlabeled). Причем брать нужно не все положительные (класс 1) примеры, а только лишь часть
5. применить random negative sampling для построения классификатора в новых условиях  
6. сравнить качество с решением из пункта 4 (построить отчет - таблицу метрик)
6. поэкспериментировать с долей P на шаге 5 (как будет меняться качество модели при уменьшении/увеличении размера P)
Бонусный вопрос:  

Как вы думаете, какой из методов на практике является более предпочтительным: random negative sampling или 2-step approach?

In [1]:
import pandas as pd
import numpy as np
from sklearn.metrics import recall_score, precision_score, roc_auc_score, accuracy_score, f1_score,precision_recall_curve
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
import itertools
from catboost import CatBoostClassifier
from lightgbm import LGBMClassifier
import matplotlib.pyplot as plt

%matplotlib inline

In [2]:
def evaluate_results(y_test, y_predict):
    print('Classification results:')
    f1 = f1_score(y_test, y_predict)
    print("f1: %.2f%%" % (f1 * 100.0)) 
    roc = roc_auc_score(y_test, y_predict)
    print("roc: %.2f%%" % (roc * 100.0)) 
    rec = recall_score(y_test, y_predict, average='binary')
    print("recall: %.2f%%" % (rec * 100.0)) 
    prc = precision_score(y_test, y_predict, average='binary')
    print("precision: %.2f%%" % (prc * 100.0)) 
    

def get_metrics(y_test,probs):
    precision, recall, thresholds = precision_recall_curve(y_test, probs)

    fscore = (2 * precision * recall) / (precision + recall)
    # locate the index of the largest f score
    ix = np.argmax(fscore)
    print('Best Threshold=%f, F-Score=%.3f, Precision=%.3f, Recall=%.3f, Roc-AUC=%.3f' % (thresholds[ix], 
                                                                            fscore[ix],
                                                                            precision[ix],
                                                                            recall[ix],
                                                                            roc_auc_score(y_test, probs)))
    return roc_auc_score(y_test, probs),fscore[ix], precision[ix],recall[ix], thresholds[ix]
    

In [3]:
def random_negative_sampling(data,estimator,target_name,use_share_pos):
    mod_data = data.copy()
    positiv_ind = data[(data[target_name]==1)].index.to_numpy()
    count_positiv = len(positiv_ind)
    count_positiv_sample = int(use_share_pos *count_positiv)
    #shuffle them
    np.random.shuffle(positiv_ind)
    print(f'Using {count_positiv_sample}/{count_positiv} as positives and unlabeling the rest') 
    positiv_sample = positiv_ind[:count_positiv_sample]
    
    mod_data['class_test'] = 0
    mod_data.loc[positiv_sample,'class_test'] = 1
    print('target variable:\n', mod_data.iloc[:,-1].value_counts())
    
    mod_data = mod_data.sample(frac=1)
    PU_sample = mod_data[mod_data['class_test']==0][:count_positiv_sample]
    sample_test = mod_data[mod_data['class_test']==0][count_positiv_sample:]
    pos_sample = mod_data[mod_data['class_test']==1]
    sample_train = pd.concat([PU_sample, pos_sample]).sample(frac=1)
    print(f'PU_sample  - {PU_sample.shape}, pos_sample - {pos_sample.shape},\
        sample_test -{sample_test.shape}, sample_train -{sample_train.shape}')

    estimator.fit(sample_train.iloc[:,:-2], 
          sample_train.iloc[:,-2])
    y_predict = estimator.predict_proba(sample_test.iloc[:,:-2].values)[:,1]
    return get_metrics(sample_test.iloc[:,-2].values, y_predict)


2. Sources:
   (a) Origin:  This dataset is a subset of the 1987 National Indonesia
                Contraceptive Prevalence Survey
   (b) Creator: Tjen-Sien Lim (limt@stat.wisc.edu)
   (c) Donor:   Tjen-Sien Lim (limt@stat.wisc.edu)
   (c) Date:    June 7, 1997

3. Past Usage:
   Lim, T.-S., Loh, W.-Y. & Shih, Y.-S. (1999). A Comparison of
   Prediction Accuracy, Complexity, and Training Time of Thirty-three
   Old and New Classification Algorithms. Machine Learning. Forthcoming.
   (ftp://ftp.stat.wisc.edu/pub/loh/treeprogs/quest1.7/mach1317.pdf or
   (http://www.stat.wisc.edu/~limt/mach1317.pdf)

4. Relevant Information:
   This dataset is a subset of the 1987 National Indonesia Contraceptive
   Prevalence Survey. The samples are married women who were either not 
   pregnant or do not know if they were at the time of interview. The 
   problem is to predict the current contraceptive method choice 
   (no use, long-term methods, or short-term methods) of a woman based 
   on her demographic and socio-economic characteristics.

5. Number of Instances: 1473

6. Number of Attributes: 10 (including the class attribute)

7. Attribute Information:

   1. Wife's age                     (numerical)
   2. Wife's education               (categorical)      1=low, 2, 3, 4=high
   3. Husband's education            (categorical)      1=low, 2, 3, 4=high
   4. Number of children ever born   (numerical)
   5. Wife's religion                (binary)           0=Non-Islam, 1=Islam
   6. Wife's now working?            (binary)           0=Yes, 1=No
   7. Husband's occupation           (categorical)      1, 2, 3, 4
   8. Standard-of-living index       (categorical)      1=low, 2, 3, 4=high
   9. Media exposure                 (binary)           0=Good, 1=Not good
   10. Contraceptive method used     (class attribute)  1=No-use 
                                                        2=Long-term
                                                        3=Short-term

8. Missing Attribute Values: None


In [4]:

data = pd.read_csv("cmc.data", header=None)
data

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,24,2,3,3,1,1,2,3,0,1
1,45,1,3,10,1,1,3,4,0,1
2,43,2,3,7,1,1,3,4,0,1
3,42,3,2,9,1,1,3,3,0,1
4,36,3,3,8,1,1,3,2,0,1
...,...,...,...,...,...,...,...,...,...,...
1468,33,4,4,2,1,0,2,4,0,3
1469,33,4,4,3,1,1,1,4,0,3
1470,39,3,3,8,1,0,1,4,0,3
1471,33,3,3,4,1,0,2,2,0,3


In [5]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1473 entries, 0 to 1472
Data columns (total 10 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   0       1473 non-null   int64
 1   1       1473 non-null   int64
 2   2       1473 non-null   int64
 3   3       1473 non-null   int64
 4   4       1473 non-null   int64
 5   5       1473 non-null   int64
 6   6       1473 non-null   int64
 7   7       1473 non-null   int64
 8   8       1473 non-null   int64
 9   9       1473 non-null   int64
dtypes: int64(10)
memory usage: 115.2 KB


In [6]:
data[9].value_counts()

1    629
3    511
2    333
Name: 9, dtype: int64

Преобразуем таргет. Будем искать тех, кто не использует Contraceptive. Это будет для нас положительный таргет. Если используют 2=Long-term
3=Short-term
Отрицательный

In [7]:
data[9]=data[9].map({2:0,3:0,1:1})

In [8]:
data[9].value_counts()

0    844
1    629
Name: 9, dtype: int64

In [9]:
res= pd.DataFrame(index=['roc_auc','f1','precision','recall','thresholds'])

In [10]:
X_train,X_test,y_train,y_test = train_test_split(data.drop(9,1),data[9],random_state = 0)

In [11]:
model_lr = LogisticRegression(random_state = 0,max_iter=1000)

In [12]:
model_lr.fit(X_train,y_train)
predict = model_lr.predict_proba(X_test)[:,1]

In [13]:
res['lr_out_dumm'] =get_metrics(y_test,predict)

Best Threshold=0.364757, F-Score=0.656, Precision=0.571, Recall=0.772, Roc-AUC=0.709


In [14]:
dummies = [0,1,2,3,6,7]

In [15]:
data_lr = data.copy()

In [16]:
new_dummies=pd.DataFrame()
for i in dummies:
    new_dummies = pd.concat([new_dummies, pd.get_dummies(data_lr[i],prefix=i)],1)  
data_lr = pd.concat([data_lr, new_dummies],1)
data_lr.drop(dummies,1,inplace=True)

In [17]:
def min_max(arr):
    return (arr-arr.min())/(arr.max()-arr.min())

In [18]:
data_lr

Unnamed: 0,4,5,8,9,0_16,0_17,0_18,0_19,0_20,0_21,...,3_13,3_16,6_1,6_2,6_3,6_4,7_1,7_2,7_3,7_4
0,1,1,0,1,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,1,0
1,1,1,0,1,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,1
2,1,1,0,1,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,1
3,1,1,0,1,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,1,0
4,1,1,0,1,0,0,0,0,0,0,...,0,0,0,0,1,0,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1468,1,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,1
1469,1,1,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,1
1470,1,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,1
1471,1,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,1,0,0


In [19]:
X_train,X_test,y_train,y_test = train_test_split(data_lr.drop(9,1),data_lr[9],random_state = 0)
model_lr.fit(X_train,y_train)
res['lr_with_dumm'] =get_metrics(y_test,predict)


Best Threshold=0.364757, F-Score=0.656, Precision=0.571, Recall=0.772, Roc-AUC=0.709


In [20]:
model_cb = CatBoostClassifier(verbose=False,random_state=0)

In [21]:
X_train,X_test,y_train,y_test = train_test_split(data.drop(9,1),data[9],random_state = 0)
model_cb.fit(X_train,y_train)
predict = model_cb.predict_proba(X_test)[:,1]
res['catboost']=get_metrics(y_test,predict)

Best Threshold=0.413598, F-Score=0.692, Precision=0.705, Recall=0.679, Roc-AUC=0.776


In [22]:
a = list(zip(range(9),model_cb.feature_importances_))
a.sort(key = lambda x: x[1],reverse=True)
a

[(3, 31.96961828135555),
 (0, 25.49212522367243),
 (1, 11.080901808028834),
 (7, 9.024267261228506),
 (6, 7.478283609335111),
 (2, 5.895902971220775),
 (5, 4.088420549670451),
 (4, 2.4988635592140254),
 (8, 2.471616736274314)]

### random negative sampling

In [23]:

for i in [0.2+i*5/100 for i in range(11)]:
    res[f'cb_PU_{round(i,2)}']=random_negative_sampling(data,model_cb,9,i)
    



Using 125/629 as positives and unlabeling the rest
target variable:
 0    1348
1     125
Name: class_test, dtype: int64
PU_sample  - (125, 11), pos_sample - (125, 11),        sample_test -(1223, 11), sample_train -(250, 11)


  fscore = (2 * precision * recall) / (precision + recall)


Best Threshold=0.982194, F-Score=nan, Precision=0.000, Recall=0.000, Roc-AUC=0.715
Using 157/629 as positives and unlabeling the rest
target variable:
 0    1316
1     157
Name: class_test, dtype: int64
PU_sample  - (157, 11), pos_sample - (157, 11),        sample_test -(1159, 11), sample_train -(314, 11)
Best Threshold=0.691487, F-Score=0.624, Precision=0.528, Recall=0.761, Roc-AUC=0.747
Using 188/629 as positives and unlabeling the rest
target variable:
 0    1285
1     188
Name: class_test, dtype: int64
PU_sample  - (188, 11), pos_sample - (188, 11),        sample_test -(1097, 11), sample_train -(376, 11)
Best Threshold=0.709419, F-Score=0.601, Precision=0.538, Recall=0.679, Roc-AUC=0.744
Using 220/629 as positives and unlabeling the rest
target variable:
 0    1253
1     220
Name: class_test, dtype: int64
PU_sample  - (220, 11), pos_sample - (220, 11),        sample_test -(1033, 11), sample_train -(440, 11)
Best Threshold=0.614616, F-Score=0.580, Precision=0.477, Recall=0.740, Roc-

In [24]:
res

Unnamed: 0,lr_out_dumm,lr_with_dumm,catboost,cb_PU_0.2,cb_PU_0.25,cb_PU_0.3,cb_PU_0.35,cb_PU_0.4,cb_PU_0.45,cb_PU_0.5,cb_PU_0.55,cb_PU_0.6,cb_PU_0.65,cb_PU_0.7
roc_auc,0.70852,0.70852,0.775586,0.715228,0.74701,0.744232,0.734053,0.766605,0.750237,0.75826,0.768185,0.742867,0.76744,0.750196
f1,0.656168,0.656168,0.691824,,0.623529,0.600715,0.580046,0.610108,0.579646,0.556,0.567164,0.522013,0.535088,0.511905
precision,0.570776,0.570776,0.705128,0.0,0.528239,0.538462,0.477099,0.676,0.708108,0.503623,0.505703,0.542484,0.655914,0.754386
recall,0.771605,0.771605,0.679012,0.0,0.760766,0.679245,0.739645,0.555921,0.490637,0.620536,0.645631,0.50303,0.451852,0.387387
thresholds,0.364757,0.364757,0.413598,0.982194,0.691487,0.709419,0.614616,0.748957,0.786706,0.657767,0.679626,0.79977,0.773514,0.903336


Как видно при 60% приближается к максимальному результату

Попробуем посмотреть на более большом объеме данных

In [25]:
data = pd.read_csv('course_project_train.csv')

In [26]:
data

Unnamed: 0,Home Ownership,Annual Income,Years in current job,Tax Liens,Number of Open Accounts,Years of Credit History,Maximum Open Credit,Number of Credit Problems,Months since last delinquent,Bankruptcies,Purpose,Term,Current Loan Amount,Current Credit Balance,Monthly Debt,Credit Score,Credit Default
0,Own Home,482087.0,,0.0,11.0,26.3,685960.0,1.0,,1.0,debt consolidation,Short Term,99999999.0,47386.0,7914.0,749.0,0
1,Own Home,1025487.0,10+ years,0.0,15.0,15.3,1181730.0,0.0,,0.0,debt consolidation,Long Term,264968.0,394972.0,18373.0,737.0,1
2,Home Mortgage,751412.0,8 years,0.0,11.0,35.0,1182434.0,0.0,,0.0,debt consolidation,Short Term,99999999.0,308389.0,13651.0,742.0,0
3,Own Home,805068.0,6 years,0.0,8.0,22.5,147400.0,1.0,,1.0,debt consolidation,Short Term,121396.0,95855.0,11338.0,694.0,0
4,Rent,776264.0,8 years,0.0,13.0,13.6,385836.0,1.0,,0.0,debt consolidation,Short Term,125840.0,93309.0,7180.0,719.0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7495,Rent,402192.0,< 1 year,0.0,3.0,8.5,107866.0,0.0,,0.0,other,Short Term,129360.0,73492.0,1900.0,697.0,0
7496,Home Mortgage,1533984.0,1 year,0.0,10.0,26.5,686312.0,0.0,43.0,0.0,debt consolidation,Long Term,444048.0,456399.0,12783.0,7410.0,1
7497,Rent,1878910.0,6 years,0.0,12.0,32.1,1778920.0,0.0,,0.0,buy a car,Short Term,99999999.0,477812.0,12479.0,748.0,0
7498,Home Mortgage,,,0.0,21.0,26.5,1141250.0,0.0,,0.0,debt consolidation,Short Term,615274.0,476064.0,37118.0,,0


In [27]:
res= pd.DataFrame(index=['roc_auc','f1','precision','recall','thresholds'])

In [28]:
object_features = data.select_dtypes(exclude=[np.number]).columns
data[object_features]=data[object_features].astype(str)

In [29]:
param_cb = {
    "n_estimators": 800,
    "loss_function": "Logloss",
    "eval_metric": "AUC",
    "task_type": "CPU",
    "thread_count": -1,
    "early_stopping_rounds": 90,
    'custom_metric':  ['AUC'],
    'cat_features' : object_features,
    "verbose": False,
    "max_bin": 30,
    'learning_rate':0.05,
    "max_depth": 2,
    "l2_leaf_reg": 28, 
    'random_state':0

    
    
    
    }
model_cb = CatBoostClassifier(**param_cb)

In [30]:
X_train,X_test,y_train,y_test = train_test_split(data.drop('Credit Default',1),data['Credit Default'],random_state = 0)
model_cb.fit(X_train,y_train)
predict = model_cb.predict_proba(X_test)[:,1]
res['catboost']=get_metrics(y_test,predict)

Best Threshold=0.294262, F-Score=0.582, Precision=0.533, Recall=0.640, Roc-AUC=0.771


In [31]:
for i in [0.2+i*5/100 for i in range(11)]:
    res[f'cb_PU_{round(i,2)}']=random_negative_sampling(data,model_cb,'Credit Default',i)

Using 422/2113 as positives and unlabeling the rest
target variable:
 0    7078
1     422
Name: class_test, dtype: int64
PU_sample  - (422, 18), pos_sample - (422, 18),        sample_test -(6656, 18), sample_train -(844, 18)
Best Threshold=0.708395, F-Score=0.503, Precision=0.438, Recall=0.591, Roc-AUC=0.749
Using 528/2113 as positives and unlabeling the rest
target variable:
 0    6972
1     528
Name: class_test, dtype: int64
PU_sample  - (528, 18), pos_sample - (528, 18),        sample_test -(6444, 18), sample_train -(1056, 18)
Best Threshold=0.673741, F-Score=0.475, Precision=0.411, Recall=0.562, Roc-AUC=0.743
Using 633/2113 as positives and unlabeling the rest
target variable:
 0    6867
1     633
Name: class_test, dtype: int64
PU_sample  - (633, 18), pos_sample - (633, 18),        sample_test -(6234, 18), sample_train -(1266, 18)
Best Threshold=0.684999, F-Score=0.471, Precision=0.432, Recall=0.517, Roc-AUC=0.745
Using 739/2113 as positives and unlabeling the rest
target variable:

In [32]:
res

Unnamed: 0,catboost,cb_PU_0.2,cb_PU_0.25,cb_PU_0.3,cb_PU_0.35,cb_PU_0.4,cb_PU_0.45,cb_PU_0.5,cb_PU_0.55,cb_PU_0.6,cb_PU_0.65,cb_PU_0.7
roc_auc,0.77111,0.749011,0.743036,0.745087,0.754687,0.755466,0.761189,0.751094,0.761204,0.760081,0.752163,0.766474
f1,0.581731,0.503212,0.474665,0.470548,0.46477,0.459536,0.44719,0.417957,0.442163,0.413015,0.387097,0.36413
precision,0.53304,0.438228,0.410579,0.431804,0.397221,0.420074,0.371523,0.379925,0.463415,0.383477,0.45283,0.523438
recall,0.640212,0.590823,0.562457,0.51693,0.56,0.507181,0.561562,0.46445,0.422775,0.447482,0.338028,0.279167
thresholds,0.294262,0.708395,0.673741,0.684999,0.645467,0.664033,0.624114,0.683934,0.712693,0.669784,0.733294,0.754852
