1. взять любой набор данных для бинарной классификации (можно скачать один из модельных с https://archive.ics.uci.edu/ml/datasets.php)

In [35]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from catboost import CatBoostClassifier
from sklearn.metrics import recall_score, precision_score, roc_auc_score, accuracy_score, f1_score

In [36]:
df = pd.read_csv('Sleep_health_and_lifestyle_dataset.csv')
df.head(3)

Unnamed: 0,Person ID,Gender,Age,Occupation,Sleep Duration,Quality of Sleep,Physical Activity Level,Stress Level,BMI Category,Blood Pressure,Heart Rate,Daily Steps,Sleep Disorder
0,1,Male,27,Software Engineer,6.1,6,42,6,Overweight,126/83,77,4200,
1,2,Male,28,Doctor,6.2,6,60,8,Normal,125/80,75,10000,
2,3,Male,28,Doctor,6.2,6,60,8,Normal,125/80,75,10000,


3. сделать feature engineering

In [37]:
df['Gender'] = df['Gender'].map({'Male': 1, 'Female': 0})
df['Occupation'] = df['Occupation'].map({v: i for i, v in enumerate(df['Occupation'].unique())})
df['BMI Category'] = df['BMI Category'].map({'Normal': 0, 'Normal Weight': 0, 'Obese': 1, 'Overweight': 2})
df['Blood Pressure 1'] = df['Blood Pressure'].apply(lambda x: int(x.split('/')[0]))
df['Blood Pressure 2'] = df['Blood Pressure'].apply(lambda x: int(x.split('/')[1]))
df['Sleep Disorder'] = df['Sleep Disorder'].map({'Sleep Apnea': 1, 'Insomnia': 1})
df['Sleep Disorder'] = df['Sleep Disorder'].fillna(0)
df = df.drop(columns=['Blood Pressure', 'Person ID'])

4. обучить любой классификатор (какой вам нравится)

In [38]:
x_train, x_test, y_train, y_test = train_test_split(df.drop(columns=['Sleep Disorder']), df['Sleep Disorder'], test_size=0.2, random_state=7)

In [39]:
model = CatBoostClassifier(silent=True)
model.fit(x_train, y_train)
y_pred = model.predict(x_test)

5. далее разделить ваш набор данных на два множества: P (positives) и U (unlabeled). Причем брать нужно не все положительные (класс 1) примеры, а только лишь часть

In [40]:
df_modified = df.copy()
true_values_ind = np.where(df_modified['Sleep Disorder'].values == 1)[0]
np.random.shuffle(true_values_ind)
true_values_sample_len = int(np.ceil(0.15*len(true_values_ind)))
true_values_sample = true_values_ind[:true_values_sample_len][0]

In [41]:
df_modified['Class Test'] = -1
df_modified.loc[true_values_sample, 'Class Test'] = 1

6. применить random negative sampling для построения классификатора в новых условиях

In [42]:
df_modified = df_modified.sample(frac=1)
neg_sample = df_modified[df_modified['Class Test']==-1][:len(df_modified[df_modified['Class Test']==1])]
sample_test = df_modified[df_modified['Class Test']==-1][len(df_modified[df_modified['Class Test']==1]):]
pos_sample = df_modified[df_modified['Class Test']==1]
sample_train = pd.concat([neg_sample, pos_sample]).sample(frac=1)

7. сравнить качество с решением из пункта 4 (построить отчет - таблицу метрик)

In [43]:
model_np = CatBoostClassifier(silent=True)
model_np.fit(
    sample_train.drop(columns=['Sleep Disorder', 'Class Test']).values,
    sample_train['Sleep Disorder'].values
    )
y_pred_np = model_np.predict(sample_test.drop(columns=['Sleep Disorder', 'Class Test']).values)

In [44]:
def evaluate_results(y_test, y_predict, model_name):
    metrics_dict = {
        r'model': model_name,
        r'f1': f1_score(y_test, y_predict),
        r'roc': roc_auc_score(y_test, y_predict),
        r'recall': recall_score(y_test, y_predict, average='binary'),
        r'precision': precision_score(y_test, y_predict, average='binary')
    }
    return pd.DataFrame.from_dict(metrics_dict, orient='index').T

In [45]:
metrics_table = evaluate_results(y_test, y_pred, 'Without NP')
metrics_table = pd.concat([metrics_table, evaluate_results(sample_test['Sleep Disorder'].values, y_pred_np, 'With NP (p = 0.15)')], axis=0).reset_index(drop=True)
metrics_table

Unnamed: 0,model,f1,roc,recall,precision
0,Without NP,0.962025,0.960714,0.95,0.974359
1,With NP (p = 0.15),0.032258,0.169903,0.032468,0.032051


8. поэкспериментировать с долей P на шаге 5 (как будет меняться качество модели при уменьшении/увеличении размера P)

In [46]:
for p in [0.1, 0.2]:
    df_modified = df.copy()
    true_values_ind = np.where(df_modified['Sleep Disorder'].values == 1)[0]
    np.random.shuffle(true_values_ind)
    true_values_sample_len = int(np.ceil(p*len(true_values_ind)))
    true_values_sample = true_values_ind[:true_values_sample_len][0]
    df_modified['Class Test'] = -1
    df_modified.loc[true_values_sample, 'Class Test'] = 1
    df_modified = df_modified.sample(frac=1)
    neg_sample = df_modified[df_modified['Class Test']==-1][:len(df_modified[df_modified['Class Test']==1])]
    sample_test = df_modified[df_modified['Class Test']==-1][len(df_modified[df_modified['Class Test']==1]):]
    pos_sample = df_modified[df_modified['Class Test']==1]
    sample_train = pd.concat([neg_sample, pos_sample]).sample(frac=1)
    model_np = CatBoostClassifier(silent=True)
    model_np.fit(
        sample_train.drop(columns=['Sleep Disorder', 'Class Test']).values,
        sample_train['Sleep Disorder'].values
        )
    y_pred_np = model_np.predict(sample_test.drop(columns=['Sleep Disorder', 'Class Test']).values)
    metrics_table = pd.concat([metrics_table, evaluate_results(sample_test['Sleep Disorder'].values, y_pred_np, f'With NP (p = {p})')], axis=0).reset_index(drop=True)

In [47]:
metrics_table

Unnamed: 0,model,f1,roc,recall,precision
0,Without NP,0.962025,0.960714,0.95,0.974359
1,With NP (p = 0.15),0.032258,0.169903,0.032468,0.032051
2,With NP (p = 0.1),0.770992,0.811867,0.655844,0.935185
3,With NP (p = 0.2),0.75,0.797927,0.623377,0.941176


Видимо, качество должно расти с увеличением p. Хотя, по приведенному примеру трудно делать какие-то выводы. Видимо, был выбран слишком маленький датасет.