<p style="text-align: center;">Импорт необходимых библиотек и DataFrame'ов</p>

In [31]:
# Работа с данными
import pandas as pd
import numpy as np

# Работа с Machine Learning
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.metrics import roc_auc_score

# Модели
from catboost import CatBoostClassifier, Pool

In [32]:
# Настройка среды выполнения
import warnings

warnings.filterwarnings("ignore")

In [33]:
train_df = pd.read_parquet("../data/processed/train_data.pqt", engine = "auto")
test_df = pd.read_parquet("../data/raw/test_data.pqt", engine = "auto")

<p style="text-align: center;">Предварительная обработка данных</p>

In [34]:
df = pd.concat([train_df, test_df], ignore_index = True)
df["date"] = df["date"].replace(
    {
        "month_4": "month_1",
        "month_5": "month_2",
        "month_6": "month_3"
    }
)

> 1. Объединяем тренировачный и тестовый `DataFrame'ы` в один.
> 2. Меняем значения в столбце `date` для удобства.

In [35]:
numeric_columns = df.drop(columns = ["id"]).select_dtypes(include = ["number"]).columns.tolist()

for column in numeric_columns:
    mean_value = df[column].mean()
    df[column].fillna(mean_value, inplace = True)

> 1. Формируем список `numeric_columns`, включающий в себя названия всех числовых столбцов из `df`.
> 2. Для каждого столбца из списка `numeric_columns` производятся следующие действия:
>    - Вычисляется среднее значение (`mean_value`).
>    - Любые пропущенные значения в данном столбце заменяются на `mean_value`.

In [36]:
df["avg_a_oper_1m"] = df["sum_a_oper_1m"] / df["cnt_a_oper_1m"]
df["avg_b_oper_1m"] = df["sum_b_oper_1m"] / df["cnt_b_oper_1m"]
df["avg_c_oper_1m"] = df["sum_c_oper_1m"] / df["cnt_c_oper_1m"]

df["avg_deb_d_oper_1m"] = df["sum_deb_d_oper_1m"] / df["cnt_deb_d_oper_1m"]
df["avg_cred_d_oper_1m"] = df["sum_cred_d_oper_1m"] / df["cnt_cred_d_oper_1m"]

df["avg_deb_e_oper_1m"] = df["sum_deb_e_oper_1m"] / df["cnt_deb_e_oper_1m"]
df["avg_cred_e_oper_1m"] = df["sum_cred_e_oper_1m"] / df["cnt_cred_e_oper_1m"]

df["avg_deb_f_oper_1m"] = df["sum_deb_f_oper_1m"] / df["cnt_deb_f_oper_1m"]
df["avg_cred_f_oper_1m"] = df["sum_cred_f_oper_1m"] / df["cnt_cred_f_oper_1m"]

df["avg_deb_g_oper_1m"] = df["sum_deb_g_oper_1m"] / df["cnt_deb_g_oper_1m"]
df["avg_cred_g_oper_1m"] = df["sum_cred_g_oper_1m"] / df["cnt_cred_g_oper_1m"]

df["avg_deb_h_oper_1m"] = df["sum_deb_h_oper_1m"] / df["cnt_deb_h_oper_1m"]
df["avg_cred_h_oper_1m"] = df["sum_cred_h_oper_1m"] / df["cnt_cred_h_oper_1m"]

df["avg_a_oper_3m"] = df["sum_a_oper_3m"] / df["cnt_a_oper_3m"]
df["avg_b_oper_3m"] = df["sum_b_oper_3m"] / df["cnt_b_oper_3m"]
df["avg_c_oper_3m"] = df["sum_c_oper_3m"] / df["cnt_c_oper_3m"]

df["avg_deb_d_oper_3m"] = df["sum_deb_d_oper_3m"] / df["cnt_deb_d_oper_3m"]
df["avg_cred_d_oper_3m"] = df["sum_cred_d_oper_3m"] / df["cnt_cred_d_oper_3m"]

df["avg_deb_e_oper_3m"] = df["sum_deb_e_oper_3m"] / df["cnt_deb_e_oper_3m"]
df["avg_cred_e_oper_3m"] = df["sum_cred_e_oper_3m"] / df["cnt_cred_e_oper_3m"]

df["avg_deb_f_oper_3m"] = df["sum_deb_f_oper_3m"] / df["cnt_deb_f_oper_3m"]
df["avg_cred_f_oper_3m"] = df["sum_cred_f_oper_3m"] / df["cnt_cred_f_oper_3m"]

df["avg_deb_g_oper_3m"] = df["sum_deb_g_oper_3m"] / df["cnt_deb_g_oper_3m"]
df["avg_cred_g_oper_3m"] = df["sum_cred_g_oper_3m"] / df["cnt_cred_g_oper_3m"]

df["avg_deb_h_oper_3m"] = df["sum_deb_h_oper_3m"] / df["cnt_deb_h_oper_3m"]
df["avg_cred_h_oper_3m"] = df["sum_cred_h_oper_3m"] / df["cnt_cred_h_oper_3m"]

> Вычисляем средние значения операционных показателей за 1-месячный и 3-месячный периоды путем деления суммы каждого показателя на соответствующее количество операций, выполненных в течение каждого периода, затем создаем новые признаки на основе этих средних значений операций.

In [37]:
# def restore_categorical_feature(values: pd.Series) -> pd.Series:
#     if values.isna().any() and not values.isna().all():
#         return values.fillna(values.dropna().iloc[-1])
#     elif values.isna().all():
#         return values.fillna("missing")
    
#     return values

In [38]:
def restore_categorical_feature(values: pd.Series) -> pd.Series:
    if values.isna().all():
        return values.fillna("missing")
    return values.fillna(values.dropna().iloc[-1])

> Внутри функции `restore_categorical_feature` происходит следующее:
> - Проверяется, все ли значения в группы пропущенны или только некоторые.
> - Если все значения в группе пропущены, то они заполняются значением `missing`.
> - Если в группе пропущены некоторые значения, то они заполняются последним непропущенным значением в группе.
> - Если нет пропущенных значений, возвращается исходная группа без изменений.

In [39]:
columns_to_restore = [
    "channel_code",
    "city",
    "city_type",
    "ogrn_month",
    "ogrn_year",
    "okved",
    "segment"
]

for column in columns_to_restore:
    df[column] = df.groupby("id")[column].apply(restore_categorical_feature).reset_index()[column]

> Для каждого столбца из списка `columns_to_restore` выполняем группировку данных по столбцу `id`, применяем функцию `restore_categorical_feature`, сбрасываем индекс и обновляется соответствующий столбец в исходном `df`.

In [40]:
categorical_columns = [
    "channel_code",
    "city",
    "city_type",
    "okved",
    "segment",
    "ogrn_month",
    "ogrn_year",
]

categorical_columns_month_1 = [f"{column}_month_1" for column in categorical_columns]
categorical_columns_month_2 = [f"{column}_month_2" for column in categorical_columns]

df = df.pivot_table(index = "id", columns = "date", aggfunc = "first")

df.columns = [f"{column[0]}_{column[1]}" for column in df.columns]
df.reset_index(inplace = True)

df = df.drop(columns = ["end_cluster_month_1", "end_cluster_month_2"] + categorical_columns_month_1 + categorical_columns_month_2, axis = 0)
categorical_columns = df.select_dtypes(include = ["object"]).columns

df[categorical_columns] = df[categorical_columns].fillna("missing")

> 1. Создаём сводную таблицу по клиентам и месяцам.
> 2. Так как у клиентов может не быть 4-го месяца, то при необходимости заполняем пропущенные значения в категориальных колонках значением `missing`.

In [41]:
df.head()

Unnamed: 0,id,avg_a_oper_1m_month_1,avg_a_oper_1m_month_2,avg_a_oper_1m_month_3,avg_a_oper_3m_month_1,avg_a_oper_3m_month_2,avg_a_oper_3m_month_3,avg_b_oper_1m_month_1,avg_b_oper_1m_month_2,avg_b_oper_1m_month_3,...,sum_deb_h_oper_3m_month_3,sum_of_paym_1y_month_1,sum_of_paym_1y_month_2,sum_of_paym_1y_month_3,sum_of_paym_2m_month_1,sum_of_paym_2m_month_2,sum_of_paym_2m_month_3,sum_of_paym_6m_month_1,sum_of_paym_6m_month_2,sum_of_paym_6m_month_3
0,0,-0.452818,-0.452818,-0.452818,-0.993386,-0.993386,-0.993386,-0.069323,-0.069323,-0.069323,...,0.87705,0.51149,0.486425,0.480547,0.942275,0.645704,0.403604,0.536013,0.536378,0.613167
1,1,-0.452818,-0.452818,-0.452818,-0.993386,-0.993386,-0.993386,-0.069323,-0.069323,-0.069323,...,0.043221,0.052041,0.033554,0.039472,0.014051,-0.057593,-0.092059,0.0438,0.035027,0.025233
2,2,-0.452818,-0.452818,-0.452818,-0.993386,-0.993386,-0.993386,-0.069323,-0.069323,-0.069323,...,-0.165588,-0.291924,-0.290712,-0.288318,-0.255837,-0.267913,-0.255946,-0.287121,-0.284955,-0.280676
3,3,-0.449255,-0.452818,-0.449255,-0.993386,-0.993386,-0.993386,-0.069031,-0.069323,-0.069031,...,-0.165588,-0.242793,-0.262878,-0.273303,-0.273969,-0.273969,-0.273969,-0.268832,-0.294398,-0.294447
4,4,-0.452818,-0.452818,-0.452818,-0.993386,-0.993386,-0.993386,-0.069323,-0.069323,-0.069323,...,-0.078297,-0.124641,-0.121939,-0.128903,-0.103807,-0.134192,-0.16674,-0.130025,-0.134049,-0.142831


In [42]:
numeric_columns = df.drop(columns = ["id"]).select_dtypes(include = ["number"]).columns.tolist()

for column in numeric_columns:
    mean_value = df[column].mean()
    df[column].fillna(mean_value, inplace = True)

> Ещё раз проходимся по всем числовым слобцам и заполняем пропущенные значения минимальным значением признака

In [43]:
train_data = df[df["start_cluster_month_3"] != "missing"].drop(["id", "end_cluster_month_3"], axis = 1)
predict_data = df[df["start_cluster_month_3"] == "missing"].drop(["id", "end_cluster_month_3"], axis = 1)

X = train_data.drop("start_cluster_month_3", axis = 1)
y = train_data["start_cluster_month_3"]

X_train, X_value, y_train, y_value = train_test_split(X, y,test_size = 0.2, random_state = 42)

> 1. Получаем тренировачный и прогназируемый набор данных.
> 2. Выделяем обучающие признаки и целевую переменную.
> 3. Разделяем данных на обучающую и валидационную выборки в пропорции 80/20.

<p style="text-align: center;">Обучение</p>

In [44]:
catboost_model_start_cluster = CatBoostClassifier(
    iterations = 1024,
    depth = 6,
    learning_rate = 0.075,
    random_seed = 47,
    loss_function = "MultiClass",
    task_type = "GPU",
    devices = "0",
    early_stopping_rounds = 20    # регуляризация ранней остановкой в случае отстутсвия изменения функции ошибки 20 итераций подряд
)

> Создаём `модель классификатора` (градиентного бустинга на основе деревьев решений) с следующими параметрами:
> - Количество итераций обучения = 1024
> - Глубина модели = 6
> - Скорость обучения = 0.075
> - Обучение производится на GPU

In [45]:
def train_catboost(
        model: CatBoostClassifier,
        x_train: pd.DataFrame, y_train: pd.Series,
        x_value: pd.DataFrame, y_value: pd.Series,
        cat_names: pd.core.indexes.base.Index,
        model_name: str,
        verbose_step: int = 100
    ) -> pd.DataFrame:
    model.fit(
        x_train,
        y_train,
        cat_features = np.array(cat_names),
        eval_set = (x_value, y_value),
        verbose = verbose_step
    )

    model.save_model(f"../models/{model_name}.json")

    feature_importance = model.get_feature_importance(prettified = True)
    return feature_importance

> Объявляем функцию `train_catboost`, которая обучает модель на данных `x_train` и `y_train`, сохраняет её в файл и возвращает `"важность признаков"` модели.

In [46]:
cat_names = X.select_dtypes(include = ["object"]).columns

feature_importance = train_catboost(
    catboost_model_start_cluster,
    X_train,
    y_train,
    X_value,
    y_value,
    cat_names,
    "catboost_model_start_cluster"
)

0:	learn: 1.9166965	test: 1.8946141	best: 1.8946141 (0)	total: 271ms	remaining: 4m 36s
100:	learn: 0.2487399	test: 0.2270516	best: 0.2270516 (100)	total: 12.1s	remaining: 1m 50s
200:	learn: 0.2258419	test: 0.2141865	best: 0.2141865 (200)	total: 23.4s	remaining: 1m 35s
300:	learn: 0.2184614	test: 0.2110014	best: 0.2110014 (300)	total: 33.8s	remaining: 1m 21s
400:	learn: 0.2123362	test: 0.2088580	best: 0.2088580 (400)	total: 44.4s	remaining: 1m 8s
500:	learn: 0.2075920	test: 0.2075796	best: 0.2075796 (500)	total: 55s	remaining: 57.4s
600:	learn: 0.2039694	test: 0.2067356	best: 0.2067327 (598)	total: 1m 5s	remaining: 46s
700:	learn: 0.2003251	test: 0.2060063	best: 0.2060063 (700)	total: 1m 15s	remaining: 35s
800:	learn: 0.1965581	test: 0.2053582	best: 0.2053500 (799)	total: 1m 26s	remaining: 24.2s
900:	learn: 0.1932139	test: 0.2047788	best: 0.2047769 (899)	total: 1m 37s	remaining: 13.3s
1000:	learn: 0.1900095	test: 0.2042862	best: 0.2042721 (996)	total: 1m 48s	remaining: 2.49s
1023:	learn

> Обучаем модель с получением важности признаков.

In [47]:
y_pred = catboost_model_start_cluster.predict(X_value)

print(classification_report(y_value, y_pred))

              precision    recall  f1-score   support

     {other}       0.93      0.92      0.92      2432
          {}       0.88      0.86      0.87      3847
      {α, β}       0.92      0.93      0.92       710
      {α, γ}       0.93      0.93      0.93      2274
      {α, δ}       0.85      0.86      0.86       575
   {α, ε, η}       0.92      0.89      0.90       125
   {α, ε, θ}       0.86      0.83      0.85        53
   {α, ε, ψ}       0.88      0.83      0.85        35
      {α, ε}       0.90      0.84      0.87       369
      {α, η}       0.96      0.97      0.97      3651
      {α, θ}       0.91      0.88      0.90       322
      {α, λ}       0.77      0.84      0.80        67
      {α, μ}       0.85      0.84      0.84        93
      {α, π}       0.00      0.00      0.00         1
      {α, ψ}       0.94      0.92      0.93       331
         {α}       0.96      0.97      0.97     25112
         {λ}       1.00      0.33      0.50         3

    accuracy              

> Предсказываем разные классы из тестовой подвыборки и выводим статистику точности моделей по разным метрикам.

In [48]:
X_predict = predict_data.drop("start_cluster_month_3", axis = 1)
predicted_clusters = catboost_model_start_cluster.predict(X_predict)
predicted_clusters_flat = np.ravel(predicted_clusters)
class_counts = pd.Series(predicted_clusters_flat).value_counts()

print(class_counts)

{α}          68396
{α, η}        7997
{}            6710
{other}       5773
{α, γ}        5096
{α, β}        1952
{α, δ}        1341
{α, ε}         792
{α, θ}         715
{α, ψ}         446
{α, μ}         268
{α, ε, η}      202
{α, λ}         147
{α, ε, θ}      112
{α, ε, ψ}       45
{λ}              8
Name: count, dtype: int64


> Предсказываем кластеры на основе полученных данных и выводим подсчитанное количество каждого уникального предсказанного кластера в виде таблицы.

In [49]:
predicted_index = 0
df_restore_start_cluster = df.copy()

for index, row in df_restore_start_cluster.iterrows():
    if row['id'] >= 200000:
        df_restore_start_cluster.at[index, "start_cluster_month_3"] = predicted_clusters[predicted_index][0]
        predicted_index += 1

matching_rows = df_restore_start_cluster[df_restore_start_cluster["id"] >= 200000].loc[(df_restore_start_cluster["start_cluster_month_1"] == df_restore_start_cluster["start_cluster_month_2"]) & (
    df_restore_start_cluster["start_cluster_month_2"] == df_restore_start_cluster["start_cluster_month_3"])]

matching_rows

Unnamed: 0,id,avg_a_oper_1m_month_1,avg_a_oper_1m_month_2,avg_a_oper_1m_month_3,avg_a_oper_3m_month_1,avg_a_oper_3m_month_2,avg_a_oper_3m_month_3,avg_b_oper_1m_month_1,avg_b_oper_1m_month_2,avg_b_oper_1m_month_3,...,sum_deb_h_oper_3m_month_3,sum_of_paym_1y_month_1,sum_of_paym_1y_month_2,sum_of_paym_1y_month_3,sum_of_paym_2m_month_1,sum_of_paym_2m_month_2,sum_of_paym_2m_month_3,sum_of_paym_6m_month_1,sum_of_paym_6m_month_2,sum_of_paym_6m_month_3
200000,200000,1.261947,-0.164805,6.259527,4.461107,4.636003,5.464893,-0.069323,-0.069323,-0.069323,...,-0.152800,0.676573,0.688449,0.671862,0.416833,0.433195,0.223961,0.332409,0.284317,0.285376
200001,200001,-0.449255,-0.449255,-0.449255,-0.979398,-0.979398,-0.979398,-0.069031,-0.069031,-0.069031,...,-0.165588,0.003539,0.003539,0.003539,0.001738,0.001738,0.001738,-0.001459,-0.001459,-0.001459
200002,200002,9.330325,43.041202,24.722909,16.991300,45.016891,49.509455,-0.069323,-0.069323,-0.069323,...,2.614870,0.365591,0.970531,1.211625,1.303965,3.870911,4.142519,0.550373,1.620822,1.969559
200003,200003,-0.449255,-0.449255,-0.449255,-0.979398,-0.979398,-0.979398,-0.069031,-0.069031,-0.069031,...,-0.165588,0.003539,0.003539,0.003539,0.001738,0.001738,0.001738,-0.001459,-0.001459,-0.001459
200006,200006,-0.449255,-0.449255,-0.449255,-0.979398,-0.979398,-0.979398,-0.069031,-0.069031,-0.069031,...,-0.165588,-0.279586,0.003539,0.003539,-0.273969,0.001738,0.001738,-0.294633,-0.001459,-0.001459
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
299994,299994,-0.452818,-0.452818,-0.452818,-0.993386,-0.993386,-0.993386,-0.069323,-0.069323,-0.069323,...,-0.165588,-0.265564,-0.270446,-0.270747,-0.256200,-0.263090,-0.267985,-0.262542,-0.279291,-0.282271
299995,299995,-0.449255,-0.449255,-0.449255,-0.979398,-0.979398,-0.979398,-0.069031,-0.069031,-0.069031,...,-0.165588,-0.296115,-0.296115,-0.296115,-0.273969,-0.273969,-0.273969,-0.294633,-0.294633,-0.294633
299996,299996,-0.452818,-0.452818,-0.452818,-0.993386,-0.993386,-0.993386,-0.069323,-0.069323,-0.069323,...,-0.155776,-0.289371,-0.289810,-0.287516,-0.273496,-0.272045,-0.259945,-0.284472,-0.283784,-0.284207
299997,299997,-0.452818,-0.452818,-0.452818,-0.993386,-0.993386,-0.993386,-0.069323,-0.069323,-0.069323,...,0.087901,-0.137811,-0.111877,-0.084739,-0.070691,-0.041622,-0.008728,-0.098975,-0.082558,-0.068838


> Подготавливаем данные к разбиению на тренировачный и тестовый DataFrame'ы.

In [50]:
train_df = df_restore_start_cluster[df_restore_start_cluster["id"] < 200000]
test_df = df_restore_start_cluster[df_restore_start_cluster["id"] >= 200000]

X = train_df.drop(["id", "end_cluster_month_3"], axis = 1)
y = train_df["end_cluster_month_3"]

X_train, X_value, y_train, y_value = train_test_split(X, y, test_size = 0.2, random_state = 42)

> 1. Обратно получаем тренировочный и тестовый DataFrame'ы из общего.
> 2. Выделяем обучающие признаки и целевую переменную.
> 3. Разделяем данных на обучающую и валидационную выборки в пропорции 80/20.

In [51]:
catboost_model_end_cluster = CatBoostClassifier(
    iterations = 2025,
    depth = 6,
    learning_rate = 0.075,
    random_seed = 47,
    loss_function = "MultiClass",
    task_type = "GPU",
    devices = "0",
    early_stopping_rounds = 20
)

> Создаём новую `модель классификатора` с одним изменённым параметром:
> - Количество итераций обучения = 2025

In [52]:
cat_names = X_train.select_dtypes(include = ["object"]).columns

feature_importance = train_catboost(
    catboost_model_end_cluster,
    X_train,
    y_train,
    X_value,
    y_value,
    cat_names,
    "catboost_model_end_cluster"
)

0:	learn: 2.2483055	test: 2.2032291	best: 2.2032291 (0)	total: 400ms	remaining: 13m 29s
100:	learn: 0.8388352	test: 0.7671292	best: 0.7671292 (100)	total: 14.3s	remaining: 4m 31s
200:	learn: 0.8097360	test: 0.7511976	best: 0.7511976 (200)	total: 27.1s	remaining: 4m 6s
300:	learn: 0.7932072	test: 0.7453097	best: 0.7453097 (300)	total: 39.4s	remaining: 3m 45s
400:	learn: 0.7820404	test: 0.7423467	best: 0.7423467 (400)	total: 51s	remaining: 3m 26s
500:	learn: 0.7718062	test: 0.7406590	best: 0.7406508 (498)	total: 1m 2s	remaining: 3m 10s
600:	learn: 0.7632253	test: 0.7394039	best: 0.7394039 (600)	total: 1m 14s	remaining: 2m 55s
700:	learn: 0.7557146	test: 0.7386782	best: 0.7386782 (700)	total: 1m 25s	remaining: 2m 41s
800:	learn: 0.7480914	test: 0.7380538	best: 0.7380531 (795)	total: 1m 36s	remaining: 2m 28s
bestTest = 0.738053125
bestIteration = 795
Shrink model to first 796 iterations.


> Обучаем модель с получением важности признаков.

In [53]:
y_pred = catboost_model_end_cluster.predict(X_value)

print(classification_report(y_value, y_pred))

              precision    recall  f1-score   support

     {other}       0.67      0.54      0.60      3381
          {}       0.72      0.70      0.71      7488
      {α, β}       0.56      0.32      0.41       772
      {α, γ}       0.64      0.64      0.64      2114
      {α, δ}       0.32      0.05      0.09       335
   {α, ε, η}       0.56      0.36      0.44       148
   {α, ε, θ}       0.50      0.14      0.22        76
   {α, ε, ψ}       0.40      0.06      0.11        32
      {α, ε}       0.47      0.26      0.34       343
      {α, η}       0.76      0.88      0.82      3050
      {α, θ}       0.60      0.44      0.50       385
      {α, λ}       0.32      0.08      0.13        75
      {α, μ}       0.71      0.33      0.45       129
      {α, π}       0.00      0.00      0.00         1
      {α, ψ}       0.52      0.58      0.55       262
         {α}       0.80      0.86      0.83     21405
         {λ}       0.00      0.00      0.00         4

    accuracy              

> Предсказываем разные классы из тестовой подвыборки и выводим статистику точности моделей по разным метрикам.

In [54]:
def weighted_roc_auc(y_true, y_pred, labels, weights_dict):
    unnorm_weights = np.array([weights_dict[label] for label in labels])
    weights = unnorm_weights / unnorm_weights.sum()
    classes_roc_auc = roc_auc_score(
        y_true,
        y_pred,
        labels = labels,
        multi_class = "ovr",
        average = None
    )
    
    return sum(weights * classes_roc_auc)

> Объявляем функцию `weighted_roc_auc`, которая вычисляет средневзвешенное значение `ROC AUC` (область рабочих характеристик приемника под кривой) для задачи классификации нескольких классов.
> - Вычисляем нормализованные веса для каждого класса на основе предоставленного `weights_dict`.
> - Затем используем нормализованные веса для расчета средневзвешенного значения показателей `ROC AUC` для каждого класса с использованием функции `roc_auc_score` из `scikit-learn` со стратегией `«ovr»` (один против остальных) для классификации нескольких классов без усреднения.

In [56]:
cluster_weights = pd.read_excel("../models/model weights/cluster_weights.xlsx").set_index("cluster")
weights_dict = cluster_weights["unnorm_weight"].to_dict()
y_pred_proba = catboost_model_end_cluster.predict_proba(X_value)

weighted_roc_auc(y_value, y_pred_proba, catboost_model_end_cluster.classes_, weights_dict)

0.9181205040128173

> Вычисляем взвешенную площадь под `ROC-кривой` (ROC AUC) с использованием весов из `"cluster_weights.xlsx"` для каждого кластера.

In [57]:
sample_submission_df = pd.read_csv("../models/sample_submission.csv")
last_m_test_df = test_df.drop(["id" , 'end_cluster_month_3'], axis = 1)

pool = Pool(data = last_m_test_df, cat_features = np.array(cat_names))

test_pred_proba = catboost_model_end_cluster.predict_proba(pool)
test_pred_proba_df = pd.DataFrame(test_pred_proba, columns = catboost_model_end_cluster.classes_)
sorted_classes = sorted(test_pred_proba_df.columns.to_list())
test_pred_proba_df = test_pred_proba_df[sorted_classes]

sample_submission_df[sorted_classes] = test_pred_proba_df
sample_submission_df.to_csv("../models/final_model.csv", index = False)

> Предсказываем вероятность классов для тестовых данных с использованием обученной модели и сохраняем эти предсказания в файл .csv для последующей оценки и анализа.

In [58]:
sample_submission_df

Unnamed: 0,id,{other},{},"{α, β}","{α, γ}","{α, δ}","{α, ε, η}","{α, ε, θ}","{α, ε, ψ}","{α, ε}","{α, η}","{α, θ}","{α, λ}","{α, μ}","{α, π}","{α, ψ}",{α},{λ}
0,200000,0.009987,0.017949,0.022878,0.019627,0.008703,0.000227,0.003121,0.000310,0.007130,0.002590,0.024372,0.000709,0.001978,2.988509e-05,0.002008,0.878376,0.000005
1,200001,0.005875,0.516558,0.000745,0.001394,0.000681,0.000235,0.000475,0.000015,0.001311,0.005744,0.002270,0.000187,0.000535,1.131388e-06,0.000511,0.463138,0.000324
2,200002,0.559171,0.004891,0.006374,0.093288,0.012383,0.005413,0.006088,0.014471,0.072812,0.014171,0.025888,0.007740,0.003321,7.174726e-05,0.050556,0.123299,0.000060
3,200003,0.025114,0.580016,0.000563,0.001423,0.000388,0.000395,0.000542,0.000023,0.001157,0.014868,0.002724,0.000027,0.000708,5.569543e-07,0.000391,0.371626,0.000036
4,200004,0.072791,0.122713,0.026744,0.020878,0.010139,0.004550,0.001979,0.000130,0.012249,0.039036,0.005210,0.001736,0.030281,1.221541e-05,0.000827,0.650643,0.000083
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
99995,299995,0.014917,0.331213,0.001382,0.004759,0.001501,0.000052,0.000375,0.000014,0.001081,0.001387,0.001666,0.000351,0.000413,1.977595e-06,0.000805,0.639982,0.000101
99996,299996,0.018823,0.047659,0.009912,0.031876,0.007700,0.000152,0.000977,0.000068,0.009283,0.005115,0.011261,0.001507,0.001790,4.938733e-06,0.001707,0.851553,0.000613
99997,299997,0.029824,0.039586,0.025063,0.062239,0.015198,0.000146,0.001363,0.000537,0.018174,0.003847,0.007731,0.000895,0.002162,7.713720e-05,0.018239,0.774893,0.000026
99998,299998,0.070579,0.099832,0.034245,0.049981,0.013570,0.000832,0.002628,0.001493,0.020297,0.007317,0.009502,0.019047,0.002267,1.660793e-05,0.005231,0.661481,0.001682


Наша задача заключалась в создании модели CLTV, способной предсказывать вероятности перехода клиентов в каждый из 17 продуктовых кластеров в течение 12 месяцев.

Для этого мы использовали данные, предоставленные Альфа-Банком:

Тренировочный датасет train_data.pqt, содержащий информацию о 200 000 клиентах банка и их целевых переменных за три последовательных месяца (month_1, month_2, month_3).
Тестовый датасет test_data.pqt, включающий записи о 100 000 клиентах за 3 последовательных месяца (month_4, month_5, month_6).
Для каждого клиента указан продуктовый кластер, в который он, предположительно, будет принадлежать через год (end_cluster). Наша задача - предсказать вероятности перехода клиентов в эти продуктовые кластеры для последнего месяца (month_6).
В результате анализа данных мы обнаружили следующие закономерности:

Клиент остаётся в том же кластере с 2 по 3 месяц: 0,93.
Данные имели пропуски с определенной закономерностью
Мы создали новые признаки со средними значениями из комбинации других столбцов. Также мы объединили данные об одном клиенте в единую строку.

Качество модели оценивалось по метрике ROC-AUC.

Лучший результат по качеству и скорости показала модель градиентного бустинга CatBoost.
Полученный результат: ROC-AUC = 0.918120504, на всей выборке.

<p style="text-align: center;">Пустите на отборочный этап</p>
<p align="center">
    <img src="sticker.gif" width="350" align="center">
</p>
<!-- <p style="text-align: center;">Мы очень хотим индульгенцию! 🙏</p> -->