In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
ABALONE_DATASET = 'https://raw.githubusercontent.com/akiltrebreg/eda_and_dev_tools/main/abalone.csv'

In [3]:
df = pd.read_csv(ABALONE_DATASET)

In [4]:
df.head()

Unnamed: 0,sex,length,diameter,height,whole_weight,shucked_weight,viscera_weight,shell_weight,age
0,M,0.455,0.365,0.095,0.514,0.2245,0.101,0.15,16.5
1,M,0.35,0.265,0.09,0.2255,0.0995,0.0485,0.07,8.5
2,F,0.53,0.42,0.135,0.677,0.2565,0.1415,0.21,10.5
3,M,0.44,0.365,0.125,0.7985,0.2155,0.114,0.155,11.5
4,I,0.33,0.255,0.08,0.205,0.0895,0.0395,0.055,8.5


# Простые нелинейные модели: классификация

## Наивный байесовский классификатор

In [5]:
X = df[['length', 'diameter', 'height', 'whole_weight', 'shucked_weight', 'viscera_weight', 'shell_weight']]
y = df['age']

Создадим два класса морские ушек: "молодые" (0) и "старые" (1). За разделяющий критерий возьмем средний возраст половозрелых ушек, равный 12. Тогда молодыми ушками будем считать тех, кто моложе 12 лет, а старыми - тех, кто старше 12 лет.

In [6]:
y_class = (y > 12).astype(int)

In [7]:
y_class.value_counts()

0    2730
1    1447
Name: age, dtype: int64

Наблюдаем дисбаланс классов: молодых в 1.9 раз больше, чем старых.

In [8]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y_class, test_size=0.25, random_state=42)

In [9]:
from sklearn.naive_bayes import GaussianNB

gnb = GaussianNB()

gnb.fit(X_train, y_train)

y_pred = gnb.predict(X_test)

В качестве целевой метрики выберем ROC-AUC, так как она нечувствительна к дисбалансу классов.

In [10]:
from sklearn.metrics import roc_auc_score

roc_auc_score(y_test, y_pred)

0.6951756018093561

In [11]:
from sklearn.metrics import confusion_matrix

confusion_matrix(y_test, y_pred)

array([[487, 212],
       [106, 240]])

## KNN

In [12]:
from sklearn.neighbors import KNeighborsClassifier

knn_cl = KNeighborsClassifier()

knn_cl.fit(X_train, y_train)

pred_knn = knn_cl.predict(X_test)

In [13]:
roc_auc_score(y_test, pred_knn)

0.7291341057001332

In [14]:
confusion_matrix(y_test, pred_knn)

array([[587, 112],
       [132, 214]])

Мы рассмотрели базовые модели, результат средний.
Метод k-ближайших соседей показал результат лучше (roc_auc_score = 0.73) по сравнению с наивным байесовским классификатором (roc_auc_score = 0.70).

## Кросс-валидация

In [15]:
from sklearn.model_selection import cross_val_score

cross_val_score(knn_cl, X, y_class, scoring='roc_auc', cv=3, n_jobs=-1).mean()

0.8075799910623754

Получаем более точный результат о качестве модели: roc_auc_score = 0.81.

## Подбор гиперпараметров моделей

In [16]:
from sklearn.model_selection import GridSearchCV

X_train, X_test, y_train, y_test = train_test_split(X, y_class, test_size=0.25, random_state=42)

model = KNeighborsClassifier()

params = {'n_neighbors' : np.arange(2, 30, 2),
          'weights' : ['uniform', 'distance'],
          'metric' : ['manhattan', 'euclidean', 'chebyshev', 'minkowski']}

gs = GridSearchCV(model, params, scoring='roc_auc', cv=3, n_jobs=-1, verbose=2)
gs.fit(X_train, y_train)

Fitting 3 folds for each of 112 candidates, totalling 336 fits


In [17]:
gs.best_score_, gs.best_params_

(0.8479185700658861,
 {'metric': 'euclidean', 'n_neighbors': 26, 'weights': 'distance'})

In [18]:
pred = gs.best_estimator_.predict(X_test)

roc_auc_score(y_test, pred)

0.7499069686670471

In [19]:
confusion_matrix(y_test, pred)

array([[612,  87],
       [130, 216]])

Видим, что подбор гиперпараметров кардинально не отразился на качестве обученной модели. После подбора roc_auc_score = 0.75.

## Учет категориальных признаков


В исходном датасете имеем один категориальный признак 'sex' (пол морских ушек).

In [20]:
X_full = df.drop('age', axis=1)

X_train_full, X_test_full, y_train_full, y_test_full = train_test_split(X_full, y_class, test_size=0.25, random_state=42)

In [21]:
X_train_full.dtypes

sex                object
length            float64
diameter          float64
height            float64
whole_weight      float64
shucked_weight    float64
viscera_weight    float64
shell_weight      float64
dtype: object

In [22]:
from sklearn.preprocessing import OneHotEncoder, MinMaxScaler
from sklearn.compose import ColumnTransformer

categorical = ['sex']
numeric_features = ['length', 'diameter', 'height', 'whole_weight', 'shucked_weight', 'viscera_weight', 'shell_weight']

ct = ColumnTransformer([
    ('ohe', OneHotEncoder(handle_unknown="ignore"), categorical),
    ('scaling', MinMaxScaler(), numeric_features)
])

X_train_transformed = ct.fit_transform(X_train_full)
X_test_transformed = ct.transform(X_test_full)

In [23]:
# достанем наименования колонок после кодирования

new_features = list(ct.named_transformers_['ohe'].get_feature_names_out())
new_features.extend(numeric_features)

new_features

['sex_F',
 'sex_I',
 'sex_M',
 'length',
 'diameter',
 'height',
 'whole_weight',
 'shucked_weight',
 'viscera_weight',
 'shell_weight']

In [24]:
X_train_transformed = pd.DataFrame(X_train_transformed, columns=new_features)
X_test_transformed = pd.DataFrame(X_test_transformed, columns=new_features)

X_train_transformed.head()

Unnamed: 0,sex_F,sex_I,sex_M,length,diameter,height,whole_weight,shucked_weight,viscera_weight,shell_weight
0,1.0,0.0,0.0,0.72973,0.672269,0.119469,0.374358,0.317754,0.345622,0.27155
1,1.0,0.0,0.0,0.594595,0.571429,0.123894,0.242253,0.188299,0.164582,0.217738
2,0.0,0.0,1.0,0.790541,0.798319,0.154867,0.559943,0.496638,0.460829,0.402093
3,0.0,0.0,1.0,0.513514,0.521008,0.084071,0.181335,0.150303,0.132324,0.147982
4,0.0,0.0,1.0,0.567568,0.579832,0.137168,0.285638,0.157028,0.151415,0.347285


In [25]:
model = KNeighborsClassifier()

params = {'n_neighbors' : np.arange(2, 30, 2),
          'weights' : ['uniform', 'distance'],
          'metric' : ['manhattan', 'euclidean', 'chebyshev', 'minkowski']}

gs = GridSearchCV(model, params, scoring='roc_auc', cv=3, n_jobs=-1, verbose=5)
gs.fit(X_train_transformed, y_train_full)

Fitting 3 folds for each of 112 candidates, totalling 336 fits


In [26]:
gs.best_score_, gs.best_params_

(0.8366148942078975,
 {'metric': 'manhattan', 'n_neighbors': 28, 'weights': 'distance'})

In [27]:
pred = gs.best_estimator_.predict(X_test_transformed)

roc_auc_score(y_test_full, pred)

0.7333680650309691

In [28]:
confusion_matrix(y_test_full, pred)

array([[601,  98],
       [136, 210]])

Добавление категориальной колонки не улучшило модель, но и кардинально не изменило результат: roc_auc_score = 0.73.

## Pipeline

In [29]:
from sklearn.pipeline import Pipeline

pipe = Pipeline([
    ('transformer', ct),
    ('model', KNeighborsClassifier())
    ])

pipe.fit(X_train_full, y_train_full)

pred_pipe = pipe.predict(X_test_full)

In [30]:
roc_auc_score(y_test_full, pred_pipe)

0.7377612113093024

Подбор гиперпараметров в пайплайне

In [31]:
params = {'model__n_neighbors' : np.arange(2, 30, 2),
          'model__weights': ['uniform', 'distance'],
          'model__metric': ['manhattan', 'euclidean', 'chebyshev', 'minkowski']}

gs = GridSearchCV(pipe, params, scoring='roc_auc', cv=3, n_jobs=-1, verbose=2)

gs.fit(X_train_full, y_train_full)

Fitting 3 folds for each of 112 candidates, totalling 336 fits


In [32]:
gs.best_estimator_

Таким образом, для обучения модели был выбран метод k-ближайших соседей.
Отобраны следующие гиперпараметры: манхэттенская метрика, число соседей = 24, уменьшение весов с расстоянием.


# Explainer Dashboard

In [33]:
!pip install explainerdashboard -q

In [34]:
from explainerdashboard import ClassifierExplainer, ExplainerDashboard

Using `tqdm.autonotebook.tqdm` in notebook mode. Use `tqdm.tqdm` instead to force console mode (e.g. in jupyter console)


In [35]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

In [36]:
explainer = ClassifierExplainer(gs.best_estimator_, X_test_full.iloc[:1000], y_test_full.iloc[:1000])

splitting pipeline...
Detected sklearn/imblearn Pipeline and succesfully extracted final output dataframe with column names and final model...
Note: shap values for shap='kernel' normally get calculated against X_background, but paramater X_background=None, so setting X_background=shap.sample(X, 50)...
Generating self.shap_explainer = shap.KernelExplainer(model, X, link='identity')


In [37]:
db = ExplainerDashboard(explainer)

Building ExplainerDashboard..
Detected google colab environment, setting mode='external'
For this type of model and model_output interactions don't work, so setting shap_interaction=False...
The explainer object has no decision_trees property. so setting decision_trees=False...
Generating layout...
Calculating shap values...


See https://dash.plotly.com/dash-in-jupyter for more details.


  0%|          | 0/1000 [00:00<?, ?it/s]

Calculating prediction probabilities...
Calculating metrics...
Calculating confusion matrices...
Calculating classification_dfs...
Calculating roc auc curves...
Calculating pr auc curves...
Calculating liftcurve_dfs...
Calculating dependencies...
Calculating permutation importances (if slow, try setting n_jobs parameter)...



X has feature names, but KNeighborsClassifier was fitted without feature names


X has feature names, but KNeighborsClassifier was fitted without feature names


X has feature names, but KNeighborsClassifier was fitted without feature names


X has feature names, but KNeighborsClassifier was fitted without feature names


X has feature names, but KNeighborsClassifier was fitted without feature names


X has feature names, but KNeighborsClassifier was fitted without feature names


X has feature names, but KNeighborsClassifier was fitted without feature names


X has feature names, but KNeighborsClassifier was fitted without feature names


X has feature names, but KNeighborsClassifier was fitted without feature names


X has feature names, but KNeighborsClassifier was fitted without feature names



Calculating predictions...
Calculating pred_percentiles...
Reminder: you can store the explainer (including calculated dependencies) with explainer.dump('explainer.joblib') and reload with e.g. ClassifierExplainer.from_file('explainer.joblib')
Registering callbacks...



X has feature names, but KNeighborsClassifier was fitted without feature names



In [38]:
db.run()

Starting ExplainerDashboard on http://172.28.0.12:8050
You can terminate the dashboard with ExplainerDashboard.terminate(8050)


<IPython.core.display.Javascript object>

Dash app running on:


<IPython.core.display.Javascript object>

В среднем наиболее важными факторами для получения прогноза оказались вес раковины ('shell_weight') и вес мяса ('shucked_weight'). <br><br>
Precision можно интерпретировать как долю объектов, названных классификатором положительными и при этом действительно являющимися положительными. Для нашей модели эта метрика = 0.472. <br>
Метрика recall показывает, что наш алгоритм нашел почти все объекты положительного класса из всех объектов положительного класса (score = 0.997). <br>
Таким образом, были успешно обнаружены практически все "старые" морские ушки, но "молодые" часто принимались моделью за "старых". <br>
Так как точность высока, а полнота мала, то f1-score = 0.557. <br>
Рассмотрим интегральные метрики roc_auc_score (score = 0.853) и pr_auc_score (score = 0.742). Модель работает средне, так как чем ближе к 1 показатели метрик ROC AUC и Precision-Recall AUC, тем лучше она предсказывает вероятности классов.<br><br>
Для индекса 3229 вероятность быть отнесенным к "старым" морским ушкам составила 75.2%. Наибольший вклад внес параметр 'shell_weight' (20.26%), а 'shucked_weight' практически не отразился на результате (-0.32%). <br>
Для индекса 416 вероятность быть отнесенным к "старым" морским ушкам составила 60.4%. Наибольший вклад внес параметр 'shell_weight' (20.11%), по важности 'shucked_weight' на втором месте, результат = -8.34%. <br>
Для индекса 1422 вероятность быть отнесенным к "старым" морским ушкам составила 86.3%. Наибольший вклад внес параметр 'shell_weight' (18.47%), а 'shucked_weight' менее сильно отразился на результате (-5.77%). <br>