## Метрики бинарной классификации

**P (positive)** - истинное количество позитивных случаев / больных / объектов, относящихся к классу 1

**N (negative)** - истинное количество негативных случаев / здоровых / объектов, относящихся к классу 0

**TP (true positive)** - количество правильно угаданных позитивных случаев 1 ✅

**TN (true negative)** - количество правильно угаданных негативных случаев 0 ✅

**FP (false positive)** - количество неправильно угаданных негативных случаев 0 ❌ (ошибка первого рода)

**FN (false negative)** - количество неправильно угаданных позитивных случаев 1 ❌ (ошибка второго рода)

**Чувствитльность (sensitivity), полнота (recall), true positive rate (TPR)**:

$$ TPR = \frac{TP}{P} = \frac{TP}{TP + FN} = \frac{1 ✅}{1 ✅ + 1 ❌}$$

**Специфичность (specificity), true negative rate**:

$$ TNR = \frac{TN}{N} = \frac{TN}{TN + FP} = \frac{0 ✅}{0 ✅ + 0 ❌}$$

**Точность-меткость (precision), positive predictive value (PPV)**:

$$ PPV = \frac{TP}{TP + FP} = \frac{1 ✅}{1 ✅ + 0 ❌} $$

**Точность (accuracy) (ACC)**:

$$ ACC = \frac{TP + TN}{P + N} = \frac{1 ✅ + 0 ✅}{1 ✅ + 1 ❌ + 0 ✅ + 0 ❌} $$

![accprecc](https://www.researchgate.net/profile/Anni-Helena-Ruotsala/publication/304674901/figure/fig6/AS:668649476067338@1536429866393/Precision-versus-accuracy-The-bullseye-represents-the-true-value-eg-the-true.ppm)

**F1 метрика**

F1 метрика - способ объединить precision (точность-меткость) и recall (чувствительность) через среднее гармоническое.

Цель объединения - использование двух метрик сразу при обучении модели и поиска лучших гиперпараметров.

$$ F1 = \frac{2}{\frac{1}{precision} + \frac{1}{recall}} $$




In [1]:
import pandas as pd
import numpy as np
from sklearn import (
    preprocessing,
    linear_model,
    svm,
    neighbors,
    naive_bayes,
    ensemble,
    neural_network,
    model_selection,
    pipeline,
    feature_selection,
    metrics,
)
from featuretools import selection
import matplotlib.pyplot as plt

In [7]:
t_cells = pd.read_table('T_cells.tsv')
b_cells = pd.read_table('B_cells.tsv')

# Preporcessing

## Add class

In [None]:
t_cells_modified = t_cells.drop(t_cells.columns[0], axis=1)
t_cells_modified['CELL_TYPE'] = 1

b_cells_modified = b_cells.drop(b_cells.columns[0], axis=1)
b_cells_modified['CELL_TYPE'] = 0

## Concatenate matrixes

In [None]:
all_cells = t_cells_modified.copy(deep=True)
all_cells = all_cells.append(b_cells_modified).reset_index(drop=True)
all_cells

## Normilize features

In [None]:
all_cells_X = all_cells.iloc[: , :-1]
temp_X = preprocessing.normalize(all_cells_X, norm='l2')
all_cells_X = pd.DataFrame(temp_X, columns=all_cells_X.columns)

## Remove highly correlated features

In [None]:
X_all_cells = selection.remove_highly_correlated_features(all_cells_X)

## Save feature names

In [None]:
feature_names = X_all_cells.columns.T

## Get Train and Test data

In [None]:
TEST_SIZE = 0.2
y_all_cells = all_cells.iloc[: , -1:].values.ravel()

X_train, X_test, y_train, y_test = model_selection.train_test_split(
    X_all_cells.values, y_all_cells, test_size=TEST_SIZE, random_state=np.random.RandomState(0),
)

# Classifiers preparation

In [None]:
CLASSIFIERS = {
    # Logistic Regression.
    linear_model.LogisticRegression(): {'C': [1, 5, 10]},
    
    # Perceptron.
    linear_model.Perceptron(
        penalty='elasticnet', early_stopping=True
    ): {'alpha': [0.0001, 0.001, 0.01, 0.1, 1.0]},
    
    # Support-vector machine.
    svm.SVC(): {
        'kernel': ['rbf', 'linear'], 'C': [1, 5, 10]
    },
    
    # k-nearest Neighbors algorithm.
    neighbors.KNeighborsClassifier(): {
        'n_neighbors': [5, 10, 15, 20], 'weights': ('uniform', 'distance')
    },
    
    # Gaussian Naive Bayes.
    naive_bayes.GaussianNB(): None,
    
    # Gradient Boosting.
    ensemble.GradientBoostingClassifier(): {
        'learning_rate': [0.01, 0.1, 1.0], 'n_estimators': [10, 100, 1000]
    },
    
    # Multilayer Perceptron.
    neural_network.MLPClassifier(learning_rate='adaptive', early_stopping=True): None,
}

# Best features finding using grid search

## Finding features

In [None]:
N_FEATURES_RANGE = 3

scores = ['accuracy', 'f1']  # Classifiers scores for grid search
all_best_features_counts = {}

for k in range(1, N_FEATURES_RANGE+1):
    # ANOVA F-value scoring function.
    SCORE_FUNCTION = feature_selection.f_classif

    # Number of selected features.
    N_FEAUTRES = k

    # Select features according to the k highest scores.
    skbest = feature_selection.SelectKBest(
        score_func=SCORE_FUNCTION,
        k=N_FEAUTRES
    )
    # Iterate over all classifiers.
    current_best_features_counts = {}
    for classifier, tuned_parameters in CLASSIFIERS.items():
        for score in scores:
            # Init Pipeline.
            pipe = pipeline.Pipeline(steps=[
                ('filter', skbest),
                ('classifier', classifier),
            ])

            # Init parameters for grid search.
            param_grid = {}
            if tuned_parameters is not None:
                for param_name, values in tuned_parameters.items():
                    param_grid[f'classifier__{param_name}'] = values

            # Preapare cross-validation method.
            cv = model_selection.ShuffleSplit(
                n_splits=5,
                test_size=0.2,
                random_state=np.random.RandomState(0)
            )

            # Run grid search.
            search = model_selection.GridSearchCV(
                pipe,
                param_grid,
                scoring=score,
                cv=cv,
                n_jobs=-1,
            )
            search.fit(X_train, y_train)

            # Add best features.
            best_features_names = list(feature_names[
                search.best_estimator_['filter'].get_support()
            ])
            for feature_name in best_features_names:
                if feature_name not in current_best_features_counts:
                    current_best_features_counts[feature_name] = 1
                else:
                    current_best_features_counts[feature_name] += 1
    all_best_features_counts[N_FEAUTRES] = current_best_features_counts

## Found features

In [None]:
all_best_features_counts

In [None]:
fig, ax = plt.subplots(figsize=(10, 5))

LST = []
LABELS = set()

for i in range(N_FEATURES_RANGE):
    spec_features = all_best_features_counts[i+1]
    selected_feature_names = list(spec_features.keys())
    features_scores = list(spec_features.values())
    LABELS = LABELS.union(set(spec_features.keys()))

width = 0.2
x = np.arange(len(list(LABELS)))
for i in range(N_FEATURES_RANGE):
    temp_lst = [0 for i in range(N_FEATURES_RANGE)]
    features_scores = list(all_best_features_counts[i+1].values())
    features_scores = features_scores + temp_lst[i+1:]
    rects = ax.bar(x - width*i, features_scores, width, color=np.random.random(3), label=i+1)
    LST.append(rects)

ax.set_ylabel('Count')
ax.set_title('Feature counts from different numbers of festures to select')
ax.set_xticks(x)
ax.set_xticklabels(list(LABELS))
ax.legend()

fig.tight_layout()

plt.show()

# Test best features

## Function for testing

In [None]:
def eval_features(fueatures_names, df_to_use):
    # Cut data by features.
    X_all_cells_cut = df_to_use[fueatures_names]

    # Preapare data for testing.
    X_train_eval, X_test_eval, y_train_eval, y_test_eval = model_selection.train_test_split(
        X_all_cells_cut.values, y_all_cells, test_size=0.2, random_state=np.random.RandomState(0),
    )

    test_results = []
    for classifier, tuned_parameters in CLASSIFIERS.items():
        # Preapare cross-validation method.
        cv = model_selection.ShuffleSplit(
            n_splits=5,
            test_size=0.2,
            random_state=np.random.RandomState(0)
        )

        # Prepare grid search.
        param_grid = {}
        if tuned_parameters is not None:
            for param_name, values in tuned_parameters.items():
                param_grid[f'{param_name}'] = values
        test_search = model_selection.GridSearchCV(
            classifier,
            param_grid,
            cv=cv,
            n_jobs=-1,
        )

        test_search.fit(X_train_eval, y_train_eval)
        
        y_pred_eval_train = test_search.predict(X_train_eval)
        tn, fp, fn, tp = metrics.confusion_matrix(
            y_train_eval, y_pred_eval_train
        ).ravel()
        TRAIN_SENS = round(tp / (tp + fn), 2)
        TRAIN_SPEC = round(tn / (tn + fp), 2)
        TRAIN_ACC = round(metrics.accuracy_score(y_train_eval, y_pred_eval_train), 2)
        
        y_pred_eval_test = test_search.predict(X_test_eval)
        tn, fp, fn, tp = metrics.confusion_matrix(
            y_test_eval, y_pred_eval_test
        ).ravel()
        TEST_SENS = round(tp / (tp + fn), 2)
        TEST_SPEC = round(tn / (tn + fp), 2)
        TEST_ACC = round(metrics.accuracy_score(y_test_eval, y_pred_eval_test), 2)

        test_results.append({
            'classifier': classifier.__class__.__name__,
            'TRAIN sens': TRAIN_SENS,
            'TRAIN spec': TRAIN_SPEC,
            'TRAIN ACC': TRAIN_ACC,
            'TEST sens': TEST_SENS,
            'TEST spec': TEST_SPEC,
            'TEST ACC': TEST_ACC, 
        })
    test_results_df = pd.DataFrame(
        test_results).set_index('classifier').sort_values(
            by=['TEST ACC', 'TRAIN ACC'], ascending=False
    )
    return (X_all_cells_cut, test_results_df)

## Test Best 1

In [None]:
best_features_names = list(all_best_features_counts[1].keys())
slice_df_1, test_results_1 = eval_features(best_features_names, X_all_cells)

In [None]:
print('Selected features: ', list(slice_df_1.columns))
test_results_1

## Test Best 2

In [None]:
best_features_names = list(all_best_features_counts[2].keys())
slice_df_2, test_results_2 = eval_features(best_features_names, X_all_cells)

In [None]:
print('Selected features: ', list(slice_df_2.columns))
test_results_2

## Test Best 3

In [None]:
best_features_names = list(all_best_features_counts[3].keys())
slice_df_3, test_results_3 = eval_features(best_features_names, X_all_cells)

In [None]:
print('Selected features: ', list(slice_df_3.columns))
test_results_3

In [None]:
best_features_names

# Test Combination given features

In [None]:
given_features = {
    482: 'CMY_Y_Wavelet_energy',
    66: 'RGB_B_Wavelet_energy',
    79: 'HSVIL_H_Матрица простр смежности_ASM',
    54: 'RGB_B_Матрица простр смежности_CON',
    466: 'CMY_M_pix_mean',
    83: 'HSVIL_H_Матрица простр смежности_MPR',
    77: 'RGB_B_pix_stddev',
    492: 'CMY_Y_pix_mean',
    2: 'RGB_R_Матрица простр смежности_CON',
    51: 'RGB_G_pix_stddev',
}

## 1 Feature

In [None]:
given_1 = []
for feature in given_features.values():
    slice_df_given_1, test_results_given_1 = eval_features([feature], all_cells_X)
    item = {
        'FEATURE_NAME': feature,
        'best classifier': test_results_given_1.index[0],
    }
    item_keys = list(test_results_given_1.iloc[0].index)
    item_values = list(test_results_given_1.iloc[0].values)
    item = {**item, **dict(zip(item_keys, item_values))}
    given_1.append(item)

In [None]:
results_given_1 = pd.DataFrame(given_1).set_index('FEATURE_NAME')
results_given_1

In [None]:
ax = results_given_1['TEST ACC'].plot.barh(title='accuracy')

## 2 Features

In [None]:
features_2 = [
    [66, 79],
    [482, 79],
    [482, 492],
    [66, 83],
    [482, 77],
    [66, 54],
]
features_2 = [[given_features[pair[0]], given_features[pair[1]]] for pair in features_2]
features_2

In [None]:
given_2 = []
for features_pair in features_2:
    slice_df_given_2, test_results_given_2 = eval_features(features_pair, all_cells_X)
    item = {
        'FEATURE_PAIR': features_pair,
        'best classifier': test_results_given_2.index[0],
    }
    item_keys = list(test_results_given_2.iloc[0].index)
    item_values = list(test_results_given_2.iloc[0].values)
    item = {**item, **dict(zip(item_keys, item_values))}
    given_2.append(item)

In [None]:
results_given_2 = pd.DataFrame(given_2).set_index('FEATURE_PAIR')
results_given_2

In [None]:
ax = results_given_2['TEST ACC'].plot.barh(title='accuracy')

## 3 Features

In [None]:
features_3 = [
    [66, 79, 466],
    [482, 79, 54],
    [66, 79, 51],
    [66, 83, 2],
]
features_3 = [[
    given_features[tiple[0]],
    given_features[tiple[1]],
    given_features[tiple[2]]
] for tiple in features_3]
features_3

In [None]:
given_3 = []
for features_triple in features_3:
    slice_df_given_3, test_results_given_3 = eval_features(features_triple, all_cells_X)
    item = {
        'FEATURE_TRIPLE': features_pair,
        'best classifier': test_results_given_3.index[0],
    }
    item_keys = list(test_results_given_3.iloc[0].index)
    item_values = list(test_results_given_3.iloc[0].values)
    item = {**item, **dict(zip(item_keys, item_values))}
    given_3.append(item)

In [None]:
results_given_3 = pd.DataFrame(given_3).set_index('FEATURE_TRIPLE')
results_given_3

In [None]:
ax = results_given_3['TEST ACC'].plot.barh(title='accuracy')