# Wstęp do uczenia maszynowego. Projekt nr 1.

*Maciej Borkowski, Michał Chęć
21.04.2023r.*

Zadaniem projektowym jest zrealizowanie zadania klasyfikacji binarnej na zbiorze danych numerycznych ze strony
[https://www.kaggle.com/datasets/nextbigwhat/dataset-1](https://www.kaggle.com/datasets/nextbigwhat/dataset-1)

In [1]:
# ładujemy potrzebne pakiety
import pandas as pd
import numpy as np
from tabulate import tabulate
import matplotlib.pyplot as plt

from sklearn.base import TransformerMixin, BaseEstimator
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer, make_column_selector

from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.metrics import confusion_matrix, precision_score, recall_score, f1_score, roc_auc_score, accuracy_score
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV, train_test_split
from sklearn.feature_selection import SelectKBest, SelectFromModel, SequentialFeatureSelector, RFE, VarianceThreshold
from sklearn.decomposition import PCA

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from sklearn.ensemble import AdaBoostClassifier

import warnings
import os
import sys
if not sys.warnoptions:
    warnings.simplefilter("ignore")
    os.environ["PYTHONWARNINGS"] = "ignore"

np.random.seed(42)

# 1. Import i podział danych

In [2]:
# wczytujemy całą ramkę danych
dataset = pd.read_csv("dataset_1.csv")
df = dataset.copy()

In [3]:
# dzielimy dane, tworzymy zbiór do ewaluacji i do testów
train_set, eval_set = train_test_split(df, test_size=0.3, random_state=42)
train_df, test_df = train_test_split(train_set, test_size=0.3, random_state=42)
df = train_df.copy()

In [4]:
class ColumnRemover(BaseEstimator, TransformerMixin):

    def __init__(self, threshold_constant, threshold_corr, n_info_vals):
        self.threshold_constant = threshold_constant
        self.threshold_corr = threshold_corr
        self.n_info_vals = n_info_vals
        self.columns_to_remove = []
        self.columns_to_keep = []

    def fit(self, X, y=None):
        # usuwanie zduplikowanych kolumn
        self.columns_to_remove.extend(X.loc[:, X.T.duplicated()].columns.tolist())

        # usuwanie stałych i prawie stałych kolumn
        for column in X.columns:
            if X[column].value_counts(normalize=True).iloc[0] >= self.threshold_constant:
                self.columns_to_remove.append(column)

        # usuwanie skorelowanych kolumn, jeśli pierwsza kolumna jest ciągła stosujemy korelację pearsona,
        # jeśli dyskretna to korelację spearmana; pierwsza kolumna będzie typu takiego samego jak cały X ponieważ
        # klasę tą stosujemy w column tranformerze ograniczając podzbiór kolumn do okreslonego typu
        if X.dtypes[0] == 'float64':
            corr = X.corr(method='pearson')
        else:
            corr = X.corr(method='spearman')
        corr = corr[corr > self.threshold_corr]
        dependent_columns = corr.apply(lambda row: row[row > 0].index, axis=1)
        for j in range (len(dependent_columns)):
            for k in dependent_columns[j]:
                if k is not dependent_columns.index[j]:
                    if k not in dependent_columns.index[0:j]:
                        self.columns_to_remove.append(k)

        # usuwanie kolumn nie niosących informacji
        amount_of_ones = y[y == 1].shape[0]
        X = X.join(y)
        for column in X.columns:
            tmp = X.groupby(column)['target'].agg(['sum','count']).sort_values('sum',ascending = False).reset_index()
            if any(tmp[column] == 0) and (tmp.loc[tmp[column] == 0, 'sum'] > amount_of_ones - self.n_info_vals).bool():
                self.columns_to_remove.append(column)
        X.drop('target', axis=1, inplace=True)

        self.columns_to_keep = [col for col in X.columns if col not in self.columns_to_remove]

        return self

    def transform(self, X):
        return X[self.columns_to_keep]

In [5]:
# zarówno dla zmiennych dyskretnych i ciągłych stosujemy nasz transformator ColumnRemover z różnymi parametrami - w kolejnych krokach będziemy szukać najlepszej ich kombinacji

# dokonujemy kodowania one hot encoding zmiennych dyskretnych - traktujemy je jako kategoryczne
int_transformer = Pipeline([
    ('int', ColumnRemover(0.9995, 0.99, 1)),
    ('one_hot', OneHotEncoder(handle_unknown='ignore', sparse=False, dtype='int64'))])

# dokojumeny standaryzacji zmiennych ciągłych - w kolejnych krokach sprawdzimy, czy jest lepsza od normalizacji min_max
float_transformer = Pipeline([
    ('float', ColumnRemover(0.9999, 0.99, 10)),
    ('min_max', StandardScaler())])

col_transformer = ColumnTransformer([
    ('int_pipe', int_transformer, make_column_selector(dtype_include=np.int64)),
    ('float_pipe', float_transformer, make_column_selector(dtype_include=np.float64))
])

In [6]:
X_train = train_df.drop('target', axis=1)
y_train = train_df.target
x_eval = eval_set.drop('target', axis=1)
y_eval = eval_set.target

Utworzyliśmy ponadto funkcję liczącą wszystko za nas i wyświetlającą macierz pomyłek oraz interesujące nas metryki

In [7]:
def show_scores(clf, X, y):
    y_pred = clf.predict(X)
    y_pred_prob = clf.predict_proba(X)
    print(tabulate(confusion_matrix(y, y_pred), headers=['Predicted 0', 'Predicted 1'], tablefmt='orgtbl'))
    print()
    print(f'accuracy:              {round(accuracy_score(y, y_pred), 4)}')
    print(f'precision:             {round(precision_score(y, y_pred), 4)}')
    print(f'recall:                {round(recall_score(y, y_pred), 4)}')
    print(f'f1:                    {round(f1_score(y, y_pred), 4)}')
    print(f'roc_auc_discrete:      {round(roc_auc_score(y, y_pred), 4)}')
    print(f'roc_auc_continuous:    {round(roc_auc_score(y, y_pred_prob[:, 1]), 4)}')

Przejdźmy zatem do modeli

# 3. Regresja logistyczna

## 3.1 Preprocessing

In [8]:
# zarówno dla zmiennych dyskretnych i ciągłych stosujemy nasz transformator ColumnRemover z różnymi parametrami - w kolejnych krokach będziemy szukać najlepszej ich kombinacji

# dokonujemy kodowania one hot encoding zmiennych dyskretnych - traktujemy je jako kategoryczne
int_transformer = Pipeline([
    ('int', ColumnRemover(0.9995, 0.99, 1)),
    ('one_hot', OneHotEncoder(handle_unknown='ignore', sparse_output=False, dtype='int64'))])

# dokojumeny standaryzacji zmiennych ciągłych - sprawdziliśmy, że jest lepsza od normalizacji min-max dla tego modelu
float_transformer = Pipeline([
    ('float', ColumnRemover(0.9999, 0.99, 10)),
    ('standard_scaler', StandardScaler())])

col_transformer = ColumnTransformer([
    ('int_pipe', int_transformer, make_column_selector(dtype_include=np.int64)),
    ('float_pipe', float_transformer, make_column_selector(dtype_include=np.float64))
])

## 3.2 Trening pierwszego modelu

In [9]:
clf = Pipeline([
    ('preprocessing', col_transformer),
    ('model', LogisticRegression(random_state=42))])

clf.fit(X_train, y_train)

# wyniki dla danych treningowych
show_scores(clf, X_train, y_train)

|   Predicted 0 |   Predicted 1 |
|---------------+---------------|
|         23550 |            10 |
|           934 |             6 |

accuracy:              0.9615
precision:             0.375
recall:                0.0064
f1:                    0.0126
roc_auc_discrete:      0.503
roc_auc_continuous:    0.8057


In [10]:
# wyniki dla danych testowych
show_scores(clf, x_eval, y_eval)

|   Predicted 0 |   Predicted 1 |
|---------------+---------------|
|         14391 |             8 |
|           600 |             1 |

accuracy:              0.9595
precision:             0.1111
recall:                0.0017
f1:                    0.0033
roc_auc_discrete:      0.5006
roc_auc_continuous:    0.7864


Model się nie uczy. Wiemy, że klasy są niezbalansowane (jest ok. 4% jedynek). Zastosujemy parametr class_weight = 'balanced')

In [11]:
clf = Pipeline([
    ('preprocessing', col_transformer),
    ('model', LogisticRegression(random_state=42, class_weight='balanced'))])

clf.fit(X_train, y_train)
show_scores(clf, X_train, y_train)

|   Predicted 0 |   Predicted 1 |
|---------------+---------------|
|         16261 |          7299 |
|           207 |           733 |

accuracy:              0.6936
precision:             0.0913
recall:                0.7798
f1:                    0.1634
roc_auc_discrete:      0.735
roc_auc_continuous:    0.8125


In [12]:
show_scores(clf, x_eval, y_eval)

|   Predicted 0 |   Predicted 1 |
|---------------+---------------|
|          9873 |          4526 |
|           163 |           438 |

accuracy:              0.6874
precision:             0.0882
recall:                0.7288
f1:                    0.1574
roc_auc_discrete:      0.7072
roc_auc_continuous:    0.7848


Daje nam to minimalne podstawy do dalszego doboru parametrów.

## 3.3 Strojenie hiperparametrów i dobór cech

### 3.3.1 Dobór parametrów dla ColumnRemover'a


In [13]:
#X_train = train_df.drop('target', axis=1)
#y_train = train_df.target
#X_test = test_df.drop('target', axis=1)
#y_test = test_df.target

In [14]:
clf = Pipeline([
    ('preprocessing', col_transformer),
    ('model', LogisticRegression(random_state=42, class_weight='balanced'))])

parameters = dict(preprocessing__int_pipe__int__threshold_constant = np.arange(0.9995, 1, 0.0001),
                  preprocessing__int_pipe__int__threshold_corr = np.arange(0.96, 1, 0.01),
                  preprocessing__int_pipe__int__n_info_vals = np.arange(0, 5, 1),
                  preprocessing__float_pipe__float__threshold_constant = np.arange(0.9995, 1, 0.0001),
                  preprocessing__float_pipe__float__threshold_corr = np.arange(0.96, 1, 0.01),
                  preprocessing__float_pipe__float__n_info_vals = np.arange(0, 16, 3))

col_remove_search = RandomizedSearchCV(clf, scoring='roc_auc', param_distributions=parameters, cv=3, n_iter=500, n_jobs=-1, random_state=42).fit(X_train, y_train)

Sprawdzamy rezultaty treningu

In [15]:
print(round(col_remove_search.best_score_, 4), col_remove_search.best_params_)

0.7856 {'preprocessing__int_pipe__int__threshold_corr': 1.0, 'preprocessing__int_pipe__int__threshold_constant': 0.9997, 'preprocessing__int_pipe__int__n_info_vals': 0, 'preprocessing__float_pipe__float__threshold_corr': 0.96, 'preprocessing__float_pipe__float__threshold_constant': 0.9999, 'preprocessing__float_pipe__float__n_info_vals': 3}


In [16]:
show_scores(col_remove_search.best_estimator_, X_train, y_train)
show_scores(col_remove_search.best_estimator_, x_eval, y_eval)

|   Predicted 0 |   Predicted 1 |
|---------------+---------------|
|         16409 |          7151 |
|           206 |           734 |

accuracy:              0.6997
precision:             0.0931
recall:                0.7809
f1:                    0.1663
roc_auc_discrete:      0.7387
roc_auc_continuous:    0.8141
|   Predicted 0 |   Predicted 1 |
|---------------+---------------|
|          9958 |          4441 |
|           166 |           435 |

accuracy:              0.6929
precision:             0.0892
recall:                0.7238
f1:                    0.1588
roc_auc_discrete:      0.7077
roc_auc_continuous:    0.7855


Otrzymujemy porównywalne wyniki.

### 3.3.2 Regularyzacja modelu

Zajmiemy się regularyzacją modelu regresji liniowej - będziemy sprawdzać odwrotność współczynnika regularyzacji (im większy tym mniejsza regulacja), oraz rodzaj kary (l1 - regresja LASSO, l2 - regresja grzbietowa). Zastosujemy podadto parametry ColumnRemovera ustalone w poprzednich podpunktach.


In [17]:
# ustawiamy parametry ColumnRemovera na znalezione w 3.3.1
int_transformer = Pipeline([
    ('int', ColumnRemover(0.9998, 1, 0)),
    ('one_hot', OneHotEncoder(handle_unknown='ignore', sparse_output=False, dtype='int64'))])

float_transformer = Pipeline([
    ('float', ColumnRemover(0.9996, 0.97, 0)),
    ('standard_scaler', StandardScaler())])

col_transformer = ColumnTransformer([
    ('int_pipe', int_transformer, make_column_selector(dtype_include=np.int64)),
    ('float_pipe', float_transformer, make_column_selector(dtype_include=np.float64))
])

X_train = col_transformer.fit_transform(train_df.drop('target', axis=1), train_df.target)
y_train = train_df.target
x_eval = col_transformer.transform(eval_set.drop('target', axis=1))
y_eval = eval_set.target

In [18]:
clf = LogisticRegression(random_state=42, class_weight='balanced', solver='liblinear')

parameters = dict(C=np.logspace(-6, 2, 20), penalty=['l1', 'l2'])
reg_search = GridSearchCV(clf, scoring='roc_auc', cv=3, return_train_score=True, param_grid=parameters, n_jobs=-1).fit(X_train, y_train)

Sprawdzamy najlepsze parametry i wyniki dla najlepszego modelu

In [19]:
print(round(reg_search.best_score_, 4), reg_search.best_params_)

0.7879 {'C': 0.11288378916846883, 'penalty': 'l1'}


In [20]:
show_scores(reg_search.best_estimator_, X_train, y_train)
show_scores(reg_search.best_estimator_, x_eval, y_eval)

|   Predicted 0 |   Predicted 1 |
|---------------+---------------|
|         16357 |          7203 |
|           213 |           727 |

accuracy:              0.6973
precision:             0.0917
recall:                0.7734
f1:                    0.1639
roc_auc_discrete:      0.7338
roc_auc_continuous:    0.8121
|   Predicted 0 |   Predicted 1 |
|---------------+---------------|
|          9918 |          4481 |
|           169 |           432 |

accuracy:              0.69
precision:             0.0879
recall:                0.7188
f1:                    0.1567
roc_auc_discrete:      0.7038
roc_auc_continuous:    0.7893


In [21]:
cvres = reg_search.cv_results_
for mean_score, params in zip(cvres["mean_test_score"], cvres["params"]):
    print(round(mean_score, 4), "   ", params)

0.5     {'C': 1e-06, 'penalty': 'l1'}
0.7605     {'C': 1e-06, 'penalty': 'l2'}
0.5     {'C': 2.6366508987303555e-06, 'penalty': 'l1'}
0.761     {'C': 2.6366508987303555e-06, 'penalty': 'l2'}
0.5     {'C': 6.951927961775606e-06, 'penalty': 'l1'}
0.7628     {'C': 6.951927961775606e-06, 'penalty': 'l2'}
0.5     {'C': 1.8329807108324375e-05, 'penalty': 'l1'}
0.766     {'C': 1.8329807108324375e-05, 'penalty': 'l2'}
0.5     {'C': 4.8329302385717524e-05, 'penalty': 'l1'}
0.7709     {'C': 4.8329302385717524e-05, 'penalty': 'l2'}
0.5     {'C': 0.00012742749857031334, 'penalty': 'l1'}
0.7755     {'C': 0.00012742749857031334, 'penalty': 'l2'}
0.5     {'C': 0.0003359818286283781, 'penalty': 'l1'}
0.7801     {'C': 0.0003359818286283781, 'penalty': 'l2'}
0.7333     {'C': 0.0008858667904100823, 'penalty': 'l1'}
0.7832     {'C': 0.0008858667904100823, 'penalty': 'l2'}
0.7714     {'C': 0.002335721469090121, 'penalty': 'l1'}
0.7851     {'C': 0.002335721469090121, 'penalty': 'l2'}
0.7792     {'C': 0.0061

Skorzystamy z wyznaczonych najlepszych parametrów przy selekcji cech metodą lasso.

### 3.3.3 Dobór zmiennych nie wymagajacy modelu

Korelację uwzględniliśmy w transformatorze ColumnRemover. Technika którą wykorzystamy poniżej to Univariate feature selection - SelectKBest.


### SelectKBest

In [22]:
# szukamy najlepszego parametru k w SelectKBest
clf = Pipeline([
    ('select', SelectKBest()),
    ('model', LogisticRegression(random_state=42, class_weight='balanced'))])

parameters = dict(select__k=np.arange(1, 200, 1))
k_best_search = GridSearchCV(clf, scoring='roc_auc', cv=3, return_train_score=True, param_grid=parameters, n_jobs=-1).fit(X_train, y_train)

In [23]:
# wyniki dla najlepszego znalezionego parametru
print(round(k_best_search.best_score_, 4), k_best_search.best_params_)

0.7928 {'select__k': 154}


In [24]:
show_scores(k_best_search.best_estimator_, X_train, y_train)
show_scores(k_best_search.best_estimator_, x_eval, y_eval)

|   Predicted 0 |   Predicted 1 |
|---------------+---------------|
|         16290 |          7270 |
|           208 |           732 |

accuracy:              0.6948
precision:             0.0915
recall:                0.7787
f1:                    0.1637
roc_auc_discrete:      0.7351
roc_auc_continuous:    0.8102
|   Predicted 0 |   Predicted 1 |
|---------------+---------------|
|          9863 |          4536 |
|           164 |           437 |

accuracy:              0.6867
precision:             0.0879
recall:                0.7271
f1:                    0.1568
roc_auc_discrete:      0.706
roc_auc_continuous:    0.7835


Wydaje się dość duże k, zobaczmy, jak prezentują się wszystkie wyniki

In [25]:
cvres = k_best_search.cv_results_
for mean_score, params in zip(cvres["mean_test_score"], cvres["params"]):
    print(round(mean_score, 4), "   ", params)

0.6717     {'select__k': 1}
0.6741     {'select__k': 2}
0.6741     {'select__k': 3}
0.675     {'select__k': 4}
0.6772     {'select__k': 5}
0.6819     {'select__k': 6}
0.6822     {'select__k': 7}
0.6914     {'select__k': 8}
0.7     {'select__k': 9}
0.7114     {'select__k': 10}
0.7108     {'select__k': 11}
0.7657     {'select__k': 12}
0.7659     {'select__k': 13}
0.7651     {'select__k': 14}
0.7779     {'select__k': 15}
0.7781     {'select__k': 16}
0.778     {'select__k': 17}
0.7785     {'select__k': 18}
0.7786     {'select__k': 19}
0.7785     {'select__k': 20}
0.78     {'select__k': 21}
0.7802     {'select__k': 22}
0.7804     {'select__k': 23}
0.7805     {'select__k': 24}
0.7806     {'select__k': 25}
0.7806     {'select__k': 26}
0.7811     {'select__k': 27}
0.7815     {'select__k': 28}
0.7816     {'select__k': 29}
0.7817     {'select__k': 30}
0.7819     {'select__k': 31}
0.7843     {'select__k': 32}
0.7844     {'select__k': 33}
0.7843     {'select__k': 34}
0.7843     {'select__k': 35}
0

Wnioski: roc_auc maleje nieznacznie i wybierając dużo mniej zmiennych nie stracimy zbytnio na wyniku, a zmniejszymy drastycznie liczbę cech.

In [26]:
clf = Pipeline([
    ('select', SelectKBest(k=33)),
    ('model', LogisticRegression(random_state=42, class_weight='balanced'))])

clf.fit(X_train, y_train)
show_scores(clf, X_train, y_train)
show_scores(clf, x_eval, y_eval)

|   Predicted 0 |   Predicted 1 |
|---------------+---------------|
|         15821 |          7739 |
|           231 |           709 |

accuracy:              0.6747
precision:             0.0839
recall:                0.7543
f1:                    0.151
roc_auc_discrete:      0.7129
roc_auc_continuous:    0.7852
|   Predicted 0 |   Predicted 1 |
|---------------+---------------|
|          9603 |          4796 |
|           162 |           439 |

accuracy:              0.6695
precision:             0.0839
recall:                0.7304
f1:                    0.1504
roc_auc_discrete:      0.6987
roc_auc_continuous:    0.7713


Obserwujemy nieznaczne pogorszenie wyników (szczególnie nieznaczne jest na zbiorze testowym), jednak zredukowaliśmy liczbę cech do zaledwie 33.

### 3.3.4 Dobór zmiennych na podstawie modelu

Dotychczasowe starania miały na celu osiągnięcie jak najlepszych rezultatów dla modelu. Teraz przeprowadzimy dobór cech na podstawie modelu ze znalezionymi najlepszymi parametrami. Zastosujemy metody L1-based feature selection, Sequential Feature Selection i RFE

### SelectFromModel - L1-based feature selection

In [27]:
clf = LogisticRegression(**reg_search.best_params_, random_state=42, solver='liblinear', class_weight='balanced')
clf.fit(X_train, y_train)

show_scores(clf, X_train, y_train)
show_scores(clf, x_eval, y_eval)

|   Predicted 0 |   Predicted 1 |
|---------------+---------------|
|         16357 |          7203 |
|           213 |           727 |

accuracy:              0.6973
precision:             0.0917
recall:                0.7734
f1:                    0.1639
roc_auc_discrete:      0.7338
roc_auc_continuous:    0.8121
|   Predicted 0 |   Predicted 1 |
|---------------+---------------|
|          9918 |          4481 |
|           169 |           432 |

accuracy:              0.69
precision:             0.0879
recall:                0.7188
f1:                    0.1567
roc_auc_discrete:      0.7038
roc_auc_continuous:    0.7893


In [28]:
X_train.shape[1]

406

In [29]:
sfl = SelectFromModel(clf, prefit=True)
X_train_t = sfl.transform(X_train)
X_eval_t = sfl.transform(x_eval)
X_train_t.shape[1]

114

In [31]:
clf.fit(X_train_t, y_train)

show_scores(clf, X_train_t, y_train)
show_scores(clf, X_eval_t, y_eval)

|   Predicted 0 |   Predicted 1 |
|---------------+---------------|
|         16357 |          7203 |
|           213 |           727 |

accuracy:              0.6973
precision:             0.0917
recall:                0.7734
f1:                    0.1639
roc_auc_discrete:      0.7338
roc_auc_continuous:    0.8121
|   Predicted 0 |   Predicted 1 |
|---------------+---------------|
|          9918 |          4481 |
|           169 |           432 |

accuracy:              0.69
precision:             0.0879
recall:                0.7188
f1:                    0.1567
roc_auc_discrete:      0.7038
roc_auc_continuous:    0.7893


Nie obserwujemy żadnego spadku w wynikach - zreudkowaliśmy natomiast liczbę cech do 114.

### Sequential Feature Selection

In [32]:
sfs = SequentialFeatureSelector(
    LogisticRegression(random_state=42, class_weight='balanced'),
    direction='forward',
    scoring='roc_auc',
    n_features_to_select=40,
    cv=3,
    n_jobs=-1)

pipe = Pipeline([
    ('selector', sfs),
    ('model', LogisticRegression(random_state=42, class_weight='balanced'))]).fit(X_train, y_train)

show_scores(pipe, X_train, y_train)

|   Predicted 0 |   Predicted 1 |
|---------------+---------------|
|         16089 |          7471 |
|           221 |           719 |

accuracy:              0.686
precision:             0.0878
recall:                0.7649
f1:                    0.1575
roc_auc_discrete:      0.7239
roc_auc_continuous:    0.809


In [33]:
show_scores(pipe, x_eval, y_eval)

|   Predicted 0 |   Predicted 1 |
|---------------+---------------|
|          9765 |          4634 |
|           168 |           433 |

accuracy:              0.6799
precision:             0.0855
recall:                0.7205
f1:                    0.1528
roc_auc_discrete:      0.6993
roc_auc_continuous:    0.7931


Po kilku próbach otrzymaliśmy niepogorszone wyniki dla zalwdwie 40 zmiennych na podstawie selekcji SequentialFeaturSelector. Nie szukamy w tym przypadku parametru n_features_to_select za pomocą grid search, ponieważ zajęło by nam to z pół roku, a zajęliśmy się empiryczym sprawdzeniem najlepszej wartości.


### Recursive Feature Elimination

Dla najlepszego modelu regresji logistycznej sprawdźmy które cechy niosą za sobą najwięcej informacji

In [36]:
#3 wyniki w 81 minut to trochę za długo :(
'''k_val = [k for k in range (2,100,2)]
score = []
for k in k_val:
    print(k)
    log_reg = LogisticRegression(random_state=42, class_weight='balanced', solver='liblinear')
    rfe = RFE(log_reg,n_features_to_select = k)
    rfe.fit(X_train, y_train)
    X_train_rfe = rfe.transform(X_train)
    X_eval_rfe = rfe.transform(x_eval)
    log_reg.fit(X_train_rfe, y_train)

    y_pred_prob = log_reg.predict_proba(pd.DataFrame(X_eval_rfe))

    score_roc = roc_auc_score(y_eval, y_pred_prob[:, 1])
    score.append(score_roc)
    print(score_roc, k)'''

"k_val = [k for k in range (2,100,2)]\nscore = []\nfor k in k_val:\n    print(k)\n    log_reg = LogisticRegression(random_state=42, class_weight='balanced', solver='liblinear')\n    rfe = RFE(log_reg,n_features_to_select = k)\n    rfe.fit(X_train, y_train)\n    X_train_rfe = rfe.transform(X_train)\n    X_eval_rfe = rfe.transform(x_eval)\n    log_reg.fit(X_train_rfe, y_train)\n\n    y_pred_prob = log_reg.predict_proba(pd.DataFrame(X_eval_rfe))\n\n    score_roc = roc_auc_score(y_eval, y_pred_prob[:, 1])\n    score.append(score_roc)\n    print(score_roc, k)"

In [37]:
'''plt.stem(score)
plt.xticks(k_val, score)
plt.xlim([-1, 50])
plt.xlabel("K Values")
plt.ylabel("roc_auc Score")'''

'plt.stem(score)\nplt.xticks(k_val, score)\nplt.xlim([-1, 50])\nplt.xlabel("K Values")\nplt.ylabel("roc_auc Score")'

![fre_logreg.png](./Img/rfe_logreg.png)

Z powyższego wykresu możemy wyczytać, że najlepiej będzie użyć rfe dla 72 cech.
W związku z tym spróbujmy użyć PCA na wybranych kolumnach.

(uwaga: ten i pozostałe wykresy zawarte w tym notatniku zostały przeklejone z innych naszych notatników ze względu na długi czas liczenia kodu rfe)

### Principal Component Analysis

In [38]:
rfe = RFE(LogisticRegression(random_state=42, class_weight='balanced', solver='liblinear'),n_features_to_select = 72).fit(X_train, y_train)
X_train_rfe = rfe.transform(X_train)
X_eval_rfe = rfe.transform(x_eval)

k_values = [i for i in range (2,72)]
scores = []

for k in k_values:

    log_reg = LogisticRegression(random_state=42, class_weight='balanced', solver='liblinear')
    pca = PCA(n_components=k)

    X_red_train = pca.fit_transform(X_train_rfe)
    X_red_eval = pca.fit_transform(X_eval_rfe)

    log_reg.fit(X_red_train, y_train)
    y_pred_prob = log_reg.predict_proba(pd.DataFrame(X_red_eval))

    score_roc = roc_auc_score(y_eval, y_pred_prob[:, 1])
    scores.append(score_roc)


In [None]:
plt.stem(scores)
plt.xticks(k_values, scores)
plt.xlim([-1, 72])
plt.xlabel("K Values")
plt.ylabel("roc_auc Score")
plt.title("PCA for linear regression")

![linear_reg.png](./Img/linear_reg.png)

Widzimy więc, że dla tego modelu najlepiej będzie wybrać wartość n_components = 3

### 3.4 Podsumowanie

Model regresji logistycznej działa przyzwoicie jak na warunki otrzymanego zbioru danych. Istotne wydaje się użycie parametry class_weight = 'balanced'. Przeprowadziliśmy przeszukanie siatki parametrów autorskiego ColumnRemover'a, a następnie przeszukiwanie siatki parametru C i kar w celu dobrania najlepszych parametrów regularyzacji.

Następnie przeszliśmy do selekcji cech różnymi metodami. W wielu przypadkach zaobserwowaliśmy znaczny spadek liczby kolumn przy jednoczesnym braku znaczącego pogorszenia rezultatów (nie zaobserwowaliśmy znaczącego wzrostu rezultatów przy jakiejkolwiek redukcji cech). Metody jakie wybraliśmy i ostateczna liczba zmiennych prezentują się następujaco:

- SelectKBest - 33
- L1-based feature selection - 114
- Sequential feature selection - 40
- Recursive feature elimination - 72
- RFE + PCA - 3

# 4. Drzewa decyzyjne

## 4.1 Preprocessing

In [None]:
# zarówno dla zmiennych dyskretnych i ciągłych stosujemy nasz transformator ColumnRemover z różnymi parametrami - w kolejnych krokach będziemy szukać najlepszej ich kombinacji

# dokonujemy kodowania one hot encoding zmiennych dyskretnych - traktujemy je jako kategoryczne
int_transformer = Pipeline([
    ('int', ColumnRemover(0.9995, 0.99, 1)),
    ('one_hot', OneHotEncoder(handle_unknown='ignore', sparse_output=False, dtype='int64'))])

# jesteśmy przy drzewach decyzyjnych, więc nie musimy skalować cech
float_transformer = Pipeline([
    ('float', ColumnRemover(0.9999, 0.99, 10))])

col_transformer = ColumnTransformer([
    ('int_pipe', int_transformer, make_column_selector(dtype_include=np.int64)),
    ('float_pipe', float_transformer, make_column_selector(dtype_include=np.float64))
])

## 4.2 Trening pierwszego modelu

In [None]:
X_train = train_df.drop('target', axis=1)
y_train = train_df.target
x_eval = eval_set.drop('target', axis=1)
y_eval = eval_set.target

In [None]:
clf = Pipeline([
    ('preprocessing', col_transformer),
    ('model', DecisionTreeClassifier(random_state=42))])

clf.fit(X_train, y_train)
show_scores(clf, X_train, y_train)

|   Predicted 0 |   Predicted 1 |
|---------------+---------------|
|         23560 |             0 |
|             3 |           937 |

accuracy:              0.9999
precision:             1.0
recall:                0.9968
f1:                    0.9984
roc_auc_discrete:      0.9984
roc_auc_continuous:    1.0


In [None]:
show_scores(clf, x_eval, y_eval)

|   Predicted 0 |   Predicted 1 |
|---------------+---------------|
|          9654 |           396 |
|           384 |            66 |

accuracy:              0.9257
precision:             0.1429
recall:                0.1467
f1:                    0.1447
roc_auc_discrete:      0.5536
roc_auc_continuous:    0.5536


Model jest zdecydowanie przeuczony. Spróbujmy z parametrem class_weight='balanced'.

In [None]:
clf = Pipeline([
    ('preprocessing', col_transformer),
    ('model', DecisionTreeClassifier(random_state=42, class_weight='balanced'))])

clf.fit(X_train, y_train)
show_scores(clf, X_train, y_train)

|   Predicted 0 |   Predicted 1 |
|---------------+---------------|
|         23557 |             3 |
|             0 |           940 |

accuracy:              0.9999
precision:             0.9968
recall:                1.0
f1:                    0.9984
roc_auc_discrete:      0.9999
roc_auc_continuous:    1.0


In [None]:
show_scores(clf, x_eval, y_eval)

|   Predicted 0 |   Predicted 1 |
|---------------+---------------|
|          9705 |           345 |
|           399 |            51 |

accuracy:              0.9291
precision:             0.1288
recall:                0.1133
f1:                    0.1206
roc_auc_discrete:      0.5395
roc_auc_continuous:    0.5395


Za dużo to nie dało, musimy skupić się na regularyzacji modelu drzewa decyzyjnego.

## 4.3 Strojenie hiperparametrów i wybór zmiennych

### 4.3.1 Dobór parametrów dla ColumnRemover'a


In [None]:
'''clf = Pipeline([
    ('preprocessing', col_transformer),
    ('model', DecisionTreeClassifier(random_state=42))])

parameters = dict(preprocessing__int_pipe__int__threshold_constant = np.arange(0.9995, 1, 0.0001),
                  preprocessing__int_pipe__int__threshold_corr = np.arange(0.96, 1, 0.01),
                  preprocessing__int_pipe__int__n_info_vals = np.arange(0, 5, 1),
                  preprocessing__float_pipe__float__threshold_constant = np.arange(0.9995, 1, 0.0001),
                  preprocessing__float_pipe__float__threshold_corr = np.arange(0.96, 1, 0.01),
                  preprocessing__float_pipe__float__n_info_vals = np.arange(0, 16, 3))

# siatka parametrów
col_remove_search = RandomizedSearchCV(clf, scoring='roc_auc', param_distributions=parameters, cv=3, n_iter=1000, n_jobs=-1, random_state=42).fit(X_train, y_train)'''

In [None]:
#print(round(col_remove_search.best_score_, 4), col_remove_search.best_params_)

0.5594 {'preprocessing__int_pipe__int__threshold_corr': 0.99, 'preprocessing__int_pipe__int__threshold_constant': 0.9998, 'preprocessing__int_pipe__int__n_info_vals': 2, 'preprocessing__float_pipe__float__threshold_corr': 0.96, 'preprocessing__float_pipe__float__threshold_constant': 0.9998, 'preprocessing__float_pipe__float__n_info_vals': 15}


In [None]:
#show_scores(col_remove_search.best_estimator_, X_train, y_train)
#show_scores(col_remove_search.best_estimator_, x_eval, y_eval)

|   Predicted 0 |   Predicted 1 |
|---------------+---------------|
|         23560 |             0 |
|             3 |           937 |

accuracy:              0.9999
precision:             1.0
recall:                0.9968
f1:                    0.9984
roc_auc_discrete:      0.9984
roc_auc_continuous:    1.0
|   Predicted 0 |   Predicted 1 |
|---------------+---------------|
|          9653 |           397 |
|           379 |            71 |

accuracy:              0.9261
precision:             0.1517
recall:                0.1578
f1:                    0.1547
roc_auc_discrete:      0.5591
roc_auc_continuous:    0.5591


### 4.3.2 Regularyzacja modelu

Zajmiemy się regularyzacją modelu drzewa decyzyjnego - będziemy sprawdzać parametry max_depth, min_samples_split, min_samples_leaf, max_features. Na początek jednak skupimy się tylko na max_depth - być może z jego powodu drzewo jest tak mocno przeuczone. Zastosujemy również najlepsze parametry ColumnRemovera znalezione w poprzednim podpunkcie


In [None]:
'''int_transformer = Pipeline([
    ('int', ColumnRemover(0.9998, 0.99, 2)),
    ('one_hot', OneHotEncoder(handle_unknown='ignore', sparse_output=False, dtype='int64'))])

float_transformer = Pipeline([
    ('float', ColumnRemover(0.9998, 0.96, 15))])

col_transformer = ColumnTransformer([
    ('int_pipe', int_transformer, make_column_selector(dtype_include=np.int64)),
    ('float_pipe', float_transformer, make_column_selector(dtype_include=np.float64))
])

# od tej pory dane testowe i treningowe będą już wstępnie przeprocesowane
X_train = col_transformer.fit_transform(train_df.drop('target', axis=1), train_df.target)
y_train = train_df.target
x_eval = col_transformer.transform(eval_set.drop('target', axis=1))
y_eval = eval_set.target'''

In [None]:
clf = DecisionTreeClassifier(random_state=42)

parameters = dict(max_depth=np.arange(1, 100))
depth_search = GridSearchCV(clf, cv=3, scoring='roc_auc', return_train_score=True, param_grid=parameters, n_jobs=-1).fit(X_train, y_train)

In [None]:
print(round(depth_search.best_score_, 4), depth_search.best_params_)

0.8008 {'max_depth': 4}


In [None]:
cvres = depth_search.cv_results_
for mean_score, params in zip(cvres["mean_test_score"], cvres["params"]):
    print(round(mean_score, 4), "   ", params)

0.6825     {'max_depth': 1}
0.7438     {'max_depth': 2}
0.7854     {'max_depth': 3}
0.8008     {'max_depth': 4}
0.8004     {'max_depth': 5}
0.7919     {'max_depth': 6}
0.784     {'max_depth': 7}
0.7719     {'max_depth': 8}
0.7647     {'max_depth': 9}
0.7471     {'max_depth': 10}
0.7353     {'max_depth': 11}
0.7092     {'max_depth': 12}
0.7082     {'max_depth': 13}
0.6954     {'max_depth': 14}
0.669     {'max_depth': 15}
0.6699     {'max_depth': 16}
0.6698     {'max_depth': 17}
0.6551     {'max_depth': 18}
0.6485     {'max_depth': 19}
0.6425     {'max_depth': 20}
0.6273     {'max_depth': 21}
0.6207     {'max_depth': 22}
0.6081     {'max_depth': 23}
0.5975     {'max_depth': 24}
0.5985     {'max_depth': 25}
0.5914     {'max_depth': 26}
0.581     {'max_depth': 27}
0.5752     {'max_depth': 28}
0.5757     {'max_depth': 29}
0.5654     {'max_depth': 30}
0.5616     {'max_depth': 31}
0.5602     {'max_depth': 32}
0.5574     {'max_depth': 33}
0.5581     {'max_depth': 34}
0.5567     {'max_depth': 3

In [None]:
show_scores(depth_search.best_estimator_, X_train, y_train)
show_scores(depth_search.best_estimator_, x_eval, y_eval)

|   Predicted 0 |   Predicted 1 |
|---------------+---------------|
|         23557 |             3 |
|           925 |            15 |

accuracy:              0.9621
precision:             0.8333
recall:                0.016
f1:                    0.0313
roc_auc_discrete:      0.5079
roc_auc_continuous:    0.8135
|   Predicted 0 |   Predicted 1 |
|---------------+---------------|
|         10040 |            10 |
|           450 |             0 |

accuracy:              0.9562
precision:             0.0
recall:                0.0
f1:                    0.0
roc_auc_discrete:      0.4995
roc_auc_continuous:    0.7994


Dość podejrzane te wyniki. Spróbujmy znaleźć parametry z metryką f1

In [None]:
clf = DecisionTreeClassifier(random_state=42)

parameters = dict(max_depth=np.arange(1, 100))
depth_search = GridSearchCV(clf, cv=3, scoring='f1', return_train_score=True, param_grid=parameters, n_jobs=-1).fit(X_train, y_train)

In [None]:
print(round(depth_search.best_score_, 4), depth_search.best_params_)

0.1421 {'max_depth': 29}


In [None]:
cvres = depth_search.cv_results_
for mean_score, params in zip(cvres["mean_test_score"], cvres["params"]):
    print(round(mean_score, 4), "   ", params)

0.0     {'max_depth': 1}
0.0     {'max_depth': 2}
0.0042     {'max_depth': 3}
0.0082     {'max_depth': 4}
0.0162     {'max_depth': 5}
0.0219     {'max_depth': 6}
0.0288     {'max_depth': 7}
0.0409     {'max_depth': 8}
0.047     {'max_depth': 9}
0.049     {'max_depth': 10}
0.0675     {'max_depth': 11}
0.076     {'max_depth': 12}
0.0827     {'max_depth': 13}
0.0854     {'max_depth': 14}
0.089     {'max_depth': 15}
0.0801     {'max_depth': 16}
0.086     {'max_depth': 17}
0.0932     {'max_depth': 18}
0.106     {'max_depth': 19}
0.1114     {'max_depth': 20}
0.1125     {'max_depth': 21}
0.122     {'max_depth': 22}
0.1213     {'max_depth': 23}
0.112     {'max_depth': 24}
0.1327     {'max_depth': 25}
0.1356     {'max_depth': 26}
0.129     {'max_depth': 27}
0.1317     {'max_depth': 28}
0.1421     {'max_depth': 29}
0.1343     {'max_depth': 30}
0.1357     {'max_depth': 31}
0.1409     {'max_depth': 32}
0.1354     {'max_depth': 33}
0.1382     {'max_depth': 34}
0.1385     {'max_depth': 35}
0.1383   

In [None]:
show_scores(depth_search.best_estimator_, X_train, y_train)
show_scores(depth_search.best_estimator_, x_eval, y_eval)

|   Predicted 0 |   Predicted 1 |
|---------------+---------------|
|         23546 |            14 |
|            29 |           911 |

accuracy:              0.9982
precision:             0.9849
recall:                0.9691
f1:                    0.9769
roc_auc_discrete:      0.9843
roc_auc_continuous:    0.9999
|   Predicted 0 |   Predicted 1 |
|---------------+---------------|
|          9641 |           409 |
|           378 |            72 |

accuracy:              0.925
precision:             0.1497
recall:                0.16
f1:                    0.1547
roc_auc_discrete:      0.5597
roc_auc_continuous:    0.5665


Wyniki wyglądają sensowniej niż dla roc_auc, co nie zmienia faktu, że drzewo nadal jest przeuczone i nie dostaliśmy lepszych wyników od braku regulacji. Spróbujemy określić pewną podprzestrzeń parametrów max_depth, min_samples_split, min_samples_leaf, max_features i wykorzystamy klasę RandomizedSearchCV do znalezienia optymalnej kombinacji. Wybierzemy metrykę f1, bo zauważyliśmy już, że dzięki roc_auc algorytm nie wyłapuje żadnych 1.

In [None]:
clf = DecisionTreeClassifier(random_state=42)

parameters = dict(max_depth=np.arange(25, 65, 3), min_samples_split=np.arange(2,8),
                  min_samples_leaf=np.arange(1, 20, 2), max_features=np.arange(20, 150, 5))

# siatka parametrów ma liczność 21840, przeszukujemy więc ok. 45,7% wszystkich możliwości
rand_search = RandomizedSearchCV(clf, scoring='f1', cv=3, return_train_score=True, param_distributions=parameters, n_iter=5000, n_jobs=-1, random_state=42).fit(X_train, y_train)

Sprawdzamy najlepszą kombinację

In [None]:
print(round(rand_search.best_score_, 4), rand_search.best_params_)

0.1182 {'min_samples_split': 3, 'min_samples_leaf': 3, 'max_features': 125, 'max_depth': 25}


In [None]:
show_scores(rand_search.best_estimator_, X_train, y_train)
show_scores(rand_search.best_estimator_, x_eval, y_eval)

|   Predicted 0 |   Predicted 1 |
|---------------+---------------|
|         23394 |           166 |
|           439 |           501 |

accuracy:              0.9753
precision:             0.7511
recall:                0.533
f1:                    0.6235
roc_auc_discrete:      0.763
roc_auc_continuous:    0.9878
|   Predicted 0 |   Predicted 1 |
|---------------+---------------|
|          9778 |           272 |
|           400 |            50 |

accuracy:              0.936
precision:             0.1553
recall:                0.1111
f1:                    0.1295
roc_auc_discrete:      0.542
roc_auc_continuous:    0.5879


Jak na kilka godzin szukania wyniki nie są powalające. Ciężko powiedzieć, czy regularyzacja coś daje w tym przypadku. Ale może chociaż inżynieria cech nam wyjdzie.

### 4.3.3 Dobór zmiennych nie wymagajacy modelu

Korelację uwzględniliśmy w transformatorze ColumnRemover. Technika którą wykorzystamy poniżej to Univariate feature selection - SelectKBest.

### SelectKBest

In [None]:
# szukamy najlepszego parametru k w SelectKBest
clf = Pipeline([
    ('select', SelectKBest()),
    ('model', DecisionTreeClassifier(random_state=42))])

parameters = dict(select__k=np.arange(1, 200, 1))
k_best_search = GridSearchCV(clf, scoring='roc_auc', cv=3, return_train_score=True, param_grid=parameters, n_jobs=-1).fit(X_train, y_train)

In [None]:
# wyniki dla najlepszego znalezionego parametru
print(round(k_best_search.best_score_, 4), k_best_search.best_params_)

0.715 {'select__k': 10}


In [None]:
show_scores(k_best_search.best_estimator_, X_train, y_train)
show_scores(k_best_search.best_estimator_, x_eval, y_eval)

|   Predicted 0 |   Predicted 1 |
|---------------+---------------|
|         23560 |             0 |
|           940 |             0 |

accuracy:              0.9616
precision:             0.0
recall:                0.0
f1:                    0.0
roc_auc_discrete:      0.5
roc_auc_continuous:    0.7189
|   Predicted 0 |   Predicted 1 |
|---------------+---------------|
|         10050 |             0 |
|           450 |             0 |

accuracy:              0.9571
precision:             0.0
recall:                0.0
f1:                    0.0
roc_auc_discrete:      0.5
roc_auc_continuous:    0.7151


Wyniki dla metryki roc_auc ponownie nic nie dają, sprawdzimy f1

In [None]:
clf = Pipeline([
    ('select', SelectKBest()),
    ('model', DecisionTreeClassifier(random_state=42))])

parameters = dict(select__k=np.arange(1, 200, 1))
k_best_search = GridSearchCV(clf, scoring='f1', cv=5, return_train_score=True, param_grid=parameters, n_jobs=-1).fit(X_train, y_train)

print(round(k_best_search.best_score_, 4), k_best_search.best_params_)

0.1531 {'select__k': 106}


In [None]:
show_scores(k_best_search.best_estimator_, X_train, y_train)
show_scores(k_best_search.best_estimator_, x_eval, y_eval)

|   Predicted 0 |   Predicted 1 |
|---------------+---------------|
|         23560 |             0 |
|            15 |           925 |

accuracy:              0.9994
precision:             1.0
recall:                0.984
f1:                    0.992
roc_auc_discrete:      0.992
roc_auc_continuous:    1.0
|   Predicted 0 |   Predicted 1 |
|---------------+---------------|
|          9624 |           426 |
|           373 |            77 |

accuracy:              0.9239
precision:             0.1531
recall:                0.1711
f1:                    0.1616
roc_auc_discrete:      0.5644
roc_auc_continuous:    0.5651


Zobaczmy, jak prezentują się wszystkie wyniki

In [None]:
cvres = k_best_search.cv_results_
for mean_score, params in zip(cvres["mean_test_score"], cvres["params"]):
    print(round(mean_score, 4), "   ", params)

0.0     {'select__k': 1}
0.0     {'select__k': 2}
0.0     {'select__k': 3}
0.0     {'select__k': 4}
0.0     {'select__k': 5}
0.0     {'select__k': 6}
0.0     {'select__k': 7}
0.0     {'select__k': 8}
0.0     {'select__k': 9}
0.0     {'select__k': 10}
0.0     {'select__k': 11}
0.0343     {'select__k': 12}
0.0472     {'select__k': 13}
0.0419     {'select__k': 14}
0.0346     {'select__k': 15}
0.0388     {'select__k': 16}
0.0393     {'select__k': 17}
0.0421     {'select__k': 18}
0.04     {'select__k': 19}
0.0428     {'select__k': 20}
0.0347     {'select__k': 21}
0.0361     {'select__k': 22}
0.0417     {'select__k': 23}
0.0403     {'select__k': 24}
0.0399     {'select__k': 25}
0.0424     {'select__k': 26}
0.0336     {'select__k': 27}
0.0451     {'select__k': 28}
0.046     {'select__k': 29}
0.042     {'select__k': 30}
0.0487     {'select__k': 31}
0.0451     {'select__k': 32}
0.0465     {'select__k': 33}
0.0503     {'select__k': 34}
0.0463     {'select__k': 35}
0.0528     {'select__k': 36}
0.

In [None]:
clf = Pipeline([
    ('select', SelectKBest(k=106)),
    ('model', DecisionTreeClassifier(random_state=42))])

clf.fit(X_train, y_train)
show_scores(clf, X_train, y_train)
show_scores(clf, x_eval, y_eval)

|   Predicted 0 |   Predicted 1 |
|---------------+---------------|
|         23560 |             0 |
|            15 |           925 |

accuracy:              0.9994
precision:             1.0
recall:                0.984
f1:                    0.992
roc_auc_discrete:      0.992
roc_auc_continuous:    1.0
|   Predicted 0 |   Predicted 1 |
|---------------+---------------|
|          9624 |           426 |
|           373 |            77 |

accuracy:              0.9239
precision:             0.1531
recall:                0.1711
f1:                    0.1616
roc_auc_discrete:      0.5644
roc_auc_continuous:    0.5651


### 4.3.4 Dobór zmiennych na podstawie modelu

Przeprowadzimy dobór cech na podstawie modelu ze znalezionymi najlepszymi parametrami. Zastosujemy metody SelectFromModel - feature importance, Sequential Feature Selection i RFE

### SelectFromModel - tree-based feature selection

In [None]:
clf = DecisionTreeClassifier(random_state=42)
clf.fit(X_train, y_train)

show_scores(clf, X_train, y_train)
show_scores(clf, x_eval, y_eval)

|   Predicted 0 |   Predicted 1 |
|---------------+---------------|
|         23560 |             0 |
|             3 |           937 |

accuracy:              0.9999
precision:             1.0
recall:                0.9968
f1:                    0.9984
roc_auc_discrete:      0.9984
roc_auc_continuous:    1.0
|   Predicted 0 |   Predicted 1 |
|---------------+---------------|
|          9653 |           397 |
|           379 |            71 |

accuracy:              0.9261
precision:             0.1517
recall:                0.1578
f1:                    0.1547
roc_auc_discrete:      0.5591
roc_auc_continuous:    0.5591


In [None]:
X_train.shape[1]

197

In [None]:
sfl = SelectFromModel(clf, prefit=True)
X_train_t = sfl.transform(X_train)
X_eval_t = sfl.transform(x_eval)
X_train_t.shape[1]

22

In [None]:
clf.fit(X_train_t, y_train)

show_scores(clf, X_train_t, y_train)
show_scores(clf, X_eval_t, y_eval)

|   Predicted 0 |   Predicted 1 |
|---------------+---------------|
|         23560 |             0 |
|             3 |           937 |

accuracy:              0.9999
precision:             1.0
recall:                0.9968
f1:                    0.9984
roc_auc_discrete:      0.9984
roc_auc_continuous:    1.0
|   Predicted 0 |   Predicted 1 |
|---------------+---------------|
|          9603 |           447 |
|           381 |            69 |

accuracy:              0.9211
precision:             0.1337
recall:                0.1533
f1:                    0.1429
roc_auc_discrete:      0.5544
roc_auc_continuous:    0.5544


Obserwujemy niewielkie pogorszenie wyników modelu, ale zredukowaliśmy liczbę cech do zaledwie 22!

### Sequential Feature Selection

In [None]:
sfs = SequentialFeatureSelector(
    DecisionTreeClassifier(random_state=42),
    direction='forward',
    scoring='f1',
    n_features_to_select=50,
    cv=3,
    n_jobs=-1)

pipe = Pipeline([
    ('selector', sfs),
    ('model', DecisionTreeClassifier(random_state=42))]).fit(X_train, y_train)

show_scores(pipe, X_train, y_train)

In [None]:
show_scores(pipe, X_eval_t, y_eval)

Zredukowaliśmy liczbę cech do 50 nie pogarszając wyniku.

### Recursive Feature Elimination

Dla najlepszego modelu drzewa decyzyjnego sprawdźmy które cechy niosą za sobą najwięcej informacji

In [None]:
k_val = [k for k in range (2,100,2)]
score = []
for k in k_val:

    clf = DecisionTreeClassifier(**rand_search.best_params_)
    rfe = RFE(clf,n_features_to_select = k)
    rfe.fit(X_train, y_train)
    X_train_rfe = rfe.transform(X_train)
    X_eval_rfe = rfe.transform(x_eval)
    clf.fit(X_train_rfe, y_train)

    y_pred_prob = clf.predict_proba(pd.DataFrame(X_eval_rfe))

    score_roc = roc_auc_score(y_eval, y_pred_prob[:, 1])
    score.append(score_roc)
    print(score_roc, k)

In [None]:
plt.stem(score)
plt.xticks(k_val, score)
plt.xlim([-1, 50])
plt.xlabel("K Values")
plt.ylabel("f1 Score")

![DCT.png](./Img/dct_rfe.png)

Z powyższego wykresu możemy wyczytać, że najlepiej będzie użyć rfe dla 26 cech.
W związku z tym spróbujmy użyć PCA na wybranych kolumnach.

### Principal component analysis

In [None]:
rfe = RFE(DecisionTreeClassifier(**rand_search.best_params_, random_state=42),n_features_to_select = 26).fit(X_train, y_train)
X_train_rfe = rfe.transform(X_train)
X_eval_rfe = rfe.transform(x_eval)

k_values = [i for i in range (2,26)]
scores = []

for k in k_values:

    clf = DecisionTreeClassifier(**rand_search.best_params_)
    pca = PCA(n_components=k)

    X_red_train = pca.fit_transform(X_train_rfe)
    X_red_eval = pca.fit_transform(X_eval_rfe)

    clf.fit(X_red_train, y_train)
    y_pred_prob = clf.predict_proba(pd.DataFrame(X_red_eval))

    score_roc = roc_auc_score(y_eval, y_pred_prob[:, 1])
    scores.append(score_roc)


In [None]:
plt.stem(scores)
plt.xticks(k_values, scores)
plt.xlim([-1, 72])
plt.xlabel("K Values")
plt.ylabel("roc_auc Score")
plt.title("PCA for decision Tree")

![DCT.png](./Img/DCT.png)

Widzimy więc, że dla tego modelu najlepiej będzie wybrać wartość n_components = 64

### 4.4 Podsumowanie

Model drzewa decyzyjnego działa bardzo słabo w przypadku naszego zbioru danych i wykazujemy pewną bezradność wobec tego algorytmu nawet po dość szczegółowym i wielokrotnym przeszukiwaniu siatki parametrów. Niemniej jednak udało nam się przeprowadzić selekcję cech niepogarszającą rezultatów modelu. Metody jakie wybraliśmy i ostateczna liczba zmiennych prezentują się następujaco:

- SelectKBest - 106
- Tree-based feature selection (feature_importance) - 22
- Sequential feature selection - 70
- Recursive feature elimination - 26
- RFE + PCA - 64

Zwycięzcą w tym modelu okazało się podejście oparte na modelu i parametrze feature_importance. Trzeba jednak pamiętać, że jest ono skuteczne tylko w przypadku nieprzetrenowanego modelu, czego w tym przypadku nie uzyskaliśmy.


# 5. SVM

## 5.1 Preprocessing

In [None]:
int_transformer = Pipeline([
    ('int', ColumnRemover(0.9995, 0.99, 1)),
    ('one_hot', OneHotEncoder(handle_unknown='ignore', sparse_output=False, dtype='int64'))])

float_transformer = Pipeline([
    ('float', ColumnRemover(0.9999, 0.99, 10)),
    ('standard_scaler', StandardScaler())])

col_transformer = ColumnTransformer([
    ('int_pipe', int_transformer, make_column_selector(dtype_include=np.int64)),
    ('float_pipe', float_transformer, make_column_selector(dtype_include=np.float64))
])

In [None]:
X_train = train_df.drop('target', axis=1)
y_train = train_df.target
x_eval = eval_set.drop('target', axis=1)
y_eval = eval_set.target

## 5.2 Trening pierwszego modelu

In [None]:
clf = Pipeline([
    ('preprocessing', col_transformer),
    ('model', SVC(random_state=42, probability=True))])

clf.fit(X_train, y_train)
show_scores(clf, X_train, y_train)

|   Predicted 0 |   Predicted 1 |
|---------------+---------------|
|         23560 |             0 |
|           939 |             1 |

accuracy:              0.9617
precision:             1.0
recall:                0.0011
f1:                    0.0021
roc_auc_discrete:      0.5005
roc_auc_continuous:    0.5118


In [None]:
show_scores(clf, X_train, y_eval)

|   Predicted 0 |   Predicted 1 |
|---------------+---------------|
|         10050 |             0 |
|           450 |             0 |

accuracy:              0.9571
precision:             0.0
recall:                0.0
f1:                    0.0
roc_auc_discrete:      0.5
roc_auc_continuous:    0.494


Spróbujmy wyciągnąć lepsze wyniki

### 5.2.1 Trening z wybranymi hiperparametrami

In [None]:
int_transformer = Pipeline([
    ('int', ColumnRemover(0.9995, 0.99, 1)),
    ('one_hot', OneHotEncoder(handle_unknown='ignore', sparse_output=False, dtype='int64'))])

float_transformer = Pipeline([
    ('float', ColumnRemover(0.9999, 0.99, 10))])

col_transformer = ColumnTransformer([
    ('int_pipe', int_transformer, make_column_selector(dtype_include=np.int64)),
    ('float_pipe', float_transformer, make_column_selector(dtype_include=np.float64))
])

X_train = col_transformer.fit_transform(train_df.drop('target', axis=1), train_df.target)
y_train = train_df.target
x_eval = col_transformer.transform(eval_set.drop('target', axis=1))
y_eval = eval_set.target

In [None]:
svm2 = SVC(kernel='linear', class_weight='balanced', gamma='auto',probability=True, random_state=42).fit(X_train,y_train)

In [None]:
show_scores(svm2, X_train, y_train)
show_scores(svm2, x_eval, y_eval)

![svm.png](./img/svm.png)

Pozostaje kwestia strojenia hiperparametrów i wyboru cech dla tego modelu. Niestety, z naszego doświadczenia wiemy, że czas trwania przeliczenia tego dla SVM jest liczony w dniach (co najmniej kilku). Z tego powodu ograniczyliśmy się jedynie do empirycznego wyboru powyższych hiperparametrów.

# 6. Random forest

## 6.1 Preprocessing

In [None]:
int_transformer = Pipeline([
    ('int', ColumnRemover(0.9995, 0.99, 1)),
    ('one_hot', OneHotEncoder(handle_unknown='ignore', sparse_output=False, dtype='int64'))])

float_transformer = Pipeline([
    ('float', ColumnRemover(0.9999, 0.99, 10))])

col_transformer = ColumnTransformer([
    ('int_pipe', int_transformer, make_column_selector(dtype_include=np.int64)),
    ('float_pipe', float_transformer, make_column_selector(dtype_include=np.float64))
])

## 6.2 Trening modelu

In [None]:
X_train = train_df.drop('target', axis=1)
y_train = train_df.target
x_eval = eval_set.drop('target', axis=1)
y_eval = eval_set.target

In [None]:
rf = RandomForestClassifier(random_state=42, n_estimators=1000)

clf = Pipeline([
    ('preprocessing', col_transformer),
    ('model', rf)])

clf.fit(X_train, y_train)
show_scores(clf, X_train, y_train)
show_scores(clf, x_eval, y_eval)

|   Predicted 0 |   Predicted 1 |
|---------------+---------------|
|         23560 |             0 |
|             3 |           937 |

accuracy:              0.9999
precision:             1.0
recall:                0.9968
f1:                    0.9984
roc_auc_discrete:      0.9984
roc_auc_continuous:    1.0
|   Predicted 0 |   Predicted 1 |
|---------------+---------------|
|         10028 |            22 |
|           443 |             7 |

accuracy:              0.9557
precision:             0.2414
recall:                0.0156
f1:                    0.0292
roc_auc_discrete:      0.5067
roc_auc_continuous:    0.7948


Model jest bardzo przeuczony, spróbujmy ustawić parametry dla znalezione dla najlepszego pojedynczego drzewa decyzyjnego.

In [None]:
rf = RandomForestClassifier(random_state=42, n_estimators=1000, **rand_search.best_params_)

clf = Pipeline([
    ('preprocessing', col_transformer),
    ('model', rf)])

clf.fit(X_train, y_train)
show_scores(clf, X_train, y_train)
show_scores(clf, x_eval, y_eval)


|   Predicted 0 |   Predicted 1 |
|---------------+---------------|
|         23560 |             0 |
|           873 |            67 |

accuracy:              0.9644
precision:             1.0
recall:                0.0713
f1:                    0.1331
roc_auc_discrete:      0.5356
roc_auc_continuous:    0.9968
|   Predicted 0 |   Predicted 1 |
|---------------+---------------|
|         10050 |             0 |
|           450 |             0 |

accuracy:              0.9571
precision:             0.0
recall:                0.0
f1:                    0.0
roc_auc_discrete:      0.5
roc_auc_continuous:    0.8218


# 7. Voting

Zbierzemy teraz najlepsze otrzymane algorytmy regresji logistycznej, drzewa decyzyjnego i svm łącząc je w klasyfikatorze głosującym. Zastosujemy głosowanie miękkie, które lepiej się sprawdzi w naszym przypadku.

In [None]:
log_reg = Pipeline([
    ('preprocessor', ColumnTransformer([
        ('int_pipe', Pipeline([
            ('int', ColumnRemover(0.9998, 1, 0)),
            ('one_hot', OneHotEncoder(handle_unknown='ignore', sparse_output=False, dtype='int64'))]), make_column_selector(dtype_include=np.int64)),
        ('float_pipe', Pipeline([
            ('float', ColumnRemover(0.9996, 0.97, 0)),
            ('standardization', StandardScaler())]), make_column_selector(dtype_include=np.float64))
    ])),
    ('selector', SelectKBest(k=33)),
    ('model', LogisticRegression(random_state=42, class_weight='balanced'))])

dct = Pipeline([
    ('preprocessor', ColumnTransformer([
        ('int_pipe', Pipeline([
            ('int', ColumnRemover(0.9998, 0.99, 2)),
            ('one_hot', OneHotEncoder(handle_unknown='ignore', sparse_output=False, dtype='int64'))]), make_column_selector(dtype_include=np.int64)),
        ('float_pipe', Pipeline([
            ('float', ColumnRemover(0.9998, 0.96, 15)),
            ]), make_column_selector(dtype_include=np.float64))
    ])),
    ('selector', SelectFromModel(DecisionTreeClassifier(random_state=42))),
    ('model', DecisionTreeClassifier(random_state=42))])

svc = Pipeline([
    ('preprocessor', ColumnTransformer([
        ('int_pipe', Pipeline([
            ('int', ColumnRemover(0.9995, 0.99, 1)),
            ('one_hot', OneHotEncoder(handle_unknown='ignore', sparse_output=False, dtype='int64'))]), make_column_selector(dtype_include=np.int64)),
        ('float_pipe', Pipeline([
            ('float', ColumnRemover(0.9999, 0.99, 10)),
            ('standard_scaler', StandardScaler())]), make_column_selector(dtype_include=np.float64))
    ])),
    ('model', SVC(random_state=42, probability=True))])

In [None]:
estimators=[('DecisionTree', dct), ('SVM', svc), ('LR', log_reg)]
vc = VotingClassifier(estimators=estimators, voting='soft', weights=[0.2, 0.1, 0.7]).fit(X_train, y_train)

In [None]:
show_scores(vc, X_train, y_train)

|   Predicted 0 |   Predicted 1 |
|---------------+---------------|
|         21333 |          2227 |
|           180 |           760 |

accuracy:              0.9018
precision:             0.2544
recall:                0.8085
f1:                    0.3871
roc_auc_discrete:      0.857
roc_auc_continuous:    0.9434


In [None]:
show_scores(vc, x_eval, y_eval)

|   Predicted 0 |   Predicted 1 |
|---------------+---------------|
|          8944 |          1106 |
|           236 |           214 |

accuracy:              0.8722
precision:             0.1621
recall:                0.4756
f1:                    0.2418
roc_auc_discrete:      0.6828
roc_auc_continuous:    0.7792


Otrzymaliśmy niespotykanie wysoki dotąd rezultat f1 oraz całkiem przyzwoite inne metryki.

# 8. Adaboost

Spróbujmy zastosować Adaboost na pewnych modelach.

### Regresja Logistyczna

In [None]:
log_reg = Pipeline([
    ('preprocessor', ColumnTransformer([
        ('int_pipe', Pipeline([
            ('int', ColumnRemover(0.9998, 1, 0)),
            ('one_hot', OneHotEncoder(handle_unknown='ignore', sparse_output=False, dtype='int64'))]), make_column_selector(dtype_include=np.int64)),
        ('float_pipe', Pipeline([
            ('float', ColumnRemover(0.9996, 0.97, 0)),
            ('standardization', StandardScaler())]), make_column_selector(dtype_include=np.float64))
    ])),
    ('selector', SelectKBest(k=33)),
    ('model', LogisticRegression(random_state=42, class_weight='balanced'))])

In [None]:
ADB_logreg = AdaBoostClassifier(base_estimator=log_reg, n_estimators=5000,
    learning_rate=0.5, random_state=42)

In [None]:
ADB_logreg.fit(X_train, y_train)
show_scores(ADB_logreg, X_train,y_train)

|   Predicted 0 |   Predicted 1 |
|---------------+---------------|
|         16137 |          7423 |
|           222 |           718 |

accuracy:              0.688
precision:             0.0882
recall:                0.7638
f1:                    0.1581
roc_auc_discrete:      0.7244
roc_auc_continuous:    0.7998


In [None]:
show_scores(ADB_logreg, x_eval, y_eval)

|   Predicted 0 |   Predicted 1 |
|---------------+---------------|
|          7167 |          2883 |
|           142 |           308 |

accuracy:              0.7119
precision:             0.0965
recall:                0.6844
f1:                    0.1692
roc_auc_discrete:      0.6988
roc_auc_continuous:    0.7702


### Drzewa decyzyjne

In [None]:
dct = Pipeline([
    ('preprocessor', ColumnTransformer([
        ('int_pipe', Pipeline([
            ('int', ColumnRemover(0.9998, 0.99, 2)),
            ('one_hot', OneHotEncoder(handle_unknown='ignore', sparse_output=False, dtype='int64'))]), make_column_selector(dtype_include=np.int64)),
        ('float_pipe', Pipeline([
            ('float', ColumnRemover(0.9998, 0.96, 15)),
        ]), make_column_selector(dtype_include=np.float64))
    ])),
    ('selector', SelectFromModel(DecisionTreeClassifier(random_state=42))),
    ('model', DecisionTreeClassifier(random_state=42))])

In [None]:
ADB_dct = AdaBoostClassifier(base_estimator=dct, n_estimators=1000,learning_rate=0.5,
                         algorithm='SAMME.R', random_state=42)

In [None]:
ADB_dct.fit(X_train, y_train)
show_scores(ADB_dct, X_train,y_train)
show_scores(ADB_dct, x_eval, y_eval)

|   Predicted 0 |   Predicted 1 |
|---------------+---------------|
|         23560 |             0 |
|             3 |           937 |

accuracy:              0.9999
precision:             1.0
recall:                0.9968
f1:                    0.9984
roc_auc_discrete:      0.9984
roc_auc_continuous:    1.0
|   Predicted 0 |   Predicted 1 |
|---------------+---------------|
|         10032 |            18 |
|           443 |             7 |

accuracy:              0.9561
precision:             0.28
recall:                0.0156
f1:                    0.0295
roc_auc_discrete:      0.5069
roc_auc_continuous:    0.6767


In [None]:
ADB_dct = AdaBoostClassifier(base_estimator=dct, n_estimators=2000,learning_rate=0.5,
                         algorithm='SAMME.R', random_state=42)

In [None]:
ADB_dct.fit(X_train, y_train)
show_scores(ADB_dct, X_train,y_train)
show_scores(ADB_dct, x_eval, y_eval)

|   Predicted 0 |   Predicted 1 |
|---------------+---------------|
|         23559 |             1 |
|             2 |           938 |

accuracy:              0.9999
precision:             0.9989
recall:                0.9979
f1:                    0.9984
roc_auc_discrete:      0.9989
roc_auc_continuous:    1.0
|   Predicted 0 |   Predicted 1 |
|---------------+---------------|
|         10031 |            19 |
|           443 |             7 |

accuracy:              0.956
precision:             0.2692
recall:                0.0156
f1:                    0.0294
roc_auc_discrete:      0.5068
roc_auc_continuous:    0.6844


Jak widać, adaboost może pomóc w drzewach decyzyjnych, ponieważ znacznie poprawia roc_auc_continuous. Jednakże dla regresji logistycznej jest on niepotrzebny, wręcz niewskazany

# 9. Podsumowanie projektu

W powyższym projekcie rozpoczęliśmy z 300 zmiennymi i jedną zmienną przewidywaną. 

Na początku zajeliśmy się preprocessingiem. Postanowiliśmy podzielić go na zmienne całkowitoliczbowe i zmiennoprzecinkowe. Wyrzuciliśmy zmienne noszące takie same dane, zmienne skorelowane, stałe i nie noszące informacji. 


Następnie zajęliśmy się modelami. Postanowiliśmy oceniać wyniki na podstawie polem pod krzywą roc_auc. Wyniki wyglądają następująco:

|Model   |No_hyper   |Col_rem   |with_hyper   |PCA   |K_best   |SFS   |num_rfe   |num_PCA   |
|---|---|---|---|---|---|---|---|---|
|Log_reg   |0.806   |0.777   |0.815   |0.738   |   33 | 40   |72   |3|
|dec_tree   |0.555   |0.780   |0.799   |0.682   |106 |70 |26 | 64|

|Model   |No_hyper   |with_hyper   |
|---|---|---|
|SVC   |0.670   |0.768   |
|KNN   |0.542 |0.570   |
|rand_forest   |0.794   | 0.821   |
|AdaBoost logreg  |0.774   |0.799   |
|AdaBoost dct  |0.676   |0.684   |
|Voting   | ***  | 0.779  | 
