# Лабораторная работа 8. Выбор оптимального классификатора

В этой лабораторной работе вам потребуется выбрать наилучший классификатор с оптимальными параметрами для задачи про пассажиров ["Титаника"](https://ru.wikipedia.org/wiki/Титаник).

__Задание 1.__  
Загрузите данные (см. предыдущую лабораторную работу).

In [141]:
import pandas as pd

In [142]:
train = pd.read_csv('../data/lab7/train.csv')
test = pd.read_csv('../data/lab7/test.csv')

__Задание 2.__  
Проведите предобработку данных (см. предыдущую лабораторную работу).

In [143]:
train['Age'] = train['Age'].fillna(-0.5)
test['Age'] = test['Age'].fillna(-0.5)

In [144]:
def process_age(df, cut_points, label_names):
    df = df.copy()
    df['Age'] = df['Age'].fillna(-0.5)
    df['Age_categories'] = pd.cut(df['Age'], bins=cut_points, labels=label_names, right=True)
    return df

cut_points = [-1, 0, 5, 12, 18, 35, 60, 100]
label_names = ["Missing", "Infant", "Child", "Teenager", "Young_Adult", "Adult", "Senior"]

train = process_age(train,cut_points,label_names)
test = process_age(test,cut_points,label_names)

In [145]:
def create_dummies(df: pd.DataFrame, column_name):
    return pd.get_dummies(df, columns=[column_name])

train = create_dummies(train, "Pclass")
test = create_dummies(test, "Pclass")

train = create_dummies(train, "Sex")
test = create_dummies(test, "Sex")

train = create_dummies(train, "Age_categories")
test = create_dummies(test, "Age_categories")

In [146]:
columns_to_drop = ['Name', 'Age', 'Ticket', 'PassengerId', 'Cabin', 'Embarked', 'Fare', 'Parch', 'SibSp']
train = train.drop(columns_to_drop, axis=1)
test = test.drop(columns_to_drop, axis=1)

In [147]:
from sklearn.model_selection import train_test_split


x_train, x_valid, y_train, y_valid = train_test_split(train.drop('Survived', axis=1), train['Survived'], random_state=42, test_size=0.8)
x_train.info()
X = train.drop('Survived', axis=1)
y = train['Survived']

<class 'pandas.core.frame.DataFrame'>
Index: 178 entries, 761 to 102
Data columns (total 12 columns):
 #   Column                      Non-Null Count  Dtype
---  ------                      --------------  -----
 0   Pclass_1                    178 non-null    bool 
 1   Pclass_2                    178 non-null    bool 
 2   Pclass_3                    178 non-null    bool 
 3   Sex_female                  178 non-null    bool 
 4   Sex_male                    178 non-null    bool 
 5   Age_categories_Missing      178 non-null    bool 
 6   Age_categories_Infant       178 non-null    bool 
 7   Age_categories_Child        178 non-null    bool 
 8   Age_categories_Teenager     178 non-null    bool 
 9   Age_categories_Young_Adult  178 non-null    bool 
 10  Age_categories_Adult        178 non-null    bool 
 11  Age_categories_Senior       178 non-null    bool 
dtypes: bool(12)
memory usage: 3.5 KB


__Задание 3.__  
Примените масштабирование признаков (`StandardScaler`, `MinMaxScaler`).

In [148]:
from sklearn.discriminant_analysis import StandardScaler
from sklearn.preprocessing import MinMaxScaler


min_max_scaler = MinMaxScaler()
standard_scaler = StandardScaler()

__Задание 4.__  
Примените различные преобразования признаков (`PolynomialFeatures`).

In [149]:
from sklearn.preprocessing import PolynomialFeatures


features = PolynomialFeatures()

__Задание 5.__  
Обучите несколько классификаторов, в том числе:  
1. Логистическую регрессию (`LogisticRegression`).
1. Метод опорных векторов (`SVC`).
1. Метод *k* ближайших соседей (`KNeighborsClassifier`).
1. Наивный байесовский классификатор (`MultinomialNB`).
1. Деревья решений (`DecisionTreeClassifier`).
1. Случайный лес (`RandomForestClassifier`).
1. AdaBoost (`AdaBoost`).
1. Градиентный бустинг (`GradientBoostingClassifier`).

Для обучения и проверки качества можно использовать функцию `train_test_split()`.

In [150]:
from sklearn import svm
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier, RandomForestClassifier

__Задание 6.__  
При помощи `Pipeline` и `GridSearchCV` выберите оптимальную архитектуру:
1. Метод масштабирования.
1. Степень полинома в `PolynomialFeatures`.
1. Параметры классификаторов (в том числе, параметры регуляризации).

Заносите в таблицу Excel результаты тестирования (варианты параметров, оценки качества).

In [None]:
from sklearn.metrics import classification_report
from sklearn.model_selection import GridSearchCV, KFold
from sklearn.pipeline import Pipeline


def train_with_cv(X, y, clf, scaler, param_dist=None):
    pipeline = Pipeline(steps=[
        ("scaler", scaler), 
        ("poly", PolynomialFeatures()), 
        ("model", clf),])
    cv = KFold(n_splits=10, shuffle=True, random_state=42)
    grid = GridSearchCV(
        estimator=pipeline,
        param_grid=param_dist,
        scoring="accuracy",
        cv=cv,
        n_jobs=4)
    grid.fit(X, y)
    
    return grid

In [172]:
classifiers = {
    "lg": LogisticRegression(),
    "svm": svm.SVC(),
    "knn": KNeighborsClassifier(),
    "mnb": MultinomialNB(),
    "dtc": DecisionTreeClassifier(),
    "rfc": RandomForestClassifier(),
    "ada": AdaBoostClassifier(),
    "gbc": GradientBoostingClassifier(),
}

classifier_params = {
    "lg": {
        'poly__degree': [1, 2, 3, 4],
        # 'model__penalty': ['l1', 'l2', 'elasticnet'],
        'model__C': [0.01, 0.1, 1, 10],
        'model__max_iter': [100, 500]
    },
    "svm": {
        'poly__degree': [1, 2, 3, 4],
        'model__C': [0.1, 1, 10],
        'model__kernel': ['linear', 'rbf', 'poly'],
        'model__gamma': ['scale', 'auto', 0.1],
        'model__degree': [2, 3]
    },
    "knn": {
        'poly__degree': [1, 2, 3, 4],
        'model__n_neighbors': [3, 5, 7, 9],
        'model__weights': ['uniform', 'distance'],
        'model__algorithm': ['auto', 'ball_tree'],
    },
    "mnb": {
        'poly__degree': [1, 2, 3, 4],
        'model__alpha': [0.1, 0.5, 1.0],
        'model__fit_prior': [True, False]
    },
    "dtc": {
        'poly__degree': [1, 2, 3, 4],
        'model__max_depth': [None, 5, 10],
        'model__criterion': ['gini', 'entropy']
    },
    "rfc": {
        'poly__degree': [1, 2, 3, 4],
        'model__n_estimators': [100, 200],
        'model__max_depth': [None, 5, 10],
    },
    "ada": {
        'poly__degree': [1, 2, 3, 4],
        'model__n_estimators': [50, 100],
        'model__learning_rate': [0.01, 0.1, 1.0],
        'model__algorithm': ['SAMME', 'SAMME.R']
    },
    "gbc": {
        'poly__degree': [1, 2, 3, 4],
        'model__n_estimators': [100, 200],
        'model__learning_rate': [0.05, 0.1, 0.2],
        'model__max_depth': [3, 5],
    }
}

results = {}
# scaler = 
for name, classifier in classifiers.items():
    print(f"\n{'=' * 50}")
    print(f"Training {classifier.__class__.__name__}...")
    results[name] = train_with_cv(
        X, y,
        classifier,
        StandardScaler(),
        classifier_params[name])



Training LogisticRegression...

Training SVC...


KeyboardInterrupt: 

In [153]:
results

{'lg': GridSearchCV(cv=KFold(n_splits=10, random_state=42, shuffle=True),
              estimator=Pipeline(steps=[('scaler', MinMaxScaler()),
                                        ('poly', PolynomialFeatures()),
                                        ('model', LogisticRegression())]),
              n_jobs=-1,
              param_grid={'model__C': [0.01, 0.1, 1, 10],
                          'model__max_iter': [100, 500],
                          'poly__degree': [1, 2, 3, 4]},
              scoring='accuracy'),
 'svm': GridSearchCV(cv=KFold(n_splits=10, random_state=42, shuffle=True),
              estimator=Pipeline(steps=[('scaler', MinMaxScaler()),
                                        ('poly', PolynomialFeatures()),
                                        ('model', SVC())]),
              n_jobs=-1,
              param_grid={'model__C': [0.1, 1, 10], 'model__degree': [2, 3],
                          'model__gamma': ['scale', 'auto', 0.1],
                          'model__ke

In [161]:
result_dict = []
for key, grid in results.items():
    result_dict.append({
        "model": key,
        "estimator": grid.best_estimator_,
        "best_params": grid.best_params_,
        "best_score": grid.best_score_
    })

result_dict_df = pd.DataFrame(result_dict)
result_dict_df.sort_values(by=["best_score"], ascending=False)

Unnamed: 0,model,estimator,best_params,best_score
7,gbc,"(MinMaxScaler(), PolynomialFeatures(degree=1),...","{'model__learning_rate': 0.05, 'model__max_dep...",0.8204
6,ada,"(MinMaxScaler(), PolynomialFeatures(), (Decisi...","{'model__algorithm': 'SAMME', 'model__learning...",0.819276
2,knn,"(MinMaxScaler(), PolynomialFeatures(), KNeighb...","{'model__algorithm': 'auto', 'model__n_neighbo...",0.817066
5,rfc,"(MinMaxScaler(), PolynomialFeatures(degree=1),...","{'model__max_depth': 5, 'model__n_estimators':...",0.815918
0,lg,"(MinMaxScaler(), PolynomialFeatures(), Logisti...","{'model__C': 0.1, 'model__max_iter': 100, 'pol...",0.814819
4,dtc,"(MinMaxScaler(), PolynomialFeatures(degree=1),...","{'model__criterion': 'gini', 'model__max_depth...",0.81367
1,svm,"(MinMaxScaler(), PolynomialFeatures(degree=4),...","{'model__C': 0.1, 'model__degree': 2, 'model__...",0.812547
3,mnb,"(MinMaxScaler(), PolynomialFeatures(degree=4),...","{'model__alpha': 1.0, 'model__fit_prior': True...",0.802422


__Задание 7.__  
1. Выберите несколько лучших классификаторов (от 3 до 10).
1. Обучите выбранные классификаторы на всех доступных размеченных данных.
1. Получите результаты предсказания для тестовых данных.
1. Отправьте результаты на сервер [Kaggle](https://ru.wikipedia.org/wiki/Титаник).

In [None]:
clfs = result_dict_df.sort_values(by=["best_score"], ascending=False)['estimator']
trained_clfs = []
for clf in clfs[:5]:
    trained_clfs.append(clf.fit(X, y))

GradientBoostingClassifier
AdaBoostClassifier
KNeighborsClassifier
RandomForestClassifier
LogisticRegression


In [171]:
for clf in trained_clfs:
    predictions = clf.predict(test)
    test_ids = pd.read_csv('../data/lab7/test.csv')["PassengerId"]
    submission_df = {"PassengerId": test_ids, "Survived": predictions}
    submission = pd.DataFrame(submission_df)
    submission.to_csv(f'{type(clf["model"]).__name__}__titanic_submission.csv', index=False)