<a href="https://colab.research.google.com/github/ccal2/dataScienceProject/blob/master/project_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introdução

Esse projeto foi desenvolvido utilizando um *dataset* que contém dados relacionados à série de livros *A Song of Ice and Fire* (As Crônicas de Gelo e Fogo), mais conhecida pelo título do seu primeiro livro: *A Game of Thrones* (A Guerra dos Tronos).

O arquivo `battles_2.csv` foi exportado do arquivo `project.ipynb` e possui informações de várias batalhas que ocorrem durante a história.

Nesse projeto iremos utilizar algoritmos de *Machine Learning* para criar um classificador de batalhas em relação ao `attacker_outcome`. Essa coluna possui um valor booleano indicando se o atacante venceu ou não a batalha.

# Setup

**Lembre-se de dar upload do arquivo `battles_2.csv`.**

In [1]:
!pip install optuna --quiet
!pip install mlflow --quiet
!pip install pyngrok --quiet

[K     |████████████████████████████████| 302 kB 7.1 MB/s 
[K     |████████████████████████████████| 80 kB 11.5 MB/s 
[K     |████████████████████████████████| 164 kB 61.4 MB/s 
[K     |████████████████████████████████| 75 kB 6.1 MB/s 
[K     |████████████████████████████████| 49 kB 8.4 MB/s 
[K     |████████████████████████████████| 111 kB 64.6 MB/s 
[K     |████████████████████████████████| 141 kB 63.0 MB/s 
[?25h  Building wheel for pyperclip (setup.py) ... [?25l[?25hdone
[K     |████████████████████████████████| 14.4 MB 65 kB/s 
[K     |████████████████████████████████| 56 kB 6.3 MB/s 
[K     |████████████████████████████████| 146 kB 67.7 MB/s 
[K     |████████████████████████████████| 170 kB 73.4 MB/s 
[K     |████████████████████████████████| 1.1 MB 64.7 MB/s 
[K     |████████████████████████████████| 636 kB 57.2 MB/s 
[K     |████████████████████████████████| 79 kB 10.9 MB/s 
[K     |████████████████████████████████| 52 kB 2.0 MB/s 
[K     |███████████████████

In [2]:
import pandas as pd
import numpy as np

In [3]:
battles = pd.read_csv('battles_2.csv')
battles.head()

Unnamed: 0,name,year,attacker_king,defender_king,attacker_1,defender_1,attacker_outcome,battle_type,major_death,major_capture,attacker_size,defender_size,summer,location,region,attacker_commander_1,defender_commander_1,size_difference,size_difference_disc,total_size
0,Battle of the Golden Tooth,298,Joffrey/Tommen Baratheon,Robb Stark,Lannister,Tully,1,pitched battle,1,0,15000.0,4000.0,1,Golden Tooth,The Westerlands,Jaime Lannister,Clement Piper,11000.0,advantage_2,19000.0
1,Battle at the Mummer's Ford,298,Joffrey/Tommen Baratheon,Robb Stark,Lannister,Baratheon,1,ambush,1,0,7713.0,120.0,1,Mummer's Ford,The Riverlands,Gregor Clegane,Beric Dondarrion,7593.0,advantage_2,7833.0
2,Battle of Riverrun,298,Joffrey/Tommen Baratheon,Robb Stark,Lannister,Tully,1,pitched battle,0,1,15000.0,10000.0,1,Riverrun,The Riverlands,Jaime Lannister,Edmure Tully,5000.0,advantage_2,25000.0
3,Battle of the Green Fork,298,Robb Stark,Joffrey/Tommen Baratheon,Stark,Lannister,0,pitched battle,1,1,18000.0,20000.0,1,Green Fork,The Riverlands,Roose Bolton,Tywin Lannister,-2000.0,disavantage_2,38000.0
4,Battle of the Whispering Wood,298,Robb Stark,Joffrey/Tommen Baratheon,Stark,Lannister,1,ambush,1,1,1875.0,6000.0,1,Whispering Wood,The Riverlands,Robb Stark,Jaime Lannister,-4125.0,disavantage_3,7875.0


In [4]:
battles.dtypes

name                     object
year                      int64
attacker_king            object
defender_king            object
attacker_1               object
defender_1               object
attacker_outcome          int64
battle_type              object
major_death               int64
major_capture             int64
attacker_size           float64
defender_size           float64
summer                    int64
location                 object
region                   object
attacker_commander_1     object
defender_commander_1     object
size_difference         float64
size_difference_disc     object
total_size              float64
dtype: object

Como a coluna `name` possui valores únicos referentes ao nome da batalha, vamos removê-la. A coluna `size_difference_disc` foi criada no projeto 1 como uma discretização da coluna `size_difference`, então ela também vai ser removida, já que a informação da diferença de tamanho das tropas atacantes e defensoras está duplicada nessas duas colunas.

In [5]:
battles.drop(columns=['name', 'size_difference_disc'], inplace=True)

In [6]:
# ajustar os tipos dos dados
battles['attacker_king'] = battles['attacker_king'].astype('category')
battles['defender_king'] = battles['defender_king'].astype('category')
battles['attacker_1'] = battles['attacker_1'].astype('category')
battles['defender_1'] = battles['defender_1'].astype('category')
battles['attacker_outcome'] = battles['attacker_outcome'].astype('category')
battles['battle_type'] = battles['battle_type'].astype('category')
battles['attacker_size'] = battles['attacker_size'].astype('int64')
battles['defender_size'] = battles['defender_size'].astype('int64')
battles['location'] = battles['location'].astype('category')
battles['region'] = battles['region'].astype('category')
battles['attacker_commander_1'] = battles['attacker_commander_1'].astype('category')
battles['defender_commander_1'] = battles['defender_commander_1'].astype('category')
battles['size_difference'] = battles['size_difference'].astype('int64')
battles['total_size'] = battles['total_size'].astype('int64')

battles.dtypes

year                       int64
attacker_king           category
defender_king           category
attacker_1              category
defender_1              category
attacker_outcome        category
battle_type             category
major_death                int64
major_capture              int64
attacker_size              int64
defender_size              int64
summer                     int64
location                category
region                  category
attacker_commander_1    category
defender_commander_1    category
size_difference            int64
total_size                 int64
dtype: object

Apesar das colunas `major_death`, `major_capture` e `summer` representarem dados categóricos, esses dados são booleanos (sempre 0 ou 1). Para facilitar as predições, iremos manter essas colunas com seus dados na forma numérica.

In [7]:
# mostrar que as colunas 'major_death', 'major_capture' e 'summer' só possuem valores 0 e 1
battles.groupby('major_death')['attacker_outcome'].count(), \
battles.groupby('major_capture')['attacker_outcome'].count(), \
battles.groupby('summer')['attacker_outcome'].count()

(major_death
 0    24
 1    12
 Name: attacker_outcome, dtype: int64, major_capture
 0    26
 1    10
 Name: attacker_outcome, dtype: int64, summer
 0    10
 1    26
 Name: attacker_outcome, dtype: int64)

In [8]:
# separar coluna 'attacker_outcome' para classificação
x_battles = battles.drop(columns=['attacker_outcome'])
y_battles = battles['attacker_outcome']

In [9]:
# one-hot encoding
# não modificar colunas com valores numéricos
columns_to_keep = ['year', 'major_death', 'major_capture', 'attacker_size', 'defender_size', 'summer', 'size_difference', 'total_size']

x_battles = pd.get_dummies(x_battles, columns=x_battles.columns.drop(columns_to_keep))
x_battles.head()

Unnamed: 0,year,major_death,major_capture,attacker_size,defender_size,summer,size_difference,total_size,attacker_king_Balon/Euron Greyjoy,attacker_king_Joffrey/Tommen Baratheon,attacker_king_None,attacker_king_Robb Stark,attacker_king_Stannis Baratheon,defender_king_Balon/Euron Greyjoy,defender_king_Joffrey/Tommen Baratheon,defender_king_None,defender_king_Renly Baratheon,defender_king_Robb Stark,defender_king_Stannis Baratheon,attacker_1_Baratheon,attacker_1_Bolton,attacker_1_Bracken,attacker_1_Brave Companions,attacker_1_Darry,attacker_1_Frey,attacker_1_Greyjoy,attacker_1_Lannister,attacker_1_Stark,defender_1_Baratheon,defender_1_Blackwood,defender_1_Bolton,defender_1_Brave Companions,defender_1_Darry,defender_1_Greyjoy,defender_1_Lannister,defender_1_Mallister,defender_1_None,defender_1_Stark,defender_1_Tully,defender_1_Tyrell,...,attacker_commander_1_Loras Tyrell,attacker_commander_1_Mace Tyrell,attacker_commander_1_Ramsay Snow,attacker_commander_1_Ramsey Bolton,attacker_commander_1_Robb Stark,attacker_commander_1_Robertt Glover,attacker_commander_1_Rodrik Cassel,attacker_commander_1_Roose Bolton,attacker_commander_1_Rorge,attacker_commander_1_Stannis Baratheon,attacker_commander_1_Theon Greyjoy,attacker_commander_1_Tywin Lannister,attacker_commander_1_Victarion Greyjoy,attacker_commander_1_Walder Frey,defender_commander_1_Amory Lorch,defender_commander_1_Asha Greyjoy,defender_commander_1_Beric Dondarrion,defender_commander_1_Bran Stark,defender_commander_1_Brynden Tully,defender_commander_1_Clement Piper,defender_commander_1_Dagmer Cleftjaw,defender_commander_1_Edmure Tully,defender_commander_1_Gilbert Farring,defender_commander_1_Jaime Lannister,defender_commander_1_Jason Mallister,defender_commander_1_Lord Andros Brax,defender_commander_1_Lyman Darry,defender_commander_1_Randyll Tarly,defender_commander_1_Renly Baratheon,defender_commander_1_Robb Stark,defender_commander_1_Rodrik Cassel,defender_commander_1_Rolland Storm,defender_commander_1_Rolph Spicer,defender_commander_1_Roose Bolton,defender_commander_1_Stafford Lannister,defender_commander_1_Tyrion Lannister,defender_commander_1_Tytos Blackwood,defender_commander_1_Tywin Lannister,defender_commander_1_Unknown,defender_commander_1_Vargo Hoat
0,298,1,0,15000,4000,1,11000,19000,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,298,1,0,7713,120,1,7593,7833,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,298,0,1,15000,10000,1,5000,25000,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,298,1,1,18000,20000,1,-2000,38000,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
4,298,1,1,1875,6000,1,-4125,7875,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


# Geração dos modelos

In [10]:
x_battles.shape

(36, 124)

Como temos poucos dados, vamos utilizar cross-validation.

In [11]:
from sklearn.model_selection import cross_validate, cross_val_score

Os algoritmos utilizados serão:
1. Gaussian Naive Bayes
2. KNN
3. Decision Tree
4. Random Forest

Vamos utilizar o Optuna para fazer a seleção dos hiper-parâmetros.

In [12]:
import optuna
from optuna.visualization import plot_param_importances

In [13]:
np.random.seed(10)
NUMBER_OF_TRIALS = 50

In [14]:
# logging
import mlflow
import warnings

mlflow.sklearn.autolog()
warnings.filterwarnings("ignore")

## Gaussian Naive Bayes

In [15]:
from sklearn.naive_bayes import GaussianNB

In [16]:
experiment_id = mlflow.create_experiment(name='gaussianNB')

def gaussianNB(trial):
  with mlflow.start_run(experiment_id=experiment_id):
    # definição dos hiper-parâmetros
    params = {'var_smoothing': trial.suggest_float('var_smoothing', 1e-10, 1e-08)}

    model = GaussianNB(var_smoothing=params['var_smoothing'])

    # treinamento
    scores = cross_validate(model, x_battles, y_battles,
                            scoring='accuracy', return_train_score=True)

    # acurácias
    test_accuracies = scores['test_score']
    test_accuracy = test_accuracies.mean()
    train_accuracies = scores['train_score']
    train_accuracy = train_accuracies.mean()

    # salvar acurácia de treinamento no trial
    trial.set_user_attr('train_accuracy', train_accuracy)

    # logging
    mlflow.log_params(params)
    mlflow.log_metric('test_accuracy', test_accuracy)
    mlflow.log_metric('train_accuracy', train_accuracy)
    mlflow.sklearn.log_model(model, "model")
    mlflow.end_run()

  return test_accuracy

In [17]:
study_GNB = optuna.create_study(direction='maximize')
study_GNB.optimize(gaussianNB, n_trials=NUMBER_OF_TRIALS)

[32m[I 2021-08-17 02:37:49,066][0m A new study created in memory with name: no-name-d86e9838-590d-4eaf-95a7-688a7c07c643[0m
[32m[I 2021-08-17 02:37:50,729][0m Trial 0 finished with value: 0.8071428571428572 and parameters: {'var_smoothing': 2.0235640964632587e-10}. Best is trial 0 with value: 0.8071428571428572.[0m
[32m[I 2021-08-17 02:37:52,398][0m Trial 1 finished with value: 0.7285714285714285 and parameters: {'var_smoothing': 2.4096838771467805e-09}. Best is trial 0 with value: 0.8071428571428572.[0m
[32m[I 2021-08-17 02:37:54,108][0m Trial 2 finished with value: 0.7571428571428571 and parameters: {'var_smoothing': 8.90731265753177e-09}. Best is trial 0 with value: 0.8071428571428572.[0m
[32m[I 2021-08-17 02:37:55,703][0m Trial 3 finished with value: 0.7571428571428571 and parameters: {'var_smoothing': 6.659529338292176e-09}. Best is trial 0 with value: 0.8071428571428572.[0m
[32m[I 2021-08-17 02:37:57,431][0m Trial 4 finished with value: 0.7571428571428571 and par

In [18]:
print('Best hyperparameters:\t', study_GNB.best_trial.params)
print('Training accuracy:\t', study_GNB.best_trial.user_attrs['train_accuracy'])
print('Test accuracy:\t\t', study_GNB.best_trial.value)

Best hyperparameters:	 {'var_smoothing': 1.2804668486697132e-10}
Training accuracy:	 1.0
Test accuracy:		 0.8357142857142857


In [19]:
optuna.visualization.plot_optimization_history(study_GNB)

## KNN


In [20]:
from sklearn.neighbors import KNeighborsClassifier

In [21]:
experiment_id = mlflow.create_experiment(name='knn')

def knn(trial):
  with mlflow.start_run(experiment_id=experiment_id):
    # definição dos hiper-parâmetros
    params = {'n_neighbors': trial.suggest_int('n_neighbors', 1, 28),
              'algorithm': trial.suggest_categorical('algorithm', ['ball_tree', 'kd_tree', 'brute']),
              'p': trial.suggest_int('p', 1, 10)}

    model = KNeighborsClassifier(n_neighbors=params['n_neighbors'], algorithm=params['algorithm'], p=params['p'])

    # treinamento
    scores = cross_validate(model, x_battles, y_battles,
                            scoring='accuracy', return_train_score=True)

    # acurácias
    test_accuracies = scores['test_score']
    test_accuracy = test_accuracies.mean()
    train_accuracies = scores['train_score']
    train_accuracy = train_accuracies.mean()

    # salvar acurácia de treinamento no trial
    trial.set_user_attr('train_accuracy', train_accuracy)

    # logging
    mlflow.log_params(params)
    mlflow.log_metric('test_accuracy', test_accuracy)
    mlflow.log_metric('train_accuracy', train_accuracy)
    mlflow.sklearn.log_model(model, "model")
    mlflow.end_run()

  return test_accuracy

In [22]:
study_KNN = optuna.create_study(direction='maximize')
study_KNN.optimize(knn, n_trials=NUMBER_OF_TRIALS)

[32m[I 2021-08-17 02:39:11,985][0m A new study created in memory with name: no-name-db148686-cbce-47f5-9a62-39d4aa692022[0m
[32m[I 2021-08-17 02:39:13,666][0m Trial 0 finished with value: 0.8607142857142858 and parameters: {'n_neighbors': 7, 'algorithm': 'brute', 'p': 2}. Best is trial 0 with value: 0.8607142857142858.[0m
[32m[I 2021-08-17 02:39:17,602][0m Trial 1 finished with value: 0.8607142857142858 and parameters: {'n_neighbors': 20, 'algorithm': 'ball_tree', 'p': 5}. Best is trial 0 with value: 0.8607142857142858.[0m
[32m[I 2021-08-17 02:39:19,245][0m Trial 2 finished with value: 0.8607142857142858 and parameters: {'n_neighbors': 16, 'algorithm': 'ball_tree', 'p': 6}. Best is trial 0 with value: 0.8607142857142858.[0m
[32m[I 2021-08-17 02:39:21,075][0m Trial 3 finished with value: 0.8321428571428571 and parameters: {'n_neighbors': 6, 'algorithm': 'ball_tree', 'p': 8}. Best is trial 0 with value: 0.8607142857142858.[0m
[32m[I 2021-08-17 02:39:22,710][0m Trial 4 fi

In [23]:
print('Best hyperparameters:\t', study_KNN.best_trial.params)
print('Training accuracy:\t', study_KNN.best_trial.user_attrs['train_accuracy'])
print('Test accuracy:\t\t', study_KNN.best_trial.value)

Best hyperparameters:	 {'n_neighbors': 8, 'algorithm': 'ball_tree', 'p': 1}
Training accuracy:	 0.8679802955665024
Test accuracy:		 0.8857142857142858


In [24]:
optuna.visualization.plot_optimization_history(study_KNN)

In [25]:
plot_param_importances(study_KNN)

2021/08/17 02:40:39 INFO mlflow.utils.autologging_utils: Created MLflow autologging run with ID 'e069c5ce991f415496a26a0898094032', which will track hyperparameters, performance metrics, model artifacts, and lineage information for the current sklearn workflow


## Decision Tree


In [26]:
from sklearn.tree import DecisionTreeClassifier

In [27]:
experiment_id = mlflow.create_experiment(name='decision_tree')

def decision_tree(trial):
  with mlflow.start_run(experiment_id=experiment_id):
    # definição dos hiper-parâmetros
    params = {'criterion': trial.suggest_categorical('criterion', ['gini', 'entropy']),
              'max_features': trial.suggest_int('max_features', 1, len(x_battles.columns))}

    model = DecisionTreeClassifier(criterion=params['criterion'], max_features=params['max_features'])

    # treinamento
    scores = cross_validate(model, x_battles, y_battles,
                            scoring='accuracy', return_train_score=True)

    # acurácias
    test_accuracies = scores['test_score']
    test_accuracy = test_accuracies.mean()
    train_accuracies = scores['train_score']
    train_accuracy = train_accuracies.mean()

    # salvar acurácia de treinamento no trial
    trial.set_user_attr('train_accuracy', train_accuracy)

    # logging
    mlflow.log_params(params)
    mlflow.log_metric('test_accuracy', test_accuracy)
    mlflow.log_metric('train_accuracy', train_accuracy)
    mlflow.sklearn.log_model(model, "model")
    mlflow.end_run()

  return test_accuracy

In [28]:
study_DT = optuna.create_study(direction='maximize')
study_DT.optimize(decision_tree, n_trials=NUMBER_OF_TRIALS)

[32m[I 2021-08-17 02:40:40,304][0m A new study created in memory with name: no-name-d9111523-94e3-46e4-b94f-7fa516e95f84[0m
[32m[I 2021-08-17 02:40:44,223][0m Trial 0 finished with value: 0.6928571428571428 and parameters: {'criterion': 'gini', 'max_features': 77}. Best is trial 0 with value: 0.6928571428571428.[0m
[32m[I 2021-08-17 02:40:45,864][0m Trial 1 finished with value: 0.75 and parameters: {'criterion': 'entropy', 'max_features': 48}. Best is trial 1 with value: 0.75.[0m
[32m[I 2021-08-17 02:40:47,651][0m Trial 2 finished with value: 0.6928571428571428 and parameters: {'criterion': 'gini', 'max_features': 75}. Best is trial 1 with value: 0.75.[0m
[32m[I 2021-08-17 02:40:49,283][0m Trial 3 finished with value: 0.8 and parameters: {'criterion': 'gini', 'max_features': 109}. Best is trial 3 with value: 0.8.[0m
[32m[I 2021-08-17 02:40:50,902][0m Trial 4 finished with value: 0.8357142857142857 and parameters: {'criterion': 'gini', 'max_features': 24}. Best is trial

In [29]:
print('Best hyperparameters:\t', study_DT.best_trial.params)
print('Training accuracy:\t', study_DT.best_trial.user_attrs['train_accuracy'])
print('Test accuracy:\t\t', study_DT.best_trial.value)

Best hyperparameters:	 {'criterion': 'gini', 'max_features': 8}
Training accuracy:	 1.0
Test accuracy:		 0.9464285714285714


In [30]:
optuna.visualization.plot_optimization_history(study_DT)

In [31]:
plot_param_importances(study_DT)

2021/08/17 02:42:07 INFO mlflow.utils.autologging_utils: Created MLflow autologging run with ID 'dff4adec41424dd2b792b4701b55bb0c', which will track hyperparameters, performance metrics, model artifacts, and lineage information for the current sklearn workflow


## Random Forest


In [32]:
from sklearn.ensemble import RandomForestClassifier

In [33]:
experiment_id = mlflow.create_experiment(name='random_forest')

def random_forest(trial):
  with mlflow.start_run(experiment_id=experiment_id):
    # definição dos hiper-parâmetros
    params = {'n_estimators': trial.suggest_int('n_estimators', 50, 150),
              'criterion': trial.suggest_categorical('criterion', ['gini', 'entropy']),
              'max_features': trial.suggest_int('max_features', 1, len(x_battles.columns))}

    model = RandomForestClassifier(n_estimators=params['n_estimators'], criterion=params['criterion'], max_features=params['max_features'])

    # treinamento
    scores = cross_validate(model, x_battles, y_battles,
                            scoring='accuracy', return_train_score=True)

    # acurácias
    test_accuracies = scores['test_score']
    test_accuracy = test_accuracies.mean()
    train_accuracies = scores['train_score']
    train_accuracy = train_accuracies.mean()

    # salvar acurácia de treinamento no trial
    trial.set_user_attr('train_accuracy', train_accuracy)

    # logging
    mlflow.log_params(params)
    mlflow.log_metric('test_accuracy', test_accuracy)
    mlflow.log_metric('train_accuracy', train_accuracy)
    mlflow.sklearn.log_model(model, "model")
    mlflow.end_run()

  return test_accuracy

In [34]:
study_RF = optuna.create_study(direction='maximize')
study_RF.optimize(random_forest, n_trials=NUMBER_OF_TRIALS)

[32m[I 2021-08-17 02:42:08,830][0m A new study created in memory with name: no-name-031bfd20-31f3-4c89-a134-3a569ce0e5b2[0m
[32m[I 2021-08-17 02:42:12,531][0m Trial 0 finished with value: 0.8892857142857142 and parameters: {'n_estimators': 143, 'criterion': 'entropy', 'max_features': 24}. Best is trial 0 with value: 0.8892857142857142.[0m
[32m[I 2021-08-17 02:42:16,188][0m Trial 1 finished with value: 0.8607142857142858 and parameters: {'n_estimators': 141, 'criterion': 'entropy', 'max_features': 21}. Best is trial 0 with value: 0.8892857142857142.[0m
[32m[I 2021-08-17 02:42:18,711][0m Trial 2 finished with value: 0.8321428571428571 and parameters: {'n_estimators': 56, 'criterion': 'gini', 'max_features': 74}. Best is trial 0 with value: 0.8892857142857142.[0m
[32m[I 2021-08-17 02:42:23,677][0m Trial 3 finished with value: 0.8892857142857142 and parameters: {'n_estimators': 62, 'criterion': 'gini', 'max_features': 49}. Best is trial 0 with value: 0.8892857142857142.[0m


In [35]:
print('Best hyperparameters:\t', study_RF.best_trial.params)
print('Training accuracy:\t', study_RF.best_trial.user_attrs['train_accuracy'])
print('Test accuracy:\t\t', study_RF.best_trial.value)

Best hyperparameters:	 {'n_estimators': 143, 'criterion': 'entropy', 'max_features': 24}
Training accuracy:	 1.0
Test accuracy:		 0.8892857142857142


In [36]:
optuna.visualization.plot_optimization_history(study_RF)

In [37]:
plot_param_importances(study_RF)

2021/08/17 02:44:59 INFO mlflow.utils.autologging_utils: Created MLflow autologging run with ID '940b3117f28a4ae9b6b4c8ed7a0ee6b1', which will track hyperparameters, performance metrics, model artifacts, and lineage information for the current sklearn workflow


# Escolha do algoritmo

In [38]:
# melhores modelos para cada algoritmo:
best_GNB = GaussianNB(var_smoothing=study_GNB.best_trial.params['var_smoothing'])
best_KNN = KNeighborsClassifier(n_neighbors=study_KNN.best_trial.params['n_neighbors'], algorithm=study_KNN.best_trial.params['algorithm'], p=study_KNN.best_trial.params['p'])
best_DT = DecisionTreeClassifier(criterion=study_DT.best_trial.params['criterion'], max_features=study_DT.best_trial.params['max_features'])
best_RF = RandomForestClassifier(n_estimators=study_RF.best_trial.params['n_estimators'], criterion=study_RF.best_trial.params['criterion'])

best_models = [best_GNB, best_KNN, best_DT, best_RF]

In [39]:
# encontrar o melhor algoritmo

experiment_id = mlflow.create_experiment(name='best_algorithm')

best_model = best_models[0]
best_train_accuracy = 0
best_test_accuracy = 0

for model in best_models:
  with mlflow.start_run(experiment_id=experiment_id):
    # treinamento
    scores = cross_validate(model, x_battles, y_battles,
                            scoring='accuracy', return_train_score=True)

    # acurácias
    test_accuracies = scores['test_score']
    test_accuracy = test_accuracies.mean()
    train_accuracies = scores['train_score']
    train_accuracy = train_accuracies.mean()

    # logging
    mlflow.log_metric('test_accuracy', test_accuracy)
    mlflow.log_metric('train_accuracy', train_accuracy)
    mlflow.sklearn.log_model(model, "model")
    mlflow.end_run()

    # comparação
    if test_accuracy > best_test_accuracy:
      best_test_accuracy = test_accuracy
      best_train_accuracy = train_accuracy
      best_model = model

print('Best model:\t', best_model)
print('Training accuracy:\t', best_train_accuracy)
print('Test accuracy:\t\t', best_test_accuracy)

Best model:	 KNeighborsClassifier(algorithm='ball_tree', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=8, p=1,
                     weights='uniform')
Training accuracy:	 0.8679802955665024
Test accuracy:		 0.8857142857142858


# Diagnóstico e otimização

Pelas acurácias dos conjuntos de treinamento e de teste, observamos que o modelo possui alto viés e baixa variância, o que caracteriza um *underfitting*. Ou seja, o modelo não modela bem o conjunto de treinamento.

Vamos agora tentar otimizar o modelo selecionado.

Pelo plot da importância dos hiperparâmetros utilizados no *study* do optuna podemos ver que a mudança dos parâmetros `algorithm` e `p` não teve muita interferência na acurácia do modelo. Já `n_neighbors` foi o parâmetro mais importante. Por causa disso, vamos manter o `algorithm` e o `p` do modelo atual e fazer variações em outros parâmetros do nosso modelo.

In [40]:
plot_param_importances(study_KNN)

2021/08/17 02:45:09 INFO mlflow.utils.autologging_utils: Created MLflow autologging run with ID 'c07fcf0a96de4071b2895ebc649307b9', which will track hyperparameters, performance metrics, model artifacts, and lineage information for the current sklearn workflow


In [41]:
metrics = ['minkowski', 'chebyshev']
weights = ['uniform', 'distance']
neighbors = range(1, 29) # nº máximo de vizinhos possíveis com esse conjunto de dados

scores = []
best_accuracy = best_test_accuracy
best_model = best_KNN

experiment_id = mlflow.create_experiment(name='best_KNN')

for metric in metrics:
  for weight in weights:
    for n_neighbors in neighbors:
      with mlflow.start_run(experiment_id=experiment_id):
        # modelo
        model = KNeighborsClassifier(algorithm=study_KNN.best_trial.params['algorithm'], leaf_size=30,
                                     metric=metric, metric_params=None, n_jobs=None, n_neighbors=n_neighbors,
                                     p=study_KNN.best_trial.params['p'], weights=weight)
  
        # treinamento
        accuracies = cross_val_score(model, x_battles, y_battles, scoring='accuracy')
  
        # resultados
        accuracy = accuracies.mean()
        scores.append(accuracy)

        print('{ metric:', metric, ', weight:', weight, ', n_neighbors:', n_neighbors, '}, accuracy:', accuracy)
  
        # logging
        mlflow.log_metric('accuracy', accuracy)
        mlflow.sklearn.log_model(model, "model")
        mlflow.end_run()

        # comparação
        if accuracy > best_accuracy:
          best_accuracy = accuracy
          best_model = model

{ metric: minkowski , weight: uniform , n_neighbors: 1 }, accuracy: 0.7821428571428571
{ metric: minkowski , weight: uniform , n_neighbors: 2 }, accuracy: 0.7
{ metric: minkowski , weight: uniform , n_neighbors: 3 }, accuracy: 0.8357142857142857
{ metric: minkowski , weight: uniform , n_neighbors: 4 }, accuracy: 0.8071428571428572
{ metric: minkowski , weight: uniform , n_neighbors: 5 }, accuracy: 0.8321428571428571
{ metric: minkowski , weight: uniform , n_neighbors: 6 }, accuracy: 0.8321428571428571
{ metric: minkowski , weight: uniform , n_neighbors: 7 }, accuracy: 0.8857142857142858
{ metric: minkowski , weight: uniform , n_neighbors: 8 }, accuracy: 0.8857142857142858
{ metric: minkowski , weight: uniform , n_neighbors: 9 }, accuracy: 0.8607142857142858
{ metric: minkowski , weight: uniform , n_neighbors: 10 }, accuracy: 0.8607142857142858
{ metric: minkowski , weight: uniform , n_neighbors: 11 }, accuracy: 0.8607142857142858
{ metric: minkowski , weight: uniform , n_neighbors: 12 

In [42]:
print('Best model:\t', best_model)
print('Accuracy:\t', best_accuracy)

Best model:	 KNeighborsClassifier(algorithm='ball_tree', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=8, p=1,
                     weights='uniform')
Accuracy:	 0.8857142857142858


## Comparação

In [43]:
# antes
print('Model:\t\t', best_KNN)
print('Accuracy:\t', best_test_accuracy)

Model:		 KNeighborsClassifier(algorithm='ball_tree', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=8, p=1,
                     weights='uniform')
Accuracy:	 0.8857142857142858


In [44]:
# depois
print('Model:\t\t', best_model)
print('Accuracy:\t', best_accuracy)

Model:		 KNeighborsClassifier(algorithm='ball_tree', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=8, p=1,
                     weights='uniform')
Accuracy:	 0.8857142857142858


É possível observar que mesmo variando outros parâmetros do KNN não conseguimos encontrar um modelo com acurácia maior do que a encontrada previamente.

Dessa forma, o melhor modelo encontrado possui acurácia de 88,57%, o que não é um valor tão bom, mas como o *dataset* utilizado possui poucos dados, isso já era esperado.