<a href="https://colab.research.google.com/github/ccal2/dataScienceProject/blob/master/project_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introdução

Esse projeto foi desenvolvido utilizando um *dataset* que contém dados relacionados à série de livros *A Song of Ice and Fire* (As Crônicas de Gelo e Fogo), mais conhecida pelo título do seu primeiro livro: *A Game of Thrones* (A Guerra dos Tronos).

O arquivo `battles_2.csv` foi exportado do arquivo `project.ipynb` e possui informações de várias batalhas que ocorrem durante a história.

Nesse projeto iremos utilizar algoritmos de *Machine Learning* para criar um classificador de batalhas em relação ao `attacker_outcome`. Essa coluna possui um valor booleano indicando se o atacante venceu ou não a batalha.

# Setup

**Lembre-se de dar upload do arquivo `battles_2.csv`.**

In [1]:
!pip install optuna --quiet
!pip install mlflow --quiet
!pip install pyngrok --quiet

[K     |████████████████████████████████| 302 kB 31.3 MB/s 
[K     |████████████████████████████████| 164 kB 54.2 MB/s 
[K     |████████████████████████████████| 80 kB 6.2 MB/s 
[K     |████████████████████████████████| 75 kB 4.1 MB/s 
[K     |████████████████████████████████| 141 kB 40.7 MB/s 
[K     |████████████████████████████████| 49 kB 4.6 MB/s 
[K     |████████████████████████████████| 111 kB 68.8 MB/s 
[?25h  Building wheel for pyperclip (setup.py) ... [?25l[?25hdone
[K     |████████████████████████████████| 14.4 MB 64 kB/s 
[K     |████████████████████████████████| 79 kB 6.7 MB/s 
[K     |████████████████████████████████| 146 kB 73.3 MB/s 
[K     |████████████████████████████████| 1.1 MB 21.1 MB/s 
[K     |████████████████████████████████| 56 kB 4.4 MB/s 
[K     |████████████████████████████████| 170 kB 53.7 MB/s 
[K     |████████████████████████████████| 636 kB 39.6 MB/s 
[K     |████████████████████████████████| 52 kB 1.2 MB/s 
[K     |████████████████████

In [2]:
import pandas as pd
import numpy as np

In [3]:
battles = pd.read_csv('battles_2.csv')
battles.head()

Unnamed: 0,name,year,attacker_king,defender_king,attacker_1,defender_1,attacker_outcome,battle_type,major_death,major_capture,attacker_size,defender_size,summer,location,region,attacker_commander_1,defender_commander_1,size_difference,size_difference_disc,total_size
0,Battle of the Golden Tooth,298,Joffrey/Tommen Baratheon,Robb Stark,Lannister,Tully,1,pitched battle,1,0,15000.0,4000.0,1,Golden Tooth,The Westerlands,Jaime Lannister,Clement Piper,11000.0,advantage_2,19000.0
1,Battle at the Mummer's Ford,298,Joffrey/Tommen Baratheon,Robb Stark,Lannister,Baratheon,1,ambush,1,0,7713.0,120.0,1,Mummer's Ford,The Riverlands,Gregor Clegane,Beric Dondarrion,7593.0,advantage_2,7833.0
2,Battle of Riverrun,298,Joffrey/Tommen Baratheon,Robb Stark,Lannister,Tully,1,pitched battle,0,1,15000.0,10000.0,1,Riverrun,The Riverlands,Jaime Lannister,Edmure Tully,5000.0,advantage_2,25000.0
3,Battle of the Green Fork,298,Robb Stark,Joffrey/Tommen Baratheon,Stark,Lannister,0,pitched battle,1,1,18000.0,20000.0,1,Green Fork,The Riverlands,Roose Bolton,Tywin Lannister,-2000.0,disavantage_2,38000.0
4,Battle of the Whispering Wood,298,Robb Stark,Joffrey/Tommen Baratheon,Stark,Lannister,1,ambush,1,1,1875.0,6000.0,1,Whispering Wood,The Riverlands,Robb Stark,Jaime Lannister,-4125.0,disavantage_3,7875.0


In [4]:
battles.dtypes

name                     object
year                      int64
attacker_king            object
defender_king            object
attacker_1               object
defender_1               object
attacker_outcome          int64
battle_type              object
major_death               int64
major_capture             int64
attacker_size           float64
defender_size           float64
summer                    int64
location                 object
region                   object
attacker_commander_1     object
defender_commander_1     object
size_difference         float64
size_difference_disc     object
total_size              float64
dtype: object

Como a coluna `name` possui valores únicos referentes ao nome da batalha, vamos removê-la. A coluna `size_difference_disc` foi criada no projeto 1 como uma discretização da coluna `size_difference`, então ela também vai ser removida, já que a informação da diferença de tamanho das tropas atacantes e defensoras está duplicada nessas duas colunas.

In [5]:
battles.drop(columns=['name', 'size_difference_disc'], inplace=True)

In [6]:
# ajustar os tipos dos dados
battles['attacker_king'] = battles['attacker_king'].astype('category')
battles['defender_king'] = battles['defender_king'].astype('category')
battles['attacker_1'] = battles['attacker_1'].astype('category')
battles['defender_1'] = battles['defender_1'].astype('category')
battles['attacker_outcome'] = battles['attacker_outcome'].astype('category')
battles['battle_type'] = battles['battle_type'].astype('category')
battles['attacker_size'] = battles['attacker_size'].astype('int64')
battles['defender_size'] = battles['defender_size'].astype('int64')
battles['location'] = battles['location'].astype('category')
battles['region'] = battles['region'].astype('category')
battles['attacker_commander_1'] = battles['attacker_commander_1'].astype('category')
battles['defender_commander_1'] = battles['defender_commander_1'].astype('category')
battles['size_difference'] = battles['size_difference'].astype('int64')
battles['total_size'] = battles['total_size'].astype('int64')

battles.dtypes

year                       int64
attacker_king           category
defender_king           category
attacker_1              category
defender_1              category
attacker_outcome        category
battle_type             category
major_death                int64
major_capture              int64
attacker_size              int64
defender_size              int64
summer                     int64
location                category
region                  category
attacker_commander_1    category
defender_commander_1    category
size_difference            int64
total_size                 int64
dtype: object

Apesar das colunas `major_death`, `major_capture` e `summer` representarem dados categóricos, esses dados são booleanos (sempre 0 ou 1). Para facilitar as predições, iremos manter essas colunas com seus dados na forma numérica.

In [7]:
# mostrar que as colunas 'major_death', 'major_capture' e 'summer' só possuem valores 0 e 1
battles.groupby('major_death')['attacker_outcome'].count(), \
battles.groupby('major_capture')['attacker_outcome'].count(), \
battles.groupby('summer')['attacker_outcome'].count()

(major_death
 0    24
 1    12
 Name: attacker_outcome, dtype: int64, major_capture
 0    26
 1    10
 Name: attacker_outcome, dtype: int64, summer
 0    10
 1    26
 Name: attacker_outcome, dtype: int64)

In [8]:
# separar coluna 'attacker_outcome' para classificação
x_battles = battles.drop(columns=['attacker_outcome'])
y_battles = battles['attacker_outcome']

In [9]:
# one-hot encoding
# não modificar colunas com valores numéricos
columns_to_keep = ['year', 'major_death', 'major_capture', 'attacker_size', 'defender_size', 'summer', 'size_difference', 'total_size']

x_battles = pd.get_dummies(x_battles, columns=x_battles.columns.drop(columns_to_keep))
x_battles.head()

Unnamed: 0,year,major_death,major_capture,attacker_size,defender_size,summer,size_difference,total_size,attacker_king_Balon/Euron Greyjoy,attacker_king_Joffrey/Tommen Baratheon,attacker_king_None,attacker_king_Robb Stark,attacker_king_Stannis Baratheon,defender_king_Balon/Euron Greyjoy,defender_king_Joffrey/Tommen Baratheon,defender_king_None,defender_king_Renly Baratheon,defender_king_Robb Stark,defender_king_Stannis Baratheon,attacker_1_Baratheon,attacker_1_Bolton,attacker_1_Bracken,attacker_1_Brave Companions,attacker_1_Darry,attacker_1_Frey,attacker_1_Greyjoy,attacker_1_Lannister,attacker_1_Stark,defender_1_Baratheon,defender_1_Blackwood,defender_1_Bolton,defender_1_Brave Companions,defender_1_Darry,defender_1_Greyjoy,defender_1_Lannister,defender_1_Mallister,defender_1_None,defender_1_Stark,defender_1_Tully,defender_1_Tyrell,...,attacker_commander_1_Loras Tyrell,attacker_commander_1_Mace Tyrell,attacker_commander_1_Ramsay Snow,attacker_commander_1_Ramsey Bolton,attacker_commander_1_Robb Stark,attacker_commander_1_Robertt Glover,attacker_commander_1_Rodrik Cassel,attacker_commander_1_Roose Bolton,attacker_commander_1_Rorge,attacker_commander_1_Stannis Baratheon,attacker_commander_1_Theon Greyjoy,attacker_commander_1_Tywin Lannister,attacker_commander_1_Victarion Greyjoy,attacker_commander_1_Walder Frey,defender_commander_1_Amory Lorch,defender_commander_1_Asha Greyjoy,defender_commander_1_Beric Dondarrion,defender_commander_1_Bran Stark,defender_commander_1_Brynden Tully,defender_commander_1_Clement Piper,defender_commander_1_Dagmer Cleftjaw,defender_commander_1_Edmure Tully,defender_commander_1_Gilbert Farring,defender_commander_1_Jaime Lannister,defender_commander_1_Jason Mallister,defender_commander_1_Lord Andros Brax,defender_commander_1_Lyman Darry,defender_commander_1_Randyll Tarly,defender_commander_1_Renly Baratheon,defender_commander_1_Robb Stark,defender_commander_1_Rodrik Cassel,defender_commander_1_Rolland Storm,defender_commander_1_Rolph Spicer,defender_commander_1_Roose Bolton,defender_commander_1_Stafford Lannister,defender_commander_1_Tyrion Lannister,defender_commander_1_Tytos Blackwood,defender_commander_1_Tywin Lannister,defender_commander_1_Unknown,defender_commander_1_Vargo Hoat
0,298,1,0,15000,4000,1,11000,19000,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,298,1,0,7713,120,1,7593,7833,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,298,0,1,15000,10000,1,5000,25000,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,298,1,1,18000,20000,1,-2000,38000,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
4,298,1,1,1875,6000,1,-4125,7875,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


# Geração dos modelos

In [10]:
x_battles.shape

(36, 124)

Como temos poucos dados, vamos utilizar cross-validation.

In [11]:
from sklearn.model_selection import cross_validate

Os algoritmos utilizados serão:
1. Gaussian Naive Bayes
2. KNN
3. Decision Tree
4. Random Forest

Vamos utilizar o Optuna para fazer a seleção dos hiper-parâmetros.

In [12]:
import optuna
from optuna.visualization import plot_param_importances

In [13]:
np.random.seed(10)
NUMBER_OF_TRIALS = 50

In [14]:
# logging
import mlflow
import warnings

mlflow.sklearn.autolog()
warnings.filterwarnings("ignore")

## Gaussian Naive Bayes

In [15]:
from sklearn.naive_bayes import GaussianNB

In [16]:
experiment_id = mlflow.create_experiment(name='gaussianNB')

def gaussianNB(trial):
  with mlflow.start_run(experiment_id=experiment_id):
    # definição dos hiper-parâmetros
    params = {'var_smoothing': trial.suggest_float('var_smoothing', 1e-10, 1e-08)}

    model = GaussianNB(var_smoothing=params['var_smoothing'])

    # treinamento
    scores = cross_validate(model, x_battles, y_battles,
                            scoring='accuracy', return_train_score=True)

    # acurácias
    test_accuracies = scores['test_score']
    test_accuracy = test_accuracies.mean()
    train_accuracies = scores['train_score']
    train_accuracy = train_accuracies.mean()

    # salvar acurácia de treinamento no trial
    trial.set_user_attr('train_accuracy', train_accuracy)

    # logging
    mlflow.log_params(params)
    mlflow.log_metric('test_accuracy', test_accuracy)
    mlflow.log_metric('train_accuracy', train_accuracy)
    mlflow.sklearn.log_model(model, "model")
    mlflow.end_run()

  return test_accuracy

In [17]:
study_GNB = optuna.create_study(direction='maximize')
study_GNB.optimize(gaussianNB, n_trials=NUMBER_OF_TRIALS)

[32m[I 2021-08-16 21:31:07,244][0m A new study created in memory with name: no-name-84700337-9dda-4b8e-b3f6-d6914debb856[0m
[32m[I 2021-08-16 21:31:09,082][0m Trial 0 finished with value: 0.7285714285714285 and parameters: {'var_smoothing': 3.805268071110081e-09}. Best is trial 0 with value: 0.7285714285714285.[0m
[32m[I 2021-08-16 21:31:10,923][0m Trial 1 finished with value: 0.7571428571428572 and parameters: {'var_smoothing': 3.130158451488061e-09}. Best is trial 1 with value: 0.7571428571428572.[0m
[32m[I 2021-08-16 21:31:12,824][0m Trial 2 finished with value: 0.7571428571428572 and parameters: {'var_smoothing': 3.122840903404032e-09}. Best is trial 1 with value: 0.7571428571428572.[0m
[32m[I 2021-08-16 21:31:14,598][0m Trial 3 finished with value: 0.7285714285714285 and parameters: {'var_smoothing': 2.151558345779655e-09}. Best is trial 1 with value: 0.7571428571428572.[0m
[32m[I 2021-08-16 21:31:16,566][0m Trial 4 finished with value: 0.7571428571428571 and para

In [18]:
print('Best hyperparameters:\t', study_GNB.best_trial.params)
print('Training accuracy:\t', study_GNB.best_trial.user_attrs['train_accuracy'])
print('Test accuracy:\t\t', study_GNB.best_trial.value)

Best hyperparameters:	 {'var_smoothing': 1.2627252715048782e-10}
Training accuracy:	 1.0
Test accuracy:		 0.8357142857142857


In [19]:
optuna.visualization.plot_optimization_history(study_GNB)

## KNN


In [20]:
from sklearn.neighbors import KNeighborsClassifier

In [21]:
experiment_id = mlflow.create_experiment(name='knn')

def knn(trial):
  with mlflow.start_run(experiment_id=experiment_id):
    # definição dos hiper-parâmetros
    params = {'n_neighbors': trial.suggest_int('n_neighbors', 1, 28),
              'algorithm': trial.suggest_categorical('algorithm', ['ball_tree', 'kd_tree', 'brute']),
              'p': trial.suggest_int('p', 1, 2)}

    model = KNeighborsClassifier(n_neighbors=params['n_neighbors'], algorithm=params['algorithm'], p=params['p'])

    # treinamento
    scores = cross_validate(model, x_battles, y_battles,
                            scoring='accuracy', return_train_score=True)

    # acurácias
    test_accuracies = scores['test_score']
    test_accuracy = test_accuracies.mean()
    train_accuracies = scores['train_score']
    train_accuracy = train_accuracies.mean()

    # salvar acurácia de treinamento no trial
    trial.set_user_attr('train_accuracy', train_accuracy)

    # logging
    mlflow.log_params(params)
    mlflow.log_metric('test_accuracy', test_accuracy)
    mlflow.log_metric('train_accuracy', train_accuracy)
    mlflow.sklearn.log_model(model, "model")
    mlflow.end_run()

  return test_accuracy

In [22]:
study_KNN = optuna.create_study(direction='maximize')
study_KNN.optimize(knn, n_trials=NUMBER_OF_TRIALS)

[32m[I 2021-08-16 21:32:39,457][0m A new study created in memory with name: no-name-7ee938eb-3f1f-40c4-b872-feafdf7f8d7c[0m
[32m[I 2021-08-16 21:32:41,391][0m Trial 0 finished with value: 0.8607142857142858 and parameters: {'n_neighbors': 20, 'algorithm': 'brute', 'p': 1}. Best is trial 0 with value: 0.8607142857142858.[0m
[32m[I 2021-08-16 21:32:45,736][0m Trial 1 finished with value: 0.8321428571428571 and parameters: {'n_neighbors': 6, 'algorithm': 'ball_tree', 'p': 1}. Best is trial 0 with value: 0.8607142857142858.[0m
[32m[I 2021-08-16 21:32:47,627][0m Trial 2 finished with value: 0.8607142857142858 and parameters: {'n_neighbors': 16, 'algorithm': 'kd_tree', 'p': 2}. Best is trial 0 with value: 0.8607142857142858.[0m
[32m[I 2021-08-16 21:32:49,682][0m Trial 3 finished with value: 0.8357142857142857 and parameters: {'n_neighbors': 3, 'algorithm': 'brute', 'p': 2}. Best is trial 0 with value: 0.8607142857142858.[0m
[32m[I 2021-08-16 21:32:51,546][0m Trial 4 finished

In [23]:
print('Best hyperparameters:\t', study_KNN.best_trial.params)
print('Training accuracy:\t', study_KNN.best_trial.user_attrs['train_accuracy'])
print('Test accuracy:\t\t', study_KNN.best_trial.value)

Best hyperparameters:	 {'n_neighbors': 8, 'algorithm': 'brute', 'p': 1}
Training accuracy:	 0.8679802955665024
Test accuracy:		 0.8857142857142858


In [24]:
optuna.visualization.plot_optimization_history(study_KNN)

In [25]:
plot_param_importances(study_KNN)

2021/08/16 21:34:18 INFO mlflow.utils.autologging_utils: Created MLflow autologging run with ID '97f4533bfc554431a4b6da86444f7ef1', which will track hyperparameters, performance metrics, model artifacts, and lineage information for the current sklearn workflow


## Decision Tree


In [26]:
from sklearn.tree import DecisionTreeClassifier

In [27]:
experiment_id = mlflow.create_experiment(name='decision_tree')

def decision_tree(trial):
  with mlflow.start_run(experiment_id=experiment_id):
    # definição dos hiper-parâmetros
    params = {'criterion': trial.suggest_categorical('criterion', ['gini', 'entropy']),
              'max_features': trial.suggest_int('max_features', 1, len(x_battles.columns))}

    model = DecisionTreeClassifier(criterion=params['criterion'], max_features=params['max_features'])

    # treinamento
    scores = cross_validate(model, x_battles, y_battles,
                            scoring='accuracy', return_train_score=True)

    # acurácias
    test_accuracies = scores['test_score']
    test_accuracy = test_accuracies.mean()
    train_accuracies = scores['train_score']
    train_accuracy = train_accuracies.mean()

    # salvar acurácia de treinamento no trial
    trial.set_user_attr('train_accuracy', train_accuracy)

    # logging
    mlflow.log_params(params)
    mlflow.log_metric('test_accuracy', test_accuracy)
    mlflow.log_metric('train_accuracy', train_accuracy)
    mlflow.sklearn.log_model(model, "model")
    mlflow.end_run()

  return test_accuracy

In [28]:
study_DT = optuna.create_study(direction='maximize')
study_DT.optimize(decision_tree, n_trials=NUMBER_OF_TRIALS)

[32m[I 2021-08-16 21:34:19,107][0m A new study created in memory with name: no-name-5fc5e203-ac5c-400c-85b8-571ffcf6fb85[0m
[32m[I 2021-08-16 21:34:21,010][0m Trial 0 finished with value: 0.5821428571428572 and parameters: {'criterion': 'entropy', 'max_features': 65}. Best is trial 0 with value: 0.5821428571428572.[0m
[32m[I 2021-08-16 21:34:25,501][0m Trial 1 finished with value: 0.8857142857142858 and parameters: {'criterion': 'gini', 'max_features': 38}. Best is trial 1 with value: 0.8857142857142858.[0m
[32m[I 2021-08-16 21:34:27,385][0m Trial 2 finished with value: 0.7142857142857143 and parameters: {'criterion': 'entropy', 'max_features': 27}. Best is trial 1 with value: 0.8857142857142858.[0m
[32m[I 2021-08-16 21:34:29,291][0m Trial 3 finished with value: 0.7535714285714286 and parameters: {'criterion': 'entropy', 'max_features': 64}. Best is trial 1 with value: 0.8857142857142858.[0m
[32m[I 2021-08-16 21:34:31,408][0m Trial 4 finished with value: 0.667857142857

In [29]:
print('Best hyperparameters:\t', study_DT.best_trial.params)
print('Training accuracy:\t', study_DT.best_trial.user_attrs['train_accuracy'])
print('Test accuracy:\t\t', study_DT.best_trial.value)

Best hyperparameters:	 {'criterion': 'gini', 'max_features': 26}
Training accuracy:	 1.0
Test accuracy:		 0.9714285714285713


In [30]:
optuna.visualization.plot_optimization_history(study_DT)

In [31]:
plot_param_importances(study_DT)

2021/08/16 21:35:58 INFO mlflow.utils.autologging_utils: Created MLflow autologging run with ID 'a0462399aa33420288ff41f5b9d7b701', which will track hyperparameters, performance metrics, model artifacts, and lineage information for the current sklearn workflow


## Random Forest


In [32]:
from sklearn.ensemble import RandomForestClassifier

In [33]:
experiment_id = mlflow.create_experiment(name='random_forest')

def random_forest(trial):
  with mlflow.start_run(experiment_id=experiment_id):
    # definição dos hiper-parâmetros
    params = {'n_estimators': trial.suggest_int('n_estimators', 50, 150),
              'criterion': trial.suggest_categorical('criterion', ['gini', 'entropy']),
              'max_features': trial.suggest_int('max_features', 1, len(x_battles.columns))}

    model = RandomForestClassifier(n_estimators=params['n_estimators'], criterion=params['criterion'], max_features=params['max_features'])

    # treinamento
    scores = cross_validate(model, x_battles, y_battles,
                            scoring='accuracy', return_train_score=True)

    # acurácias
    test_accuracies = scores['test_score']
    test_accuracy = test_accuracies.mean()
    train_accuracies = scores['train_score']
    train_accuracy = train_accuracies.mean()

    # salvar acurácia de treinamento no trial
    trial.set_user_attr('train_accuracy', train_accuracy)

    # logging
    mlflow.log_params(params)
    mlflow.log_metric('test_accuracy', test_accuracy)
    mlflow.log_metric('train_accuracy', train_accuracy)
    mlflow.sklearn.log_model(model, "model")
    mlflow.end_run()

  return test_accuracy

In [34]:
study_RF = optuna.create_study(direction='maximize')
study_RF.optimize(random_forest, n_trials=NUMBER_OF_TRIALS)

[32m[I 2021-08-16 21:35:59,396][0m A new study created in memory with name: no-name-06084c55-193d-47c6-9c62-cbe378f6ccf1[0m
[32m[I 2021-08-16 21:36:03,226][0m Trial 0 finished with value: 0.8321428571428571 and parameters: {'n_estimators': 114, 'criterion': 'gini', 'max_features': 36}. Best is trial 0 with value: 0.8321428571428571.[0m
[32m[I 2021-08-16 21:36:05,890][0m Trial 1 finished with value: 0.8571428571428571 and parameters: {'n_estimators': 50, 'criterion': 'gini', 'max_features': 123}. Best is trial 1 with value: 0.8571428571428571.[0m
[32m[I 2021-08-16 21:36:09,086][0m Trial 2 finished with value: 0.8321428571428571 and parameters: {'n_estimators': 82, 'criterion': 'entropy', 'max_features': 63}. Best is trial 1 with value: 0.8571428571428571.[0m
[32m[I 2021-08-16 21:36:12,756][0m Trial 3 finished with value: 0.8607142857142858 and parameters: {'n_estimators': 108, 'criterion': 'entropy', 'max_features': 8}. Best is trial 3 with value: 0.8607142857142858.[0m


In [35]:
print('Best hyperparameters:\t', study_RF.best_trial.params)
print('Training accuracy:\t', study_RF.best_trial.user_attrs['train_accuracy'])
print('Test accuracy:\t\t', study_RF.best_trial.value)

Best hyperparameters:	 {'n_estimators': 122, 'criterion': 'gini', 'max_features': 26}
Training accuracy:	 1.0
Test accuracy:		 0.9142857142857143


In [36]:
optuna.visualization.plot_optimization_history(study_RF)

In [37]:
plot_param_importances(study_RF)

2021/08/16 21:39:12 INFO mlflow.utils.autologging_utils: Created MLflow autologging run with ID '3e407f2e3dc345b58577353a40cd23d6', which will track hyperparameters, performance metrics, model artifacts, and lineage information for the current sklearn workflow


# Escolha do algoritmo

In [38]:
# melhores modelos para cada algoritmo:
best_GNB = GaussianNB(var_smoothing=study_GNB.best_trial.params['var_smoothing'])
best_KNN = KNeighborsClassifier(n_neighbors=study_KNN.best_trial.params['n_neighbors'], algorithm=study_KNN.best_trial.params['algorithm'], p=study_KNN.best_trial.params['p'])
best_DT = DecisionTreeClassifier(criterion=study_DT.best_trial.params['criterion'], max_features=study_DT.best_trial.params['max_features'])
best_RF = RandomForestClassifier(n_estimators=study_RF.best_trial.params['n_estimators'], criterion=study_RF.best_trial.params['criterion'])

best_models = [best_GNB, best_KNN, best_DT, best_RF]

In [39]:
# encontrar o melhor algoritmo

experiment_id = mlflow.create_experiment(name='best_algorithm')

best_model = best_models[0]
best_train_accuracy = 0
best_test_accuracy = 0

for model in best_models:
  with mlflow.start_run(experiment_id=experiment_id):
    # treinamento
    scores = cross_validate(model, x_battles, y_battles,
                            scoring='accuracy', return_train_score=True)

    # acurácias
    test_accuracies = scores['test_score']
    test_accuracy = test_accuracies.mean()
    train_accuracies = scores['train_score']
    train_accuracy = train_accuracies.mean()

    # logging
    mlflow.log_metric('test_accuracy', test_accuracy)
    mlflow.log_metric('train_accuracy', train_accuracy)
    mlflow.sklearn.log_model(model, "model")
    mlflow.end_run()

    # comparação
    if test_accuracy > best_test_accuracy:
      best_test_accuracy = test_accuracy
      best_train_accuracy = train_accuracy
      best_model = model

print('Best model:\t', best_model)
print('Training accuracy:\t', best_train_accuracy)
print('Test accuracy:\t\t', best_test_accuracy)

Best model:	 KNeighborsClassifier(algorithm='brute', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=8, p=1,
                     weights='uniform')
Training accuracy:	 0.8679802955665024
Test accuracy:		 0.8857142857142858
