**Table of contents**<a id='toc0_'></a>    
- [Imports](#toc1_)    
- [Read Data](#toc2_)    
- [Data Preprocessing](#toc3_)
  - [Drop rows with missing values](#toc3_2_)    
  - [Removing Categorical Columns](#toc4_)    
  - [Split Train and Test Data](#toc5_)    
  - [Data Cleaning](#toc6_)    
    - [Impute missing numeric data](#toc6_1_)    
  - [Data Normalization](#toc7_)    
- [Model training](#toc8_)    
  - [KNN](#toc8_1_)  
  - [LVQ](#toc8_2_)
  - [Decision Tree](#toc8_3_)  
  - [MLP](#toc8_4_)
  - [SVM](#toc8_5_)  
  - [Stacking](#toc8_6_)  
  - [Random Forest](#toc8_7_)  

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

----

# <a id='toc1_'></a>[Imports](#toc0_)

In [25]:
import numpy as np
from matplotlib import pyplot as plt
from math import sqrt
from tqdm import tqdm

import pandas as pd
pd.options.display.max_colwidth = 1000
pd.options.display.max_columns = 1000
pd.options.display.max_rows = 200

import sklearn
from sklearn.impute import SimpleImputer
from sklearn.model_selection import cross_validate,train_test_split, GridSearchCV
from sklvq import GLVQ
from sklearn.preprocessing import LabelEncoder,MinMaxScaler,StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import StackingClassifier, RandomForestClassifier
from sklearn.svm import SVC

from sklearn.metrics import f1_score, recall_score, precision_score, accuracy_score, make_scorer
import random
from random import seed,randrange
import requests
import io
import pickle

In [26]:
pip install -U git+https://github.com/rickvanveen/sklvq.git

Defaulting to user installation because normal site-packages is not writeable
Collecting git+https://github.com/rickvanveen/sklvq.git
  Cloning https://github.com/rickvanveen/sklvq.git to c:\users\pichau\appdata\local\temp\pip-req-build-8fam6gx1
Note: you may need to restart the kernel to use updated packages.


  Running command git clone --filter=blob:none --quiet https://github.com/rickvanveen/sklvq.git 'C:\Users\pichau\AppData\Local\Temp\pip-req-build-8fam6gx1'
  fatal: unable to access 'https://github.com/rickvanveen/sklvq.git/': Could not resolve host: github.com
  error: subprocess-exited-with-error
  
  × git clone --filter=blob:none --quiet https://github.com/rickvanveen/sklvq.git 'C:\Users\pichau\AppData\Local\Temp\pip-req-build-8fam6gx1' did not run successfully.
  │ exit code: 128
  ╰─> See above for output.
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
error: subprocess-exited-with-error

× git clone --filter=blob:none --quiet https://github.com/rickvanveen/sklvq.git 'C:\Users\pichau\AppData\Local\Temp\pip-req-build-8fam6gx1' did not run successfully.
│ exit code: 128
╰─> See above for output.

note: This error originates from a subprocess, and is likely not a problem with pip.


# <a id='toc2_'></a>[Read Data](#toc0_)

In [34]:
# Downloading the csv file from your GitHub account

url = "https://raw.githubusercontent.com/Zuluke/Projetos-AM/main/spotify_activity/dataset.csv" # Make sure the url is the raw version of the file on GitHub
download = requests.get(url).content

# Reading the downloaded content and turning it into a pandas dataframe

df = pd.read_csv(io.StringIO(download.decode('utf-8')))

## <a id='toc3_'></a>[Visualize Data](#toc0_)

In [35]:
df.shape
print(df.shape)
print('\n')
df.info()
print('\n')
df.head()

(114000, 21)


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 114000 entries, 0 to 113999
Data columns (total 21 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   Unnamed: 0        114000 non-null  int64  
 1   track_id          114000 non-null  object 
 2   artists           113999 non-null  object 
 3   album_name        113999 non-null  object 
 4   track_name        113999 non-null  object 
 5   popularity        114000 non-null  int64  
 6   duration_ms       114000 non-null  int64  
 7   explicit          114000 non-null  bool   
 8   danceability      114000 non-null  float64
 9   energy            114000 non-null  float64
 10  key               114000 non-null  int64  
 11  loudness          114000 non-null  float64
 12  mode              114000 non-null  int64  
 13  speechiness       114000 non-null  float64
 14  acousticness      114000 non-null  float64
 15  instrumentalness  114000 non-null  float64
 16  liven

Unnamed: 0.1,Unnamed: 0,track_id,artists,album_name,track_name,popularity,duration_ms,explicit,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature,track_genre
0,0,5SuOikwiRyPMVoIQDJUgSV,Gen Hoshino,Comedy,Comedy,73,230666,False,0.676,0.461,1,-6.746,0,0.143,0.0322,1e-06,0.358,0.715,87.917,4,acoustic
1,1,4qPNDBW1i3p13qLCt0Ki3A,Ben Woodward,Ghost (Acoustic),Ghost - Acoustic,55,149610,False,0.42,0.166,1,-17.235,1,0.0763,0.924,6e-06,0.101,0.267,77.489,4,acoustic
2,2,1iJBSr7s7jYXzM8EGcbK5b,Ingrid Michaelson;ZAYN,To Begin Again,To Begin Again,57,210826,False,0.438,0.359,0,-9.734,1,0.0557,0.21,0.0,0.117,0.12,76.332,4,acoustic
3,3,6lfxq3CG4xtTiEg7opyCyx,Kina Grannis,Crazy Rich Asians (Original Motion Picture Soundtrack),Can't Help Falling In Love,71,201933,False,0.266,0.0596,0,-18.515,1,0.0363,0.905,7.1e-05,0.132,0.143,181.74,3,acoustic
4,4,5vjLSffimiIP26QG5WcN2K,Chord Overstreet,Hold On,Hold On,82,198853,False,0.618,0.443,2,-9.681,1,0.0526,0.469,0.0,0.0829,0.167,119.949,4,acoustic


# <a id='toc3_'></a>[Data Preprocessing](#toc0_)

## <a id='toc3_2_'></a>[Drop rows with missing values](#toc0_)

In [36]:
df.dropna(inplace=True, axis=0, how='any')

## <a id='toc4_'></a>[Removing Categorical Columns](#toc0_)

In [37]:
categorical_columns = ["Unnamed: 0", "track_id", "track_name", "album_name", "artists"]
df = df.drop(categorical_columns, axis=1)
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 113999 entries, 0 to 113999
Data columns (total 16 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   popularity        113999 non-null  int64  
 1   duration_ms       113999 non-null  int64  
 2   explicit          113999 non-null  bool   
 3   danceability      113999 non-null  float64
 4   energy            113999 non-null  float64
 5   key               113999 non-null  int64  
 6   loudness          113999 non-null  float64
 7   mode              113999 non-null  int64  
 8   speechiness       113999 non-null  float64
 9   acousticness      113999 non-null  float64
 10  instrumentalness  113999 non-null  float64
 11  liveness          113999 non-null  float64
 12  valence           113999 non-null  float64
 13  tempo             113999 non-null  float64
 14  time_signature    113999 non-null  int64  
 15  track_genre       113999 non-null  object 
dtypes: bool(1), float64(9), i

## <a id='toc5_'></a>[Split Train and Test Data](#toc0_)

In [38]:
def train_validation_test_split(df, target_column, validation_size=0.1, test_size=0.1, random_state=42):
    df_train, df_test = train_test_split(df, test_size=test_size, random_state=random_state, stratify=df[target_column])
    
    df_train, df_validation = train_test_split(df_train,
                                               test_size=validation_size/(1 - test_size),
                                               random_state=random_state,
                                               stratify=df_train[target_column])
    return df_train, df_validation, df_test  

In [39]:
df_train, df_validation, df_test = train_validation_test_split(df, "track_genre",0.2, 0.2)
df.info()

print('\n',len(df_train.values)/float(len(df)),len(df_test.values)/float(len(df)),len(df_validation.values)/float(len(df))) #Garantindo que o percentual ocorre

<class 'pandas.core.frame.DataFrame'>
Index: 113999 entries, 0 to 113999
Data columns (total 16 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   popularity        113999 non-null  int64  
 1   duration_ms       113999 non-null  int64  
 2   explicit          113999 non-null  bool   
 3   danceability      113999 non-null  float64
 4   energy            113999 non-null  float64
 5   key               113999 non-null  int64  
 6   loudness          113999 non-null  float64
 7   mode              113999 non-null  int64  
 8   speechiness       113999 non-null  float64
 9   acousticness      113999 non-null  float64
 10  instrumentalness  113999 non-null  float64
 11  liveness          113999 non-null  float64
 12  valence           113999 non-null  float64
 13  tempo             113999 non-null  float64
 14  time_signature    113999 non-null  int64  
 15  track_genre       113999 non-null  object 
dtypes: bool(1), float64(9), i

## <a id='toc6_'></a>[Data Cleaning](#toc0_)

### <a id='toc6_1_'></a>[Impute missing numeric data](#toc0_)

In [40]:
numeric_columns = df_train.select_dtypes(include=['number']).columns

numeric_imputer = SimpleImputer(strategy='median')
numeric_imputer.fit(df_train[numeric_columns])

df_train[numeric_columns] = numeric_imputer.transform(df_train[numeric_columns])
df_validation[numeric_columns] = numeric_imputer.transform(df_validation[numeric_columns])
df_test[numeric_columns] = numeric_imputer.transform(df_test[numeric_columns])

## <a id='toc7_'></a>[Data Normalization](#toc0_)

In [41]:
normalizer = MinMaxScaler()

normalizer.fit(df_train[numeric_columns])

df_train[numeric_columns] = normalizer.transform(df_train[numeric_columns])
df_validation[numeric_columns] = normalizer.transform(df_validation[numeric_columns])
df_test[numeric_columns] = normalizer.transform(df_test[numeric_columns])

In [42]:
#Div. de dados atributos e classe
df_cara_train = df_train[numeric_columns].values  #caracteristicas
df_clas_train = df_train['track_genre'].values #classe

df_cara_validation = df_validation[numeric_columns].values  #caracteristicas
df_clas_validation = df_validation['track_genre'].values #classe

df_cara_test = df_test[numeric_columns].values  #caracteristicas
df_clas_test = df_test['track_genre'].values #classe

# <a id='toc8_'></a>[Model training](#toc0_)

## <a id='toc8_1_'></a>[KNN](#toc0_)

In [43]:
df_cara_train_scaled = df_cara_train
df_cara_valid_scaled = df_cara_validation
df_cara_test_scaled = df_cara_test

In [44]:
knn = KNeighborsClassifier().fit(df_cara_train,df_clas_train)
param_grid = {
    'n_neighbors': np.arange(1,81,2),
    'metric': ['euclidean', 'manhattan']
}
grid_search = GridSearchCV(knn, param_grid, cv=5, scoring='f1_weighted')
grid_search.fit(df_cara_train,df_clas_train)

with open('KNN_model_searcher.pkl', 'wb') as f:
    pickle.dump(grid_search,f)

print(grid_search.cv_results_['mean_test_score'],'\n\n')
print(f'Melhor parametro: {grid_search.best_params_}')
print(f'Melhor resultado: {grid_search.best_score_}','\n\n')

best_knn = grid_search.best_estimator_
df_clas_pred = grid_search.best_estimator_.predict(df_cara_test)



# Avaliando a Acurácia, recall, F1 score, precisão e roc_auc

# Evaluating accuracy, recall, F1 score, and precision with the trained data
accuracy = best_knn.score(df_cara_train_scaled, df_clas_train)
recall = recall_score(df_clas_train, best_knn.predict(df_cara_train_scaled), average='macro')
f1 = f1_score(df_clas_train, best_knn.predict(df_cara_train_scaled), average='macro')
precision = precision_score(df_clas_train, best_knn.predict(df_cara_train_scaled), average='macro', zero_division=0)

evaluation_train = {
    'accuracy': make_scorer(accuracy_score),
    'recall_macro': make_scorer(recall_score, average='macro'),
    'f1_macro': make_scorer(f1_score, average='macro'),
    'precision_macro': make_scorer(precision_score, average='macro', zero_division = 0)
}

# Evaluating accuracy, recall, F1 score, and precision with the test data
accuracy = best_knn.score(df_cara_test_scaled, df_clas_test)
recall = recall_score(df_clas_test, best_knn.predict(df_cara_test_scaled), average='macro')
f1 = f1_score(df_clas_test, best_knn.predict(df_cara_test_scaled), average='macro')
precision = precision_score(df_clas_test, best_knn.predict(df_cara_test_scaled), average='macro', zero_division=0)

evaluation_test = {
    'accuracy': make_scorer(accuracy_score),
    'recall_macro': make_scorer(recall_score, average='macro'),
    'f1_macro': make_scorer(f1_score, average='macro'),
    'precision_macro': make_scorer(precision_score, average='macro', zero_division = 0)
}

# Avaliando a Acurácia, recall, F1 score, precisão
accuracy = best_knn.score(df_cara_valid_scaled, df_clas_validation)
recall = recall_score(df_clas_validation, best_knn.predict(df_cara_valid_scaled), average='macro')
f1 = f1_score(df_clas_validation, best_knn.predict(df_cara_valid_scaled), average='macro')
precision = precision_score(df_clas_validation, best_knn.predict(df_cara_valid_scaled), average='macro', zero_division=0)

evaluation_validation = {
    'accuracy': make_scorer(accuracy_score),
    'recall_macro': make_scorer(recall_score, average='macro'),
    'f1_macro': make_scorer(f1_score, average='macro'),
    'precision_macro': make_scorer(precision_score, average='macro', zero_division = 0)
}


print(f'Dados de Teste')
print(evaluation_train)
print(evaluation_validation)
print(evaluation_test)

[0.1748636  0.15861048 0.16880577 0.17154974 0.17500064 0.17567899
 0.17537008 0.17594873 0.17530662 0.17495474 0.17563676 0.17429404
 0.17423267 0.17333656 0.17353709 0.1732115  0.17287979 0.17233861
 0.17125046 0.17021634 0.1702484  0.16896469 0.16769784 0.16724919
 0.16641829 0.16637438 0.16600326 0.16487446 0.16456805 0.16393318
 0.16359721 0.16302103 0.16188695 0.16125885 0.16108099 0.16073382
 0.16078705 0.15982873 0.1588715  0.15828664 0.18945929 0.17724955
 0.18825781 0.19278721 0.19557491 0.19736664 0.19715771 0.19794149
 0.19734906 0.1991353  0.19896836 0.19766001 0.19836991 0.19772185
 0.19737806 0.19758553 0.19734314 0.19711047 0.19644256 0.19637877
 0.19560007 0.19604323 0.19556347 0.19502116 0.1942498  0.19444474
 0.19418936 0.19393728 0.19386993 0.19266551 0.19213404 0.19144797
 0.19135651 0.19105361 0.19006619 0.18963809 0.19024574 0.19016178
 0.18950826 0.1891294 ] 


Melhor parametro: {'metric': 'manhattan', 'n_neighbors': 19}
Melhor resultado: 0.1991353044549033 


D

## <a id='toc8_2_'></a>[LVQ](#toc0_)

In [45]:
df_cara_train_scaled = df_cara_train
df_cara_valid_scaled = df_cara_validation
df_cara_test_scaled = df_cara_test

In [46]:
# Definindo o classificador LVQ
lvq = GLVQ()

# Criando o dicionário de parâmetros para o grid search
param_grid = {
    "prototype_n_per_class": [1,3],  # Número de protótipos por classe
    "distance_type": ["euclidean"],
    "solver_params": [{"max_runs": 5, "step_size": step} for step in [0.1, 0.5]]  # Lista de dicionários para diferentes step_sizes
}
 


# Criando os scorers personalizados
scorers = {
    "accuracy": make_scorer(accuracy_score),
    "precision_macro": make_scorer(precision_score, average='macro', zero_division = 0),
    "recall_macro": make_scorer(recall_score, average='macro'),
    "f1_macro": make_scorer(f1_score, average='macro')
}

# Criando o objeto GridSearchCV
grid_search = GridSearchCV(lvq, param_grid, cv=5, scoring=scorers, refit="accuracy")

# Treinando o GridSearchCV com os dados de treino escalados
grid_search.fit(df_cara_train_scaled, df_clas_train)

with open('LVQ_model_searcher.pkl', 'wb') as f:
    pickle.dump(grid_search,f)



In [47]:
# Os melhores parâmetros encontrados
print("Melhores parâmetros: ", grid_search.best_params_)

# O melhor classificador encontrado pelo grid search

best_lvq = grid_search.best_estimator_

# Avaliando a Acurácia, recall, F1 score, precisão e roc_auc

# Evaluating accuracy, recall, F1 score, and precision with the trained data
accuracy = best_lvq.score(df_cara_train_scaled, df_clas_train)
recall = recall_score(df_clas_train, best_lvq.predict(df_cara_train_scaled), average='macro')
f1 = f1_score(df_clas_train, best_lvq.predict(df_cara_train_scaled), average='macro')
precision = precision_score(df_clas_train, best_lvq.predict(df_cara_train_scaled), average='macro', zero_division=0)

evaluation_train = {
    'accuracy': make_scorer(accuracy_score),
    'recall_macro': make_scorer(recall_score, average='macro'),
    'f1_macro': make_scorer(f1_score, average='macro'),
    'precision_macro': make_scorer(precision_score, average='macro', zero_division = 0)
}

# Evaluating accuracy, recall, F1 score, and precision with the trained data
accuracy = best_lvq.score(df_cara_test_scaled, df_clas_test)
recall = recall_score(df_clas_test, best_lvq.predict(df_cara_test_scaled), average='macro')
f1 = f1_score(df_clas_test, best_lvq.predict(df_cara_test_scaled), average='macro')
precision = precision_score(df_clas_test, best_lvq.predict(df_cara_test_scaled), average='macro', zero_division=0)

evaluation_test = {
    'accuracy': make_scorer(accuracy_score),
    'recall_macro': make_scorer(recall_score, average='macro'),
    'f1_macro': make_scorer(f1_score, average='macro'),
    'precision_macro': make_scorer(precision_score, average='macro', zero_division = 0)
}

# Avaliando a Acurácia, recall, F1 score, precisão e roc_auc
accuracy = best_lvq.score(df_cara_valid_scaled, df_clas_validation)
recall = recall_score(df_clas_validation, best_lvq.predict(df_cara_valid_scaled), average='macro')
f1 = f1_score(df_clas_validation, best_lvq.predict(df_cara_valid_scaled), average='macro')
precision = precision_score(df_clas_validation, best_lvq.predict(df_cara_valid_scaled), average='macro', zero_division=0)

evaluation_validation = {
    'accuracy': make_scorer(accuracy_score),
    'recall_macro': make_scorer(recall_score, average='macro'),
    'f1_macro': make_scorer(f1_score, average='macro'),
    'precision_macro': make_scorer(precision_score, average='macro', zero_division = 0)
}

LVQ_best = GLVQ(distance_type='euclidean', prototype_n_per_class=3, solver_params={"max_runs": 5, "step_size": 0.1})

LVQ_best.fit(df_cara_train_scaled, df_clas_train)

evaluation_avg = {
    'accuracy': make_scorer(accuracy_score),
    'recall_macro': make_scorer(recall_score, average='macro'),
    'f1_macro': make_scorer(f1_score, average='macro'),
    'precision_macro': make_scorer(precision_score, average='macro', zero_division = 0)
}


# 10 folds cross validation

cv_results = cross_validate(LVQ_best, df_cara_valid_scaled, df_clas_validation, cv=5, scoring=evaluation_avg)



results_df = pd.DataFrame()
# results per metric
for metric in evaluation_avg:
    print(f"{metric} per fold: ", cv_results[f'test_{metric}'])
    results_df[f'{metric}_per_fold'] = cv_results[f'test_{metric}']

for i in range(4):
    print(np.mean(results_df.values[:,i])*100)

print(f'Dados de Teste')
print(evaluation_train)
print(evaluation_validation)
print(evaluation_test)

Melhores parâmetros:  {'distance_type': 'euclidean', 'prototype_n_per_class': 3, 'solver_params': {'max_runs': 5, 'step_size': 0.1}}
accuracy per fold:  [0.12982456 0.12938596 0.13377193 0.13333333 0.13881579]
recall_macro per fold:  [0.12982456 0.12938596 0.13377193 0.13333333 0.13881579]
f1_macro per fold:  [0.1140344  0.11959892 0.11408463 0.11836989 0.12084056]
precision_macro per fold:  [0.12466036 0.13964932 0.14156033 0.1312936  0.13110597]
13.302631578947368
13.302631578947368
11.738567900442213
13.365391757392203
Dados de Teste
{'accuracy': make_scorer(accuracy_score), 'recall_macro': make_scorer(recall_score, average=macro), 'f1_macro': make_scorer(f1_score, average=macro), 'precision_macro': make_scorer(precision_score, average=macro, zero_division=0)}
{'accuracy': make_scorer(accuracy_score), 'recall_macro': make_scorer(recall_score, average=macro), 'f1_macro': make_scorer(f1_score, average=macro), 'precision_macro': make_scorer(precision_score, average=macro, zero_division

## <a id='toc8_3_'></a>[Decision Tree](#toc0_)

## <a id='toc8_4_'></a>[MLP](#toc0_)

In [48]:
df_cara_train_scaled = df_cara_train
df_cara_valid_scaled = df_cara_validation
df_cara_test_scaled = df_cara_test

In [49]:
mlp = MLPClassifier()
param_grid = {
    'hidden_layer_sizes': [(100,)],
    'activation': ['relu', 'tanh'],
    'solver': ['adam'], #, 'sgd'],
    'alpha': [0.0001],# 0.01],
    'learning_rate': ['adaptive'],
    'max_iter': [700]#300,500,
}

grid_search = GridSearchCV(mlp, param_grid, cv=5, scoring='accuracy', verbose=3)
grid_search.fit(df_cara_train, df_clas_train)

Fitting 5 folds for each of 2 candidates, totalling 10 fits




[CV 1/5] END activation=relu, alpha=0.0001, hidden_layer_sizes=(100,), learning_rate=adaptive, max_iter=700, solver=adam;, score=0.296 total time= 7.4min




[CV 2/5] END activation=relu, alpha=0.0001, hidden_layer_sizes=(100,), learning_rate=adaptive, max_iter=700, solver=adam;, score=0.291 total time= 7.2min




[CV 3/5] END activation=relu, alpha=0.0001, hidden_layer_sizes=(100,), learning_rate=adaptive, max_iter=700, solver=adam;, score=0.310 total time= 6.9min




[CV 4/5] END activation=relu, alpha=0.0001, hidden_layer_sizes=(100,), learning_rate=adaptive, max_iter=700, solver=adam;, score=0.307 total time= 8.0min




[CV 5/5] END activation=relu, alpha=0.0001, hidden_layer_sizes=(100,), learning_rate=adaptive, max_iter=700, solver=adam;, score=0.302 total time= 7.0min




[CV 1/5] END activation=tanh, alpha=0.0001, hidden_layer_sizes=(100,), learning_rate=adaptive, max_iter=700, solver=adam;, score=0.301 total time= 2.9min




[CV 2/5] END activation=tanh, alpha=0.0001, hidden_layer_sizes=(100,), learning_rate=adaptive, max_iter=700, solver=adam;, score=0.299 total time= 2.8min




[CV 3/5] END activation=tanh, alpha=0.0001, hidden_layer_sizes=(100,), learning_rate=adaptive, max_iter=700, solver=adam;, score=0.305 total time= 2.8min




[CV 4/5] END activation=tanh, alpha=0.0001, hidden_layer_sizes=(100,), learning_rate=adaptive, max_iter=700, solver=adam;, score=0.303 total time= 2.8min




[CV 5/5] END activation=tanh, alpha=0.0001, hidden_layer_sizes=(100,), learning_rate=adaptive, max_iter=700, solver=adam;, score=0.305 total time= 2.8min




In [50]:
with open('MLP_model_searcher.pkl', 'wb') as f:
    pickle.dump(grid_search, f)
print(grid_search.cv_results_['mean_test_score'])
print(f'Melhor parametro: {grid_search.best_params_}')
print(f'Melhor resultado: {grid_search.best_score_}')

[0.3010278  0.30273837]
Melhor parametro: {'activation': 'tanh', 'alpha': 0.0001, 'hidden_layer_sizes': (100,), 'learning_rate': 'adaptive', 'max_iter': 700, 'solver': 'adam'}
Melhor resultado: 0.30273837281631594


In [51]:
# Predizendo os rótulos dos dados de teste
best_mlp = grid_search.best_estimator_
df_clas_pred = grid_search.best_estimator_.predict(df_cara_test)

accuracy = best_mlp.score(df_cara_train_scaled, df_clas_train)
recall = recall_score(df_clas_train, best_mlp.predict(df_cara_train_scaled), average='macro')
f1 = f1_score(df_clas_train, best_mlp.predict(df_cara_train_scaled), average='macro')
precision = precision_score(df_clas_train, best_mlp.predict(df_cara_train_scaled), average='macro', zero_division=0)

evaluation_train = {
    'accuracy': make_scorer(accuracy_score),
    'recall_macro': make_scorer(recall_score, average='macro'),
    'f1_macro': make_scorer(f1_score, average='macro'),
    'precision_macro': make_scorer(precision_score, average='macro', zero_division = 0)
}

# Evaluating accuracy, recall, F1 score, and precision with the trained data
accuracy = best_mlp.score(df_cara_test_scaled, df_clas_test)
recall = recall_score(df_clas_test, best_mlp.predict(df_cara_test_scaled), average='macro')
f1 = f1_score(df_clas_test, best_mlp.predict(df_cara_test_scaled), average='macro')
precision = precision_score(df_clas_test, best_mlp.predict(df_cara_test_scaled), average='macro', zero_division=0)

evaluation_test = {
    'accuracy': make_scorer(accuracy_score),
    'recall_macro': make_scorer(recall_score, average='macro'),
    'f1_macro': make_scorer(f1_score, average='macro'),
    'precision_macro': make_scorer(precision_score, average='macro', zero_division = 0)
}

# Avaliando a Acurácia, recall, F1 score, precisão e roc_auc
accuracy = best_mlp.score(df_cara_valid_scaled, df_clas_validation)
recall = recall_score(df_clas_validation, best_mlp.predict(df_cara_valid_scaled), average='macro')
f1 = f1_score(df_clas_validation, best_mlp.predict(df_cara_valid_scaled), average='macro')
precision = precision_score(df_clas_validation, best_mlp.predict(df_cara_valid_scaled), average='macro', zero_division=0)

evaluation_validation = {
    'accuracy': make_scorer(accuracy_score),
    'recall_macro': make_scorer(recall_score, average='macro'),
    'f1_macro': make_scorer(f1_score, average='macro'),
    'precision_macro': make_scorer(precision_score, average='macro', zero_division = 0)
}

print(f'Dados de Teste')
print(evaluation_train)
print(evaluation_validation)
print(evaluation_test)

Dados de Teste
{'accuracy': make_scorer(accuracy_score), 'recall_macro': make_scorer(recall_score, average=macro), 'f1_macro': make_scorer(f1_score, average=macro), 'precision_macro': make_scorer(precision_score, average=macro, zero_division=0)}
{'accuracy': make_scorer(accuracy_score), 'recall_macro': make_scorer(recall_score, average=macro), 'f1_macro': make_scorer(f1_score, average=macro), 'precision_macro': make_scorer(precision_score, average=macro, zero_division=0)}
{'accuracy': make_scorer(accuracy_score), 'recall_macro': make_scorer(recall_score, average=macro), 'f1_macro': make_scorer(f1_score, average=macro), 'precision_macro': make_scorer(precision_score, average=macro, zero_division=0)}


## <a id='toc8_5_'></a>[SVM](#toc0_)

In [52]:
class_svm = SVC().fit(df_cara_train,df_clas_train)
###CUIDADO AO RODAR AS CÉLULAS ABAIXO

In [53]:
lista_kernels=['rbf']
lista_c =[100]
lista_gamma = [2]

# Criando um dicionário com os hiperparâmetros e valores a serem testados
param_grid = {'kernel': lista_kernels,'C': lista_c, 'gamma':lista_gamma}

In [54]:
grid_search = GridSearchCV(class_svm, param_grid, cv=5, scoring='accuracy')
grid_search.fit(df_cara_train,df_clas_train)

with open('SVM_model_searcher.pkl', 'wb') as f:
    pickle.dump(grid_search,f)
print(grid_search.cv_results_['mean_test_score'])
print(f'Melhor parametro: {grid_search.best_params_}')
print(f'Melhor resultado: {grid_search.best_score_}')
#lista_kernels=['linear','rbf']
#lista_c =[2,3,4,5,7,10,100]
#lista_gamma = [2,3,4,5,7,10,100]
#Melhor parametro: {'C': 100, 'gamma': 2, 'kernel': 'rbf'}
#Melhor resultado: 0.26281478175137607

[0.2449159]
Melhor parametro: {'C': 100, 'gamma': 2, 'kernel': 'rbf'}
Melhor resultado: 0.24491589853230442


In [55]:
# Predizendo os rótulos dos dados de teste
best_mlp = grid_search.best_estimator_
df_clas_pred = grid_search.best_estimator_.predict(df_cara_test)

df_clas_pred = grid_search.best_estimator_.predict(df_cara_test)

accuracy = best_mlp.score(df_cara_train_scaled, df_clas_train)
recall = recall_score(df_clas_train, best_mlp.predict(df_cara_train_scaled), average='macro')
f1 = f1_score(df_clas_train, best_mlp.predict(df_cara_train_scaled), average='macro')
precision = precision_score(df_clas_train, best_mlp.predict(df_cara_train_scaled), average='macro', zero_division=0)

evaluation_train = {
    'accuracy': make_scorer(accuracy_score),
    'recall_macro': make_scorer(recall_score, average='macro'),
    'f1_macro': make_scorer(f1_score, average='macro'),
    'precision_macro': make_scorer(precision_score, average='macro', zero_division = 0)
}

# Evaluating accuracy, recall, F1 score, and precision with the trained data
accuracy = best_mlp.score(df_cara_test_scaled, df_clas_test)
recall = recall_score(df_clas_test, best_mlp.predict(df_cara_test_scaled), average='macro')
f1 = f1_score(df_clas_test, best_mlp.predict(df_cara_test_scaled), average='macro')
precision = precision_score(df_clas_test, best_mlp.predict(df_cara_test_scaled), average='macro', zero_division=0)

evaluation_test = {
    'accuracy': make_scorer(accuracy_score),
    'recall_macro': make_scorer(recall_score, average='macro'),
    'f1_macro': make_scorer(f1_score, average='macro'),
    'precision_macro': make_scorer(precision_score, average='macro', zero_division = 0)
}

# Avaliando a Acurácia, recall, F1 score, precisão e roc_auc
accuracy = best_mlp.score(df_cara_valid_scaled, df_clas_validation)
recall = recall_score(df_clas_validation, best_mlp.predict(df_cara_valid_scaled), average='macro')
f1 = f1_score(df_clas_validation, best_mlp.predict(df_cara_valid_scaled), average='macro')
precision = precision_score(df_clas_validation, best_mlp.predict(df_cara_valid_scaled), average='macro', zero_division=0)

evaluation_validation = {
    'accuracy': make_scorer(accuracy_score),
    'recall_macro': make_scorer(recall_score, average='macro'),
    'f1_macro': make_scorer(f1_score, average='macro'),
    'precision_macro': make_scorer(precision_score, average='macro', zero_division = 0)
}

print(f'Dados de Teste')
print(evaluation_train)
print(evaluation_validation)
print(evaluation_test)

Dados de Teste
{'accuracy': make_scorer(accuracy_score), 'recall_macro': make_scorer(recall_score, average=macro), 'f1_macro': make_scorer(f1_score, average=macro), 'precision_macro': make_scorer(precision_score, average=macro, zero_division=0)}
{'accuracy': make_scorer(accuracy_score), 'recall_macro': make_scorer(recall_score, average=macro), 'f1_macro': make_scorer(f1_score, average=macro), 'precision_macro': make_scorer(precision_score, average=macro, zero_division=0)}
{'accuracy': make_scorer(accuracy_score), 'recall_macro': make_scorer(recall_score, average=macro), 'f1_macro': make_scorer(f1_score, average=macro), 'precision_macro': make_scorer(precision_score, average=macro, zero_division=0)}


## <a id='toc8_6_'></a>[Stacking](#toc0_)

## <a id='toc8_7_'></a>[Random Forest](#toc0_)