<a href="https://colab.research.google.com/github/afss2/Projetos_CD/blob/main/Projeto_TAGDI.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install fancyimpute

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting fancyimpute
  Downloading fancyimpute-0.7.0.tar.gz (25 kB)
Collecting knnimpute>=0.1.0
  Downloading knnimpute-0.1.0.tar.gz (8.3 kB)
Collecting nose
  Downloading nose-1.3.7-py3-none-any.whl (154 kB)
[K     |████████████████████████████████| 154 kB 6.3 MB/s 
Building wheels for collected packages: fancyimpute, knnimpute
  Building wheel for fancyimpute (setup.py) ... [?25l[?25hdone
  Created wheel for fancyimpute: filename=fancyimpute-0.7.0-py3-none-any.whl size=29899 sha256=48d6790acb902f571bf6dbd10fa608d826d8010bd78ffda180a4a66c8077e4a6
  Stored in directory: /root/.cache/pip/wheels/e3/04/06/a1a7d89ef4e631ce6268ea2d8cde04f7290651c1ff1025ce68
  Building wheel for knnimpute (setup.py) ... [?25l[?25hdone
  Created wheel for knnimpute: filename=knnimpute-0.1.0-py3-none-any.whl size=11353 sha256=34ef3543e09e20336269cda6b20ad3025018d46071d4937ce7dd48b955ff9fa2
  Stored in dire

In [None]:
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats

# Projeto 1

## Coleta de dados

In [None]:
#Realizando a importação do dataset dos jogadores da base de dados do FIFA 20, com 20 colunas

from google.colab import drive
drive.mount('/content/gdrive')
df = pd.read_csv('gdrive/MyDrive/ProjetoTAGDI/players_20.csv')

df.dataframeName = 'players_20.csv'

len(df)

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


18278

## Pré-processamento

Definição de Tipos

In [None]:
# Primeiro, iremos checar os tipos e realizar a categorização dos mesmos caso não esteja correto.

df.dtypes

In [None]:
# Agora imputando os novos tipos

df['nationality'] = df['nationality'].astype('category')
df['club'] = df['club'].astype('category')
df['player_positions'] = df['player_positions'].astype('category')
df['dob'] = df['dob'].astype('datetime64[ns]')

In [None]:
#Substituindo no dataframe pelos códigos e salvando num dict as categorias antes da substituição, para poder consultar os valores posteriormente

nat = dict(enumerate(df['nationality'].cat.categories))
df['nationality'] = df['nationality'].cat.codes

d = dict(enumerate(df['club'].cat.categories))
df['club'] = df['club'].cat.codes

In [None]:
#Após a mudança de tipos, assim ficou o dataframe
df.dtypes

Tratamento de dados ausentes

In [None]:
# Após ajustar corretamente os tipos, vamos checar se há algum dado ausente:
print(df.isnull().sum())

In [None]:
#Como obtivemos exatamente 2036 jogadores com os atributos 'pace', 'shooting', 'passing', 'dribbling', 'defending' e 'physic' ausentes, suspeitamos que houvesse algo em comum entre eles.
#Pelo nosso conhecimento do domínio, suspeitamos que se tratavam de goleiros. Testamos a hipótese e obtivemos a confirmação
dfs = df[(df['pace'].isnull()) & (df['shooting'].isnull()) & (df['passing'].isnull()) & (df['dribbling'].isnull()) & (df['defending'].isnull()) & (df['physic'].isnull()) & (df['player_positions'] == 'GK')]

print(len(dfs))


In [None]:
#Decidimos imputar a mediana dos valores das colunas vazias dos goleiros. Preferimos a mediana no lugar de média para reduzir a influência dos outliers
df['pace'].fillna(df['pace'].median(), inplace=True);
df['shooting'].fillna(df['shooting'].median(), inplace=True);
df['passing'].fillna(df['passing'].median(), inplace=True);
df['dribbling'].fillna(df['dribbling'].median(), inplace=True);
df['defending'].fillna(df['defending'].median(), inplace=True);
df['physic'].fillna(df['physic'].median(), inplace=True);



In [None]:
print(df.isnull().sum())
df.head()

In [None]:
#Vemos que a coluna com mais dados ausentes é a 'nation_position', pelo conhecimento do domínio sabemos que nem todos os jogadores atuam pela seleção de seu país. 
#Como temos muitos valores ausentes e já temos as colunas players_positions e team_position, ela acabam se tornando mais relevantes e por isso decidimos remover a coluna 'nation_position'
df = df.drop(columns=['nation_position', 'team_position'])
df.head()

In [None]:
# Aqui, fazemos mais algumas alterações no dataframe (drop das colunas short_name e player_positions) para poder utilizar o KNN para imputar os valores nulos restantes

# Salvando as informações para adicionar posteriormente
short_name_column = df['short_name'];
player_positions_column = df['player_positions'];

df = df.drop(columns=['short_name', 'player_positions']);


df['dob'] = df['dob'].values.astype("float64");

In [None]:
# Aqui utilizamos o KNN com 3 vizinhos para realizar a imputação

from fancyimpute import KNN
fit_knn = KNN(k=3).fit_transform(df)

fit_knn.shape

In [None]:
# Aqui fazemos a criação do dataframe utilizando os valores do KNN

imputed_df = pd.DataFrame(data=fit_knn[0:,0:],columns=['age',	'dob',	'height_cm',	'nationality',	'club',	'overall',	'potential',	'value_eur', 'wage_eur'	,'release_clause_eur',	'pace', 'shooting',	'passing', 'dribbling', 'defending', 'physic' ]) 

In [None]:
# E então, vemos que não há nenhum valor nulo

print(imputed_df.isnull().sum())

In [None]:
imputed_df.describe()

Normalização e discretização

In [None]:
imputed_df.head()

In [None]:
# alguns valores, como a data de nascimento (convertida para float), valor de mercado, release clause e salário acabam dominando o cálculo.

dist = np.linalg.norm(imputed_df.values[1]-imputed_df.values[2])
print(dist)

In [None]:
imputed_df_norm = (imputed_df - imputed_df.min()) / (imputed_df.max() - imputed_df.min())
print(imputed_df_norm.head())

In [None]:
# recalculando a distância
dist = np.linalg.norm(imputed_df_norm.values[3]-imputed_df_norm.values[4])
print(dist)

In [None]:
# Agora fazendo a discretização de algumas colunas importantes:

imputed_df['age_dist'] = pd.qcut(imputed_df['age'],4)
imputed_df['overall_dist'] = pd.qcut(imputed_df['overall'],4)
imputed_df['potential_dist'] = pd.qcut(imputed_df['potential'],4)
imputed_df['wage_eur_dist'] = pd.qcut(imputed_df['wage_eur'],4)
imputed_df['value_eur_dist'] = pd.qcut(imputed_df['value_eur'],4)


In [None]:
imputed_df['overall'].describe()

In [None]:
imputed_df['overall_dist'].value_counts()

Limpeza de dados

(Univariado)

In [None]:
#Verificando que não possui instâncias duplicadas
imputed_df[imputed_df.duplicated()].sort_values("release_clause_eur").head()

In [None]:
imputed_df["release_clause_eur"].describe()

In [None]:
imputed_df["release_clause_eur"].plot.box()

In [None]:
len(imputed_df)

In [None]:
imputed_df["release_clause_eur"].hist()

In [None]:
from numpy import log10
imputed_df['release_clause_eur_log'] = log10(imputed_df['release_clause_eur'])
imputed_df['release_clause_eur_log'].hist()

In [None]:
from numpy import abs
mad = abs(imputed_df['release_clause_eur_log'] - imputed_df['release_clause_eur_log'].median()).median()*(1/0.6745)
print(mad)

In [None]:
imputed_df['release_clause_eur_log'].mad()

In [None]:
(abs(imputed_df['release_clause_eur_log']-imputed_df['release_clause_eur_log'].median())/mad).hist()

In [None]:
len(imputed_df)

In [None]:
imputed_df[abs(imputed_df['release_clause_eur_log']-imputed_df['release_clause_eur_log'].median())/mad > 3.5]

In [None]:
imputed_df = imputed_df[abs(imputed_df['release_clause_eur_log']-imputed_df['release_clause_eur_log'].median())/mad < 3.5]
print(len(imputed_df))

(Bivariado)

In [None]:
imputed_df.plot.scatter(x='overall',y='release_clause_eur')

In [None]:
imputed_df['release_clause_per_overall'] = imputed_df['release_clause_eur'] / imputed_df['overall']

In [None]:
imputed_df['release_clause_per_overall'].describe()

In [None]:
imputed_df['release_clause_per_overall'].plot.box()

In [None]:
imputed_df['release_clause_per_overall'].hist()

In [None]:
imputed_df['release_clause_per_overall'] = log10(imputed_df['release_clause_per_overall'])

In [None]:
imputed_df['release_clause_per_overall'].hist()

In [None]:
mad = abs(imputed_df['release_clause_per_overall'] - imputed_df['release_clause_per_overall'].median()).median()*(1/0.6745)

In [None]:
imputed_df['release_clause_per_overall'].mad()

In [None]:
((abs(imputed_df['release_clause_per_overall']-imputed_df['release_clause_per_overall'].median()))/mad).describe()

In [None]:
(abs(imputed_df['release_clause_per_overall']-imputed_df['release_clause_per_overall'].median())/mad).hist()

In [None]:
imputed_df[abs(imputed_df['release_clause_per_overall']-imputed_df['release_clause_per_overall'].median())/mad > 2.5]

In [None]:
imputed_df = imputed_df[abs(imputed_df['release_clause_per_overall']-imputed_df['release_clause_per_overall'].median())/mad < 2.5]
print(len(imputed_df))

(Multivariado)

In [None]:
cleaned_df = imputed_df

cleaned_df = cleaned_df.drop(columns=['age_dist', 'overall_dist', 'potential_dist', 'wage_eur_dist', 'value_eur_dist'])

cleaned_df.head()

In [None]:
from sklearn.covariance import EllipticEnvelope
detector = EllipticEnvelope(contamination=0.01)
detector.fit(cleaned_df)

In [None]:
scores = detector.predict(cleaned_df)

In [None]:
scores

In [None]:
cleaned_df['outlier'] = scores
print(cleaned_df.head())

In [None]:
cleaned_df[cleaned_df['outlier'] == -1]

In [None]:
cleaned_df.head()

In [None]:
len(cleaned_df)

In [None]:
cleaned_df = cleaned_df[cleaned_df['outlier'] != -1]
len(cleaned_df)

In [None]:
# Adicionando algumas colunas novamente

cleaned_df.insert(0, 'short_name', short_name_column)
cleaned_df.insert(6, 'player_positions', player_positions_column)
print(cleaned_df)

## Estatísticas descritivas

In [None]:
cleaned_df['overall'].describe()

In [None]:
cleaned_df['potential'].describe()

In [None]:
print(d)

In [None]:
# Checar os jogadores do Real Madrid

cleaned_df[cleaned_df['club'] == 504].describe()


In [None]:
# Procurando o time com maior número de jogadores

cleaned_df['club'].value_counts().idxmax()



In [None]:
#O time FC Union Berlin é o que tem o maior número de jogadores (foi o primeiro a ser retornado)
cleaned_df[cleaned_df['club'] == 317]

In [None]:
# Agora vamos ver o clube que possui o maior número de jogadores com overall acima de 81.

print(cleaned_df[cleaned_df['overall'] > 81]['club'].value_counts().idxmax())
print(cleaned_df[cleaned_df['overall'] > 81][cleaned_df['club'] == 225])

# O clube é o FC Bayern Munchen, com 2 jogadores

In [None]:
# Vamos procurar o clube que possui a melhor média de jogadores que atuam como meio campo ofensivo (CAM)

grouped_df1 = cleaned_df[cleaned_df['player_positions'].str.contains('CAM')].groupby(['club'])['overall'].mean()
print(grouped_df1.idxmax())

#Temos como resultado a seleção do Uruguai


In [None]:
cleaned_df.boxplot(column=['potential'])

In [None]:
cleaned_df[cleaned_df['nationality'] == 18].cov()

In [None]:
cleaned_df[cleaned_df['nationality'] == 18].corr(method='pearson')

In [None]:
cleaned_df[cleaned_df['nationality'] == 18].corr(method='spearman')

In [None]:
cleaned_df.plot.scatter(x='overall',y='potential')

In [None]:
cleaned_df.plot.scatter(x='potential',y='release_clause_eur_log')

In [None]:
cleaned_df.plot.scatter(x='overall',y='shooting')

## Testes de hipótese

In [None]:
print(nat[18])

In [None]:
cleaned_df.head()

In [None]:
cleaned_df['nationality'] 

In [None]:
cleaned_df.dtypes

In [None]:
print(nat)
codes = cleaned_df['nationality']
print(codes)
cleaned_df['nationality'] = codes.map(nat)
# print(ad)

In [None]:
cleaned_df.head(15)

In [None]:
cleaned_df[cleaned_df['nationality'] == 'Brazil']['release_clause_eur'].plot(kind='hist')

In [None]:
stats.shapiro(cleaned_df[cleaned_df['nationality'] == 'Brazil']['release_clause_eur'])

In [None]:
cleaned_df[cleaned_df['nationality'] != 'Brazil']['release_clause_eur'].plot(kind='hist')

In [None]:
stats.shapiro(cleaned_df[cleaned_df['nationality'] != 'Brazil']['release_clause_eur'])
#novamente percebemos que não se tratam de distribuições normais

In [None]:
stats.mannwhitneyu(cleaned_df[cleaned_df['nationality'] == 'Brazil']['release_clause_eur'], cleaned_df[cleaned_df['nationality'] != 'Brazil']['release_clause_eur'], alternative ='greater')
#Ao fazermos o teste de hipótese, percebemos que, na média, o jogador brasileiro tem uma cláusula de rescisão contrtual mais cara que a média dos jogadores estrangeiros

In [None]:
cleaned_df[cleaned_df['nationality'] == 'Brazil']['overall'].plot(kind='hist')

In [None]:


stats.shapiro(cleaned_df[cleaned_df['nationality'] == 'Brazil']['overall'])

In [None]:
cleaned_df[cleaned_df['nationality'] == 'Argentina']['overall'].plot(kind='hist')

In [None]:
stats.shapiro(cleaned_df[cleaned_df['nationality'] == 'Argentina']['overall'])
#apesar dos gráfico se assemelharem a distribuições normais, percebemos que na verdade não são

In [None]:
import statistics
#verificaremos a hipótese do jogador brasileiro(18) ter, em média, um overall maior que o argentino(5)
print(statistics.mean(cleaned_df[cleaned_df['nationality'] == 'Brazil']['overall']))
print(statistics.mean(cleaned_df[cleaned_df['nationality'] == 'Argentina']['overall']))

In [None]:
from scipy import stats
#Ao fazermos o teste de hipótese, percebemos que o jogador brasileiro é de fato, em média, melhor que o argentinno
stats.mannwhitneyu(cleaned_df[cleaned_df['nationality'] == 'Brazil']['overall'], cleaned_df[cleaned_df['nationality'] == 'Argentina']['overall'], alternative ='greater')

# Projeto 2

Para essa segunda parte, iremos utilizar esse dataset para realizar o trabalho de classificar os jogadores entre 4 diferentes posições: Atacante, meio de campo, zagueiro e goleiro.
Não iremos reutilizar o dataset com as modificações do primeiro projeto, tendo em vista que na limpeza de dados nós acabamos perdendo alguns dados importantes (na remoção de outliers).

In [None]:
#Realizando a importação do dataset dos jogadores da base de dados do FIFA 20

from google.colab import drive
drive.mount('/content/gdrive')
raw_df = pd.read_csv('gdrive/MyDrive/ProjetoTAGDI/players_20_2.csv')

raw_df.dataframeName = 'players_20_2.csv'

len(raw_df)

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


18278

## Escolhendo uma das colunas para realizar tarefa de classificação


Agora iremos mapear as posições dos jogadores de acordo com a zona do campo em que atuam. Existem 15 possíveis posições no dataset, com cada jogador podendo ter mais de uma. Mapearemos para que cada jogador tenha apenas uma posição  entre: `atacante`, `meio-campo`, `defensor` e `goleiro`

In [None]:
df = raw_df

def combine_positions(row):
    # Existem 15 posições diferentes e cada jogador pode possuir mais de uma posição
    pos = row['player_positions'].split(', ') 
    N = len(pos)
    if N < 3:
        # Se um jogador tem 2 ou menos posições a primeira será considerada
        pos = pos[0]
        if pos in ['ST', 'LW', 'RW','CF']: #4
            return 0 #Atacante
        elif pos in ['CAM', 'LM', 'CM', 'RM', 'CDM']: #5
            return 1 #Meio-campo
        elif pos in ['LWB', 'RWB', 'LB', 'CB', 'RB']: #5
            return 2 #Defensor
        elif pos in ['GK']: #1
            return 3 #Goleiro
    else: # Se o jogador tem mais de 2 posições
        position_count = [0, 0, 0, 0] 
        # Nesse for contaremos a posição que mais se repete em cada jogador, 
        # determinando a qual posição cada jogador pertence
        for p in pos:
            if p in ['ST', 'LW', 'RW','CF']: #4
                index = 0 #Atacante
            elif p in ['CAM', 'LM', 'CM', 'RM', 'CDM']: #5
                index = 1 #Meio-campo
            elif p in ['LWB', 'RWB', 'LB', 'CB', 'RB']: #5
                index = 2 #Defensor
            elif p in ['GK']: #1
                index = 3 #Goleiro
            else:
                continue 
            position_count[index] += 1 

        return position_count.index(max(position_count))

df['player_positions'] = df.apply(combine_positions, axis=1)

In [None]:
#Modificando o dataset apenas para atributos importantes que irão ajudar na classificação:

df = df[['skill_moves', 'player_positions', 'attacking_crossing', 'attacking_finishing',
         'attacking_heading_accuracy', 'attacking_short_passing', 'attacking_volleys',
         'skill_dribbling', 'skill_curve', 'skill_fk_accuracy', 'skill_long_passing',
         'skill_ball_control', 'movement_acceleration', 'movement_sprint_speed', 
         'movement_agility', 'movement_reactions', 'movement_balance', 'power_shot_power',
         'power_jumping', 'power_stamina', 'power_strength', 'power_long_shots',
         'mentality_aggression', 'mentality_interceptions', 'mentality_positioning',
         'mentality_vision', 'mentality_penalties', 'mentality_composure',
         'defending_marking', 'defending_standing_tackle', 'defending_sliding_tackle',
         'goalkeeping_diving', 'goalkeeping_handling', 'goalkeeping_kicking',
         'goalkeeping_positioning', 'goalkeeping_reflexes']]

Verificando se existem atributos ausentes

In [None]:
df.isnull().sum()

skill_moves                   0
player_positions              0
attacking_crossing            0
attacking_finishing           0
attacking_heading_accuracy    0
attacking_short_passing       0
attacking_volleys             0
skill_dribbling               0
skill_curve                   0
skill_fk_accuracy             0
skill_long_passing            0
skill_ball_control            0
movement_acceleration         0
movement_sprint_speed         0
movement_agility              0
movement_reactions            0
movement_balance              0
power_shot_power              0
power_jumping                 0
power_stamina                 0
power_strength                0
power_long_shots              0
mentality_aggression          0
mentality_interceptions       0
mentality_positioning         0
mentality_vision              0
mentality_penalties           0
mentality_composure           0
defending_marking             0
defending_standing_tackle     0
defending_sliding_tackle      0
goalkeep

## Separando dados nos conjuntos de treinamento, validação e teste.

Removendo a coluna `player_positions`, que é a que queremos realizar a classificação e separando os dados em conjunto de treinamento(60%), validação(20%) e teste(20%)

In [None]:
from sklearn.model_selection import train_test_split

X = df.drop(["player_positions"],axis = 1)
y = df.player_positions

# Split the data to 60-20-20
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.25, random_state=1)


## Adicionando MLFlow + Algoritmos utilizados

In [None]:
!pip install mlflow --quiet
!pip install pyngrok --quiet
!pip install hyperopt --quiet

[K     |████████████████████████████████| 16.9 MB 517 kB/s 
[K     |████████████████████████████████| 209 kB 72.7 MB/s 
[K     |████████████████████████████████| 77 kB 5.8 MB/s 
[K     |████████████████████████████████| 79 kB 7.1 MB/s 
[K     |████████████████████████████████| 147 kB 75.1 MB/s 
[K     |████████████████████████████████| 181 kB 50.7 MB/s 
[K     |████████████████████████████████| 78 kB 6.2 MB/s 
[K     |████████████████████████████████| 62 kB 1.3 MB/s 
[K     |████████████████████████████████| 55 kB 3.1 MB/s 
[K     |████████████████████████████████| 140 kB 46.7 MB/s 
[K     |████████████████████████████████| 63 kB 1.4 MB/s 
[K     |████████████████████████████████| 59 kB 5.3 MB/s 
[?25h  Building wheel for databricks-cli (setup.py) ... [?25l[?25hdone
[K     |████████████████████████████████| 745 kB 5.1 MB/s 
[?25h  Building wheel for pyngrok (setup.py) ... [?25l[?25hdone


Configurando MLFLOW para rodar no google colab. Por não estarmos rodando localmente, precisamos dessas configuração para visualizarmos a UI

In [None]:
import mlflow

with mlflow.start_run(experiment_id = 1, run_name="MLflow on Colab"):
  mlflow.log_metric("m1", 2.0)
  mlflow.log_param("p1", "mlflow-colab")

get_ipython().system_raw("mlflow ui --port 5000 &")

from pyngrok import ngrok

ngrok.kill()

NGROK_AUTH_TOKEN = "2FXkWfdo7JlGYzMgVhz2UK1esGe_7QHZE3yem9dU85ANstNXP"
ngrok.set_auth_token(NGROK_AUTH_TOKEN)

ngrok_tunnel = ngrok.connect(addr="5000", proto="http", bind_tls=True)
print("MLflow Tracking UI:", ngrok_tunnel.public_url)

MLflow Tracking UI: https://5421-34-74-21-189.ngrok.io


In [None]:
from hyperopt import fmin, hp, tpe, STATUS_OK
from hyperopt.pyll import scope
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
import math

## KNN



O primeiro algoritmo utilizado será o KNN, nele utilizaremos 3 hiper-parâmetros para realizar o boosting, sendo eles o número de vizinhos, o tipo de função de peso e o tipo de distância a ser utilizado.

In [None]:
def train(params):
  with mlflow.start_run(experiment_id = 1, run_name='KNN'):
    neighbors = params['n_neighbors']
    weights = params['weights']
    p = params['p']

    neigh = KNeighborsClassifier(n_neighbors=neighbors, weights=weights, p=p)
    neigh.fit(X_train, y_train)
    predictions = neigh.predict(X_val)
    score = neigh.score(X_val, y_val)

    #selecionando os hiperparâmetros a serem logados pelo MLFLOW
    mlflow.log_param("n_neighbors", neighbors)
    mlflow.log_param("weights", weights)
    mlflow.log_param("p", p)

    reports = classification_report(list(y_val), predictions,output_dict=True)
    precision = reports["weighted avg"]["precision"]
    recall = reports["weighted avg"]["recall"]
    f1score = reports["weighted avg"]["f1-score"]
    support = reports["weighted avg"]["support"]

    #selecionando as métricas a serem logados pelo MLFLOW
    mlflow.log_metric("Precision", precision)
    mlflow.log_metric("Recall", recall)
    mlflow.log_metric("F1-Score", f1score)
    mlflow.log_metric("Support", support)
    return {'loss': -score, 'status': STATUS_OK}

search_space = { #determinando o range dos valores de hiperparâmetros a serem utlizados no treinamento
  'n_neighbors': scope.int(hp.quniform('n_neighbors', 1, 14, q=1)),
  'weights': hp.choice('weights', ['uniform','distance']),
  'p': hp.choice('p', [1,2]),

}

algo=tpe.suggest

best_hyperparameters = fmin(
fn=train,
space=search_space,
algo=algo,
max_evals=60)

100%|██████████| 60/60 [05:15<00:00,  5.26s/it, best loss: -0.8864879649890591]


In [None]:
print(best_hyperparameters)

neighbors = math.floor(best_hyperparameters['n_neighbors'])

# Como o hyperopt retorna os parâmetros escolhidos através do choice como a posição dele no array
# Temos que fazer um if para mapear.
if (best_hyperparameters['weights'] == 0):
  weights = 'uniform'
else:
  weights = 'distance'

if (best_hyperparameters['p'] == 0):
  p = 1
else:
  p = 2

best_knn = KNeighborsClassifier(n_neighbors=neighbors, weights=weights, p=p)
best_knn.fit(X_train, y_train)
predictions = best_knn.predict(X_test)
score = best_knn.score(X_test, y_test)

print(classification_report(list(y_test), predictions))

{'n_neighbors': 14.0, 'p': 0, 'weights': 1}
              precision    recall  f1-score   support

           0       0.87      0.76      0.81       720
           1       0.82      0.89      0.85      1352
           2       0.94      0.91      0.93      1189
           3       1.00      1.00      1.00       395

    accuracy                           0.88      3656
   macro avg       0.90      0.89      0.90      3656
weighted avg       0.88      0.88      0.88      3656



## Decision Tree

O segundo algoritmo escolhido foi o Decision Tree, escolhendo também 3 hiper-parâmetros, o max-depth que é a profundidade máxima da árvore, o splitter que serve para escolher a estratégia utilizada para dividir cada nó e o criterion que é a função a ser utilizada para medir a qualidade da divisão do nó.

In [None]:
def train(params):
  with mlflow.start_run(experiment_id = 1, run_name='DecTree'):
    max_depth = params['max_depth']
    criterion = params['criterion']
    splitter = params['splitter']

    dec = DecisionTreeClassifier(max_depth=max_depth, criterion=criterion, splitter=splitter)
    dec.fit(X_train, y_train)
    predictions = dec.predict(X_val)
    score = dec.score(X_val, y_val)

    #selecionando os hiperparâmetros a serem logados pelo MLFLOW
    mlflow.log_param("max_depth", max_depth)
    mlflow.log_param("criterion", criterion)
    mlflow.log_param("splitter", splitter)

    reports = classification_report(list(y_val), predictions,output_dict=True)
    precision = reports["weighted avg"]["precision"]
    recall = reports["weighted avg"]["recall"]
    f1score = reports["weighted avg"]["f1-score"]
    support = reports["weighted avg"]["support"]

    #selecionando as métricas a serem logados pelo MLFLOW
    mlflow.log_metric("Precision", precision)
    mlflow.log_metric("Recall", recall)
    mlflow.log_metric("F1-Score", f1score)
    mlflow.log_metric("Support", support)
    return {'loss': -score, 'status': STATUS_OK}

search_space = { #determinando o range dos valores de hiperparâmetros a serem utlizados no treinamento
  'max_depth': scope.int(hp.quniform('max_depth', 3, 15, q=1)),
  'criterion': hp.choice('criterion', ['gini', 'entropy'],),
  'splitter': hp.choice('splitter', ['best', 'random'],),

}

algo=tpe.suggest

best_hyperparameters = fmin(
fn=train,
space=search_space,
algo=algo,
max_evals=60)

100%|██████████| 60/60 [00:08<00:00,  6.82it/s, best loss: -0.8583150984682714]


In [None]:
print(best_hyperparameters)

max_depth = math.floor(best_hyperparameters['max_depth'])
if (best_hyperparameters['criterion'] == 0):
  criterion = 'gini'
else:
  criterion = 'entropy'

if (best_hyperparameters['splitter'] == 0):
  splitter = 'best'
else:
  splitter = 'random'

best_dec = DecisionTreeClassifier(max_depth=max_depth, criterion=criterion, splitter=splitter)
best_dec.fit(X_train, y_train)
predictions = best_dec.predict(X_test)
score = best_dec.score(X_test, y_test)

print(classification_report(list(y_test), predictions))

{'criterion': 0, 'max_depth': 8.0, 'splitter': 0}
              precision    recall  f1-score   support

           0       0.85      0.74      0.79       720
           1       0.78      0.84      0.81      1352
           2       0.89      0.89      0.89      1189
           3       1.00      1.00      1.00       395

    accuracy                           0.85      3656
   macro avg       0.88      0.87      0.87      3656
weighted avg       0.85      0.85      0.85      3656



## SVM / SVC

Como terceiro classificador, escolhemos o SVC ou C-Support Vector Classification. Como o dataset é pequeno pudemos utilizá-lo, mas para datasets grandes ele acaba sendo impraticável e muito custoso.
Como hiper-parâmetros, escolhemos o kernel e o C. O kernel é a função utilizada para colocar os dados no padrão de entrada e o C é o parâmetro de regularização.

In [None]:
def train(params):
  with mlflow.start_run(experiment_id = 1, run_name='SVM'):
    kernel = params['kernel']
    C = params['C']

    #selecionando os hiperparâmetros a serem logados pelo MLFLOW
    mlflow.log_param("kernel", kernel)
    mlflow.log_param("C", C)

    svm = SVC(kernel=kernel, C=C)
    svm.fit(X_train, y_train)

    predictions = svm.predict(X_val)
    score = svm.score(X_val, y_val)

    reports = classification_report(list(y_val), predictions,output_dict=True)
    precision = reports["weighted avg"]["precision"]
    recall = reports["weighted avg"]["recall"]
    f1score = reports["weighted avg"]["f1-score"]
    support = reports["weighted avg"]["support"]

    #selecionando as métricas a serem logados pelo MLFLOW
    mlflow.log_metric("Precision", precision)
    mlflow.log_metric("Recall", recall)
    mlflow.log_metric("F1-Score", f1score)
    mlflow.log_metric("Support", support)
    return {'loss': -score, 'status': STATUS_OK}

search_space = { #determinando o range dos valores de hiperparâmetros a serem utlizados no treinamento
  'C': hp.choice('C', np.logspace(-2, 3, 13)),
  'kernel': hp.choice('kernel', ['rbf', 'sigmoid']),
}

algo=tpe.suggest

best_hyperparameters = fmin(
fn=train,
space=search_space,
algo=algo,
max_evals=50)

  2%|▏         | 1/50 [00:21<17:23, 21.29s/it, best loss: -0.3525711159737418]

  _warn_prf(average, modifier, msg_start, len(result))

  _warn_prf(average, modifier, msg_start, len(result))

  _warn_prf(average, modifier, msg_start, len(result))



 12%|█▏        | 6/50 [01:17<10:18, 14.06s/it, best loss: -0.8925054704595186]

  _warn_prf(average, modifier, msg_start, len(result))

  _warn_prf(average, modifier, msg_start, len(result))

  _warn_prf(average, modifier, msg_start, len(result))



 22%|██▏       | 11/50 [02:28<09:37, 14.81s/it, best loss: -0.899343544857768]

  _warn_prf(average, modifier, msg_start, len(result))

  _warn_prf(average, modifier, msg_start, len(result))

  _warn_prf(average, modifier, msg_start, len(result))



 74%|███████▍  | 37/50 [05:30<02:26, 11.29s/it, best loss: -0.9026258205689278]

  _warn_prf(average, modifier, msg_start, len(result))

  _warn_prf(average, modifier, msg_start, len(result))

  _warn_prf(average, modifier, msg_start, len(result))



 88%|████████▊ | 44/50 [06:26<01:04, 10.80s/it, best loss: -0.9026258205689278]

  _warn_prf(average, modifier, msg_start, len(result))

  _warn_prf(average, modifier, msg_start, len(result))

  _warn_prf(average, modifier, msg_start, len(result))



 94%|█████████▍| 47/50 [06:56<00:35, 11.79s/it, best loss: -0.9026258205689278]

  _warn_prf(average, modifier, msg_start, len(result))

  _warn_prf(average, modifier, msg_start, len(result))

  _warn_prf(average, modifier, msg_start, len(result))



100%|██████████| 50/50 [07:22<00:00,  8.85s/it, best loss: -0.9026258205689278]


In [None]:
print(best_hyperparameters)

C = math.floor(best_hyperparameters['C'])
if (best_hyperparameters['kernel'] == 0):
  kernel = 'rbf'
else:
  kernel = 'sigmoid'

best_svm = SVC(kernel=kernel, C=C)
best_svm.fit(X_train, y_train)

predictions = best_svm.predict(X_test)
score = best_svm.score(X_test, y_test)
  
print(classification_report(list(y_test), predictions))

{'C': 7, 'kernel': 0}
              precision    recall  f1-score   support

           0       0.89      0.76      0.82       720
           1       0.83      0.90      0.86      1352
           2       0.94      0.94      0.94      1189
           3       1.00      1.00      1.00       395

    accuracy                           0.89      3656
   macro avg       0.91      0.90      0.90      3656
weighted avg       0.90      0.89      0.89      3656



## Regressão Logística

Como quarto classificador, utilizamos a regressão logística (que também pode ser utilizada em problemas de classificação) e como hiper-parâmetros, escolhemos o solver e o C novamente. O solver é o algoritmo utilizado para o problema de otimização e o C é o inverso da força de regularização.

In [None]:
def train(params):
  with mlflow.start_run(experiment_id = 1, run_name='REGL'):
    solver = params['solver']
    C = params['C']

    lr = LogisticRegression(solver=solver, C=C, max_iter = 200)
    lr.fit(X_train, y_train)

    predictions = lr.predict(X_val)
    score = lr.score(X_val, y_val)

    #selecionando os hiperparâmetros a serem logados pelo MLFLOW
    mlflow.log_param("solver", solver)
    mlflow.log_param("C", C)

    reports = classification_report(list(y_val), predictions,output_dict=True)
    precision = reports["weighted avg"]["precision"]
    recall = reports["weighted avg"]["recall"]
    f1score = reports["weighted avg"]["f1-score"]
    support = reports["weighted avg"]["support"]

    #selecionando as métricas a serem logados pelo MLFLOW
    mlflow.log_metric("Precision", precision)
    mlflow.log_metric("Recall", recall)
    mlflow.log_metric("F1-Score", f1score)
    mlflow.log_metric("Support", support)
    return {'loss': -score, 'status': STATUS_OK}

search_space = { #determinando o range dos valores de hiperparâmetros a serem utlizados no treinamento
  'C': hp.choice('C', np.logspace(-2, 3, 13)),
  'solver': hp.choice('solver', ['liblinear', 'newton-cg', 'lbfgs', 'sag', 'saga']),
}

algo=tpe.suggest

best_hyperparameters = fmin(
fn=train,
space=search_space,
algo=algo,
max_evals=50)


  2%|▏         | 1/50 [00:02<02:10,  2.66s/it, best loss: -0.888129102844639]




  4%|▍         | 2/50 [00:05<02:18,  2.88s/it, best loss: -0.8886761487964989]




  6%|▌         | 3/50 [00:07<02:01,  2.58s/it, best loss: -0.8886761487964989]

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression



  8%|▊         | 4/50 [00:10<01:59,  2.61s/it, best loss: -0.8886761487964989]




 12%|█▏        | 6/50 [00:13<01:29,  2.04s/it, best loss: -0.8886761487964989]

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression



 14%|█▍        | 7/50 [00:15<01:29,  2.08s/it, best loss: -0.8886761487964989]

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression



 16%|█▌        | 8/50 [00:18<01:40,  2.40s/it, best loss: -0.8886761487964989]




 18%|█▊        | 9/50 [00:20<01:34,  2.31s/it, best loss: -0.8886761487964989]

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression



 20%|██        | 10/50 [00:23<01:30,  2.26s/it, best loss: -0.8886761487964989]

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression



 22%|██▏       | 11/50 [00:25<01:33,  2.39s/it, best loss: -0.8886761487964989]




 24%|██▍       | 12/50 [00:28<01:37,  2.58s/it, best loss: -0.8886761487964989]




 26%|██▌       | 13/50 [00:31<01:40,  2.72s/it, best loss: -0.8886761487964989]




 28%|██▊       | 14/50 [00:34<01:36,  2.69s/it, best loss: -0.8886761487964989]




 34%|███▍      | 17/50 [00:46<01:46,  3.23s/it, best loss: -0.8886761487964989]




 36%|███▌      | 18/50 [00:49<01:37,  3.05s/it, best loss: -0.8886761487964989]




 42%|████▏     | 21/50 [01:10<02:32,  5.27s/it, best loss: -0.8886761487964989]




 44%|████▍     | 22/50 [01:13<02:08,  4.60s/it, best loss: -0.8886761487964989]




 46%|████▌     | 23/50 [01:16<01:51,  4.12s/it, best loss: -0.8886761487964989]




 48%|████▊     | 24/50 [01:19<01:38,  3.80s/it, best loss: -0.8886761487964989]




 50%|█████     | 25/50 [01:22<01:29,  3.57s/it, best loss: -0.8886761487964989]




 52%|█████▏    | 26/50 [01:25<01:21,  3.41s/it, best loss: -0.8886761487964989]




 56%|█████▌    | 28/50 [01:29<00:59,  2.71s/it, best loss: -0.8886761487964989]




 58%|█████▊    | 29/50 [01:32<00:58,  2.80s/it, best loss: -0.8886761487964989]




 60%|██████    | 30/50 [01:35<00:57,  2.88s/it, best loss: -0.8886761487964989]




 62%|██████▏   | 31/50 [02:35<06:17, 19.88s/it, best loss: -0.8886761487964989]




 66%|██████▌   | 33/50 [02:38<03:03, 10.82s/it, best loss: -0.8886761487964989]




 68%|██████▊   | 34/50 [02:42<02:15,  8.49s/it, best loss: -0.8886761487964989]




 70%|███████   | 35/50 [02:44<01:40,  6.72s/it, best loss: -0.8886761487964989]




 74%|███████▍  | 37/50 [02:47<00:53,  4.11s/it, best loss: -0.8886761487964989]

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression



 76%|███████▌  | 38/50 [02:49<00:42,  3.53s/it, best loss: -0.8886761487964989]

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression



 78%|███████▊  | 39/50 [02:52<00:37,  3.38s/it, best loss: -0.8886761487964989]




 80%|████████  | 40/50 [02:55<00:33,  3.32s/it, best loss: -0.8886761487964989]




 82%|████████▏ | 41/50 [03:48<02:42, 18.07s/it, best loss: -0.8886761487964989]




 84%|████████▍ | 42/50 [03:51<01:47, 13.43s/it, best loss: -0.8886761487964989]




 86%|████████▌ | 43/50 [03:54<01:12, 10.31s/it, best loss: -0.8886761487964989]




 88%|████████▊ | 44/50 [03:56<00:47,  7.84s/it, best loss: -0.8886761487964989]

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression



 92%|█████████▏| 46/50 [04:00<00:19,  4.92s/it, best loss: -0.8886761487964989]




 94%|█████████▍| 47/50 [04:02<00:12,  4.23s/it, best loss: -0.8886761487964989]




 96%|█████████▌| 48/50 [04:05<00:07,  3.88s/it, best loss: -0.8886761487964989]




100%|██████████| 50/50 [04:16<00:00,  5.13s/it, best loss: -0.8886761487964989]


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression



In [None]:
print(best_hyperparameters)

C = math.floor(best_hyperparameters['C'])
if (best_hyperparameters['solver'] == 0):
  solver = 'liblinear'
elif (best_hyperparameters['solver'] == 1):
  solver = 'newton-cg'
elif (best_hyperparameters['solver'] == 2):
  solver = 'lbfgs'
elif (best_hyperparameters['solver'] == 3):
  solver = 'sag'
elif (best_hyperparameters['solver'] == 4):
  solver = 'saga'

best_lr = LogisticRegression(solver=solver, C=C, max_iter = 200)
best_lr.fit(X_train, y_train)

predictions = best_lr.predict(X_test)
score = best_lr.score(X_test, y_test)
  
print(classification_report(list(y_test), predictions))

{'C': 7, 'solver': 4}
              precision    recall  f1-score   support

           0       0.86      0.80      0.83       720
           1       0.83      0.87      0.85      1352
           2       0.92      0.92      0.92      1189
           3       1.00      1.00      1.00       395

    accuracy                           0.88      3656
   macro avg       0.90      0.90      0.90      3656
weighted avg       0.89      0.88      0.88      3656





## Comparação entre os classificadores/Diagnóstico

Analisando todos os classificadores utilizando seus melhores hiperparâmetros, tivemos os seguindos resultados:

Respectivamente precision, recall, f1-score e número total de amostras:

*   KNN: 
  * 0.88
  * 0.88
  * 0.88
  * 3656
* Decision Tree: 
  * 0.85 
  * 0.85 
  * 0.85 
  * 3656
* SVC: 
  * 0.90
  * 0.89
  * 0.89      
  * 3656
* Regressão Logistica:     
  * 0.89
  * 0.88
  * 0.88
  * 3656








Desses modelos, o que mais se mostrou eficiente foi o SVC, com uma leve vantagem quanto a regressão logística (1% na frente em todas as métricas) e KNN (2% em accuracy e 1% no restante).

O SVC procura um hiperplano que melhor divide os dados, procurando a possível melhor margem (distância) entre a observação (vetor de suporte) e o hiperplano.

Ele reduz o problema em múltiplos problemas de binários de classificação (1 x 1), por exemplo:

Atacante x Meio Campo
Atacante x Goleiro
Meio Campo x Goleiro...


O pior modelo foi a decision tree, que ficou 4~5% atrás do SVC.

Levando em conta o tempo de treinamento, os classificadores SVC e de regressão logística demoraram bem mais, algo que pode ser percebido através do MLFlow.

Em conclusão, mesmo o modelo mais ineficiente ainda tem uma boa precisão, o KNN também poderia ser utilizado caso tivessemos um dataset ainda maior e que o tempo de processamento fosse um parâmetro de escolha do classificador, tendo em vista que ainda possui resultados muito bons.