# 120 anos de olimpíadas

Atividade da disciplina de Inteligência Computacional.


**Professora:** Carine G. Webber

**Alunos:**

- Luis Henrique Ziliotto Salamon
- Rafael Bourscheid da Silveira

**Dataset** disponível no Kaggle: [120 years of Olympic history: athletes and results](https://www.kaggle.com/heesoo37/120-years-of-olympic-history-athletes-and-results)


## Importação de dependências

In [23]:
from time import time
import numpy as np
import matplotlib.pyplot as plt
from sklearn import metrics
from sklearn.cluster import KMeans
from sklearn import preprocessing
from sklearn.preprocessing import scale
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split 
from sklearn.preprocessing import StandardScaler  
from sklearn.neural_network import MLPClassifier  
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.model_selection import cross_val_predict



import numpy as np
import pandas as pd
import csv



## Etapa 1: Entrada de dados

### Importação dos datasets
#### Eventos dos atletas
Dataset principal. Classificador: ```Medal```

In [24]:
athlete_events = pd.read_csv('data/athlete_events.csv', dtype='str')
athlete_events_original = athlete_events.copy()
athlete_events.head()


Unnamed: 0,ID,Name,Sex,Age,Height,Weight,Team,NOC,Games,Year,Season,City,Sport,Event,Medal
0,1,A Dijiang,M,24,180.0,80.0,China,CHN,1992 Summer,1992,Summer,Barcelona,Basketball,Basketball Men's Basketball,
1,2,A Lamusi,M,23,170.0,60.0,China,CHN,2012 Summer,2012,Summer,London,Judo,Judo Men's Extra-Lightweight,
2,3,Gunnar Nielsen Aaby,M,24,,,Denmark,DEN,1920 Summer,1920,Summer,Antwerpen,Football,Football Men's Football,
3,4,Edgar Lindenau Aabye,M,34,,,Denmark/Sweden,DEN,1900 Summer,1900,Summer,Paris,Tug-Of-War,Tug-Of-War Men's Tug-Of-War,Gold
4,5,Christine Jacoba Aaftink,F,21,185.0,82.0,Netherlands,NED,1988 Winter,1988,Winter,Calgary,Speed Skating,Speed Skating Women's 500 metres,


#### NOC
Código de três letras do Comitê Olímpico Nacional

In [25]:
noc_regions = pd.read_csv('data/noc_regions.csv', dtype=str)
noc_regions_original = noc_regions.copy()
noc_regions.head()

Unnamed: 0,NOC,region,notes
0,AFG,Afghanistan,
1,AHO,Curacao,Netherlands Antilles
2,ALB,Albania,
3,ALG,Algeria,
4,AND,Andorra,




### Pré-visualização

In [26]:
athlete_events.describe()

Unnamed: 0,ID,Name,Sex,Age,Height,Weight,Team,NOC,Games,Year,Season,City,Sport,Event,Medal
count,271116,271116,271116,261642,210945,208241,271116,271116,271116,271116,271116,271116,271116,271116,39783
unique,135571,134732,2,74,95,220,1184,230,51,35,2,42,66,765,3
top,77710,Robert Tait McKenzie,M,23,180,70,United States,USA,2000 Summer,1992,Summer,London,Athletics,Football Men's Football,Gold
freq,58,58,196594,21875,12492,9625,17847,18853,13821,16413,222552,22426,38624,5733,13372


 
 
## Etapa 2: Sanitização
Faremos a classificação considerando a coluna ```Medal``` do arquivo ```athlete_events.csv```.

Porém, antes disso, ele precisa de uma pequena limpeza:
- Alguns dados (peso, altura ou idade) são desconhecidos, então o registro todo foi removido
- Alguns atletas não ganharam medalha, então receberam o valor "No" para indicar isso.
- Os campos desnecessários foram removidos


O código abaixo é responsável por isso. É gerado um novo csv no processo.

In [27]:
first_line = True
with open('data/athlete_events.csv', 'r') as csv_original:
    with open('data/athlete_events_sanitized.csv', 'w') as csv_limpo:
        reader = csv.reader(csv_original, delimiter=",")
        writer = csv.writer(csv_limpo, delimiter=",")
        for row in reader:
#           ID, Name, Sex, Age, Height, Weight, Team, NOC, Games, Year, Season, City, Sport, Event, Medal
            _, _, Sex, Age, Height, Weight, _, NOC, _, Year, Season, _, Sport, Event, Medal = row
    
            if not first_line:
                if (Age == 'NA' or Height == 'NA' or Weight == 'NA'):
                    continue

                if Medal == 'NA':
                    Medal = 'No'
                
                Weight = float(Weight)
                Weight = int(Weight)
                
            new_row = [Sex, Age, Height, Weight, NOC, Year, Season, Sport, Event, Medal]
            writer.writerow(new_row)
            first_line = False

In [28]:
athlete_events_clean = pd.read_csv('data/athlete_events_sanitized.csv', dtype=str)
print (f" dataset original: \t{athlete_events.shape} \n dataset sanitizado: \t{athlete_events_clean.shape}")
display(athlete_events_clean.head())
display(athlete_events_clean.describe())

 dataset original: 	(271116, 15) 
 dataset sanitizado: 	(206165, 10)


Unnamed: 0,Sex,Age,Height,Weight,NOC,Year,Season,Sport,Event,Medal
0,M,24,180,80,CHN,1992,Summer,Basketball,Basketball Men's Basketball,No
1,M,23,170,60,CHN,2012,Summer,Judo,Judo Men's Extra-Lightweight,No
2,F,21,185,82,NED,1988,Winter,Speed Skating,Speed Skating Women's 500 metres,No
3,F,21,185,82,NED,1988,Winter,Speed Skating,"Speed Skating Women's 1,000 metres",No
4,F,25,185,82,NED,1992,Winter,Speed Skating,Speed Skating Women's 500 metres,No


Unnamed: 0,Sex,Age,Height,Weight,NOC,Year,Season,Sport,Event,Medal
count,206165,206165,206165,206165,206165,206165,206165,206165,206165,206165
unique,2,61,94,143,226,35,2,56,590,4
top,M,23,180,70,USA,2000,Summer,Athletics,Ice Hockey Men's Ice Hockey,No
freq,139454,17743,12184,9563,14214,13682,166706,32374,3825,175984




### Convertendo campos textuais em numéricos
Nosso Perceptron gosta de números. Então, vamos dar números a ele.
A partir daqui o dataset já está útil. Usaremos a variável ```data``` daqui para frente.

In [29]:
def index_of_dic(dic, key):
    return dic[key]

def StrList_to_UniqueIndexList(lista):
    group = set(lista)
    
    dic = {}
    i = 0
    for g in group:
        if g not in dic:
            dic[g] = i
            i += 1

    return [index_of_dic(dic, p) for p in lista]

def cast_list_int(the_list):
    return [int(x) for x in the_list]

In [30]:
data = athlete_events_clean.copy()

# Sex, Age, Height, Weight, NOC, Year, Season, Sport, Event, Medal

data['Sex'] = StrList_to_UniqueIndexList(data['Sex'])
data['Age'] = cast_list_int(data['Age'])
data['Height'] = cast_list_int(data['Height'])
data['Weight'] = cast_list_int(data['Weight'])
data['NOC'] = StrList_to_UniqueIndexList(data['NOC'])
data['Year'] = cast_list_int(data['Year'])
data['Season'] = StrList_to_UniqueIndexList(data['Season'])
data['Sport'] = StrList_to_UniqueIndexList(data['Sport'])
data['Event'] = StrList_to_UniqueIndexList(data['Event'])
# data['Medal'] = StrList_to_UniqueIndexList(data['Medal'])

In [31]:
display(data.head())
display(data.describe())

Unnamed: 0,Sex,Age,Height,Weight,NOC,Year,Season,Sport,Event,Medal
0,1,24,180,80,122,1992,1,42,111,No
1,1,23,170,60,122,2012,1,55,349,No
2,0,21,185,82,33,1988,0,44,551,No
3,0,21,185,82,33,1988,0,44,166,No
4,0,25,185,82,33,1992,0,44,551,No


Unnamed: 0,Sex,Age,Height,Weight,NOC,Year,Season,Sport,Event
count,206165.0,206165.0,206165.0,206165.0,206165.0,206165.0,206165.0,206165.0,206165.0
mean,0.676419,25.055509,175.37195,70.686004,96.210264,1989.674678,0.808605,32.577261,286.79384
std,0.467843,5.483096,10.546088,14.339753,62.295254,20.130865,0.3934,16.303681,171.788472
min,0.0,11.0,127.0,25.0,0.0,1896.0,0.0,0.0,0.0
25%,0.0,21.0,168.0,60.0,44.0,1976.0,1.0,20.0,134.0
50%,1.0,24.0,175.0,70.0,88.0,1992.0,1.0,35.0,276.0
75%,1.0,28.0,183.0,79.0,145.0,2006.0,1.0,49.0,439.0
max,1.0,71.0,226.0,214.0,225.0,2016.0,1.0,55.0,589.0



## Etapa 3: MLP
### Preparando os dados de treino e teste

- ```X```: nossos atributos (Sex, Age, Height, Weight, NOC, Year, Season, Sport, Event)
- ```y```: nossas classes (No, Bronze, Silver, Gold)

In [32]:
X = data.iloc[:, 0:-2]
y = data.iloc[:,-1]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.30)  
scaler = StandardScaler()  
scaler.fit(X_train)

X_train = scaler.transform(X_train)  
X_test = scaler.transform(X_test) 

print(X_test)

[[-1.44538314  0.53727026 -1.64423267 ...  0.11311621  0.48667747
   0.51665644]
 [ 0.69185807  0.53727026  0.62918172 ...  0.90826798  0.48667747
   0.6394444 ]
 [ 0.69185807 -0.92149935 -0.60225107 ... -1.27839937  0.48667747
  -0.8340112 ]
 ...
 [-1.44538314  1.81369367 -0.60225107 ...  0.31190415  0.48667747
   1.13059627]
 [ 0.69185807 -0.37446075 -0.50752547 ... -1.07961143 -2.0547489
  -0.71122323]
 [ 0.69185807 -0.37446075 -0.41279987 ... -1.27839937 -2.0547489
  -1.26376908]]




### Configurando e usando o MLP

#### Criando um espaço de hiper-parâmetros
Isso é usado para automatizar o processo de encontrar as melhores configurações.


In [19]:
parameter_space = {
    'hidden_layer_sizes': [(10,10,10), (15,15,15,15), (20, 20, 20)],
    'activation': ['logistic', 'tanh', 'relu'],
    'solver': ['lbfgs', 'sgd', 'adam'],
    'alpha': [0.0001, 0.05, 0.1],
    'learning_rate': ['constant','adaptive'],
}
mlp = MLPClassifier(max_iter=250)  

#### Grid Search
Busca da melhor combinação dos parâmetros definidos acima

**Aviso:** esse código demora algumas horas. Comentado para evitar execução acidental.

In [20]:
# clf = GridSearchCV(mlp, parameter_space, n_jobs=-1, cv=3)
# clf.fit(X_train, y_train)

# print('Best parameters found:\n', clf.best_params_)

# means = clf.cv_results_['mean_test_score']
# stds = clf.cv_results_['std_test_score']
# for mean, std, params in zip(means, stds, clf.cv_results_['params']):
#     print("%0.3f (+/-%0.03f) for %r" % (mean, std * 2, params))


Best parameters found:
 {'activation': 'tanh', 'alpha': 0.0001, 'hidden_layer_sizes': (15, 15, 15, 15), 'learning_rate': 'constant', 'solver': 'adam'}
0.853 (+/-0.000) for {'activation': 'logistic', 'alpha': 0.0001, 'hidden_layer_sizes': (10, 10, 10), 'learning_rate': 'constant', 'solver': 'lbfgs'}
0.853 (+/-0.000) for {'activation': 'logistic', 'alpha': 0.0001, 'hidden_layer_sizes': (10, 10, 10), 'learning_rate': 'constant', 'solver': 'sgd'}
0.853 (+/-0.000) for {'activation': 'logistic', 'alpha': 0.0001, 'hidden_layer_sizes': (10, 10, 10), 'learning_rate': 'constant', 'solver': 'adam'}
0.853 (+/-0.000) for {'activation': 'logistic', 'alpha': 0.0001, 'hidden_layer_sizes': (10, 10, 10), 'learning_rate': 'adaptive', 'solver': 'lbfgs'}
0.853 (+/-0.000) for {'activation': 'logistic', 'alpha': 0.0001, 'hidden_layer_sizes': (10, 10, 10), 'learning_rate': 'adaptive', 'solver': 'sgd'}
0.853 (+/-0.000) for {'activation': 'logistic', 'alpha': 0.0001, 'hidden_layer_sizes': (10, 10, 10), 'learnin



Da saída acima: 
```Best parameters found:
 {'activation': 'tanh', 'alpha': 0.0001, 'hidden_layer_sizes': (15, 15, 15, 15), 'learning_rate': 'constant', 'solver': 'adam'}```

In [45]:
mlp = MLPClassifier(hidden_layer_sizes=(15, 15, 15, 15), max_iter=500, activation='tanh', solver='adam', learning_rate='constant', alpha=0.0001)  
mlp.fit(X_train, y_train.values.ravel()) 
predictions = mlp.predict(X_test)  

print(confusion_matrix(y_test,predictions))  
print(classification_report(y_test,predictions, zero_division=0)) 

[[    5    56  3007     5]
 [    1   180  2809     6]
 [    1    96 52686    16]
 [    3    55  2919     5]]
              precision    recall  f1-score   support

      Bronze       0.50      0.00      0.00      3073
        Gold       0.47      0.06      0.11      2996
          No       0.86      1.00      0.92     52799
      Silver       0.16      0.00      0.00      2982

    accuracy                           0.85     61850
   macro avg       0.49      0.27      0.26     61850
weighted avg       0.79      0.85      0.79     61850



**Interpretação do resultado:**

A rede prevê com 86% de chance que você não vai ganhar medalha.

Ela dá um palpite não muito confiável nos outros casos.



### Cross-validation

In [46]:
# #Armazena as predicões
predictions = cross_val_predict(mlp,X,y,cv=10)

# #Calcula acurácia do treino
accuracy_score(y,predictions)*100

# #Gera a matriz de confusão do treino
confusion_matrix(y,predictions)

# #Gera a matriz de confusão do test
print(classification_report(y,predictions,zero_division=0))

              precision    recall  f1-score   support

      Bronze       0.00      0.00      0.00     10148
        Gold       0.00      0.00      0.00     10167
          No       0.85      1.00      0.92    175984
      Silver       0.00      0.00      0.00      9866

    accuracy                           0.85    206165
   macro avg       0.21      0.25      0.23    206165
weighted avg       0.73      0.85      0.79    206165



## Conclusão
A previsão não é muito boa, e o cross-validation confirma isso.

Como há muito mais dados de não-medalhistas, a rede não consegue aprender adequadamente a classificar quem ganhou medalha com base nas informações. Ela consegue, entretanto, prever com bom grau de acurácia que a pessoa *não* ganhará medalha.