# 120 anos de olimpíadas

Atividade da disciplina de Inteligência Computacional.


**Professora:** Carine G. Webber

**Alunos:**

- Luis Henrique Ziliotto Salamon
- Rafael Bourscheid da Silveira

**Dataset** disponível no Kaggle: [120 years of Olympic history: athletes and results](https://www.kaggle.com/heesoo37/120-years-of-olympic-history-athletes-and-results)


## Importação de dependências

In [1]:
from time import time
import numpy as np
import matplotlib.pyplot as plt
from sklearn import metrics
from sklearn.cluster import KMeans
from sklearn import preprocessing
from sklearn.preprocessing import scale
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split 
from sklearn.preprocessing import StandardScaler  
from sklearn.neural_network import MLPClassifier  
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.model_selection import cross_val_predict



import numpy as np
import pandas as pd
import csv



## Etapa 1: Entrada de dados

### Importação dos datasets
#### Eventos dos atletas
Dataset principal. Classificador: ```Medal```

In [2]:
athlete_events = pd.read_csv('data/athlete_events.csv', dtype='str')
athlete_events_original = athlete_events.copy()
athlete_events.head()


Unnamed: 0,ID,Name,Sex,Age,Height,Weight,Team,NOC,Games,Year,Season,City,Sport,Event,Medal
0,1,A Dijiang,M,24,180.0,80.0,China,CHN,1992 Summer,1992,Summer,Barcelona,Basketball,Basketball Men's Basketball,
1,2,A Lamusi,M,23,170.0,60.0,China,CHN,2012 Summer,2012,Summer,London,Judo,Judo Men's Extra-Lightweight,
2,3,Gunnar Nielsen Aaby,M,24,,,Denmark,DEN,1920 Summer,1920,Summer,Antwerpen,Football,Football Men's Football,
3,4,Edgar Lindenau Aabye,M,34,,,Denmark/Sweden,DEN,1900 Summer,1900,Summer,Paris,Tug-Of-War,Tug-Of-War Men's Tug-Of-War,Gold
4,5,Christine Jacoba Aaftink,F,21,185.0,82.0,Netherlands,NED,1988 Winter,1988,Winter,Calgary,Speed Skating,Speed Skating Women's 500 metres,


#### NOC
Código de três letras do Comitê Olímpico Nacional

In [3]:
noc_regions = pd.read_csv('data/noc_regions.csv', dtype=str)
noc_regions_original = noc_regions.copy()
noc_regions.head()

Unnamed: 0,NOC,region,notes
0,AFG,Afghanistan,
1,AHO,Curacao,Netherlands Antilles
2,ALB,Albania,
3,ALG,Algeria,
4,AND,Andorra,




### Pré-visualização

In [4]:
athlete_events.describe()

Unnamed: 0,ID,Name,Sex,Age,Height,Weight,Team,NOC,Games,Year,Season,City,Sport,Event,Medal
count,271116,271116,271116,261642,210945,208241,271116,271116,271116,271116,271116,271116,271116,271116,39783
unique,135571,134732,2,74,95,220,1184,230,51,35,2,42,66,765,3
top,77710,Robert Tait McKenzie,M,23,180,70,United States,USA,2000 Summer,1992,Summer,London,Athletics,Football Men's Football,Gold
freq,58,58,196594,21875,12492,9625,17847,18853,13821,16413,222552,22426,38624,5733,13372


 
 
## Etapa 2: Sanitização
Faremos a classificação considerando a coluna ```Medal``` do arquivo ```athlete_events.csv```.

Porém, antes disso, ele precisa de uma pequena limpeza:
- Alguns dados (peso, altura ou idade) são desconhecidos, então o registro todo foi removido
- Alguns atletas não ganharam medalha, então receberam o valor "No" para indicar isso.
- Os campos desnecessários foram removidos


O código abaixo é responsável por isso. É gerado um novo csv no processo.

In [5]:
first_line = True
with open('data/athlete_events.csv', 'r') as csv_original:
    with open('data/athlete_events_sanitized.csv', 'w') as csv_limpo:
        reader = csv.reader(csv_original, delimiter=",")
        writer = csv.writer(csv_limpo, delimiter=",")
        for row in reader:
#           ID, Name, Sex, Age, Height, Weight, Team, NOC, Games, Year, Season, City, Sport, Event, Medal
            _, _, Sex, Age, Height, Weight, _, NOC, _, Year, Season, _, Sport, Event, Medal = row
    
            if not first_line:
                if (Age == 'NA' or Height == 'NA' or Weight == 'NA'):
                    continue

                if Medal == 'NA':
                    Medal = 'No'
                
                Weight = float(Weight)
                Weight = int(Weight)
                
            new_row = [Sex, Age, Height, Weight, NOC, Year, Season, Sport, Event, Medal]
            writer.writerow(new_row)
            first_line = False

In [6]:
athlete_events_clean = pd.read_csv('data/athlete_events_sanitized.csv', dtype=str)
print (f" dataset original: \t{athlete_events.shape} \n dataset sanitizado: \t{athlete_events_clean.shape}")
display(athlete_events_clean.head())
display(athlete_events_clean.describe())

 dataset original: 	(271116, 15) 
 dataset sanitizado: 	(206165, 10)


Unnamed: 0,Sex,Age,Height,Weight,NOC,Year,Season,Sport,Event,Medal
0,M,24,180,80,CHN,1992,Summer,Basketball,Basketball Men's Basketball,No
1,M,23,170,60,CHN,2012,Summer,Judo,Judo Men's Extra-Lightweight,No
2,F,21,185,82,NED,1988,Winter,Speed Skating,Speed Skating Women's 500 metres,No
3,F,21,185,82,NED,1988,Winter,Speed Skating,"Speed Skating Women's 1,000 metres",No
4,F,25,185,82,NED,1992,Winter,Speed Skating,Speed Skating Women's 500 metres,No


Unnamed: 0,Sex,Age,Height,Weight,NOC,Year,Season,Sport,Event,Medal
count,206165,206165,206165,206165,206165,206165,206165,206165,206165,206165
unique,2,61,94,143,226,35,2,56,590,4
top,M,23,180,70,USA,2000,Summer,Athletics,Ice Hockey Men's Ice Hockey,No
freq,139454,17743,12184,9563,14214,13682,166706,32374,3825,175984




### Convertendo campos textuais em numéricos
Nosso Perceptron gosta de números. Então, vamos dar números a ele.
A partir daqui o dataset já está útil. Usaremos a variável ```data``` daqui para frente.

In [7]:
def index_of_dic(dic, key):
    return dic[key]

def StrList_to_UniqueIndexList(lista):
    group = set(lista)
    
    dic = {}
    i = 0
    for g in group:
        if g not in dic:
            dic[g] = i
            i += 1

    return [index_of_dic(dic, p) for p in lista]

def cast_list_int(the_list):
    return [int(x) for x in the_list]

In [8]:
data = athlete_events_clean.copy()

# Sex, Age, Height, Weight, NOC, Year, Season, Sport, Event, Medal

data['Sex'] = StrList_to_UniqueIndexList(data['Sex'])
data['Age'] = cast_list_int(data['Age'])
data['Height'] = cast_list_int(data['Height'])
data['Weight'] = cast_list_int(data['Weight'])
data['NOC'] = StrList_to_UniqueIndexList(data['NOC'])
data['Year'] = cast_list_int(data['Year'])
data['Season'] = StrList_to_UniqueIndexList(data['Season'])
data['Sport'] = StrList_to_UniqueIndexList(data['Sport'])
data['Event'] = StrList_to_UniqueIndexList(data['Event'])
# data['Medal'] = StrList_to_UniqueIndexList(data['Medal'])

In [9]:
display(data.head())
display(data.describe())

Unnamed: 0,Sex,Age,Height,Weight,NOC,Year,Season,Sport,Event,Medal
0,0,24,180,80,58,1992,0,24,301,No
1,0,23,170,60,58,2012,0,17,476,No
2,1,21,185,82,203,1988,1,50,0,No
3,1,21,185,82,203,1988,1,50,223,No
4,1,25,185,82,203,1992,1,50,0,No


Unnamed: 0,Sex,Age,Height,Weight,NOC,Year,Season,Sport,Event
count,206165.0,206165.0,206165.0,206165.0,206165.0,206165.0,206165.0,206165.0,206165.0
mean,0.323581,25.055509,175.37195,70.686004,116.198026,1989.674678,0.191395,29.266098,298.78136
std,0.467843,5.483096,10.546088,14.339753,65.244418,20.130865,0.3934,14.100937,167.508616
min,0.0,11.0,127.0,25.0,0.0,1896.0,0.0,0.0,0.0
25%,0.0,21.0,168.0,60.0,51.0,1976.0,0.0,19.0,164.0
50%,0.0,24.0,175.0,70.0,129.0,1992.0,0.0,29.0,303.0
75%,1.0,28.0,183.0,79.0,171.0,2006.0,0.0,42.0,450.0
max,1.0,71.0,226.0,214.0,225.0,2016.0,1.0,55.0,589.0



## Etapa 3 - MLP
### Preparando os dados de treino e teste

- ```X```: nossos atributos (Sex, Age, Height, Weight, NOC, Year, Season, Sport, Event)
- ```y```: nossas classes (No, Bronze, Silver, Gold)

In [10]:
X = data.iloc[:, 0:-2]
y = data.iloc[:,-1]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.30)  
scaler = StandardScaler()  
scaler.fit(X_train)

X_train = scaler.transform(X_train)  
X_test = scaler.transform(X_test) 

print(X_test)

[[ 1.44801778 -0.55658971 -0.03564355 ...  0.71273323 -0.48753766
  -1.79065862]
 [-0.69059925  0.35598254  0.24937386 ... -1.07724466  2.05112361
   0.97493031]
 [-0.69059925 -0.37407526 -0.79568998 ... -0.48058536 -0.48753766
  -0.0178452 ]
 ...
 [ 1.44801778 -0.19156081 -1.74574802 ...  1.01106288  2.05112361
   1.47131807]
 [ 1.44801778  1.26855479 -0.70068418 ...  1.20994931  2.05112361
   0.26580494]
 [-0.69059925 -1.10413306 -0.70068418 ...  0.31496037 -0.48753766
   0.90401777]]




### Configurando e usando o MLP

In [11]:
mlp = MLPClassifier(hidden_layer_sizes=(10, 10, 10), max_iter=1000)  
mlp.fit(X_train, y_train.values.ravel()) 
predictions = mlp.predict(X_test)  

print(confusion_matrix(y_test,predictions))  
print(classification_report(y_test,predictions)) 

[[    0    33  3006     0]
 [    0   163  2892     0]
 [    0    96 52704     0]
 [    0    51  2905     0]]


  _warn_prf(average, modifier, msg_start, len(result))


              precision    recall  f1-score   support

      Bronze       0.00      0.00      0.00      3039
        Gold       0.48      0.05      0.10      3055
          No       0.86      1.00      0.92     52800
      Silver       0.00      0.00      0.00      2956

    accuracy                           0.85     61850
   macro avg       0.33      0.26      0.25     61850
weighted avg       0.75      0.85      0.79     61850





### Cross-validation

In [12]:
# #Armazena as predicões
# predictions = cross_val_predict(mlp,X,y,cv=10)

# #Calcula acurácia do treino
# accuracy_score(y,predictions)*100

# #Gera a matriz de confusão do treino
# confusion_matrix(y,predictions)

# #Gera a matriz de confusão do test
# print(classification_report(y,predictions,zero_division=0))