# Practico Mentoria - Aprendizaje No Supervisado

El objectivo de este practico es realizar [Clustering](https://es.wikipedia.org/wiki/Algoritmo_de_agrupamiento) sobre el Dataset de las Caracteristicas de los jugadores.

De forma de juntar en los clusters a los jugadores con caracteristicas similares, y en particular de este practico analizar si estos clusters se corresponden con la posicion en la que juegan estos jugadores.

---

### Importaciones

In [1]:
%load_ext autoreload
%autoreload 2

%matplotlib inline

In [37]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import scipy as sp
import warnings

from sklearn.decomposition import PCA
from sklearn.cluster import KMeans

In [3]:
warnings.filterwarnings('ignore')

sns.set_style("whitegrid")

In [4]:
# Seteamos una semilla para Reproducibilidad
np.random.seed(1)

---

### Carga del Dateset

In [39]:
player_df = pd.read_csv('../Datasets/football_player_full.csv', index_col='player_name')

#player_df.set_index('player_name', inplace=True)
print("Shape 'player_df' = {}".format(player_df.shape))

# Copy Dataframe
player2_df = player_df.copy(deep=False)

Shape 'player_df' = (9925, 36)


In [40]:
player_df.sample(10)

Unnamed: 0_level_0,overall_rating,potential,crossing,finishing,heading_accuracy,short_passing,volleys,dribbling,curve,free_kick_accuracy,...,penalties,marking,standing_tackle,sliding_tackle,gk_diving,gk_handling,gk_kicking,gk_positioning,gk_reflexes,position
player_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Gnaly Maxwell Cornet,63.31,77.69,39.94,65.88,56.62,55.31,64.06,72.0,60.69,39.06,...,66.19,21.25,23.44,21.56,15.31,12.31,7.31,11.31,8.31,FW
Kamil Wlodyka,57.93,66.0,47.93,50.29,44.07,63.29,51.21,61.21,52.29,49.0,...,50.0,45.5,50.29,46.14,7.0,13.0,10.0,13.0,14.0,MID
Luigi Bruins,67.77,71.54,64.77,64.62,50.0,67.08,68.23,71.0,78.23,70.31,...,52.62,38.54,40.54,40.08,10.62,11.0,30.77,10.38,11.77,MID
Dinei,70.09,72.55,48.27,71.64,72.09,60.73,59.64,66.55,59.0,54.09,...,70.0,28.36,50.0,21.0,6.0,13.0,25.18,11.0,12.0,FW
Savio Nsereko,65.38,74.25,64.88,60.12,46.0,61.75,67.0,68.5,39.0,41.0,...,58.0,28.0,34.0,27.0,9.75,17.0,31.0,17.25,18.5,FW
Roman Golobart,60.76,71.06,34.12,20.06,65.06,41.59,16.18,21.12,31.0,27.88,...,36.24,63.71,64.35,58.71,12.29,8.88,9.18,7.12,11.53,DEF
Sebastien Bassong,74.28,77.52,51.17,23.34,76.21,58.28,27.21,29.69,56.31,38.93,...,49.41,74.48,78.79,75.62,9.0,10.83,19.86,9.97,8.38,DEF
Lesly Malouda,58.4,63.7,58.8,33.7,44.2,56.6,39.7,63.3,57.7,42.7,...,55.5,40.7,46.3,36.7,12.6,16.8,19.7,14.0,16.1,MID
Michel Breuer,66.25,67.81,50.44,35.38,68.81,61.31,26.0,48.12,43.88,32.75,...,63.75,64.94,65.88,66.62,7.38,11.31,26.69,15.12,12.38,DEF
Richard Sukuta-Pasu,66.85,74.37,54.85,64.44,64.81,60.22,62.0,64.37,56.89,47.63,...,64.48,20.48,35.74,19.78,6.37,9.59,14.89,15.3,12.04,FW


In [41]:
player_df.dtypes

overall_rating        float64
potential             float64
crossing              float64
finishing             float64
heading_accuracy      float64
short_passing         float64
volleys               float64
dribbling             float64
curve                 float64
free_kick_accuracy    float64
long_passing          float64
ball_control          float64
acceleration          float64
sprint_speed          float64
agility               float64
reactions             float64
balance               float64
shot_power            float64
jumping               float64
stamina               float64
strength              float64
long_shots            float64
aggression            float64
interceptions         float64
positioning           float64
vision                float64
penalties             float64
marking               float64
standing_tackle       float64
sliding_tackle        float64
gk_diving             float64
gk_handling           float64
gk_kicking            float64
gk_positio

In [52]:
# Guardamos la lista de la posicion de los jugadores
from collections import Counter
player_position_list = player_df.position.tolist()
player_position_dict = Counter(player_position_list)
player_position_dict

Counter({'DEF': 3664, 'MID': 3473, 'GK': 869, 'FW': 1919})

In [9]:
player_df = player_df[[
    'overall_rating', 'potential', 'crossing', 'finishing', 'heading_accuracy',
    'short_passing', 'volleys', 'dribbling', 'curve', 'free_kick_accuracy',
    'long_passing', 'ball_control', 'acceleration', 'sprint_speed', 'agility',
    'reactions', 'balance', 'shot_power', 'jumping', 'stamina', 'strength',
    'long_shots', 'aggression', 'interceptions', 'positioning', 'vision',
    'penalties', 'marking', 'standing_tackle', 'sliding_tackle',
    'gk_diving', 'gk_handling', 'gk_kicking', 'gk_positioning', 'gk_reflexes',
]]

In [10]:
player_df.dtypes

overall_rating        float64
potential             float64
crossing              float64
finishing             float64
heading_accuracy      float64
short_passing         float64
volleys               float64
dribbling             float64
curve                 float64
free_kick_accuracy    float64
long_passing          float64
ball_control          float64
acceleration          float64
sprint_speed          float64
agility               float64
reactions             float64
balance               float64
shot_power            float64
jumping               float64
stamina               float64
strength              float64
long_shots            float64
aggression            float64
interceptions         float64
positioning           float64
vision                float64
penalties             float64
marking               float64
standing_tackle       float64
sliding_tackle        float64
gk_diving             float64
gk_handling           float64
gk_kicking            float64
gk_positio

In [11]:
player_df.sample(10)

Unnamed: 0_level_0,overall_rating,potential,crossing,finishing,heading_accuracy,short_passing,volleys,dribbling,curve,free_kick_accuracy,...,vision,penalties,marking,standing_tackle,sliding_tackle,gk_diving,gk_handling,gk_kicking,gk_positioning,gk_reflexes
player_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Rolando Mandragora,60.93,73.13,47.87,44.33,48.07,69.33,41.67,60.07,49.67,33.67,...,65.07,31.67,46.67,57.0,52.73,12.67,13.67,8.67,13.67,15.67
Daniel Pinillos,59.71,66.14,59.57,32.14,48.14,48.29,33.14,52.29,57.57,39.14,...,43.86,46.14,57.86,63.57,70.43,6.14,15.14,10.14,12.14,6.14
Stopira,60.25,65.0,56.0,28.0,32.0,49.0,32.0,44.0,47.0,42.0,...,51.0,45.0,62.0,59.5,66.0,8.0,13.0,11.0,13.0,5.0
Kakha Kaladze,78.5,83.1,67.3,32.8,77.1,71.2,46.0,51.7,44.0,48.3,...,61.0,64.2,81.9,81.6,72.7,11.0,15.3,49.1,14.6,17.8
Sergi Darder,69.43,75.61,48.91,39.13,35.65,77.17,36.04,63.83,61.87,54.26,...,75.0,38.26,52.61,64.48,61.13,6.26,9.26,5.26,13.26,5.26
Zeljko Brkic,75.0,77.12,18.5,19.0,17.5,32.71,16.58,20.17,17.88,18.42,...,27.33,31.54,19.0,19.92,20.92,80.58,70.0,59.5,76.88,78.92
Stephen Elliott,66.5,70.93,52.79,67.14,65.64,59.79,61.14,64.21,52.0,47.71,...,64.93,63.43,32.5,31.43,25.0,13.57,10.79,22.79,9.29,9.43
Adil Ramzi,66.17,66.17,61.67,55.67,50.0,71.0,54.0,69.33,49.0,70.0,...,70.67,75.0,48.67,33.33,35.33,12.0,6.0,10.0,10.0,9.0
Igor Bubnjic,69.47,76.18,28.76,20.59,68.12,45.65,30.12,36.12,34.76,34.12,...,34.06,43.12,74.12,75.59,72.29,11.76,11.76,13.76,12.76,4.76
Igor Lolo,68.19,69.52,58.1,44.43,69.43,59.33,47.0,60.0,42.62,30.24,...,53.57,39.57,64.29,70.0,63.76,10.05,10.67,13.67,9.38,15.67


---

> ### Aplicar Clustering sobre las features de los jugadores

Usar [K-Means](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html) para el clustering.

Probar primero con 4 clusters, este numero se debe a cantidad de clases con respecto a la posicion de los jugadores:
* **GK**: Goalkeeper (Arquero)
* **DEF**: Defenser (Defensor)
* **MID**: Midfielder (Mediocampistas)
* **FW**: Forward (Delantero)

Luedo de hacer clustering, ver cuantos elementos tiene cada cluster.

In [30]:
kmeans = KMeans(n_clusters=4, n_jobs=-1)
kmeans.fit(player_df)
player_df_clusters=player_df.copy()
player_df_clusters['kmedias_4']=kmeans.labels_
print(kmeans)
player_df_clusters.head(10)

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
    n_clusters=4, n_init=10, n_jobs=-1, precompute_distances='auto',
    random_state=None, tol=0.0001, verbose=0)


Unnamed: 0_level_0,overall_rating,potential,crossing,finishing,heading_accuracy,short_passing,volleys,dribbling,curve,free_kick_accuracy,...,penalties,marking,standing_tackle,sliding_tackle,gk_diving,gk_handling,gk_kicking,gk_positioning,gk_reflexes,kmedias_4
player_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Aaron Appindangoye,63.6,67.6,48.6,43.6,70.6,60.6,43.6,50.6,44.6,38.6,...,47.6,63.8,66.0,67.8,5.6,10.6,9.6,7.6,7.6,1
Aaron Cresswell,66.97,74.48,70.79,49.45,52.94,62.27,29.15,61.09,61.88,62.12,...,53.12,69.39,68.79,71.52,12.18,8.67,14.24,10.36,12.91,2
Aaron Doran,67.0,74.19,68.12,57.92,58.69,65.12,54.27,69.04,60.19,55.62,...,60.54,22.04,21.12,21.35,14.04,11.81,17.73,10.12,13.5,0
Aaron Galindo,69.09,70.78,57.22,26.26,69.26,64.7,47.78,55.57,37.78,40.39,...,41.74,70.61,70.65,68.04,14.17,11.17,22.87,11.17,10.17,1
Aaron Hughes,73.24,74.68,45.08,38.84,73.04,64.76,32.08,50.6,45.48,26.36,...,52.96,77.6,76.04,74.6,8.28,8.32,24.92,12.84,11.92,1
Aaron Hunt,77.26,80.15,73.89,72.81,65.52,78.26,77.67,78.81,77.85,68.44,...,75.59,31.7,31.52,32.33,13.22,12.41,15.07,15.56,14.85,0
Aaron Kuhl,60.57,76.0,47.57,31.57,46.57,63.57,33.57,53.57,55.57,39.57,...,41.57,51.57,57.14,56.57,7.57,12.57,13.57,13.57,14.57,1
Aaron Lennon,79.77,82.0,78.04,65.96,30.46,76.27,68.38,85.19,62.08,54.35,...,63.46,23.23,26.15,20.88,12.85,9.81,17.88,16.92,13.12,0
Aaron Lennox,48.0,56.86,12.0,15.0,16.0,23.0,14.0,15.0,14.0,18.0,...,41.0,15.0,15.0,12.0,53.0,41.0,39.0,51.0,53.0,3
Aaron Meijers,67.05,69.42,63.89,46.05,56.84,68.95,59.21,69.74,70.89,64.37,...,54.42,62.58,64.58,61.74,6.21,14.21,6.21,9.21,14.21,2


In [34]:
player_df_clusters['kmedias_4'].value_counts().sort_values()

3     869
1    2673
2    2877
0    3506
Name: kmedias_4, dtype: int64

> ##### Evaluar resultados

Evaluar los resultados del clustering usando una medida como la [Pureza](https://nlp.stanford.edu/IR-book/html/htmledition/evaluation-of-clustering-1.html).


**Hint 1**: Puede que en los clusters haya confusion entre las distintas posiciones dentro del campo de juego, esto no esta mal. Ya que hay que recordar que las posiciones estan simplicadas.


**Hint 2**: Un indicador de mala calidad es que haya clusters muy chiquitos y uno muy grande, lo cual indica que en el espacio no se distinguen bien grupos separados y hay que usar otro espacio.

In [58]:
# player_position_dict = {'DEF': 3664, 'MID': 3473, 'GK': 869, 'FW': 1919}
gk_purity = 869 / 869
fw_purity = 1919 / 2673
mid_purity = 2877 / 3473
def_purity = 3506 / 3664
purity = (gk_purity + fw_purity + mid_purity + def_purity) / 4
print("GK purity: {}\nFW purity: {}\nMID purity: {}\nDEF purity: {}".format(
    gk_purity, fw_purity, mid_purity, def_purity))
print("\nOverall purity: {}".format(purity))

GK purity: 1.0
FW purity: 0.7179199401421623
MID purity: 0.8283904405413187
DEF purity: 0.9568777292576419

Overall purity: 0.8757970274852808


> ### Diferentes numero de clusters

Usar diferentes numero de clusters, especialmente numeros altos, para observar las subdivisiones de las clases, y que clases se confunden mas.


**Nota**: Las posiciones asignadas a los jugadores son simplificadas, esto quiere decir que al hacer mas de 4 clusters podemos llegar descubrir posiciones mas especificas dentro del campo de juego (por ejemplo: Defensor central, Lateral derecho/izquierdo, Mediocampista defensivo/ofensivo, etc.)


**Recordar**: Calcular la Pureza para analizar si tener una mayor cantidad de clusters da mejores resultados.

In [None]:
# TODO

> ### Subconjunto de Features

Probar diferentes subconjunto de caracteristicas del dataset para analizar si los resultados mejoran.

Por ejemplo, probar con el siguiente subconjunto de caracteristicas:
* `gk_diving`
* `gk_handling`
* `gk_kicking`
* `gk_positioning`
* `standing_tackle`
* `sliding_tackle`
* `short_passing`
* `vision`
* `finishing`
* `volleys`

Tambien probar con otros subconjuntos.


**Recordar**: Calcular la Pureza

In [None]:
# TODO

> ### Uso de Embedding

Aplicar el uso de embeddings, por ejemplo [PCA](https://es.wikipedia.org/wiki/PCA), para comparar que sucede en ese espacio en comparacion con lo que sucede en el espacio original.

In [None]:
# TODO

---

**Comunicación de Resultados**

Se pide que toda esta información no quede plasmada solamente en un Jupyter Notebook, sino que se diagrame una comunicación en formato textual o interactivo (Google Docs, PDF o Markdown por ejemplo).

La comunicación debe estar apuntada a un público técnico pero sin conocimiento del tema particular, como por ejemplo, sus compañeros de clase.