# EDA & Profiling

This EDA is based on the <a href="https://www.kaggle.com/hugomathien/soccer">European Soccer Database</a> with more than 25,000 matches and more than 10,000 players for European professional soccer seasons from 2008 to 2016.

### Import Libraries

customplot: contains functions written for this notebook

In [None]:
import sqlite3
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.preprocessing import scale

In [None]:
!find . | grep customplot

In [None]:
%%!
mkdir customplot && touch ./customplot/__init__.py
cp ../_pycode/customplot.py ./customplot/customplot.py

In [None]:
from customplot import *

### Data 

Download the data from: <a href="https://www.kaggle.com/hugomathien/soccer">https://www.kaggle.com/hugomathien/soccer</a>

#### Ingest Data

In [None]:
%find.. 'database.sqlite'

In [None]:
# Create your connection.
cnx = sqlite3.connect('../_data/database.sqlite')
df = pd.read_sql_query("SELECT * FROM Player_Attributes", cnx)

### Exploring DataFrame

In [None]:
df.columns

### Feature stats

In [None]:
df.describe().T

#### Check nulls, NaN's, etc.

In [None]:
df.isnull().any().any()
'percentage null: '; df.isnull().any().sum() / df.shape[0] * 100

#### Percentage nulls

In [None]:
%precision 2
df.isnull().sum(axis=0).describe()
df.isnull().sum(axis=0).max() * 100 / df.shape[0], 'max % NaN'

#### Drop nulls

In [None]:
df = df.dropna()

##### Sanity check

In [None]:
df.isnull().sum(axis=0).max() * 100 / df.shape[0], 'max % NaN'
df.info()

#### Shuffle df

In [None]:
df = df.reindex(np.random.permutation(df.index))

### Predicting: 'overall_rating' of a player

In [None]:
df.sample(5)

### Feature Correlation Analysis 
Next, we will check if 'penalties' is correlated to 'overall_rating'. We are using a similar selection operation, bu this time for all the rows and within the correlation function. 

In [None]:
df[:10][['penalties', 'overall_rating']]

In [None]:
df['overall_rating'].corr(df['penalties'])

### Create a list of potentially correlated features

In [None]:
potentialFeatures = ['acceleration', 'curve', 'free_kick_accuracy', 'ball_control', 'shot_power', 'stamina']

#### Check correlation coefficient of "overall_rating" of a player with each feature we added to the list as potential.

In [None]:
for f in potentialFeatures:
    related = df['overall_rating'].corr(df[f])
    print("%s: %f" % (f,related))

In [None]:
df.columns.values.shape

In [None]:
cols = ['potential',  'crossing', 'finishing', 'heading_accuracy',
       'short_passing', 'volleys', 'dribbling', 'curve', 'free_kick_accuracy',
       'long_passing', 'ball_control', 'acceleration', 'sprint_speed',
       'agility', 'reactions', 'balance', 'shot_power', 'jumping', 'stamina',
       'strength', 'long_shots', 'aggression', 'interceptions', 'positioning',
       'vision', 'penalties', 'marking', 'standing_tackle', 'sliding_tackle']

In [None]:
corr_list = [(f, df['overall_rating'].corr(df[f])) for f in cols]

In [None]:
df2 = pd.DataFrame(corr_list, columns=['attributes', 'correlation'])

In [None]:
df2.sample(5)

In [None]:
p25 = df2.describe().loc['25%',][0]
p50 = df2.describe().loc['50%',][0]
p75 = df2.describe().loc['75%',][0]
p25, p50, p75

### Visualisation of correlations

In [None]:
def plot_dataframe(df, y_label):  
    global p25, p50, p75
    color='coral'
    fig = plt.gcf()
    fig.set_size_inches(20, 6)
    plt.title(y_label)

    ax = df['correlation'].plot(linewidth=3.3, color=color)
    ax.axhline(p25, c='gray')
    ax.axhline(p50, c='k')
    ax.axhline(p75, c='gray')
    ax.xaxis.grid()
    ax.set_xticks(df.index)
    ax.set_xticklabels(df.attributes, rotation=75); #Notice the ; (remove it and see what happens !)
    plt.show()

In [None]:
plot_dataframe(df2, 'Player\'s Overall Rating')

### Correlation heatmap

The features with highest correlation coefficients are indicative for high Overall Rating. However we are never sure if the top features are independent!

In [None]:
import seaborn as sns

plt.figure(figsize=(20, 12))
sns.set(style="white")
cmap = sns.diverging_palette(220, 10, as_cmap=True)

cor = df.loc[:, cols].corr()
cor.shape
mask = np.zeros_like(cor)
mask[np.triu_indices_from(mask)] = True

ax = sns.heatmap(cor, mask=mask, cmap=cmap, vmax=.85);

## Clustering Players into similar groups

We can group similar players based on certain features.

<b>Note:</b> Generally, someone with domain knowledge needs to define important features. We could have also selected some of the features with highest correlation with overall_rating. However, it does not guarantee best outcome always as we are not sure if the top five features are independent. For example, if 4 of the 5 features depend on the remaining 1 feature, taking all 5 does not give new information.

#### Select features for clustering - looking for youg mid-field player

In [None]:
sel_features = ['reactions', 'short_passing', 'long_passing', 'vision', 'interceptions', 'standing_tackle', 'potential']

In [None]:
df_select = df[sel_features].copy(deep=True)

In [None]:
df_select.head()

### Perform K-Means Clustering

We use K-Means to cluster the selected features in K clusters.

In [None]:
# Perform scaling on the dataframe containing the features
data = scale(df_select)

# Define number of clusters
k = 4

# Train a model
model = KMeans(init='k-means++', n_clusters=k, n_init=20).fit(data)

### DataFrame with feature coords for each cluster center

In [None]:
df = pd.DataFrame(model.cluster_centers_)
df.columns = sel_features
df['players'] = pd.value_counts(model.labels_, sort=False)
df['cluster'] = df.index.astype(int)
df

## Cluster profiles
We have K clusters based on the selected features and visualise them as profiles for similar groups of players. Each point is the average value of the cluster for that feature.

In [None]:
repeat = len(data)//5 + len(data) % 5
my_colors = list('brgyk' * repeat)[:len(data)]

In [None]:
from pandas.plotting import parallel_coordinates

plt.figure(figsize=(15,8)).gca().axes.set_ylim([-2.5, +2.5])
df.pop('players')
parallel_coordinates(df, 'cluster', color = my_colors, marker='o');