## K-Means with Python – Clustering Shot Creators in the Premier League

We will use the k-means algorithm to put players into different groups based on their shot creating actions.

The process will take the following steps:

1. Check and tidy dataset
2. Create k-means model and assign each player into a cluster of similar players
3. Describe & visualise results

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [None]:
from sklearn.cluster import KMeans

In [None]:
#Allow for full tables to be shown
pd.options.display.max_columns = None
pd.options.display.max_rows = None

In [None]:
data = pd.read_csv('./data/SCA.csv')

In [None]:
data.head()

_Check & Tidy Dataset_

In [None]:
#Split the player names by the slash, and use the first one
data['Player'] = data['Player'].str.split('\\', expand=True)[0]

#Split the nation names by the space, and use the second one
data['Nation'] = data['Nation'].str.split(' ', expand=True)[1]

#Some positions have 2 (e.g. MFFW), let's just use the first two letters for now
data['Pos'] = data['Pos'].str[:2]

data.head(2)

One more thing to consider is the effect of playing for a stronger team. As a broad assumption, we can expect players in better teams to create more shots, and players in worse teams to produce fewer.

This might produce results that group players based on their production levels, not the styles of their productions.

As such, let’s create some new columns to look at the percentages for each action type. We’ll do this by creating a sum column, then dividing each column by the sum.

In [None]:
#Create list of columns to sum, then assign the sum to a new column
add_list = ['Pass SCA', 'Deadball SCA', 'Dribble SCA', 'Shot SCA', 'Fouled SCA']
data['Sum SCA'] = data[add_list].sum(axis=1)

#Create our first new column
data['Pass SCA Ratio'] = data['Pass SCA']/data['Sum SCA']
data.head()

First, we’ll create the new column names in a loop. Then we will run another loop with the code that we just used to create our remaining columns.

In [None]:
#Create new column names by adding ' ratio' to each name in our previous list
new_cols_list = [each + ' Ratio' for each in add_list]

#For each new column name, calculate the column exactly as we did a minute ago
for idx, val in enumerate(new_cols_list):
    data[val] = data[add_list[idx]]/data['Sum SCA']

#Create a sum of the percentages to check that they all add to 1
data['Sum SCA Ratio'] = data[new_cols_list].sum(axis=1)
data.head(5)

We’ll create a new dataframe that will ask for only forwards or midfielders. Also, let’s set a floor for playing time & shots created to cut out anyone with low appearance/creation numbers.

In [None]:
#New dataframe where Pos == FW or MF. AND played more than 5 90s AND created more than 15 shots
data_mffw = data[((data['Pos'] == 'FW') | (data['Pos'] == 'MF')) & (data['90s'] > 5) & (data['SCA'] > 15)]

data_mffw.head()

_Create k-means model and assign each player into a cluster of similar players_

As simply as possible, the method splits all of our players into a number of clusters that we decide.

One way that it does this is by putting the centre of the clusters somewhere at random in our data. From here, the players are assigned a cluster based on which one they are closest to.

The cluster’s location then changes to the average of its players’ datapoints and the clusters are re-assigned. This process repeats until no players change their membership after the cluster centres move to their new average. Once this process stops, we then have our final clusters!

In [None]:
km = KMeans(n_clusters=5, init='random', random_state=0)

In [None]:
y_km = km.fit_predict(data_mffw[new_cols_list])
y_km

In [None]:
data_mffw['Cluster'] = y_km
data_mffw.head()

_Describe & Visualise Results_

In [None]:
data_mffw[data_mffw['Cluster'] == 0].head()

In [None]:
data_mffw[data_mffw['Cluster'] == 1].head()

In [None]:
#We'll do this a couple of times, let's make a function
def plotClusters(xAxis, yAxis):
    plt.scatter(data_mffw[data_mffw['Cluster']==0][xAxis], data_mffw[data_mffw['Cluster']==0][yAxis], s=40, c='red', label ='Cluster 1')
    plt.scatter(data_mffw[data_mffw['Cluster']==1][xAxis], data_mffw[data_mffw['Cluster']==1][yAxis], s=40, c='blue', label ='Cluster 2')
    plt.scatter(data_mffw[data_mffw['Cluster']==2][xAxis], data_mffw[data_mffw['Cluster']==2][yAxis], s=40, c='green', label ='Cluster 3')
    plt.scatter(data_mffw[data_mffw['Cluster']==3][xAxis], data_mffw[data_mffw['Cluster']==3][yAxis], s=40, c='pink', label ='Cluster 4')
    plt.scatter(data_mffw[data_mffw['Cluster']==4][xAxis], data_mffw[data_mffw['Cluster']==4][yAxis], s=40, c='gold', label ='Cluster 5')
    plt.xlabel(xAxis)
    plt.ylabel(yAxis)    
    plt.legend() 
    
plotClusters('Pass SCA Ratio', 'Dribble SCA Ratio')

One final plot, let’s look at shots created per 90 against age. This kind of thing might help us to examine a player that we are replacing and look for younger players with similar contributions.

In [None]:
#Age vs number of shot creations per 90, split by cluster
plotClusters('SCA90', 'Age')