# Premier League Midfielder Analysis - Clustering Players By Playing Style

In this workbook, we will use Python to access data that I have stored in a series of IBM DB2 tables, looking at various performance metrics for players in the Premier League. This data comes from fbref.com.

We will use this data to try to cluster midfielders into playing styles, using the K Means Clustering algorithm.

Firstly, we need to install and load relevant packages to access the IBM DB2 data using SQL code, and transform it into a Pandas dataframe for further analysis

In [1]:
!pip install ipython_sql
print("Installed!")

Collecting ipython_sql
  Downloading https://files.pythonhosted.org/packages/ab/3d/0d38357c620df31cebb056ca1804027112e5c008f4c2c0e16d879996ad9f/ipython_sql-0.4.0-py3-none-any.whl
Collecting prettytable<1 (from ipython_sql)
  Downloading https://files.pythonhosted.org/packages/ef/30/4b0746848746ed5941f052479e7c23d2b56d174b82f4fd34a25e389831f5/prettytable-0.7.2.tar.bz2
Collecting sqlparse (from ipython_sql)
[?25l  Downloading https://files.pythonhosted.org/packages/85/ee/6e821932f413a5c4b76be9c5936e313e4fc626b33f16e027866e1d60f588/sqlparse-0.3.1-py2.py3-none-any.whl (40kB)
[K     |████████████████████████████████| 40kB 9.0MB/s  eta 0:00:01
Building wheels for collected packages: prettytable
  Building wheel for prettytable (setup.py) ... [?25ldone
[?25h  Stored in directory: /home/dsxuser/.cache/pip/wheels/80/34/1c/3967380d9676d162cb59513bd9dc862d0584e045a162095606
Successfully built prettytable
Installing collected packages: prettytable, sqlparse, ipython-sql
Successfully installed 

In [2]:
import pandas as pd
import numpy as np
import ibm_db
import ibm_db_dbi
from sklearn.cluster import KMeans

## Getting the data from SQL

With these packages now loaded, we will create a connection to the IBM DB2 database, and bring the data that we need into Python. We will return data for midfielders who have played at least 500 minutes in the 2019/20 season, prior to the break in the season enforced by the Coronavirus pandemic.

In [3]:
%load_ext sql

In [4]:
# The code was removed by Watson Studio for sharing.

In [5]:
%%sql result_set << 
SELECT STD."Player",STD."Pos",STD."Squad",STD."Min",STD."xG90",STD."xA90",PAS."TotalAtt",PAS."TotalCmpProp",PAS."TotDist" as "PassTotDist",PAS."PrgDist" AS "PassPrgDist",PAS."ShortAtt",PAS."MidAtt",PAS."LongAtt",PAS."KP",PAS."ToFinalThird",PAS."Prog" as "ProgressivePasses",CRE."SCA90",CRE."SCAPassLive",CRE."SCAPassDead",CRE."SCADrib",CRE."GCA90",DEF."Tkl",DEF."Press",DEF."Blocks",DEF."Pass_Blocks",DEF."Int",DEF."Clr",POS."Touches",POS."Att_Drib",POS."Drib_Succ_Prop",POS."Carries",POS."TotDist" as "CarryTotDist",POS."PrgDist" as "CarryPrgDist",POS."Pass_Targ",POS."Miscon",POS."Dispos"
FROM PLAYERSSTANDARD AS STD
JOIN PLAYERSPASSING AS PAS ON PAS."Player"=STD."Player"
JOIN PLAYERSCREATION AS CRE ON CRE."Player"=STD."Player"
JOIN PLAYERDEFENSIVE AS DEF ON DEF."Player"=STD."Player"
JOIN PLAYERPOSSESSION AS POS ON POS."Player"=STD."Player"
WHERE STD."Pos" LIKE '%MF%'
AND STD."Min" >= 500

 * ibm_db_sa://pgv93227:***@dashdb-txn-sbox-yp-dal09-08.services.dal.bluemix.net:50000/BLUDB
Done.
Returning data to local variable result_set


## Converting to a *Pandas* Dataframe

And we will transform this data into a Pandas dataframe for ease of further analysis.

Using the approach below, we can order the resulting dataframe by a particular attribute - in this case the average distance that a player carries the ball forwards per 90 minutes.

In [6]:
MF_Stats=result_set.DataFrame()
MF_Stats.sort_values(by="CarryPrgDist",ascending=False).head(10)

Unnamed: 0,Player,Pos,Squad,Min,xG90,xA90,TotalAtt,TotalCmpProp,PassTotDist,PassPrgDist,...,Clr,Touches,Att_Drib,Drib_Succ_Prop,Carries,CarryTotDist,CarryPrgDist,Pass_Targ,Miscon,Dispos
14,Sofiane Boufal,"MF,FW",Southampton,762,0.11,0.31,38.1,75.9,441.5,124.5,...,0.24,60.4,11.3,56.3,52.5,449.6,291.2,54.9,2.47,4.82
4,Felipe Anderson,"MF,FW",West Ham,1407,0.16,0.18,57.2,77.5,730.7,215.3,...,0.83,71.3,6.15,67.7,57.2,458.1,285.1,64.0,2.05,2.05
41,Jack Grealish,"FW,MF",Aston Villa,2333,0.21,0.2,44.6,78.9,674.2,185.0,...,0.39,59.3,3.71,61.5,46.7,453.7,269.2,51.9,1.89,2.36
108,Allan Saint-Maximin,"MF,FW",Newcastle Utd,1268,0.16,0.18,22.0,70.6,269.4,62.8,...,0.14,36.9,7.59,64.5,35.5,374.3,268.4,36.5,2.62,2.98
137,Wilfried Zaha,"MF,FW",Crystal Palace,2546,0.14,0.08,33.6,78.2,417.8,117.6,...,0.11,52.2,7.88,61.9,48.3,388.7,257.5,54.5,3.39,4.56
55,Diogo Jota,"FW,MF",Wolves,1761,0.43,0.13,33.4,78.0,428.1,83.2,...,0.36,49.4,5.56,58.7,41.4,328.9,236.4,50.3,2.6,3.01
114,Bernardo Silva,"FW,MF",Manchester City,1497,0.28,0.23,64.2,85.7,902.1,181.3,...,0.54,75.4,3.13,73.1,58.6,379.0,234.1,67.0,0.54,1.14
17,Dani Ceballos,MF,Arsenal,908,0.04,0.09,73.3,85.3,1253.5,301.4,...,0.89,83.5,2.28,69.6,63.9,431.8,231.5,65.6,0.59,1.58
49,Onel Hernández,"FW,MF",Norwich City,765,0.13,0.07,22.6,68.2,216.7,44.2,...,0.71,39.8,5.41,56.5,33.3,322.9,231.1,38.1,2.71,3.88
72,Riyad Mahrez,"FW,MF",Manchester City,1389,0.25,0.46,56.7,84.1,858.3,207.6,...,0.45,70.5,4.55,62.9,54.5,397.7,229.4,63.5,1.17,1.82


Ok - now that we have the data in Python, we can begin to consider how we might look to cluster these players. Midfielders are a particularly interesting group, because there are many different styles. As a general rule, players break down into the following groups:

- Holding midfielders: Primarily tasked with breaking up opposition play
- Deep Lying "Quarterbacks": The focal point for the team's build-up play, getting on the ball frequently and setting the tempo with their passing
- Box-To-Box midfielders: All-action players who contribute to both defensive and attacking phases of the game
- Attacking midfielders: Creative players primarily tasked with generating goal-scoring opportunities, and perhaps taking shots themselves
- Wide Men: Players who rely primarily on pace and dribbling skills to create goal-scoring chances for themselves and others

It would be really interesting to see whether we can identify these different playing styles by running a clustering algorithm on our dataset.

## Preprocessing the Data

To do this, we firstly need to remove some of the columns from our dataset, such as name, team, minutes played etc, that are not relevant to playing styles

In [7]:
cluster_df = MF_Stats.drop(['Player','Pos','Squad','Min'], axis=1)
cluster_df.head()

Unnamed: 0,xG90,xA90,TotalAtt,TotalCmpProp,PassTotDist,PassPrgDist,ShortAtt,MidAtt,LongAtt,kp,...,Clr,Touches,Att_Drib,Drib_Succ_Prop,Carries,CarryTotDist,CarryPrgDist,Pass_Targ,Miscon,Dispos
0,0.36,0.11,38.6,77.7,496.6,114.7,2.62,28.1,7.9,1.08,...,0.77,54.6,3.23,50.8,36.1,220.4,118.8,53.1,2.21,2.77
1,0.2,0.03,24.9,74.5,294.2,74.1,1.71,18.5,4.71,0.57,...,0.49,38.0,3.23,49.4,29.2,277.5,159.2,38.3,1.86,1.67
2,0.08,0.05,50.3,84.9,679.5,197.3,2.12,41.9,6.29,0.98,...,1.06,63.5,3.79,66.0,46.8,289.5,162.1,48.3,1.52,2.12
3,0.12,0.01,43.0,85.9,757.1,213.1,0.71,29.0,13.2,0.12,...,6.9,56.0,0.36,66.7,30.1,119.4,56.4,35.4,0.12,0.12
4,0.16,0.18,57.2,77.5,730.7,215.3,3.72,44.2,9.29,1.6,...,0.83,71.3,6.15,67.7,57.2,458.1,285.1,64.0,2.05,2.05


We then need to transform the resulting dataset into a Numpy Array, and scale it to ensure attributes with larger numbers don't unduly influence the clustering model.

In [8]:
from sklearn.preprocessing import StandardScaler
X1 = cluster_df.values[:,1:]
X = np.nan_to_num(X1)
Clus_dataSet = StandardScaler().fit_transform(X)
Clus_dataSet



array([[-0.05124941, -0.50449997, -0.23114308, ...,  0.48314811,
         0.94714454,  1.31926829],
       [-0.97877977, -1.41898478, -0.74692269, ..., -0.79077612,
         0.52648172,  0.10605616],
       [-0.74689718,  0.2764834 ,  0.92936104, ...,  0.0699835 ,
         0.11783785,  0.60237021],
       ...,
       [-0.86283848,  1.50469657,  1.46125877, ...,  1.02542667,
        -1.48068086, -1.21744799],
       [ 1.80381131, -0.73145241,  0.17180974, ...,  0.21631263,
         0.6466711 ,  0.06193935],
       [-0.3990733 , -0.83825355, -0.15055252, ...,  0.60365446,
         2.36537918,  3.29349531]])

## Building our Model

Now that the data is in the format we need, we can run the algorithm itself. Given we listed 5 playing styles earlier, we'll tell the algorithm to see if it can find 5 groups within our list of players. We'll also set a "random state" so that we can repeat our results in future.

In [9]:
clusterNum = 5
k_means = KMeans(init = "k-means++", n_clusters = clusterNum, n_init = 12,random_state=3425)
k_means.fit(X)
labels = k_means.labels_
print(labels)

[2 1 3 3 3 1 1 1 1 0 1 2 2 3 1 3 3 4 2 3 2 3 2 2 0 0 3 3 1 3 2 3 4 2 2 2 2
 4 1 0 2 3 1 3 0 4 2 4 1 1 2 1 3 2 4 1 3 1 1 1 4 2 3 0 3 2 0 2 3 3 3 0 0 3
 0 2 2 2 3 1 3 1 3 4 3 2 1 0 2 3 3 1 0 3 4 3 3 2 2 1 4 2 0 1 3 2 4 3 1 1 1
 0 3 2 0 0 3 2 1 4 1 3 3 0 1 1 2 3 1 3 3 3 2 2 4 4 2 1]


Great - we've generated a list of labels as an output here, ranging from 0 to 4 inclusive. However, they're not much use in isolation. We need to bring them back into our original dataset:

In [10]:
MF_Stats["Cluster"] = labels
MF_Stats.head(5)

Unnamed: 0,Player,Pos,Squad,Min,xG90,xA90,TotalAtt,TotalCmpProp,PassTotDist,PassPrgDist,...,Touches,Att_Drib,Drib_Succ_Prop,Carries,CarryTotDist,CarryPrgDist,Pass_Targ,Miscon,Dispos,Cluster
0,Dele Alli,"MF,FW",Tottenham,1758,0.36,0.11,38.6,77.7,496.6,114.7,...,54.6,3.23,50.8,36.1,220.4,118.8,53.1,2.21,2.77,2
1,Miguel Almirón,"FW,MF",Newcastle Utd,2370,0.2,0.03,24.9,74.5,294.2,74.1,...,38.0,3.23,49.4,29.2,277.5,159.2,38.3,1.86,1.67,1
2,Steven Alzate,"MF,DF",Brighton,1188,0.08,0.05,50.3,84.9,679.5,197.3,...,63.5,3.79,66.0,46.8,289.5,162.1,48.3,1.52,2.12,3
3,Ibrahim Amadou,"DF,MF",Norwich City,759,0.12,0.01,43.0,85.9,757.1,213.1,...,56.0,0.36,66.7,30.1,119.4,56.4,35.4,0.12,0.12,3
4,Felipe Anderson,"MF,FW",West Ham,1407,0.16,0.18,57.2,77.5,730.7,215.3,...,71.3,6.15,67.7,57.2,458.1,285.1,64.0,2.05,2.05,3


Ok - now we can see the group that each player falls into. Dele Alli, for example, falls into cluster 2, Miguel Almiron into cluster 1, and so on. However, what we really want to understand at this stage is what the most prominent attributes are in each cluster, so that we can understand whether the clusters as a whole align to our initial ideas about playing styles.

## Making Sense of Our Results

We can get a sense of this by adding the cluster labels to the dataset we originally created for the clustering model, and then looking at the average of every attribute for each cluster:

In [11]:
cluster_df2=cluster_df.apply(pd.to_numeric, errors='coerce')
cluster_df2["Cluster"]=labels
summary_stats=pd.DataFrame(cluster_df2.groupby('Cluster').mean())
summary_stats["Cluster"]=[0,1,2,3,4]
summary_stats

Unnamed: 0_level_0,xG90,xA90,TotalAtt,TotalCmpProp,PassTotDist,PassPrgDist,ShortAtt,MidAtt,LongAtt,kp,...,Touches,Att_Drib,Drib_Succ_Prop,Carries,CarryTotDist,CarryPrgDist,Pass_Targ,Miscon,Dispos,Cluster
Cluster,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,0.111176,0.164118,59.782353,83.652941,967.982353,245.811765,1.993529,41.782353,16.02,1.638824,...,70.782353,2.385882,72.011765,50.958824,291.447059,154.723529,55.988235,0.96,1.307059,0
1,0.201875,0.114063,28.51875,73.153125,326.003125,86.5,2.2175,21.38125,4.914375,1.072188,...,42.921875,4.232188,58.084375,32.634375,251.715625,152.76875,43.084375,2.274375,2.483125,1
2,0.143529,0.116176,39.120588,77.055882,528.764706,145.723529,2.290588,27.720588,9.104412,1.271176,...,50.961765,1.945588,58.079412,33.55,191.385294,99.085294,41.514706,1.519706,1.512353,2
3,0.091463,0.097317,50.329268,81.160976,765.131707,200.4,1.897317,35.495122,12.929268,1.132195,...,61.865854,1.980732,66.817073,41.436585,226.839024,112.860976,46.292683,1.153415,1.273415,3
4,0.092857,0.100714,74.807143,86.428571,1311.242857,316.264286,1.843571,51.292857,21.664286,1.244286,...,84.607143,1.386429,74.164286,60.307143,297.685714,146.964286,65.228571,0.583571,0.848571,4


Looking at the first column, we can see that players in cluster 1 tends to having higher xG90 (expected goals per 90 minutes played), while the second column shows us that cluster 0 typically performs best for expected assists.

This feels like we're going in the right direction, but it would be great if we could see a summary of the top attributes for each cluster.

We can do this by scaling the *summary_stats* dataframe we just created, and then writing a loop to return the top 5 attributes for each cluster.

In [12]:
from sklearn.preprocessing import MinMaxScaler

In [13]:
min_max_scaler = MinMaxScaler()

new_array = min_max_scaler.fit_transform(summary_stats)
scaled_df = pd.DataFrame(new_array, index=summary_stats.index, columns=summary_stats.columns)
scaled_df["Cluster"]=summary_stats["Cluster"]
scaled_df

  return self.partial_fit(X, y)


Unnamed: 0_level_0,xG90,xA90,TotalAtt,TotalCmpProp,PassTotDist,PassPrgDist,ShortAtt,MidAtt,LongAtt,kp,...,Touches,Att_Drib,Drib_Succ_Prop,Carries,CarryTotDist,CarryPrgDist,Pass_Targ,Miscon,Dispos,Cluster
Cluster,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,0.178542,1.0,0.675409,0.79092,0.651597,0.69337,0.335464,0.682046,0.663026,1.0,...,0.668353,0.351208,0.866177,0.662183,0.941311,1.0,0.61034,0.222633,0.280497,0
1,1.0,0.250678,0.0,0.0,0.0,0.0,0.836498,0.0,0.0,0.0,...,0.0,1.0,0.000309,0.0,0.567546,0.964866,0.066192,1.0,1.0,1
2,0.471563,0.282324,0.229039,0.293983,0.205799,0.257758,1.0,0.211936,0.250153,0.351176,...,0.192871,0.196489,0.0,0.033088,0.0,0.0,0.0,0.553662,0.406093,2
3,0.0,0.0,0.471188,0.603208,0.445707,0.495725,0.120232,0.471853,0.478504,0.105902,...,0.454453,0.208838,0.543222,0.318082,0.333524,0.247594,0.201485,0.337025,0.259914,3
4,0.012623,0.050856,1.0,1.0,1.0,1.0,0.0,1.0,1.0,0.303719,...,1.0,0.0,1.0,1.0,1.0,0.860541,1.0,0.0,0.0,4


In [14]:
num_top_attributes = 5

for cluster in scaled_df["Cluster"]:
    print("----"+str(cluster)+"----")
    temp = scaled_df[scaled_df['Cluster'] == cluster].T.reset_index()
    temp.columns = ['stat','score']
    temp.sort_values(by=['score'],ascending=False,inplace=True)
    temp = temp[temp.stat != 'Cluster']
    temp.head(num_top_attributes)
    temp = temp.iloc[0:]
    temp['score'] = temp['score'].astype(float)
    temp = temp.round({'score': 2})
    print(temp.sort_values('score', ascending=False).reset_index(drop=True).head(num_top_attributes))
    print('\n')

----0----
           stat  score
0            kp    1.0
1   SCAPassDead    1.0
2         gca90    1.0
3  CarryPrgDist    1.0
4   SCAPassLive    1.0


----1----
       stat  score
0      xG90    1.0
1  Att_Drib    1.0
2    Dispos    1.0
3    Miscon    1.0
4   SCADrib    1.0


----2----
          stat  score
0     ShortAtt    1.0
1  Pass_Blocks    1.0
2       Blocks    1.0
3        Press    1.0
4  SCAPassDead    1.0


----3----
          stat  score
0          Clr   1.00
1          Int   0.96
2       Blocks   0.83
3  SCAPassDead   0.71
4          Tkl   0.68


----4----
             stat  score
0  Drib_Succ_Prop    1.0
1    CarryTotDist    1.0
2       Pass_Targ    1.0
3         LongAtt    1.0
4         Touches    1.0




This is really useful. Using the output above, we can see whether each cluster aligns to our original thoughts on the various midfield playing styles:

- Cluster 0 players tend to score highly for key passes, goal creating actions, and shot creation from open play, as well as their ability to carry the ball. This would seem to align closely to our earlier definition of an *Attacking Midfielder*
- Cluster 1 players score highly for expected goals, dribble attempts and shot creation from dribbles, which would align with our initial definition of *Wide Men*
- Cluster 2 players score highly for short pass frequency, pressing, and pass blocking. We probably need to investigate this further, but they could be *Box-To-Box* midfielders
- Cluster 3  players score highly for clearances, interceptions, blocks and tackles - these look like our *Holding Midfielders*
- Cluster 4 players look to carry the ball a lot, tend to be the target of a lot of passes, and generally seem to get on the ball a lot. These look like our *Deep Lying "Quarterbacks"*

## Reviewing Player Allocations

This looks promising - this cursory review of the clusters looks to align pretty closely to our expectations. However, we might confirm this by reviewing which players fall into each category:

In [17]:
MF_Stats[['Player','Pos','Squad','Min','kp','gca90','SCAPassLive']].loc[MF_Stats['Cluster'] == 0].sort_values(by=['SCAPassLive'],ascending=False).reset_index(drop=True).head(10)

Unnamed: 0,Player,Pos,Squad,Min,kp,gca90,SCAPassLive
0,Kevin De Bruyne,MF,Manchester City,2147,3.85,1.13,4.77
1,Riyad Mahrez,"FW,MF",Manchester City,1389,2.86,1.49,4.48
2,David Silva,MF,Manchester City,1255,2.23,0.72,4.32
3,Bernardo Silva,"FW,MF",Manchester City,1497,1.99,0.54,3.43
4,Ross Barkley,MF,Chelsea,675,2.67,0.53,3.2
5,Giovani Lo Celso,"MF,FW",Tottenham,878,1.73,0.21,2.86
6,James Maddison,MF,Leicester City,2399,3.0,0.45,2.62
7,Nemanja Mati?,MF,Manchester Utd,959,1.12,0.09,2.34
8,Davy Pröpper,MF,Brighton,2340,0.92,0.15,2.0
9,João Moutinho,MF,Wolves,2436,2.47,0.37,1.88


Cluster 0 does indeed look to feature a good number of *Attacking Midfielders* - Kevin De Bruyne, James Maddison, Ross Barkley, and Giovani Lo Celso all fall into this category. However, there are also a few players here who wouldn't typically seem to fit this definition - Nemanja Matic, in particular, is more of a holding midfielder.

In [18]:
MF_Stats[['Player','Pos','Squad','Min','xG90','Att_Drib','SCADrib']].loc[MF_Stats['Cluster'] == 1].sort_values(by=['xG90'],ascending=False).reset_index(drop=True).head(10)

Unnamed: 0,Player,Pos,Squad,Min,xG90,Att_Drib,SCADrib
0,Diogo Jota,"FW,MF",Wolves,1761,0.43,5.56,0.61
1,Michail Antonio,"FW,MF",West Ham,1030,0.41,6.32,0.88
2,Moise Kean,"FW,MF",Everton,687,0.4,4.08,0.66
3,Ismaila Sarr,"FW,MF",Watford,1226,0.36,5.29,0.22
4,Harvey Barnes,MF,Leicester City,1709,0.36,3.58,0.42
5,Ayoze Pérez,"MF,FW",Leicester City,1581,0.29,3.24,0.28
6,Lucas Moura,"FW,MF",Tottenham,1682,0.29,4.71,0.53
7,Anwar El Ghazi,"FW,MF",Aston Villa,1675,0.29,2.2,0.32
8,Mason Greenwood,"FW,MF",Manchester Utd,660,0.29,1.92,0.14
9,Joshua King,"FW,MF",Bournemouth,1462,0.29,5.49,0.49


Cluster 1 looks to be accurate as our *Wide Men* group - all of the players above could reasonably be classed in this category

In [19]:
MF_Stats[['Player','Pos','Squad','Min','ShortAtt','Pass_Blocks','Press']].loc[MF_Stats['Cluster'] == 2].sort_values(by=['Press'],ascending=False).reset_index(drop=True).head(10)

Unnamed: 0,Player,Pos,Squad,Min,ShortAtt,Pass_Blocks,Press
0,Jesse Lingard,"MF,FW",Manchester Utd,892,2.53,0.71,32.6
1,Matthew Longstaff,MF,Newcastle Utd,545,1.31,0.33,30.2
2,Pablo Fornals,MF,West Ham,1465,3.07,2.21,29.0
3,James McArthur,MF,Crystal Palace,2425,2.34,2.27,28.1
4,Joe Willock,MF,Arsenal,584,2.31,1.08,26.2
5,Andreas Pereira,"MF,FW",Manchester Utd,1446,2.24,1.43,26.2
6,Tom Cleverley,MF,Watford,657,2.19,1.51,25.8
7,Ondrej Duda,MF,Norwich City,577,3.44,2.03,25.5
8,Gylfi Sigurðsson,MF,Everton,1994,1.44,1.49,25.4
9,Philip Billing,MF,Bournemouth,2164,1.75,2.54,25.1


Cluster 2 does indeed look like a good sample of *Box-To-Box* midfielders. These players tend to have a broad skill base, so can be hard to group together precisely, but most of these players could reasonably be claimed to help out at both ends of the pitch.

In [20]:
MF_Stats[['Player','Pos','Squad','Min','Clr','Int','Blocks']].loc[MF_Stats['Cluster'] == 3].sort_values(by=['Int'],ascending=False).reset_index(drop=True).head(10)

Unnamed: 0,Player,Pos,Squad,Min,Clr,Int,Blocks
0,Wilfred Ndidi,MF,Leicester City,1894,3.05,3.1,2.52
1,Hamza Choudhury,MF,Leicester City,784,1.95,2.99,2.3
2,N'Golo Kanté,MF,Chelsea,1385,1.88,2.6,2.47
3,Étienne Capoue,MF,Watford,2138,2.65,2.35,1.85
4,Declan Rice,MF,West Ham,2610,1.76,2.1,2.0
5,Ibrahim Amadou,"DF,MF",Norwich City,759,6.9,2.02,0.95
6,Jefferson Lerma,MF,Bournemouth,1943,2.22,1.94,1.48
7,Alexander Tettey,"MF,DF",Norwich City,1782,2.93,1.77,1.72
8,Lewis Cook,MF,Bournemouth,1009,1.25,1.7,1.34
9,Scott McTominay,MF,Manchester Utd,1547,2.62,1.57,1.98


Cluster 3 certainly matches our *Holding Midfielder* group - all of these players are widely regarded as fitting this description.

In [22]:
MF_Stats[['Player','Pos','Squad','Min','CarryTotDist','Pass_Targ','Touches']].loc[MF_Stats['Cluster'] == 4].sort_values(by=['Pass_Targ'],ascending=False).reset_index(drop=True).head(10)

Unnamed: 0,Player,Pos,Squad,Min,CarryTotDist,Pass_Targ,Touches
0,Rodri,"MF,DF",Manchester City,1957,350.0,80.6,99.2
1,?lkay Gündo?an,MF,Manchester City,1593,296.6,78.4,94.0
2,Mateo Kova?i?,MF,Chelsea,1685,449.5,78.3,93.9
3,Paul Pogba,MF,Manchester Utd,522,322.2,72.4,91.7
4,Jorginho,MF,Chelsea,2011,287.6,69.1,92.2
5,Jordan Henderson,MF,Liverpool,1886,297.5,67.2,84.3
6,Dani Ceballos,MF,Arsenal,908,431.8,65.6,83.5
7,James Milner,"MF,DF",Liverpool,769,228.0,65.6,84.6
8,Harry Winks,MF,Tottenham,1587,326.5,62.0,78.6
9,Fred,MF,Manchester Utd,2029,324.0,60.3,83.5


And finally, cluster 4 seems to capture our *Deep Lying "Quarterbacks"* well. Ilkay Gundogan, Paul Pogba, Jorginho, Jordan Henderson, Dani Ceballos, and Harry Winks all play this role for their respective clubs. A few of these players - Rodri and Fred, in particular, would arguably be thought of more as *Holding Midfielders*, but one reason they may fall into this category according to our model is that their teams tend to dominate possession, so as well as their defensive activities, they tend to see a lot of the ball.

### Conclusions and Possible Next Steps

Our clustering model seems to have worked very successfully. We have identified 5 clusters that align very closely to our initial expectations, both in terms of the key attributes highlighted, and the players that fall into these groups.

A model like this could be very useful for teams looking to identify gaps in their squad (for example, those lacking players from one or more of the groups), or when seeking to replace a departing player.

We might look to improve this model by adjusting each players statistics to reflect their team's style. For example, adjusting for team possession might help us to identify players in teams that see less of the ball that perform more creative/attacking roles, but are currently being put into a more defensive cluster because they don't get the opportunity to shine at the attacking end as often (Felipe Anderson at West Ham, who is an attacking midfielder but is classified in the *Holding Midfielder* group because of his high defensive output, is a good example of this). Equally, this might see players on high possession teams switch groups, as suggested above with Rodri or Fred.

We might also enhance the model further with data from other leagues, or else build a Nearest Neighbours model using this additional data to see how players from other leagues would be grouped by this existing model.

But all of this is for another day.

Huge thanks to FBRef, and their partners StatsBomb, for making this brilliant dataset publicly available.