 # Cluster analysis of the top 100 valuable FIFA 2020 players
 
In this exercise soccer players skills will be used to determine their positions by using a machine learning technique called K-means clustering which is an unsupervised machine learning algorithm. The algorithm tries to find relationships between the observations which have similair pattern and try to cluster them together. The provided data set do have player's positions but will not include them during clustering, otherwise this exercise is meaningless.
The player's position data will be compared later after the cluster analysis to se how the analysis performed.


### Table of contents:
*   [Data cleaning](#cleaning)
*   [Exploratory data analysis](#eda)
*   [Cluster analysis](#analys) 
*   [Conclusion](#conclusion)



In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
import seaborn as sns

df = pd.read_csv('../Data/players_20.csv')

FileNotFoundError: [Errno 2] No such file or directory: '../Data/players_20.csv'

## Data cleaning <a id='cleaning'></a>

In [None]:
#Check the shape of the data frame
df.shape


There are 18278 players and 104 features in the data set. We will just focus on the top 100 players in this exercise.

In [None]:
# Extract the top100 players based on their market value
df_top100 = df.sort_values('value_eur',ascending=False)[:100]
df_top100.head()            

In [None]:
# Make a new df with the numerical which consist players personal data and skill rating. Remove the sofifa_id column
df_top100_new = df_top100[df_top100.describe().columns].copy()
df_top100_new.drop(columns=['sofifa_id'],inplace=True)
df_top100_new.head()

In [None]:
# Check for null values
df_top100_new.isnull().any()

It appears that some skillset have null values. It is probably due to goalkeeper don't have a certain skillset that other players have and vice versa. We will replace the nulls with 0.

In [None]:
df_top100_new=df_top100_new.fillna(0)
df_top100_new.isnull().any()

All null values are now replaced. Let's explore the data set before cluster analysis.


## Exploratory data analysis (EDA) <a id='eda'></a>

Time to do some EDA to get a better picture of our data set.

Let's plot how the age distribution is among the top 100 valuable players

In [None]:
df_top100['age'].hist()
plt.xlabel('Age')
plt.title('Age distribution among top 100 valuable  players')
plt.show()

From the histogram one can see that the average age is around 26 year and that the age distribution is normal distributed.


As mentioned earlier, player's positions will not be included in the clustering. But for the EDA, the positions can be intereting to explore. Let's see how the positions are denoted for some of the players first


In [None]:
df_top100[['short_name','player_positions']].head()

For convenience, the positions from the data set need to be translated into the four positions: <i>Goalkeeper</i>, <i>Defender</i>, <i>Midfielder</i> and <i>Striker</i>.

At the moment the positions are divided into sub-groups. For instance, Messi can have either following positions: <i>RW</i>, <i>CF</i> and <i>ST</i> which are different positions for a striker. 
The sub-positions will be replaced with one of the main positions.

In [None]:
# Plot number of players for each positions

def change_pos_name(row):
    """
    INPUT: Players postions, the sub-group
    OUTPUT: One of the main four positions
    
    The function takes the inputs variable and splits the string.
    Each splitted strings are then looped and identified as a new player postision and are stored in the list "pos".
    If the list only consist of one player position, then that player position will be returned. 
    Else, if the number of player positions are more than one. Count the number  of occurrnce 
    for each player postions in the list and send the one which highest number.
    If the number is equal between the player positions then return the postion in the list. 
       
    """ 
    pos = []
    positions = row.replace(",",'').split() # Split the string with the positions

    for i in positions:
        if i in ["RB" , "CB" , "LB" , "RCB" , "RWB" , "LCB"]:
            pos.append('Defender')
        if i in ["RW" , "CF" , "LW" , "ST" , "RS" , "LS" , "LF" , "RF"] :
            pos.append('Striker')
        if i in ["RM" , "CM" , "LM" , "CAM" , "LDM" , "RDM" , "LAM" , "RAM" , "CDM", "RCM", "LCM"]:
            pos.append('Midfielder')
        if i in ["GK"]:
            pos.append('Goalkeeper')

    if len(pos) == 1:
        return pos[0]
    else:
        if pos.count(pos[0]) == 1 and pos.count(pos[1]) == 1:
            return pos[0]
        if pos.count(pos[0]) == 1 and pos.count(pos[1]) != 1:
            return pos[1]
        if pos.count(pos[0]) != 1: 
            return pos[0]


df_top100['player_positions_update']=df_top100['player_positions'].apply(lambda row:change_pos_name(row))
df_top100['player_positions_update'].hist()
plt.show()

Midfielder is the dominating position and goalkeeper is the least dominating position.

In [None]:
# Plot the average salary for each positions
sns.barplot(x='player_positions_update', y='value_eur',data=df_top100,ci="sd")
plt.show()

Despite goalkeeper is the least dominating positions among 100 most valuable players, it has the second largest average salary. Highest salary does striker has and it is also the position with the highest variance among all positions. Defender has the least variance.
Let's plot the relation between Wage vs Value for each position

In [None]:
# Plot Wage vs Value for the positions

plt.figure(figsize=(12,4))
sns.scatterplot(data=df_top100, x="value_eur", y="wage_eur", hue='player_positions_update')
plt.title('Wage vs Value')
plt.legend()
plt.show()

When it comes to value, the scatterplot complies with the previous barplot. There is one midfielder that stands out when it comes to value. However, majority of the midfielder only has 50% of that value which also lower the average salary as can be seen in the barplot. The trend for each position is the same, that is, a higher value gives a higher wage. However, the trend is different, striker has the strongest trend and goalkeeper seems to have an almost horizontal trend.

Let's also see if age matters and plot the top 10 players with highest value. The age will be divided into three categories <i>Low age</i>, <i>Mid age</i> and <i>High age</i>.

In [None]:
df_top100_new['age_cat']=pd.cut(df_top100['age'],bins=[19,24,29,34],include_lowest=True,labels=['Low age', 'Mid age', 'High age'] )

fig, ax = plt.subplots(figsize=(13,5))

sns.scatterplot(data=df_top100_new, x="value_eur", y="wage_eur", hue='age_cat')

# top valueable players name
top10_name = df_top100.sort_values(ascending=False, by='value_eur')[:10]['short_name']
top10 =  df_top100.sort_values(ascending=False, by='value_eur')[:10]

# Annotate the top valueable players name
for i, txt in enumerate(top10_name):
    ax.annotate(txt, (top10['value_eur'].iloc[i], top10['wage_eur'].iloc[i]))
plt.title('Wage vs Value')
plt.show()

From the plot we can see a trend that higher value yield a higher wage. 
Player with a higher age tends to have a higher salary comapred to the younger players with similair value. Most of the younger players have also a lower value.

Now let's see which are the top 10 countries among the top 100 valuable players

In [None]:
nationality_count = df_top100.groupby('nationality')['sofifa_id'].count().sort_values(ascending=False)
plt.bar(nationality_count.index[:10],nationality_count[:10])
plt.xticks(rotation=45 )
plt.title('Top 10 nationalities among top 100 valuable players')
plt.show()


The European countries are dominating. Only Brazil and Argentina are from outside Europe.

# Cluster analysis with K-means clustering <a id='analys'></a>

Before the clustering some features will be chosen. There are 104 features in total in the data set but in this analysis only a few of them will be incuded. The feature <i>player_positions</i> will not be included. The values will be standardized which makes them more normal distributed. The number of clusters will be determined with <i>Elbow method</i>. Lastly, the clusters will be translated to players positions using different methods such as averge skill set for each cluster, historgram and pairplots. 

In [None]:
# Choose the features we will use
col = ['weak_foot', 'skill_moves',
       'shooting','passing', 'dribbling', 'defending',  
       'defending_marking', 'defending_standing_tackle',
       'defending_sliding_tackle', 'goalkeeping_diving',
       'goalkeeping_handling', 'goalkeeping_kicking',
       'goalkeeping_positioning', 'goalkeeping_reflexes',
      'attacking_crossing', 'attacking_finishing',
       'attacking_heading_accuracy', 'attacking_short_passing',
       'attacking_volleys']

df_top100_update = df_top100_new[col]


In [None]:
# Standardise the valaues
standard = StandardScaler()
df_standard = standard.fit_transform(df_top100_update)
df_standard

In [None]:
# Perform K-means clustering. Assuming 1 to 10 clusters
wcss = []
for i in range(1,11):
    kmeans = KMeans(n_clusters = i, 
                    init = 'k-means++', 
                    random_state = 7)
    kmeans.fit(df_standard)
    wcss.append(kmeans.inertia_)

In [None]:
# Plot sum of squared distances of samples to their closest cluster center vs number of clusters
plt.plot(range(1,11),wcss,'o--')
plt.xlabel('Number of clusters')
plt.ylabel('Within-Cluster-Sum-of-Squares (WCSS)')
plt.title('Elbow method plot')
plt.show()

From the elbow method plot we can see that the elbow start at 3 clusters. 
However we now that there are 4 distinct positions in football (goalkeeper, defender, midfielder, striker). So we will choose 4 clusters for our case. Elbow method is just an indicator and sometimes it can be also be hard to find the elbow if the sum of squared distances does not drop sharply.

We will use 4 clusters for our cluster analysis

In [None]:
# K-means with 4 clusters

kmeans =KMeans(n_clusters = 4,
               init = 'k-means++',
               random_state = 7)
kmeans.fit(df_standard)

In [None]:
# print the cluster for each 100 players

kmeans.labels_

In [None]:
df_kmeans = df_top100_update.copy()
df_kmeans['cluster'] = kmeans.labels_
df_kmeans_analysis = df_kmeans.groupby('cluster').mean() 
df_kmeans_analysis

We grouped each cluster and averaged all the values. From the table we can already see some interesting patterns:

*    Cluster 1 has very high value when it comes to defending
*    Cluster 2 has value zero for skills like shooting, passin, dribbling and defending. But very high value for goalkeeping skills


We can almost conclude that Cluster 1 is defender and Cluster 2 is goalkeeper. Let's look how each cluster are distributed for each skill

In [None]:
#Visualise each skill for each cluster in a histogram 
pd.options.display.max_rows = 10500
df_kmeans_i = df_kmeans.reset_index()
for i in df_kmeans_i.iloc[:,1:20]:
    grid= sns.FacetGrid(df_kmeans_i, col='cluster')
    grid.map(plt.hist, i)
    plt.show()

From the histograms one can see that there are some skills that cluster 1 and cluster 2 respectively have that are more dominating compare to the other clusters. This complies with the table above.

Let's use pairplot see the correlation between the skill sets with the respect to the clusters and also how the clusters are distributed with respect to each other.

In [None]:
sns.pairplot(df_kmeans,vars=['shooting','passing', 'dribbling', 'defending',  
                            'defending_marking', 'defending_standing_tackle'],                           
                             hue='cluster',palette="tab10")


plt.show()

In [None]:
sns.pairplot(df_kmeans,vars=[  'goalkeeping_handling', 'goalkeeping_kicking',
                           'goalkeeping_positioning', 'goalkeeping_reflexes',
                            'defending_sliding_tackle', 'goalkeeping_diving'],
                             hue='cluster',palette="tab10")
plt.show()

Cluster 2 is very easy to distinguish in the pairplot. The other clusters are close to cluster 0 for certain skills. This is maybe due to the players do have a more offensive or defensive role. From earlier, the conclusion was cluster 2 is goalkeeper and cluster 1 is defender. Cluster 3 is probably striker and cluster 0 is midfielder. That should make sense since some players in cluster 1 and cluster 3 are very close to cluster 0. This means they swtich positions sometimes which is common for some players.

From the clustering:
*  cluster 0: Midfielder
*  cluster 1: Defender
*  cluster 2: Goalkeeper
*  cluster 3: Striker


In [None]:
# Change cluster number to position name
df_kmeans['cluster']=df_kmeans['cluster'].map({0:"Midfielder",
                                               1:"Defender",
                                               2:"Goalkeeper",
                                               3:"Striker"})


Compare the cluster results with the player positions that were provided from the data set.

In [None]:
# Take only the top 100 valuable players from the ordignial data set
player_name = df.loc[df_kmeans.index][['short_name','player_positions']]

# Update players positions with the new position name
player_name['player_positions']=player_name['player_positions'].apply(lambda row:change_pos_name(row))

player_cluster = pd.concat([player_name,df_kmeans],axis=1)[['short_name','player_positions','cluster']]
player_cluster

The midfielder position seems to have some difficulties to cluster for some players which also complies with the pairplot as mentioned before.
Let's calculate and see how many mismatches there are among the 100 players.

In [None]:
def compare(row):
    """
    INPUT: Each players positions from the data set and from the cluster
    OUTPUT: 1 or 0 depending if it is equal or not
    
    The function compare the positions from the data set with the cluster analysis.
    If the positions are equal the funcion return 0, otherwise it return 1.
    """
    
    if row[1] != row[2]:
        
        return 1
    else:
        return 0

not_equal = []
not_equal.append(player_cluster.apply(lambda row: compare(row),axis=1))

np.array(not_equal).sum()


There are 22 missmatches among the 100 players. If we include all the postions for players, how well will the cluster results be then?

In [None]:
# Plot number of players for each positions

def change_pos_name1(row):
    """
    INPUT: Players postions, the sub-group
    OUTPUT: One of the main four positions
    
    The function takes the inputs variable and splits the string.
    Each splitted strings are then looped and identified as a new player postision and are stored in the list "pos".
    If the list only consist of one player position, then that player position will be returned. 
    Else, if the number of player positions are more than one. Count the number  of occurrnce 
    for each player postions in the list and send the one which highest number.
    If the number is equal between the player positions then return the postion in the list. 
       
    """ 
    pos1 = []
    pos2 = []
    positions = row.replace(",",'').split() # Split the string with the positions
    
    if len(positions) == 3:
           
        for i in range(0,len(positions),2):

            if i == 0:  
                    if positions[i] in ["RB" , "CB" , "LB" , "RCB" , "RWB" , "LCB"]:
                        pos1.append('Defender')

                    if positions[i] in ["RW" , "CF" , "LW" , "ST" , "RS" , "LS" , "LF" , "RF"] :
                        pos1.append('Striker')

                    if positions[i] in ["RM" , "CM" , "LM" , "CAM" , "LDM" , "RDM" , "LAM" , "RAM" , "CDM", "RCM", "LCM"]:
                        pos1.append('Midfielder')

                    if positions[i] in ["GK"]:
                        pos1.append('Goalkeeper')

            else:
                if positions[i] in ["RB" , "CB" , "LB" , "RCB" , "RWB" , "LCB"]:
                        pos2.append('Defender')
                if positions[i] in ["RW" , "CF" , "LW" , "ST" , "RS" , "LS" , "LF" , "RF"] :
                        pos2.append('Striker')
                if positions[i] in ["RM" , "CM" , "LM" , "CAM" , "LDM" , "RDM" , "LAM" , "RAM" , "CDM", "RCM", "LCM"]:
                        pos2.append('Midfielder')
                if positions[i] in ["GK"]:
                        pos2.append('Goalkeeper')
    else:
        for i in range(0,len(positions)):
            if i == 0:  
                    if positions[i] in ["RB" , "CB" , "LB" , "RCB" , "RWB" , "LCB"]:
                        pos1.append('Defender')

                    if positions[i] in ["RW" , "CF" , "LW" , "ST" , "RS" , "LS" , "LF" , "RF"] :
                        pos1.append('Striker')

                    if positions[i] in ["RM" , "CM" , "LM" , "CAM" , "LDM" , "RDM" , "LAM" , "RAM" , "CDM", "RCM", "LCM"]:
                        pos1.append('Midfielder')

                    if positions[i] in ["GK"]:
                        pos1.append('Goalkeeper')

            else:
                if positions[i] in ["RB" , "CB" , "LB" , "RCB" , "RWB" , "LCB"]:
                        pos2.append('Defender')
                if positions[i] in ["RW" , "CF" , "LW" , "ST" , "RS" , "LS" , "LF" , "RF"] :
                        pos2.append('Striker')
                if positions[i] in ["RM" , "CM" , "LM" , "CAM" , "LDM" , "RDM" , "LAM" , "RAM" , "CDM", "RCM", "LCM"]:
                        pos2.append('Midfielder')
                if positions[i] in ["GK"]:
                        pos2.append('Goalkeeper')
    try:
        return [pos1[0], pos2[0]]
    except:
        return pos1[0], ''
lst_pos1 = []
lst_pos2 = []

df_top100['player_positions1']=df_top100['player_positions'].apply(lambda row:change_pos_name1(row))

for i in df_top100['player_positions1']:
    if len(i) == 1:
        lst_pos1.append(i[0])
    else:
        lst_pos1.append(i[0])
        lst_pos2.append(i[1])
        
df_player_pos = pd.DataFrame(list(zip(lst_pos1,lst_pos2)),columns=['position 1','position 2'])

df_kmeans = df_kmeans.reset_index(drop=True)
compare = pd.concat([df_player_pos,df_kmeans],axis=1)[['position 1','position 2','cluster']]
compare

# Conclusion <a id='conclusion'></a>

From the data set I took the most valuable 100 players.
EDA showed:
*    Midfielder are the dominating position and goalkeeper is least dominating position
*    Striker has highest average salary and goalkeeper has the second highest. Defender has the lowest average salary.
*    The average age among the players are around 26 years
*    Younger players value more skewed to lower range compared to the other age group
*    Top 10 countries are dominated by European countries with France, Spain and Germany as top 3. Brazil and Argentina are the only countries that are not from Europe.

Different method were performed to see patterns among the clusters during the cluster analysis. Two distinct clusters could be distinguished from the average skill set which were defender and goalkeeper. The midfielder cluster was not well separeted against defender and striker cluster respectively. One reason can be that players switched between two positions depending on the situation which makes the players skills not as distinct.
Lastly, the positions that were determined by the cluster analysis were compared with the positions provided from the data set. 22 of 100 positions of did not match with the provided data. Most of these positions is due to the midfielder skill is not well separeted between striker and defender for some players.