## *Player Similarity Study; call it a Player Recommender*

Discover a player resembling the departing one by examining various game aspects, including offensive contributions like shots, dribbles, and presence in the penalty box, as well as defensive skills like aerial duels and defensive actions.

The Jupyter notebook provides an overview of the following:

    - Accessing Wyscout data and selecting role/position templates
    - Understanding data scaling and its purpose
    - Why aggregating data per 90 minutes
    - Adjusting metrics to emphasize specific player profiles
    - Comparing the effectiveness of Cosine and Euclidean metrics in capturing similarities
    - Presenting a use case study and employing radar charts for data visualization.
    
*By; EL Mehdi DAHBI*

### *Imports*

In [1]:
import pandas as pd 
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.metrics import pairwise_distances
import numpy as np
import warnings
warnings.filterwarnings('ignore')

In [2]:
df= pd.read_parquet('../data/players.parquet')
df.head()

Unnamed: 0,Joueur,Équipe,Équipe dans la période sélectionnée,Place,Âge,Valeur marchande,Minutes jouées,Buts,xG,Passes décisives,...,Passes pénétrantes par 90,"Passes en profondeur précises, %",Passes progressives par 90,"Passes progressives précises, %",League,id,90s,Duels aériens gagnés par 90,Dribbles réussis par 90,xG/Tir
0,P. Onuachu,Southampton,Genk,CF,28,17000000,1263,16,11.71,0,...,0.29,0.0,1.07,46.67,Jupiler Pro League,0.0,14.03,2.99376,1.142108,0.29275
1,A. Skov Olsen,Club Brugge,Club Brugge,"RWB, RAMF, RW",23,17000000,1681,7,5.79,7,...,0.64,50.0,7.01,87.79,Jupiler Pro League,0.0,18.68,0.215,4.38864,0.103393
2,Sergio Gómez,Manchester City,Anderlecht,LB,22,15000000,14,0,0.0,0,...,0.0,0.0,0.0,0.0,Jupiler Pro League,0.0,0.16,0.0,0.0,0.0
3,N. Lang,Club Brugge,Club Brugge,"LAMF, CF, LWF",23,15000000,2456,9,9.13,6,...,2.31,39.68,6.71,86.34,Jupiler Pro League,0.0,27.29,0.255884,3.663378,0.169074
4,Fábio Silva,PSV,Anderlecht,CF,20,13000000,1632,7,7.66,1,...,0.94,29.41,1.93,80.0,Jupiler Pro League,0.0,18.13,1.048662,1.819298,0.170222


### *Players metrics template*
The provided templates outline distinct player roles categorized based on StatsBomb similarity tool. These roles encapsulate crucial performance metrics specific to each position on the field.

For instance, the "Striker" template emphasizes essential offensive aspects such as *expected goals per 90, shots per 90, successful dribbles per 90, and defensive actions achieved per 90.* On the other hand, the "Midfielder" template highlights *passing accuracy, progressive passes per 90, expected assists per 90, and successful dribbles per 90,* crucial for orchestrating the play in the middle of the field.

Each template tailors its metrics to the unique demands and responsibilities of the respective player positions, providing a comprehensive framework to evaluate and stidy similarities between players performances within these roles.

In [3]:
templates = {
    "Striker": [
        '90s',
        'Joueur',
        'Équipe',
        'Place',
        'Âge',
        'xG par 90', 
        'Tirs par 90',
        'Touches de balle dans la surface de réparation sur 90',
        'Actions défensives réussies par 90',
        'Duels aériens gagnés par 90',
        'Dribbles réussis par 90',
        'xG/Tir',
        'Passes réceptionnées par 90'
        ],
    
    "Winger/Attacking Midfielder": [
        '90s',
        'Joueur',
        'Équipe',
        'Place',
        'Âge',
        'xG par 90',
        'xA par 90',
        'Tirs par 90', 
        'Touches de balle dans la surface de réparation sur 90',
        'Сentres précises, %', 
        'Fautes subies par 90',
        'Dribbles réussis par 90', 
        'Tirs par 90',
        'Passes vers la surface de réparation précises, %', 
        'Interceptions PAdj'
        ],
    
    "Midfielder": [
        '90s',
        'Joueur',
        'Équipe',
        'Place',
        'Âge',
        'Passes précises, %',
        'Passes progressives par 90', 
        'xA par 90', 
        'Dribbles réussis par 90',
        'Fautes subies par 90', 
        'Interceptions PAdj', 
        'Courses progressives par 90'
        ],
    
    "Defender": [
        '90s',
        'Joueur',
        'Équipe',
        'Place',
        'Âge',
        'Passes réceptionnées par 90',
        'Passes en avant précises, %',
        'Interceptions PAdj',
        'Tacles glissés PAdj',
        'Fautes par 90', 
        'Duels aériens par 90', 
        'Duels aériens gagnés, %'
        ],
    
    "Full Wing Back": [
        '90s',
        'Joueur',
        'Équipe',
        'Place',
        'Âge',
        'Passes réceptionnées par 90',
        'Passes en avant précises, %',
        'Interceptions PAdj',
        'Tacles glissés PAdj',
        'Fautes par 90',
        'Duels aériens par 90',
        'Duels aériens gagnés, %',
        'Сentres précises, %',
        'xA par 90']
}

# Identify common columns
columns_to_keep = [col for template in templates.values() for col in template]

# Filter the DataFrame to keep only the specified columns
filtered_df = df[columns_to_keep]

filtered_df.head()

Unnamed: 0,90s,Joueur,Équipe,Place,Âge,xG par 90,Tirs par 90,Touches de balle dans la surface de réparation sur 90,Actions défensives réussies par 90,Duels aériens gagnés par 90,...,Âge.1,Passes réceptionnées par 90,"Passes en avant précises, %",Interceptions PAdj,Tacles glissés PAdj,Fautes par 90,Duels aériens par 90,"Duels aériens gagnés, %","Сentres précises, %",xA par 90
0,14.03,P. Onuachu,Southampton,CF,28,0.83,2.85,4.42,3.28,2.99376,...,28,13.61,57.89,2.38,0.1,2.21,7.2,41.58,0.0,0.05
1,18.68,A. Skov Olsen,Club Brugge,"RWB, RAMF, RW",23,0.31,3.0,4.12,4.18,0.215,...,23,27.52,65.52,3.45,0.08,0.64,0.86,25.0,42.68,0.26
2,0.16,Sergio Gómez,Manchester City,LB,22,0.0,0.0,0.0,0.0,0.0,...,22,0.0,0.0,22.5,0.0,0.0,0.0,0.0,0.0,0.0
3,27.29,N. Lang,Club Brugge,"LAMF, CF, LWF",23,0.33,1.98,3.88,3.44,0.255884,...,23,29.24,68.86,2.28,0.05,0.84,1.06,24.14,30.77,0.37
4,18.13,Fábio Silva,PSV,CF,20,0.42,2.48,5.68,2.81,1.048662,...,20,12.19,47.37,1.07,0.0,1.49,4.14,25.33,9.09,0.08


*You may have noticed that all metrics are standardized on a per 90-minute basis to enable a level comparison of player performance, irrespective of varying playing times. This ensures fair and equitable comparaison across players.*

### *Scaling data*
Scaling data with StandardScaler prior to computing cosine similarities is a valuable approach to guarantee that the similarity measure reflects the genuine relationship between features, not their scale. 

For instance, when considering metrics like passes received per 90, which range from 10 to 22 passes, and expected goals (xG) generated per 90, which varies from 0.1 to 2, the calculation involves both of these metrics. This could introduce biases towards metrics with larger magnitudes (scales). Scaling mitigates this issue by bringing all metrics to a comparable scale, ensuring a non-biased calculation.

In [4]:
def scale_data(df, template):
    """
    Scale the data in the DataFrame based on the specified template using StandardScaler.

    Parameters:
        df (DataFrame): The input DataFrame.
        template (list): List of column names to be included in the scaled DataFrame.

    Returns:
        DataFrame: Scaled DataFrame based on the specified template.
    """
    # Columns that should not be scaled
    no_to_scale = ['90s', 'Joueur', 'Équipe', 'Place', 'Âge']
    df = df[template]
    scaler = StandardScaler()

    # Fit and transform the features, excluding non-scalable columns
    scaler.fit(df.drop(no_to_scale, axis=1))
    scaled_features = scaler.transform(df.drop(no_to_scale, axis=1))
    scaled_feat_df = pd.DataFrame(scaled_features, columns=df.columns[5:])

    return pd.concat([df[no_to_scale], scaled_feat_df], axis=1)

### *Placing importance in specific metrics, Cosine or Euclidean similarity?*

Giving more weight to particular metrics allows us to seek not just a replacement for a player, but a precise match for the desired profile (placing importance on strenghts and overlooking weakneses).

Regarding the choice of using cosine similarity as a metric, it is influenced by a previous document similarity project I conducted. From that project, I deduced that assessing document similarity without considering magnitude tends to yield better outcomes. This choice is advantageous because even if two documents or players are distant in Euclidean distance, their angle of separation could still be small, illustrating what is commonly referred to as style.

In [5]:
# Defining selected features for clustering (you can adjust this list)
selected_features = [
    'xG par 90', 'Tirs par 90', 'Touches de balle dans la surface de réparation sur 90',
    'Actions défensives réussies par 90', 'Duels aériens gagnés par 90',
    'Dribbles réussis par 90', 'xG/Tir', 'Passes réceptionnées par 90'
]

# Define weights for each metric 
metric_weights = {
    'xG par 90': 0.5,
    'Tirs par 90': 0.3,
    'Touches de balle dans la surface de réparation sur 90': 0.2,
    'Actions défensives réussies par 90': 0.4,
    'Duels aériens gagnés par 90': 0.3,
    'Dribbles réussis par 90': 0.2,
    'xG/Tir': 0.5,
    'Passes réceptionnées par 90': 0.4
}

# Scaling data
X = scale_data(df, templates["Striker"])[selected_features]

# Applying weights to data
for metric, weight in metric_weights.items():
    X[metric] *= weight

# Calculate pairwise cosine similarities
cosine_similarities = 1 - pairwise_distances(X, metric='cosine')

# Define a function to find similar players based on cosine similarity (or an appropriate metric)
def find_similar_players(player_name, df, cosine_similarities):
    """
    Find similar players based on cosine similarity.

    Parameters:
        player_name (str): The name of the player to find similarities for.
        df (DataFrame): The input DataFrame containing player data.
        cosine_similarities (array): Pairwise cosine similarities between players.

    Returns:
        DataFrame: DataFrame containing similar players and their similarity percentages.
    """
    # Find the index of the specified player in the DataFrame
    player_index = df[df['Joueur'] == player_name].index[0]
    similarities = cosine_similarities[player_index] * 100
    similar_players = df.copy()

    similar_players['Similarity Percentage'] = similarities
    similar_players = round(similar_players.sort_values(by='Similarity Percentage', ascending=False), 1)

    return similar_players

### *Find similar players to Serge Gnabry*

In [6]:
# S.Gnabry Use case:
player_name = 'S. Gnabry'  
similar_players = find_similar_players(player_name, df, cosine_similarities)
similar_players[['Joueur', 'Équipe', 'League', 'Place', 'Âge', '90s', 'Similarity Percentage']].reset_index(drop=True)[1:11]

Unnamed: 0,Joueur,Équipe,League,Place,Âge,90s,Similarity Percentage
1,Ansu Fati,Barcelona,La Liga,"LWF, CF, LAMF",20,16.8,99.4
2,K. Mbappé,PSG,Ligue 1,CF,24,33.4,98.8
3,D. Berardi,Sassuolo,Italian Serie A,"RWF, RW, RAMF",28,22.3,98.6
4,M. Vlap,Twente,Eredevise,"AMF, RCMF, LCMF",26,29.6,98.5
5,N. Pépé,Nice,Ligue 1,"CF, RW, RWF",27,18.0,98.0
6,M. Tel,Bayern München,German Bundesliga,"CF, RAMF",18,5.4,97.7
7,S. Szymański,Fenerbahçe,Eredevise,"AMF, LCMF",24,23.5,97.4
8,P. Dybala,Roma,Italian Serie A,"CF, AMF, RWF",29,20.9,97.2
9,J. Ito,Reims,Jupiler Pro League,"RAMF, RW",30,1.1,97.1
10,Z. Aboukhlal,Toulouse,Ligue 1,"RAMF, RW, RWF",23,29.3,97.1


***Results Interprtation;** Now take into consideration that the players generated may not be involved in the same playing time as the player selected, for instance Ito has nearly played a full 90 while Byaern's wonderkid has only played 5,4 full ninteise as compared to Serge Gnabry who played almost 17 full 90s. filtering by some attributes would be inportant in order to get the most out of the model. That's why cheking the streamlit app would be crucial.*