## Similarity Functions

In [None]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.metrics.pairwise import euclidean_distances
from sklearn.metrics.pairwise import cosine_similarity, cosine_distances

In [None]:
df_full = pd.read_csv('https://raw.githubusercontent.com/brendad8/Datasets/main/t5_leagues.csv')

In [None]:
model_cols = ["PK","PKatt","Gls/90","Ast/90","xG/90","xAG/90","CrdY","CrdR","pass.Att","pass.Cmp%","pass.TotDist","pass.PrgDist","pass.short.Att","pass.med.Att","pass.long.Att","pass.KP","pass.1/3","pass.PPA","pass.CrsPA","pass.Prog","shoot.Sh/90","shoot.SoT/90","shoot.Dist","shoot.FK","defense.Plyrs_Tkld","defense.TklW","defense.Def 3rd","defense.Mid 3rd","defense.Att 3rd","defense.Tkl_Drib","defense.Past","defense.Blocks","defense.Int","defense.Clr","pos.Touches","pos.Def Pen","pos.Def 3rd","pos.Mid 3rd","pos.Att 3rd","pos.Att Pen","pos.drib.Att","pos.drib.Mis","pos.drib.Dis","pos.rec.Rec","pos.rec.Prog","sca.S.SCA90","sca.S.PassLive","sca.S.PassDead","sca.S.Drib","sca.S.Fld","sca.S.Def","sca.S.GCA90"]

In [None]:
per90_cols = ["pass.Att","pass.TotDist","pass.PrgDist","pass.short.Att","pass.med.Att","pass.long.Att","pass.KP","pass.1/3","pass.PPA","pass.CrsPA","pass.Prog","defense.Plyrs_Tkld","defense.TklW","defense.Def 3rd","defense.Mid 3rd","defense.Att 3rd","defense.Tkl_Drib","defense.Past","defense.Blocks","defense.Int","defense.Clr","pos.Touches","pos.Def Pen","pos.Def 3rd","pos.Mid 3rd","pos.Att 3rd","pos.Att Pen","pos.drib.Att","pos.drib.Mis","pos.drib.Dis","pos.rec.Rec","pos.rec.Prog","sca.S.PassLive","sca.S.PassDead","sca.S.Drib","sca.S.Fld","sca.S.Def"]

In [None]:
df_final = df_full[~(df_full['Pos'] == 'GK')]
df_final = df_final[df_final['Min'] >= 1000]

# Turn all non per/90 stats to per/90 min played
for col in per90_cols:
  df_final[col] = df_final[col] / df_final['90s']
df_model = df_final[model_cols].fillna(0)

Standardize variables

In [None]:
scaler = StandardScaler()
scaler.fit(df_model)
df_model_std = pd.DataFrame(scaler.transform(df_model), columns=df_model.columns, index=df_model.index)

### Euclidean Distance Method

In [None]:
# Input --> Player Name (str)
# Output --> df of 10 most similar players (using euclidean distance) from most similar to least similar

In [None]:
euc_full = pd.DataFrame(euclidean_distances(df_model_std), columns=df_model.index, index=df_model.index)

In [None]:
def get_sim_players_euc(player):
  player_idx = df_final[df_final['Player'] == player].index[0]
  order = euc_full[player_idx].sort_values(ascending=True).index.to_numpy()
  return df_final.loc[order][['Player', 'Age', 'Squad', 'Pos']][1:11]

### Cosine Similarity Method

In [None]:
# Input --> Player Name (str)
# Output --> df of 10 most similar players (using cosine similarity) from most similar to least similar

In [None]:
cos_sim_full = pd.DataFrame(cosine_similarity(df_model_std), columns=df_model.index, index=df_model.index)

In [None]:
def get_sim_players_cos(player):
  player_idx = df_final[df_final['Player'] == player].index[0]
  order = cos_sim_full[player_idx].sort_values(ascending=False).index.to_numpy()
  return df_final.loc[order][['Player', 'Age', 'Squad', 'Pos']][1:11]

## Using Similarity Functions

GOAL: Find a lesser-known Defender, Hybrid Player, and Attacker who favorably compare to elite players


Plan:

Select Several Elite Players:
* For each player:
* Get most similar (cos_sim) players into a common table
* Group by Player and see most common Players
* Exclude elite players used in lists
* Select Player by considering all the data on them

In [None]:
attackers = ['Kevin De Bruyne', 'Lionel Messi', 'Serge Gnabry', 'Kylian Mbappé', 'Vinicius Júnior', 'Raheem Sterling', 'Mohamed Salah', 'Ousmane Dembélé', 'Sadio Mané', 'Ángel Di María', 'Riyad Mahrez', 'Gabriel Jesus', 'Karim Benzema', 'Harry Kane', 'Son Heung-min']
hybrids = ['Frenkie de Jong', 'Luka Modrić', 'Rodri', 'Toni Kroos', 'Nicolò Barella', 'Paul Pogba', 'Jordan Henderson', 'Joshua Kimmich', 'Marco Verratti', 'Thiago Alcántara', 'Bernardo Silva']
defenders = ['John Stones', 'Aymeric Laporte', 'Antonio Rüdiger', 'Presnel Kimpembe', 'Rúben Dias', 'Marquinhos', 'Thiago Silva', 'David Alaba', 'Joël Matip', 'Dayot Upamecano']

#### Find attacker

In [None]:
# Initialize a list of attackers using Neymar for first attacker
# this makes concatenating results much easier

# Code below will list attackers not in the list who were in the
# top 10 most similar players for at least 4 players in the list

df_att = get_sim_players_cos('Neymar')
for att in attackers:
  df_sim = get_sim_players_cos(att)
  df_att = pd.concat([df_att, df_sim], ignore_index=False)
att_res = df_att[~df_att['Player'].isin(attackers)]['Player'].to_numpy()
df_att_res = pd.DataFrame(df_att[~df_att['Player'].isin(attackers + ['Neymar'])]['Player'].value_counts())
df_final[df_final['Player'].isin(df_att_res[df_att_res['Player'] >= 4].index.to_numpy())][['Player', 'Nation', 'Pos', 'Squad', 'Age', '90s']]


Unnamed: 0,Player,Nation,Pos,Squad,Age,90s
345,Mason Mount,eng ENG,MF,Chelsea,22.0,26.3
1210,Memphis Depay,nl NED,FW,Barcelona,27.0,20.6
1933,Amine Gouiri,fr FRA,FWMF,Nice,21.0,30.6
2061,Lovro Majer,hr CRO,MFFW,Rennes,23.0,20.4
2256,Karl Toko Ekambi,cm CMR,FW,Lyon,28.0,25.5
2417,Joaquín Correa,ar ARG,FW,Inter,26.0,11.4
2664,Luis Muriel,co COL,FWMF,Atalanta,30.0,17.1
2908,Duván Zapata,co COL,FW,Atalanta,30.0,19.1


I like the look of Lovro Majer from Rennes

Really good goal and assist numbers per 90

Good chance creation with passing:

  2.2 key passes per game (Passes leading to shots)

  4.16 passes into the final third


Touches:

  High touches in middle and final third

  3.03 touches in opp penalty area

In [None]:
display(df_final[df_final['Player'] == 'Lovro Majer'][['Player', 'Squad', 'Age'] + model_cols].transpose())

Unnamed: 0,2061
Player,Lovro Majer
Squad,Rennes
Age,23.0
PK,0
PKatt,0
Gls/90,0.29
Ast/90,0.39
xG/90,0.26
xAG/90,0.29
CrdY,1


Which players are most similar to Majer?

In [None]:
get_sim_players_cos('Lovro Majer')

Unnamed: 0,Player,Age,Squad,Pos
936,Marco Reus,32.0,Dortmund,MFFW
1009,Dominik Szoboszlai,20.0,RB Leipzig,MFFW
112,Kevin De Bruyne,30.0,Manchester City,MF
345,Mason Mount,22.0,Chelsea,MF
1848,Ángel Di María,33.0,Paris S-G,FW
1693,Yacine Adli,21.0,Bordeaux,MFFW
608,Julian Brandt,25.0,Dortmund,MFFW
1056,Florian Wirtz,18.0,Leverkusen,MFFW
1208,Ousmane Dembélé,24.0,Barcelona,FW
739,Jonas Hofmann,29.0,M'Gladbach,MFFW


Being similar to Kevin De Bruyne can never be a bad thing. It looks to me like Majer matches best to attacking midfield players (Szobo, De Bruyne, Mount, Brandt, Wirtz...)

#### Find hybrid player

In [None]:
df_mids = get_sim_players_cos('Granit Xhaka')
for mid in hybrids:
  df_sim = get_sim_players_cos(mid)
  df_mids = pd.concat([df_mids, df_sim], ignore_index=False)
mid_res = df_mids[~df_mids['Player'].isin(hybrids + ['Granit Xhaka'])]['Player'].to_numpy()
df_mid_res = pd.DataFrame(df_mids[~df_mids['Player'].isin(hybrids + ['Granit Xhaka'])]['Player'].value_counts())
df_final[df_final['Player'].isin(df_mid_res[df_mid_res['Player'] >= 3].index.to_numpy())][['Player', 'Nation', 'Pos', 'Squad', 'Age', '90s']]


Unnamed: 0,Player,Nation,Pos,Squad,Age,90s
67,João Cancelo,pt POR,DF,Manchester City,27.0,35.9
542,Oleksandr Zinchenko,ua UKR,DF,Manchester City,24.0,11.6
1081,Jordi Alba,es ESP,DF,Barcelona,32.0,29.4
1161,Sergio Canales,es ESP,FWMF,Betis,30.0,31.0
1501,Daniel Parejo,es ESP,MF,Villarreal,32.0,30.0
1601,David Silva,es ESP,MF,Real Sociedad,35.0,19.1
1942,Mattéo Guendouzi,fr FRA,MF,Marseille,22.0,35.0
1963,Ander Herrera,es ESP,MF,Paris S-G,31.0,12.1
2207,Renato Sanches,pt POR,MFFW,Lille,23.0,18.3
2297,Luis Alberto,es ESP,MF,Lazio,28.0,26.1


I really like the look of Oleksandr Zinchenko (despite his smaller sample size)



Zinchenko has really good assist numbers (.34 assists per 90)



Other points of notice:


92.5 passes attempts per game
with 88.8% completion percent

8.9 passes into final third per 90
 1.9 passes into penalty area per 90

Very high touch volume in middle third (53.4 touches per 90)

This is clearly a player who does a lot of damage on the ball via his passing and is responsible with the ball
(Demands ball and doesn't lose it alot)

In [None]:
display(df_final[df_final['Player'] == 'Oleksandr Zinchenko'][['Player', 'Squad', 'Age'] + model_cols].transpose())

Unnamed: 0,542
Player,Oleksandr Zinchenko
Squad,Manchester City
Age,24.0
PK,0
PKatt,0
Gls/90,0.0
Ast/90,0.34
xG/90,0.05
xAG/90,0.22
CrdY,0


Which players are most similar to Zinchenko?

In [None]:
get_sim_players_cos('Oleksandr Zinchenko')

Unnamed: 0,Player,Age,Squad,Pos
210,Jordan Henderson,31.0,Liverpool,MF
67,João Cancelo,27.0,Manchester City,DF
1081,Jordi Alba,32.0,Barcelona,DF
2776,Fabián Ruiz Peña,25.0,Napoli,MF
1963,Ander Herrera,31.0,Paris S-G,MF
1446,Luka Modrić,35.0,Real Madrid,MF
2261,Hamari Traoré,29.0,Rennes,DF
409,Andrew Robertson,27.0,Liverpool,DF
6,Thiago Alcántara,30.0,Liverpool,MF
785,Joshua Kimmich,26.0,Bayern Munich,MF


Zinchenko matches well to both elite fullbacks (Cancelo, Alba, Robertson) and elite center midfielders (Henderson, Modric, Kimmich). This paints the picture of a left back who is a ball dominant player like a center midfielder. This is heavily apparent when watching Zinchenko play.

#### Find defender

In [None]:
df_cbs = get_sim_players_cos('Virgil van Dijk')
for cb in defenders:
  df_sim = get_sim_players_cos(cb)
  df_cbs = pd.concat([df_cbs, df_sim], ignore_index=False)
cb_res = df_cbs[~df_cbs['Player'].isin(defenders + ['Virgil van Dijk'])]['Player'].to_numpy()
df_cb_res = pd.DataFrame(df_cbs[~df_cbs['Player'].isin(defenders + ['Virgil van Dijk'])]['Player'].value_counts())
df_final[df_final['Player'].isin(df_cb_res[df_cb_res['Player'] >= 3].index.to_numpy())][['Player', 'Nation', 'Pos', 'Squad', 'Age', '90s']]


Unnamed: 0,Player,Nation,Pos,Squad,Age,90s
83,Andreas Christensen,dk DEN,DF,Chelsea,25.0,16.6
552,Manuel Akanji,ch SUI,DF,Dortmund,26.0,25.1
1007,Niklas Süle,de GER,DF,Bayern Munich,25.0,20.4
1793,Duje Ćaleta-Car,hr CRO,DF,Marseille,24.0,22.9
2202,William Saliba,fr FRA,DF,Marseille,20.0,36.0
2290,Francesco Acerbi,it ITA,DF,Lazio,33.0,28.2
2335,Alessandro Bastoni,it ITA,DF,Inter,22.0,25.7


I like the look of Dortmund defender Manuel Akanji

He is 26 so he is coming into his prime as a defender. He also plays for Dortmund who is known as a selling club.  

He has only 3 yellow cards in 25.1 90s played.

4.886 passes into the final third as a centerback

High amount of touches in defensive and middle third of the pitch (38 and 47)

I find it hard to read into any of the defensive stats

These can be heavily influenced by which team a player plays and their play style.  (This is true for many other stats as well)

Additionaly, number of tackles/interceptions/aerials attempted doesn't necessarily reflect their quality in those duals

So I will just have to trust that Akanji comparing favorably to proven elite defenders is a good sign. This sentiment reflects also for the other players above.

In [None]:
display(df_final[df_final['Player'] == 'Manuel Akanji'][['Player', 'Squad', 'Age'] + model_cols].transpose())

Unnamed: 0,552
Player,Manuel Akanji
Squad,Dortmund
Age,26.0
PK,0
PKatt,0
Gls/90,0.04
Ast/90,0.0
xG/90,0.08
xAG/90,0.01
CrdY,3


Which players are most similar to Akanji?

In [None]:
get_sim_players_cos('Manuel Akanji')

Unnamed: 0,Player,Age,Squad,Pos
750,Mats Hummels,32.0,Dortmund,DF
1793,Duje Ćaleta-Car,24.0,Marseille,DF
1718,Benoît Badiashile,20.0,Monaco,DF
2255,Jean-Clair Todibo,21.0,Nice,DF
2076,Marquinhos,27.0,Paris S-G,DF
454,Thiago Silva,36.0,Chelsea,DF
2202,William Saliba,20.0,Marseille,DF
1011,Edmond Tapsoba,22.0,Leverkusen,DF
317,Joël Matip,29.0,Liverpool,DF
2075,Guillermo Maripán,27.0,Monaco,DF


Akanji is a clear centerback profile. He is similar to veterans Mats Hummels, Marquinhos, Thiago Silva, and Joel Matip. He is also similar to more up and coming talents Badiashile, Saliba, and Tapsoba. Many of these centerbacks are great on the ball and more than competant defensively which bodes well for Akanji.

The next step is to compare these three players to their respective subgroup. I do this in the machine_learning_model file.