In [1]:
import pandas as pd
import numpy as np

## Cleaning the target variable
* Removing observations that can't be used 
    - Loans 
    - Free transfers 
    - Academy promotions
* Converting the fee to an integer
     - Stripping the currency symbol
     - Converting string decimal notation to a multiplier
     - Converting fee to integer on same scale

In [2]:
tf_df = pd.read_csv('data/Top_8_leagues_past_6_windows.csv')

In [3]:
tf_df.drop(labels="Unnamed: 0", axis=1, inplace=True)

In [4]:
tf_df.head()

Unnamed: 0,player,age,nationality,position,selling_club,previous_league,est_market_value,fee,buying_club,window,year,buying_league
0,Ante Palaversa,18,Croatia,Defensive Midfield,HNK Hajduk Split,Croatia,£495Th.,£5.67m,Manchester City,s_w=w,2018,GB1
1,Ko Itakura,21,Japan,Centre-Back,Kawasaki Frontale,Japan,£630Th.,£990Th.,Manchester City,s_w=w,2018,GB1
2,Yangel Herrera,20,Venezuela,Central Midfield,New York City FC,United States,£900Th.,"End of loanDec 31, 2018",Manchester City,s_w=w,2018,GB1
3,Marlos Moreno,22,Colombia,Right Winger,Clube de Regatas do Flamengo,Brazil,£450Th.,"End of loanDec 31, 2018",Manchester City,s_w=w,2018,GB1
4,Anthony Cáceres,26,Australia,Central Midfield,Melbourne City FC,Australia,£450Th.,"End of loanDec 31, 2018",Manchester City,s_w=w,2018,GB1


In [5]:
# Multiple players have duplicate entries e.g., Nikola Kalinic 2020, Wesley Fofana 2020, Moussa Konaté 2020
# tf_df.loc[tf_df['player']=='Nikola Kalinic',:]

In [6]:
tf_df.shape

(9666, 12)

In [7]:
# Not unusual for players to be transfered multiple times even in same year so only removing duplicates with the
# exact same fee, buying club and selling club in the same year.
tf_df.drop_duplicates(subset=['player','selling_club','buying_club','fee','year'], keep='first', inplace=True)

In [8]:
tf_df.shape

(9378, 12)

In [9]:
tf_df['fee'].value_counts()

free transfer              1139
loan transfer              1088
-                          1010
End of loanJun 30, 2019     988
End of loanJun 30, 2018     932
                           ... 
£78.30m                       1
End of loanNov 30, 2018       1
£432Th.                       1
End of loanAug 21, 2019       1
End of loanJan 20, 2019       1
Name: fee, Length: 627, dtype: int64

### Counting and labeling loan deals

In [10]:
tf_df['loan'] = tf_df['fee'].apply(lambda x: 1 if 'loan' in x.lower() else 0)

In [11]:
tf_df['loan'].value_counts()

1    5056
0    4322
Name: loan, dtype: int64

5056 loan deals.  Not unexpected.

### Counting and labeling free transfers 

In [12]:
tf_df['free'] = tf_df['fee'].apply(lambda x: 1 if 'free' in x.lower() else 0)

In [13]:
tf_df['free'].value_counts()

0    8239
1    1139
Name: free, dtype: int64

In [14]:
#Want to see if there are any other types of free transfers
# pd.options.display.max_rows = 1200
tf_df.loc[tf_df['free'] == 1, 'fee']

20      free transfer
33      free transfer
46      free transfer
86      free transfer
131     free transfer
            ...      
9636    free transfer
9637    free transfer
9638    free transfer
9652    free transfer
9653    free transfer
Name: fee, Length: 1139, dtype: object

Free transfers happen when a contract is run down; contract length is an important determinant of transfer fee but I don't have that information.  Here, while it would be important for clubs to know how much time is left on a contract (as they may be able to use that as leverage to get a better deal) I'll use the model to bench mark the going market rate for a player of that profile.

### In the fee value count: 1010 "-"
### These appear to be internal promotions, e.g., from the U23 team. Investigating and labeling here

In [15]:
# pd.options.display.max_rows = 30
tf_df.loc[tf_df['fee'] == "-", ['player','fee','selling_club','buying_club']]

Unnamed: 0,player,fee,selling_club,buying_club
16,Callum Hudson-Odoi,-,Chelsea FC U23,Chelsea FC
19,Eddie Nketiah,-,Arsenal FC U23,Arsenal FC
29,Sean Longstaff,-,Newcastle United U23,Newcastle United
42,Kyle Taylor,-,AFC Bournemouth U21,AFC Bournemouth
44,Samir Nasri,-,Disqualification,West Ham United
...,...,...,...,...
9591,Andrey Bokovoy,-,FK Sochi II,FC Sochi
9603,Arsen Adamov,-,Akhmat Grozny II,Akhmat Grozny
9621,Vladimir Kabakhidze,-,FK Tambov II,PFK Tambov
9639,Nikita Repin,-,Rotor 2 Volgograd,Rotor Volgograd


The fee "-" are internal promotions or picking up players without a club for free.  Will not be used in our model

In [16]:
# removing loans, free transfers, academy promotions
tfdf2 = tf_df.loc[(tf_df['fee'] != "-")&(tf_df['loan'] == 0)&(tf_df['free'] == 0), :]

In [17]:
tfdf2.head()

Unnamed: 0,player,age,nationality,position,selling_club,previous_league,est_market_value,fee,buying_club,window,year,buying_league,loan,free
0,Ante Palaversa,18,Croatia,Defensive Midfield,HNK Hajduk Split,Croatia,£495Th.,£5.67m,Manchester City,s_w=w,2018,GB1,0,0
1,Ko Itakura,21,Japan,Centre-Back,Kawasaki Frontale,Japan,£630Th.,£990Th.,Manchester City,s_w=w,2018,GB1,0,0
10,Christian Pulisic,20,United States,Left Winger,Borussia Dortmund,Germany,£45.00m,£57.60m,Chelsea FC,s_w=w,2018,GB1,0,0
27,Miguel Almirón,24,Paraguay,Attacking Midfield,Atlanta United FC,United States,£13.50m,£21.60m,Newcastle United,s_w=w,2018,GB1,0,0
38,Dominic Solanke,21,England,Centre-Forward,Liverpool FC,England,£9.00m,£19.08m,AFC Bournemouth,s_w=w,2018,GB1,0,0


In [18]:
tfdf2.shape

(2173, 14)

In [19]:
# pd.options.display.max_rows = 20
# tfdf2['previous_league'].value_counts()
# Top 6 countries:
# Italy                   299
# France                  258
# England                 225
# Spain                   222
# Portugal                175
# Germany                 172

Evaluating the value counts of previous league - there are only 1176 observations where a player was bought from a league in the Big 5.  Currently only have detailed statistical profile of players from the top 5 leagues.  Looked to pull in more years and data from more leagues but fbref doesn't have detailed stats for leagues outside the top 5 (except for MLS) or for more than 3 years.  Will have to proceed with what I have understanding this is a bit reduced in scope by virture of data.

In [20]:
#Creating an indicator for bought from a top five league to faciliate merging with statistical data
tfdf2['buying_top_5'] = tfdf2['previous_league'].map(lambda x: 1 if x in ["Italy",'France','England','Spain','Germany'] else 0)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [21]:
pd.options.mode.chained_assignment = None  # default='warn'

In [22]:
tfdf2['buying_top_5'].value_counts()

1    1176
0     997
Name: buying_top_5, dtype: int64

In [23]:
tfdf3 = tfdf2.loc[(tfdf2['buying_top_5'] == 1), :]

In [24]:
tfdf3.shape

(1176, 15)

### Turning fee column into an integer

In [25]:
tfdf3['fee'].value_counts()

?          69
£1.80m     65
£2.70m     45
£900Th.    42
£3.60m     41
           ..
£30.06m     1
£35.28m     1
£248Th.     1
£20.16m     1
£4.77m      1
Name: fee, Length: 265, dtype: int64

In [26]:
filt =  tfdf3['fee']=='?'
tfdf3.drop(index=tfdf3[filt].index, inplace=True)

In [27]:
tfdf3['fee'].value_counts()

£1.80m     65
£2.70m     45
£900Th.    42
£4.50m     41
£3.60m     41
           ..
£30.06m     1
£35.28m     1
£248Th.     1
£20.16m     1
£4.77m      1
Name: fee, Length: 264, dtype: int64

In [28]:
tfdf3['currency'] = tfdf3['fee'].apply(lambda x: x[0])

In [29]:
tfdf3['currency'].value_counts()

£    1107
Name: currency, dtype: int64

In [30]:
tfdf3['multiplier'] = tfdf3['fee'].str.extract(r'([a-zA-Z]+)')
                                                                         

In [31]:
tfdf3['multiplier'].value_counts()

m     936
Th    168
Name: multiplier, dtype: int64

In [32]:
tfdf3.isna().sum()

player              0
age                 0
nationality         0
position            0
selling_club        0
previous_league     0
est_market_value    0
fee                 0
buying_club         0
window              0
year                0
buying_league       0
loan                0
free                0
buying_top_5        0
currency            0
multiplier          3
dtype: int64

In [33]:
# If there was no multiplier specified assuming it is in the specified units
tfdf3.fillna(value=1, inplace=True)

In [34]:
tfdf3['multiplier'].value_counts()

m     936
Th    168
1       3
Name: multiplier, dtype: int64

In [35]:
tfdf3['fee_numerical'] = tfdf3['fee'].str.extract(r'([0-9]+\.?[0-9]+)')

In [36]:
# convert numeric columns to appropiate dypes
tfdf3 = tfdf3.astype({'fee_numerical':'float'})

In [37]:
tfdf3.loc[tfdf3['multiplier'] == 'm', 'mult_num'] = 1000000
tfdf3.loc[tfdf3['multiplier'] == 'Th', 'mult_num'] = 1000
tfdf3.loc[tfdf3['multiplier'] == 1, 'mult_num'] = 1

In [38]:
tfdf3['fee_final'] = tfdf3['fee_numerical']*tfdf3['mult_num']

In [39]:
tfdf3['fee_final']

10      57600000.0
38      19080000.0
39      12240000.0
57       1530000.0
65      18900000.0
           ...    
9232     2700000.0
9320    18000000.0
9321     5400000.0
9455    10800000.0
9466     4950000.0
Name: fee_final, Length: 1107, dtype: float64

In [None]:
# tfdf3.head()

## Creating variables to index on for merging:
Need to assess and clean variables for this purpose in both dataframes:
* Player name
* Year of stats/year of transfer - using a one year lag (e.g., purchase in 2018 will be modeled with 2017 stats.
    - Note: the winter transfer window is labeled a year early (winter 2018 was the Jan 2019 window). Therefore use the 2017/2018 stats to predict winter/summer 2018 prices
* Columns with same information but different format:
    * club name in stats/selling club in transfers
    * nationality
    * previous_league/league
    * position

In [40]:
statsdf = pd.read_csv('data/Player_stats_top_5_leagues_2017_to_2020.csv')

In [None]:
# tfdf3.head()

In [None]:
# tfdf3['player'].value_counts()

In [None]:
# tfdf3.loc[tfdf3['player']=='Enric Gallego',:]

In [None]:
# statsdf.loc[statsdf['players']=='Enric Gallego',:]

In [None]:
# statsdf.head()

## ASSESSING MISSINGNESS

In [41]:
tfdf3.isna().sum()

player              0
age                 0
nationality         0
position            0
selling_club        0
previous_league     0
est_market_value    0
fee                 0
buying_club         0
window              0
year                0
buying_league       0
loan                0
free                0
buying_top_5        0
currency            0
multiplier          0
fee_numerical       0
mult_num            0
fee_final           0
dtype: int64

In [42]:
pd.options.display.max_rows = 140
statsdf.isna().sum()

players                          0
nationality                      1
team                             0
position                         1
age                              7
birth_year                       7
games                            0
games_start                      0
mins                             0
goals                            0
assists                          0
pens_successful                  0
pens_attempts                    0
yellow_cards                     0
red_cards                        0
goals_per_90                     0
assists_per_90                   0
goals_and_assists_per_90         0
goals_pk_per_90                  0
goals_assists_pk_per_90          0
xg                              10
npxp                            10
xa                              10
xg_per90                        20
xa_per90                        20
xg_xa_per90_list                20
npxg_per90_list                 20
npxg_xa_per90                   20
full_90s_played     

* variables that can be calculated:
    - 'aerials_won_pct', 'shots_on_target_pct', 'goals_per_shot', 'goals_per_shot_on_target', 'avg_shot_dist', 'npxg_per_shot', 'pass_percent', 'pass_percent_short', 'pass_percent_medium', 'pass_percent_long', 'dribble_tackles_pct', 'pressure_regain_pct', 'dribbles_completed_pct', 'passes_received_pct'
* variables to use fillna:
    - progressive_passes

In [43]:
#Following variables are missing hundreds of observations or are duplicates between scraped pages
# Dropping columns
statsdf.drop(labels=['aerials_won_pct', 'shots_on_target_pct','goals_per_shot','goals_per_shot_on_target',
                    'avg_shot_dist','npxg_per_shot','pass_percent','pass_percent_short','pass_percent_medium',
                    'pass_percent_long','dribble_tackles_pct','pressure_regain_pct','dribbles_completed_pct', 
                    'passes_received_pct','goals.1','pens_successful.1','pens_attempts.1','xg.1',
                    'full_90s_played.1','assists.1','xa.1','full_90s_played.2'], axis=1, inplace=True)

In [44]:
#Assessing remaining missing variables 
pd.options.display.max_rows = 140
statsdf.isna().sum()

players                        0
nationality                    1
team                           0
position                       1
age                            7
birth_year                     7
games                          0
games_start                    0
mins                           0
goals                          0
assists                        0
pens_successful                0
pens_attempts                  0
yellow_cards                   0
red_cards                      0
goals_per_90                   0
assists_per_90                 0
goals_and_assists_per_90       0
goals_pk_per_90                0
goals_assists_pk_per_90        0
xg                            10
npxp                          10
xa                            10
xg_per90                      20
xa_per90                      20
xg_xa_per90_list              20
npxg_per90_list               20
npxg_xa_per90                 20
full_90s_played                0
shots_total                    7
shots_on_t

In [45]:
statsdf2 = statsdf.fillna(value='?')

In [46]:
# Lots of variables are missing 10 or 20 observations.  Guessing these are all on the same player,
# likely with few appearances and therefore no statistics to report.  Confirming here.
pd.options.display.max_columns = 116
statsdf2.loc[statsdf2['xg_per90'] == '?',:]

Unnamed: 0,players,nationality,team,position,age,birth_year,games,games_start,mins,goals,assists,pens_successful,pens_attempts,yellow_cards,red_cards,goals_per_90,assists_per_90,goals_and_assists_per_90,goals_pk_per_90,goals_assists_pk_per_90,xg,npxp,xa,xg_per90,xa_per90,xg_xa_per90_list,npxg_per90_list,npxg_xa_per90,full_90s_played,shots_total,shots_on_target,shots_total_per90,shots_on_target_per90,npxg,xg_net,npxg_net,passes_completed,passes_attempted,passes_total_dist,passes_prog_dist,passes_completed_short,passes_attempted_short,passes_completed_medium,passes_attempted_medium,passes_completed_long,passes_attempted_long,xa_net,assisted_shots,passes_into_final_third,passes_into_penalty_area,crosses_into_penalty_area,progressive_passes,sca,sca_per90,sca_passes_live,sca_passes_dead,sca_dribbles,sca_shots,sca_fouled,sca_defense,gca,gca_per90,gca_passes_live,gca_passes_dead,gca_dribbles,gca_shots,gca_fouled,gca_defense,gca_og_for,tackles,tackles_won,tackles_def_3rd,tackles_mid_3rd,tackles_att_3rd,dribble_tackles,dribble_vs,dribbled_past,pressures,pressure_regains,pressures_def_3rd,pressures_mid_3rd,pressures_att_3rd,blocks,blocked_shots,blocked_shots_saves,blocked_passes,interceptions,tackles_interceptions,clearances,errors,touches,touches_def_pen_area,touches_def_3rd,touches_mid_3rd,touches_att_3rd,touches_att_pen_area,touches_live_ball,dribbles_completed,dribbles,players_dribbled_past,nutmegs,carries,carry_distance,carry_progressive_distance,pass_targets,passes_received,miscontrols,dispossessed,passes_left_foot,passes_right_foot,aerials_won,aerials_lost,year,league
1685,Vincent Janssen,nl NED,Tottenham,DF,23,1994,1,0,1,0,0,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0,0,0,?,?,?,?,?,0.0,0,0,0,0.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,?,0,?,0,0,0,0,0,0,0,?,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,2017-2018,Premier-League
1808,Georges-Kévin N'Koudou,fr FRA,Tottenham,MF,22,1995,1,0,1,0,0,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0,0,0,?,?,?,?,?,0.0,0,0,0,0.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,?,1,?,0,0,0,0,1,0,0,?,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,2,1,1,0,1,1,2,0,0,0,0,1,2,0,1,1,0,0,0,0,0,0,2017-2018,Premier-League
1821,Aiden O'Neill,au AUS,Burnley,MF,19,1998,1,0,1,0,0,0,0,0,0,0.0,0.0,0.0,0.0,0.0,?,?,?,?,?,?,?,?,0.0,?,0,?,0.0,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,0,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,2017-2018,Premier-League
1937,Axel Tuanzebe,eng ENG,Manchester Utd,FW,19,1997,1,0,1,0,0,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0,0,0,?,?,?,?,?,0.0,0,0,0,0.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,?,0,?,0,0,0,0,0,0,0,?,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2017-2018,Premier-League
3308,Kévin Zohi,ml MLI,Strasbourg,MF,20,1996,1,0,1,0,0,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0,0,0,?,?,?,?,?,0.0,0,0,0,0.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,?,0,?,0,0,0,0,0,0,0,?,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,1,0,0,0,0,1,0,0,2,1,0,1,0,0,0,0,2017-2018,Ligue-1
3344,Mariusz Stępiński,pl POL,Nantes,FW,22,1995,1,0,1,0,0,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0,0,0,?,?,?,?,?,0.0,0,0,0,0.0,0,0,0,1,2,9,0,1,1,0,1,0,0,0,0,0,0,0,0,0,?,0,0,0,0,0,0,0,?,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,0,1,1,0,2,0,0,0,0,1,10,8,3,2,0,0,0,1,0,0,2017-2018,Ligue-1
3446,Romain Perraud,fr FRA,Nice,MF,19,1997,1,0,1,0,0,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0,0,0,?,?,?,?,?,0.0,0,0,0,0.0,0,0,0,2,3,40,30,1,1,0,1,1,1,0,0,1,0,0,1,0,?,0,0,0,0,0,0,0,?,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3,0,0,1,2,0,2,0,0,0,0,2,20,18,2,2,0,0,1,1,0,0,2017-2018,Ligue-1
3472,Sloan Privat,fr FRA,Guingamp,FW,28,1989,1,0,1,0,0,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0,0,0,?,?,?,?,?,0.0,0,0,0,0.0,0,0,0,1,1,7,1,1,1,0,0,0,0,0,0,0,0,0,0,0,?,0,0,0,0,0,0,0,?,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,1,1,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,2017-2018,Ligue-1
3559,Yusuf Sari,tr TUR,Marseille,FW,18,1998,1,0,1,0,0,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0,0,0,?,?,?,?,?,0.0,0,0,0,0.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,?,0,?,0,0,0,0,0,0,0,?,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2017-2018,Ligue-1
3836,Christian Kouakou,ci CIV,Caen,FW,27,1991,1,0,13,0,0,0,0,0,0,0.0,0.0,0.0,0.0,0.0,?,?,?,?,?,?,?,?,0.1,0,0,0,0.0,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,0,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,0,0,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,2018-2019,Ligue-1


In [47]:
#Confirmed - of the players with missing values 17 have 1 appearance, one 2, one 8 and one 11 (the latter two 
#greater than 6 full 90's) going to zero fill
statsdf.fillna(value=0, inplace=True)

In [48]:
#Not sure why but Christian Rutjens (index=4774 has no information)
statsdf.drop(index=4774, inplace=True)

In [49]:
statsdf.isna().sum()

players                       0
nationality                   0
team                          0
position                      0
age                           0
birth_year                    0
games                         0
games_start                   0
mins                          0
goals                         0
assists                       0
pens_successful               0
pens_attempts                 0
yellow_cards                  0
red_cards                     0
goals_per_90                  0
assists_per_90                0
goals_and_assists_per_90      0
goals_pk_per_90               0
goals_assists_pk_per_90       0
xg                            0
npxp                          0
xa                            0
xg_per90                      0
xa_per90                      0
xg_xa_per90_list              0
npxg_per90_list               0
npxg_xa_per90                 0
full_90s_played               0
shots_total                   0
shots_on_target               0
shots_to

### Years
* Year of stats/year of transfer - using a one year lag (e.g., purchase in 2018 will be modeled with 2017 stats.
    - Note: the winter transfer window is labeled a year early (winter 2018 was the Jan 2019 window). Therefore use the 2017/2018 stats to predict winter/summer 2018 prices

In [50]:
print(tfdf3['year'].value_counts())
print(tfdf3['window'].value_counts())
print(statsdf['year'].value_counts())

2019    435
2018    401
2020    271
Name: year, dtype: int64
s_w=s    976
s_w=w    131
Name: window, dtype: int64
2019-2020    2732
2017-2018    2686
2018-2019    2658
Name: year, dtype: int64


In [51]:
# tfdf3.head(20)

In [52]:
tfdf3['index_year'] = tfdf3['year']

In [53]:
#renaming year as transfer year so that after merging the column is perserved 
#and interpretable
tfdf3.rename(columns={'year':'transfer_year'}, inplace=True)

In [54]:
#Select the first 4 digits (the first year) for the stats years
statsdf['index_year'] = statsdf['year'].apply(lambda x: x[0:4])

In [55]:
#convert to int
statsdf = statsdf.astype({'index_year':'int'})

In [56]:
#Adding 1 to faciliate matching of the lag
#index_year will now match up the transfer to the performance stats lagged by one year
statsdf['index_year'] = statsdf['index_year']+1

In [57]:
#renaming year as stats_year so that after merging the column is perserved 
#and interpretable
statsdf.rename(columns={'year':'stats_year'}, inplace=True)

In [58]:
statsdf['index_year'].value_counts()

2020    2732
2018    2686
2019    2658
Name: index_year, dtype: int64

### Player name

In [59]:
tfdf3['player'].nunique()

1024

In [60]:
statsdf['players'].value_counts()

Raúl García          8
Adama Traoré         6
Rafinha              6
Naldo                6
Marcelo              6
                    ..
Maxence Rivera       1
Francisco Montero    1
Hugo Magnetti        1
Ørjan Nyland         1
Álvaro Vadillo       1
Name: players, Length: 4018, dtype: int64

In [61]:
statsdf.loc[statsdf['players']=='Raúl García',['players','nationality','team','position','age','birth_year','index_year','stats_year']]

Unnamed: 0,players,nationality,team,position,age,birth_year,index_year,stats_year
6578,Raúl García,es ESP,Leganés,DF,28.0,1989.0,2018,2017-2018
6579,Raúl García,es ESP,Athletic Club,"MF,FW",31.0,1986.0,2018,2017-2018
7362,Raúl García,es ESP,Athletic Club,"MF,FW",32.0,1986.0,2019,2018-2019
7363,Raúl García,es ESP,Girona,DF,29.0,1989.0,2019,2018-2019
7364,Raúl García,es ESP,Leganés,DF,29.0,1989.0,2019,2018-2019
7927,Raúl García,es ESP,Athletic Club,"MF,FW",33.0,1986.0,2020,2019-2020
7928,Raúl García,es ESP,Getafe,DF,30.0,1989.0,2020,2019-2020
7929,Raúl García,es ESP,Valladolid,"DF,MF",30.0,1989.0,2020,2019-2020


* Looking up Raul Garcia - there is the raul garcia that has been at Bilbao for years.
* The other 5 entries here are the same Raul Garcia.  He spent time on loan and therefore time at two clubs in 2019 and 2020.  
* Options:
    - combine the stats from the same year (all % columns will be unuseable)
    - Only use the stats for the selling club.  This is of course limited, but no more limited than not considering more than one season of work.

In [62]:
tfdf3['index_name'] = tfdf3['player']

In [63]:
statsdf['index_name'] = statsdf['players']

### nationality

In [64]:
tfdf3['nationality']

10      United States
38            England
39              Wales
57            England
65              Spain
            ...      
9232        Argentina
9320          Germany
9321         Cameroon
9455          Croatia
9466           Guinea
Name: nationality, Length: 1107, dtype: object

In [65]:
statsdf['nationality']

0       ar ARG
1       al ALB
2       de GER
3       br BRA
4       ch SUI
         ...  
8072    es ESP
8073    es ESP
8074    es ESP
8075    tr TUR
8076    hr CRO
Name: nationality, Length: 8076, dtype: object

Creating a dictionary of country abbreviations to create a new column in the stats df that is the full name of the nation to facilitate index matching with nationality in the transfer df

In [66]:
#from: https://www.realifewebdesigns.com/web-marketing/abbreviations-countries.asp
country_abb = pd.read_excel('data/country_abbreviations.xlsx', header=None)

In [67]:
#format: "AF = Afghanistan"
#extracting the country code abbreviation and country name as key and value for dictionary
country_abb['key'] = country_abb[0].apply(lambda x: x[0:2].lower())
country_abb['value'] = country_abb[0].apply(lambda x: x[4:])

In [68]:
#dropping original column
country_abb.drop(labels=0, axis=1, inplace=True)

In [69]:
#Convert series to lists to facilitate creation of dictionary
key_list = list(country_abb['key'])
value_list = list(country_abb['value'])

In [70]:
country_dict = {}
for n in range(len(key_list)):
    country_dict[key_list[n]] = value_list[n]

In [75]:
#Extracting the abbreviation from the nationality column (formatting as "ar ARG")
statsdf['nationality_abb'] = statsdf['nationality'].str.extract(r'([a-z]+)')

In [76]:
#the entries for guadelupe were missing the two letter code
#French guiana's code is GUF, the one GYF is a data entry error correcting here
#explains the nunique discrepancy 
statsdf.loc[statsdf['nationality'] == ' GPE', 'nationality_abb'] = 'gp' 
statsdf.loc[statsdf['nationality'] == ' GYF', 'nationality_abb'] = 'gf'

In [None]:
# print(statsdf['nationality'].nunique())
# print(statsdf['nationality_abb'].nunique())

# print(statsdf['nationality'].value_counts())
# print(statsdf['nationality_abb'].value_counts())

In [77]:
#Manually entering missing abbreviations from the list from the website to faciliate recoding:
country_dict['cw'] = 'Curacao'
country_dict['eng'] = 'England'
country_dict['is'] = 'Iceland'
country_dict['rs'] = 'Serbia'
country_dict['xk'] = 'Kosovo'
country_dict['wal'] = 'Wales'
country_dict['sco'] = 'Scotland'
country_dict['nir'] = 'Northern Ireland'
country_dict['me'] = 'Montenegro'

In [78]:
statsdf['index_nationality'] = statsdf['nationality_abb'].map(lambda x: country_dict[x])

In [79]:
tfdf3['index_nationality'] = tfdf3['nationality']

## Indexing on selling club in transfer data and club in stats data

* The previous league column in tfdf3 is more correctly previous country.  Therefore includes purchases from lower leagues and academies not just top tier.  Therefore tfdf3 has more clubs than appear in the stats dataset.  The team in statsdf is the more generic version.  
* Import a dictionary with generic name as key and full name as value from pyfiles to add to the statsdf for index matching.

In [83]:
import py_files.team_dictionary as team_dict

In [84]:
team_dict = team_dict.team_dict

In [85]:
statsdf['index_selling_club'] = statsdf['team'].map(lambda x: team_dict[x])

In [86]:
tfdf3['index_selling_club'] = tfdf3['selling_club']

In [None]:
# These were to be used for merging (see below)
# tfdf3.set_index(keys=['index_name','index_nationality','index_selling_club','index_year'], inplace=True)
# statsdf.set_index(keys=['index_name','index_nationality','index_selling_club','index_year'], inplace=True)

# Merging transfer and stats dataframes by index

## Original idea
* Merging on name, nationality, selling club and year (stats year lagged one behind sell year)
* Did not use age because the timing of the entry in either database was unclear so couldn't get a consistent number lagged or otherwise
## Problems with this approach
* There are players that are transferred multiple times in the same year
    - e.g.,  Omar Mascarell. Madrid exercised a buy back and then sold (one stats year, two transfers)
    - e.g., Marc Cucurella. Loaned to Eibar 2018/2019, option to make permanent in summer 2019.  Barca rebought 16 days later and loaned him to Getafe.  Permanently sold to Getafe summer 2020.
    * IMPACT: We might assume buybacks are set based on the performance prior to their loan but there is nothing in the dataset indicating options to buy (or release fees).  Treating these as independent transactions based on the same performance year.
* There are players in the stats with multiple observations per year:
    - Loans/January moves/extended window, e.g., Naldo, Adama Traoré, Rafinha - One full year of stats spread across multiple teams
    - Mulitple people with same name, nationality, e.g., Naldo which means we can't just sum by player and year
    - IMPACT: we can't match based on club because of the multi club problem and we can't sum by player and year because of the multi player problem.  
* Solution: merge by name and year 736 rows
    - When merging the transfers dataframe and stats dataframe using an inner merge we will only get those that appear in both so will keep all transfers that involve moves between teams in the big 5 leagues.  
    - There will be two kinds of dual labeling:
        - For different players with the same name
        - For players that played for multiple clubs in one year
        - ID manually

In [92]:
df = tfdf3.merge(statsdf, on=['index_name','index_year'])

In [93]:
df.shape

(736, 141)

In [94]:
pd.options.display.max_rows = 736
df

Unnamed: 0,player,age_x,nationality_x,position_x,selling_club,previous_league,est_market_value,fee,buying_club,window,transfer_year,buying_league,loan,free,buying_top_5,currency,multiplier,fee_numerical,mult_num,fee_final,index_year,index_name,index_nationality_x,index_selling_club_x,players,nationality_y,team,position_y,age_y,birth_year,games,games_start,mins,goals,assists,pens_successful,pens_attempts,yellow_cards,red_cards,goals_per_90,assists_per_90,goals_and_assists_per_90,goals_pk_per_90,goals_assists_pk_per_90,xg,npxp,xa,xg_per90,xa_per90,xg_xa_per90_list,npxg_per90_list,npxg_xa_per90,full_90s_played,shots_total,shots_on_target,shots_total_per90,shots_on_target_per90,npxg,...,sca_defense,gca,gca_per90,gca_passes_live,gca_passes_dead,gca_dribbles,gca_shots,gca_fouled,gca_defense,gca_og_for,tackles,tackles_won,tackles_def_3rd,tackles_mid_3rd,tackles_att_3rd,dribble_tackles,dribble_vs,dribbled_past,pressures,pressure_regains,pressures_def_3rd,pressures_mid_3rd,pressures_att_3rd,blocks,blocked_shots,blocked_shots_saves,blocked_passes,interceptions,tackles_interceptions,clearances,errors,touches,touches_def_pen_area,touches_def_3rd,touches_mid_3rd,touches_att_3rd,touches_att_pen_area,touches_live_ball,dribbles_completed,dribbles,players_dribbled_past,nutmegs,carries,carry_distance,carry_progressive_distance,pass_targets,passes_received,miscontrols,dispossessed,passes_left_foot,passes_right_foot,aerials_won,aerials_lost,stats_year,league,nationality_abb,index_nationality_y,index_selling_club_y
0,Christian Pulisic,20,United States,Left Winger,Borussia Dortmund,Germany,£45.00m,£57.60m,Chelsea FC,s_w=w,2018,GB1,0,0,1,£,m,57.6,1000000.0,57600000.0,2018,Christian Pulisic,United States,Borussia Dortmund,Christian Pulisic,us USA,Dortmund,FW,18.0,1998.0,32,27,2302,4,5,0,0,1,0,0.16,0.2,0.35,0.16,0.35,5.2,5.2,5.9,0.2,0.23,0.43,0.2,0.43,25.6,36.0,15,1.41,0.59,5.2,...,1.0,5.0,0.19,4.0,0.0,1.0,0.0,0.0,0.0,0.0,36.0,15.0,11.0,14.0,11.0,6.0,30.0,24.0,431.0,124.0,106.0,174.0,151.0,29.0,4.0,0.0,25.0,15.0,51,11.0,0.0,1282.0,16.0,136.0,597.0,705.0,140.0,1248.0,89.0,149.0,99.0,4.0,1122.0,9110.0,5699.0,1301.0,971.0,76.0,90.0,167.0,649.0,12.0,45.0,2017-2018,Bundesliga,us,United States,Borussia Dortmund
1,Dominic Solanke,21,England,Centre-Forward,Liverpool FC,England,£9.00m,£19.08m,AFC Bournemouth,s_w=w,2018,GB1,0,0,1,£,m,19.08,1000000.0,19080000.0,2018,Dominic Solanke,England,Liverpool FC,Dominic Solanke,eng ENG,Liverpool,FW,19.0,1997.0,21,5,596,1,1,0,0,0,0,0.15,0.15,0.3,0.15,0.3,2.9,2.9,2.0,0.45,0.31,0.76,0.45,0.76,6.6,23.0,7,3.47,1.06,2.9,...,0.0,2.0,0.31,2.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,2.0,0.0,3.0,0.0,0.0,5.0,5.0,134.0,43.0,6.0,58.0,70.0,10.0,1.0,0.0,9.0,2.0,5,5.0,0.0,309.0,7.0,19.0,126.0,191.0,55.0,305.0,10.0,17.0,12.0,1.0,199.0,1219.0,657.0,397.0,238.0,32.0,13.0,22.0,141.0,9.0,14.0,2017-2018,Premier-League,eng,England,Liverpool FC
2,Emiliano Sala,28,Argentina,Centre-Forward,FC Nantes,France,£14.40m,£15.30m,Cardiff City,s_w=w,2018,GB1,0,0,1,£,m,15.3,1000000.0,15300000.0,2018,Emiliano Sala,Argentina,FC Nantes,Emiliano Sala,ar ARG,Nantes,FW,26.0,1990.0,36,34,3032,12,3,4,5,4,2,0.36,0.09,0.45,0.24,0.33,15.8,12.0,2.8,0.47,0.08,0.55,0.36,0.44,33.7,97.0,25,2.88,0.74,12.0,...,1.0,4.0,0.12,4.0,0.0,0.0,0.0,0.0,0.0,0.0,16.0,8.0,5.0,8.0,3.0,5.0,30.0,25.0,489.0,126.0,22.0,234.0,233.0,30.0,3.0,0.0,27.0,16.0,32,43.0,1.0,1321.0,49.0,105.0,578.0,690.0,177.0,1293.0,22.0,46.0,25.0,0.0,828.0,4344.0,1870.0,1695.0,977.0,121.0,79.0,47.0,572.0,41.0,77.0,2017-2018,Ligue-1,ar,Argentina,FC Nantes
3,Riyad Mahrez,27,Algeria,Right Winger,Leicester City,England,£45.00m,£61.02m,Manchester City,s_w=s,2018,GB1,0,0,1,£,m,61.02,1000000.0,61020000.0,2018,Riyad Mahrez,Algeria,Leicester City,Riyad Mahrez,dz ALG,Leicester City,"MF,FW",26.0,1991.0,36,34,2950,12,10,0,0,2,0,0.37,0.31,0.67,0.37,0.67,5.3,5.3,8.0,0.16,0.24,0.41,0.16,0.41,32.8,73.0,35,2.23,1.07,5.3,...,1.0,21.0,0.64,12.0,5.0,2.0,2.0,0.0,0.0,0.0,37.0,24.0,16.0,15.0,6.0,4.0,41.0,37.0,382.0,89.0,109.0,168.0,105.0,42.0,1.0,0.0,41.0,16.0,53,5.0,0.0,1879.0,21.0,208.0,941.0,909.0,126.0,1680.0,72.0,132.0,86.0,3.0,1334.0,11512.0,6616.0,1585.0,1279.0,83.0,87.0,1184.0,129.0,15.0,25.0,2017-2018,Premier-League,dz,Algeria,Leicester City
4,Lee Grant,35,England,Goalkeeper,Stoke City,England,£900Th.,£1.53m,Manchester United,s_w=s,2018,GB1,0,0,1,£,m,1.53,1000000.0,1530000.0,2018,Lee Grant,England,Stoke City,Lee Grant,eng ENG,Stoke City,GK,34.0,1983.0,3,3,270,0,0,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,0.0,0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0.0,0.0,80.0,63.0,80.0,0.0,0.0,0.0,58.0,0.0,0.0,0.0,0.0,43.0,274.0,115.0,31.0,30.0,0.0,0.0,5.0,52.0,0.0,0.0,2017-2018,Premier-League,eng,England,Stoke City
5,Alisson,25,Brazil,Goalkeeper,AS Roma,Italy,£54.00m,£56.25m,Liverpool FC,s_w=s,2018,GB1,0,0,1,£,m,56.25,1000000.0,56250000.0,2018,Alisson,Brazil,AS Roma,Alisson,br BRA,Roma,GK,24.0,1992.0,37,37,3330,0,0,0,0,1,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,37.0,0.0,0,0.0,0.0,0.0,...,0.0,1.0,0.03,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,3.0,0.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,1.0,2.0,1326.0,1095.0,1318.0,12.0,0.0,0.0,1089.0,6.0,6.0,6.0,0.0,793.0,3063.0,1575.0,644.0,643.0,2.0,0.0,89.0,847.0,0.0,0.0,2017-2018,Serie-A,br,Brazil,AS Roma
6,Naby Keïta,23,Guinea,Central Midfield,RB Leipzig,Germany,£58.50m,£54.00m,Liverpool FC,s_w=s,2018,GB1,0,0,1,£,m,54.0,1000000.0,54000000.0,2018,Naby Keïta,Guinea,RB Leipzig,Naby Keïta,gn GUI,RB Leipzig,MF,22.0,1995.0,27,23,1962,6,5,0,0,8,2,0.28,0.23,0.5,0.28,0.5,3.5,3.5,3.7,0.16,0.17,0.33,0.16,0.33,21.8,44.0,12,2.02,0.55,3.5,...,3.0,10.0,0.46,8.0,0.0,0.0,1.0,0.0,1.0,0.0,49.0,33.0,20.0,22.0,7.0,12.0,52.0,40.0,437.0,137.0,112.0,234.0,91.0,24.0,4.0,0.0,20.0,22.0,71,17.0,0.0,1590.0,29.0,239.0,1064.0,441.0,59.0,1564.0,71.0,111.0,76.0,4.0,1249.0,8497.0,5554.0,1242.0,1113.0,43.0,60.0,184.0,1013.0,12.0,24.0,2017-2018,Bundesliga,gn,Guinea,RB Leipzig
7,Fabinho,24,Brazil,Defensive Midfield,AS Monaco,France,£40.50m,£40.50m,Liverpool FC,s_w=s,2018,GB1,0,0,1,£,m,40.5,1000000.0,40500000.0,2018,Fabinho,Brazil,AS Monaco,Fabinho,br BRA,Monaco,MF,23.0,1993.0,34,34,3060,7,3,4,4,8,0,0.21,0.09,0.29,0.09,0.18,5.5,2.4,3.7,0.16,0.11,0.27,0.07,0.18,34.0,22.0,5,0.65,0.15,2.4,...,6.0,16.0,0.47,10.0,0.0,2.0,0.0,2.0,2.0,0.0,96.0,61.0,33.0,50.0,13.0,29.0,97.0,68.0,545.0,159.0,171.0,309.0,65.0,50.0,10.0,0.0,40.0,39.0,135,37.0,0.0,2308.0,74.0,555.0,1538.0,375.0,35.0,2273.0,28.0,38.0,31.0,0.0,1523.0,9688.0,5102.0,1550.0,1469.0,21.0,20.0,129.0,1662.0,28.0,18.0,2017-2018,Ligue-1,br,Brazil,AS Monaco
8,Xherdan Shaqiri,26,Switzerland,Right Winger,Stoke City,England,£16.20m,£13.23m,Liverpool FC,s_w=s,2018,GB1,0,0,1,£,m,13.23,1000000.0,13230000.0,2018,Xherdan Shaqiri,Switzerland,Stoke City,Xherdan Shaqiri,ch SUI,Stoke City,"FW,MF",25.0,1991.0,36,36,3039,8,7,0,1,5,0,0.24,0.21,0.44,0.24,0.44,5.3,4.6,7.0,0.16,0.21,0.36,0.13,0.34,33.8,70.0,30,2.07,0.89,4.6,...,0.0,11.0,0.32,4.0,5.0,0.0,1.0,1.0,0.0,0.0,22.0,12.0,9.0,6.0,7.0,10.0,33.0,23.0,379.0,91.0,74.0,186.0,119.0,18.0,2.0,0.0,16.0,15.0,37,2.0,0.0,1605.0,12.0,190.0,750.0,808.0,70.0,1384.0,35.0,66.0,36.0,2.0,1168.0,9239.0,4538.0,1382.0,1053.0,90.0,55.0,964.0,233.0,6.0,10.0,2017-2018,Premier-League,ch,Switzerland,Stoke City
9,Jorginho,26,Italy,Defensive Midfield,SSC Napoli,Italy,£45.00m,£51.30m,Chelsea FC,s_w=s,2018,GB1,0,0,1,£,m,51.3,1000000.0,51300000.0,2018,Jorginho,Italy,SSC Napoli,Jorginho,it ITA,Napoli,MF,25.0,1991.0,33,33,2650,2,3,1,2,5,0,0.07,0.1,0.17,0.03,0.14,2.2,0.5,3.6,0.07,0.12,0.2,0.02,0.14,29.4,15.0,3,0.51,0.1,0.5,...,2.0,12.0,0.41,10.0,1.0,0.0,1.0,0.0,0.0,0.0,67.0,44.0,18.0,36.0,13.0,16.0,71.0,55.0,574.0,180.0,135.0,331.0,108.0,37.0,3.0,0.0,34.0,52.0,119,11.0,0.0,3466.0,47.0,537.0,2395.0,704.0,9.0,3393.0,13.0,25.0,15.0,0.0,2546.0,8222.0,4637.0,2936.0,2833.0,26.0,25.0,259.0,2900.0,13.0,18.0,2017-2018,Serie-A,it,Italy,SSC Napoli


In [97]:
df['player'].value_counts()

João Cancelo              3
Leandro Cabrera           3
Leonardo Bittencourt      3
Enric Gallego             3
Vincent Pajot             3
Giovani Lo Celso          3
Marc Cucurella            3
Vincenzo Grifo            3
Karl Toko Ekambi          3
Jeison Murillo            3
Alfred Gomis              3
Rémy Cabella              3
Marius Wolf               2
Jeff Reine-Adélaïde       2
Kevin-Prince Boateng      2
Lukas Lerager             2
Kyle Walker-Peters        2
Jordan Ferri              2
Emil Krafth               2
Martin Braithwaite        2
Gabriel                   2
Davy Klaassen             2
Yannis Salibur            2
Emre Can                  2
Lucas Pérez               2
Luis Muriel               2
Khouma Babacar            2
Loïc Rémy                 2
Youssef En-Nesyri         2
Ryad Boudebouz            2
Alfred Duncan             2
Marco Tumminello          2
Diego Falcinelli          2
Christian Kouamé          2
Andrea Petagna            2
Vincent Laurini     

In [201]:
df.loc[df['player']=="Terence Kongolo"]

Unnamed: 0,player,age_x,nationality_x,position_x,selling_club,previous_league,est_market_value,fee,buying_club,window,transfer_year,buying_league,loan,free,buying_top_5,currency,multiplier,fee_numerical,mult_num,fee_final,index_year,index_name,index_nationality_x,index_selling_club_x,players,nationality_y,team,position_y,age_y,birth_year,games,games_start,mins,goals,assists,pens_successful,pens_attempts,yellow_cards,red_cards,goals_per_90,assists_per_90,goals_and_assists_per_90,goals_pk_per_90,goals_assists_pk_per_90,xg,npxp,xa,xg_per90,xa_per90,xg_xa_per90_list,npxg_per90_list,npxg_xa_per90,full_90s_played,shots_total,shots_on_target,shots_total_per90,shots_on_target_per90,npxg,...,sca_defense,gca,gca_per90,gca_passes_live,gca_passes_dead,gca_dribbles,gca_shots,gca_fouled,gca_defense,gca_og_for,tackles,tackles_won,tackles_def_3rd,tackles_mid_3rd,tackles_att_3rd,dribble_tackles,dribble_vs,dribbled_past,pressures,pressure_regains,pressures_def_3rd,pressures_mid_3rd,pressures_att_3rd,blocks,blocked_shots,blocked_shots_saves,blocked_passes,interceptions,tackles_interceptions,clearances,errors,touches,touches_def_pen_area,touches_def_3rd,touches_mid_3rd,touches_att_3rd,touches_att_pen_area,touches_live_ball,dribbles_completed,dribbles,players_dribbled_past,nutmegs,carries,carry_distance,carry_progressive_distance,pass_targets,passes_received,miscontrols,dispossessed,passes_left_foot,passes_right_foot,aerials_won,aerials_lost,stats_year,league,nationality_abb,index_nationality_y,index_selling_club_y
38,Terence Kongolo,24,Netherlands,Centre-Back,AS Monaco,France,£9.00m,£18.00m,Huddersfield Town,s_w=s,2018,GB1,0,0,1,£,m,18.0,1000000.0,18000000.0,2018,Terence Kongolo,Netherlands,AS Monaco,Terence Kongolo,nl NED,Huddersfield,DF,23.0,1994.0,13,11,1049,0,0,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.3,0.3,0.3,0.02,0.03,0.05,0.02,0.05,11.7,2.0,0,0.17,0.0,0.3,...,0.0,1.0,0.09,1.0,0.0,0.0,0.0,0.0,0.0,0.0,32.0,15.0,16.0,15.0,1.0,8.0,17.0,9.0,144.0,46.0,94.0,39.0,11.0,26.0,8.0,0.0,18.0,20.0,52,41.0,1.0,540.0,56.0,223.0,227.0,106.0,10.0,484.0,4.0,10.0,8.0,2.0,244.0,1215.0,685.0,248.0,226.0,5.0,3.0,260.0,34.0,16.0,7.0,2017-2018,Premier-League,nl,Netherlands,Huddersfield Town
39,Terence Kongolo,24,Netherlands,Centre-Back,AS Monaco,France,£9.00m,£18.00m,Huddersfield Town,s_w=s,2018,GB1,0,0,1,£,m,18.0,1000000.0,18000000.0,2018,Terence Kongolo,Netherlands,AS Monaco,Terence Kongolo,nl NED,Monaco,DF,23.0,1994.0,3,3,256,0,0,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.1,0.1,0.3,0.04,0.12,0.16,0.04,0.16,2.8,1.0,0,0.35,0.0,0.1,...,0.0,1.0,0.35,1.0,0.0,0.0,0.0,0.0,0.0,0.0,6.0,5.0,5.0,1.0,0.0,3.0,5.0,2.0,24.0,4.0,11.0,8.0,5.0,6.0,3.0,0.0,3.0,3.0,9,7.0,0.0,189.0,13.0,52.0,91.0,58.0,3.0,156.0,1.0,1.0,1.0,0.0,93.0,587.0,329.0,99.0,92.0,2.0,2.0,100.0,15.0,4.0,0.0,2017-2018,Ligue-1,nl,Netherlands,AS Monaco


Time at multiple clubs one year:
Terence Kongolo 2018 38/39
Enzo Crivelli 2018 131/132
Clément Grenier 2018 161/162
Alessandro Murgia 2019 645/646
Youri Tielemans 2019 70/71
Nenad Tomović 2018 514/515
André Silva 2020 490/491
Suso 2020 387/388
Marco Capuano 2018 525/526 
Gelson Martins 2019 191/192
Carles Pérez 2020 692/693
Koffi Djidji 2019 652/653
Rúben Vezo 2019 368/369
Facundo Roncaglia 2019 377/378
Roberto Soriano 2019 592/593
Luca Pellegrini 2019 614/615
Souleyman Doumbia 2020 242/243
Sebastien De Maio 2019 656/657
Coke 2018 287/288
Gonçalo Guedes 2018 306/307
Valon Berisha 2020 232/233
Giuseppe Pezzella 2020 687/688
Yassine Benrahou 2020 253/254
Soualiho Meïté 2018 569/570
Martin Hinteregger 2019 458/459
Dawid Kownacki 2019 438/439
Nicola Sansone 2019 590/591
Roque Mesa 2018 303/304
Kylian Mbappé 2018 156/157
John Guidetti 2018 274/275
Takashi Inui 2019 355/356
Fabrizio Cacciatore 2019 598/599
Pablo Hervías 2019 372/373
Gerard Deulofeu 2018 29/30
Vid Belec 2018 556/557
Ondrej Duda 2020 496/497
Denis Suárez 2019 374/375
Dimitri Foulquier 2020 394/395
Sergi Darder 2018 289/290
Christian Kouamé 2020 673/674
Diego Falcinelli 2018 509/510
Marco Tumminello 2018 506/507
Alfred Duncan 2020 670/671
Ryad Boudebouz 2019 212/213
Joao Cancelo 2018 (index 533 for inter, index 534 1 game for Valencia)
Leandro Cabrera 2018 spent time at (index 283, 284) Crotone and Getafe 
Leonardo Bittencourt 2020 (499 and 5000 time at two clubs
Enric Gallego 2020 (396 397)
Vincent Pajot 2020 250/251 time at two clubs
Giovani Lo Celso 2019 360/361
Vincenzo Grifo 2019 471/472
Karl Toko Ekambi 2020 236/237
Jeison Murillo 2019 584/585
Rémy Cabella 2018 164/165
Kevin-Prince Boateng 2019 601/602
Lukas Lerager 2019 607/608
Kyle Walker-Peters 2020 117/118
Jordan Ferri 2019 198/199
Yannis Salibur 2019 380/381
Emre Can 2020 484/485
Luis Muriel 2019 587/588
Khouma Babacar 2018 558/559
Loïc Rémy 2018 139/140

Purchased multiple times same year:
Ionuț Radu 2019
Mikel Merino purchased twice summer 2018
Marc Cucurella 2019
Marius Wolf 2018
Emil Krafth 2019
Martin Braithwaite 2019
Andrea Petagna 2019
Andrea Pinamonti 2020
Omar Mascarell 2018
Paco Alcácer 2019

Danilo index 595 has man city danilo's statistics 
Danilo index 596 is correct
Naldo index 127 delete
Gabriel to Arsenal 2020 (index 95) stats are wrong
Gabriel to Benfica 2018 index 708 are correct