# Football Transfer Data Project
##### by David VanHeeswijk

The purpose of this notebook is to explore Data sets found from [Kaggle- European Football Transfers Dataset](https://www.kaggle.com/giovannibeli/european-football-transfers-database).  In this dataset, we find data ranging across many areas of European football, including:
* Player Stats
* Player Information
* Club Records and Stats
* Transfers from season to season across Europe's football leagues
* League Basic information
* Coaches and Stadia
* National Team stats for players
* etc.

We would like to answer the following questions:
1. What are the best indicators for predicting market value and transfer fees?
2. Which nationality produces the best players for value across the entire European football system?
3. What player positions produce the most *bang for your buck*?
4. What is the ideal age to purchase/sell a player?

In this notebook, we will primarily look to wrangle in the data, merging several of the csv files into one unified data set that has a limited number of features from which we can create a model. We start by importing our libraries.

In [632]:
import pandas as pd
import numpy as np

import pandas_profiling
from pandas_profiling.utils.cache import cache_file

from pathlib import Path

import datetime

In [633]:
# Loading in csv files into DataFrames to explore
# First, we will load our data sets containing transfer data

transfers= pd.read_csv('transfers.csv', delimiter=';')

In [634]:
transfers.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 111602 entries, 0 to 111601
Data columns (total 19 columns):
 #   Column               Non-Null Count   Dtype  
---  ------               --------------   -----  
 0   id                   111602 non-null  int64  
 1   player_id            111602 non-null  int64  
 2   player_name          111602 non-null  object 
 3   season               111602 non-null  int64  
 4   date                 111530 non-null  object 
 5   from_club_id         111602 non-null  int64  
 6   from_club_name       111602 non-null  object 
 7   to_club_id           111602 non-null  int64  
 8   to_club_name         111602 non-null  object 
 9   market_value         72589 non-null   float64
 10  fee                  42799 non-null   float64
 11  from_coach_name      38687 non-null   object 
 12  to_coach_name        38690 non-null   object 
 13  from_sport_dir_name  17854 non-null   object 
 14  to_sport_dir_name    18226 non-null   object 
 15  contract_was_till

We want to make a list of features that will be used to create our model for predicting fees. We will narrow it down the these 10:

* season
* position
* nationality
* league
* goals + assists
* total minutes played
* height
* dob
* club position in league
* market value

In [635]:
transfers.columns

Index(['id', 'player_id', 'player_name', 'season', 'date', 'from_club_id',
       'from_club_name', 'to_club_id', 'to_club_name', 'market_value', 'fee',
       'from_coach_name', 'to_coach_name', 'from_sport_dir_name',
       'to_sport_dir_name', 'contract_was_till', 'is_loan', 'is_end_of_loan',
       'is_future_transfer'],
      dtype='object')

In [636]:
new_transfers = transfers[['player_name', 'season','market_value', 'fee', 'from_club_id', 'from_club_name', 'to_club_id','to_club_name', 'is_loan', 'is_end_of_loan']]
new_transfers['market_value'].fillna(new_transfers['fee'], inplace=True)

new_transfers.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._update_inplace(new_data)


Unnamed: 0,player_name,season,market_value,fee,from_club_id,from_club_name,to_club_id,to_club_name,is_loan,is_end_of_loan
0,Jermaine Beckford,2017,500000.0,0.0,391,Preston NE,392,Bury,0,0
1,Jermaine Beckford,2015,750000.0,0.0,289,Bolton,391,Preston NE,0,0
2,Jermaine Beckford,2014,750000.0,,391,Preston NE,289,Bolton,0,1
3,Jermaine Beckford,2014,1200000.0,,289,Bolton,391,Preston NE,1,0
4,Jermaine Beckford,2013,1500000.0,,271,Leicester,289,Bolton,0,0


In [637]:
new_transfers = new_transfers[new_transfers['to_club_name'] != 'Retired']

new_transfers.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 108090 entries, 0 to 111601
Data columns (total 10 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   player_name     108090 non-null  object 
 1   season          108090 non-null  int64  
 2   market_value    82487 non-null   float64
 3   fee             42783 non-null   float64
 4   from_club_id    108090 non-null  int64  
 5   from_club_name  108090 non-null  object 
 6   to_club_id      108090 non-null  int64  
 7   to_club_name    108090 non-null  object 
 8   is_loan         108090 non-null  int64  
 9   is_end_of_loan  108090 non-null  int64  
dtypes: float64(2), int64(5), object(3)
memory usage: 9.1+ MB


In [638]:
new_transfers.loc[(new_transfers['fee'].isnull())&(new_transfers['is_loan'] + new_transfers['is_end_of_loan'] > 0),'fee'] = 0

new_transfers.head()

Unnamed: 0,player_name,season,market_value,fee,from_club_id,from_club_name,to_club_id,to_club_name,is_loan,is_end_of_loan
0,Jermaine Beckford,2017,500000.0,0.0,391,Preston NE,392,Bury,0,0
1,Jermaine Beckford,2015,750000.0,0.0,289,Bolton,391,Preston NE,0,0
2,Jermaine Beckford,2014,750000.0,0.0,391,Preston NE,289,Bolton,0,1
3,Jermaine Beckford,2014,1200000.0,0.0,289,Bolton,391,Preston NE,1,0
4,Jermaine Beckford,2013,1500000.0,,271,Leicester,289,Bolton,0,0


In [639]:
new_transfers['is_loan'] = new_transfers['is_loan'] + new_transfers['is_end_of_loan']
new_transfers.drop('is_end_of_loan', axis=1, inplace=True)
new_transfers.head()

Unnamed: 0,player_name,season,market_value,fee,from_club_id,from_club_name,to_club_id,to_club_name,is_loan
0,Jermaine Beckford,2017,500000.0,0.0,391,Preston NE,392,Bury,0
1,Jermaine Beckford,2015,750000.0,0.0,289,Bolton,391,Preston NE,0
2,Jermaine Beckford,2014,750000.0,0.0,391,Preston NE,289,Bolton,1
3,Jermaine Beckford,2014,1200000.0,0.0,289,Bolton,391,Preston NE,1
4,Jermaine Beckford,2013,1500000.0,,271,Leicester,289,Bolton,0


In [640]:
new_transfers['free_transfer'] = True

for i in new_transfers.index:
    if(new_transfers.loc[i,'fee'] > 0)or(new_transfers.loc[i,'is_loan'] == 1):
        new_transfers.loc[i,'free_transfer'] = False
        
new_transfers.head()

Unnamed: 0,player_name,season,market_value,fee,from_club_id,from_club_name,to_club_id,to_club_name,is_loan,free_transfer
0,Jermaine Beckford,2017,500000.0,0.0,391,Preston NE,392,Bury,0,True
1,Jermaine Beckford,2015,750000.0,0.0,289,Bolton,391,Preston NE,0,True
2,Jermaine Beckford,2014,750000.0,0.0,391,Preston NE,289,Bolton,1,False
3,Jermaine Beckford,2014,1200000.0,0.0,289,Bolton,391,Preston NE,1,False
4,Jermaine Beckford,2013,1500000.0,,271,Leicester,289,Bolton,0,True


In [642]:
new_transfers.drop('is_loan',axis=1,inplace=True)

In [643]:
new_transfers.head()

Unnamed: 0,player_name,season,market_value,fee,from_club_id,from_club_name,to_club_id,to_club_name,free_transfer
0,Jermaine Beckford,2017,500000.0,0.0,391,Preston NE,392,Bury,True
1,Jermaine Beckford,2015,750000.0,0.0,289,Bolton,391,Preston NE,True
2,Jermaine Beckford,2014,750000.0,0.0,391,Preston NE,289,Bolton,False
3,Jermaine Beckford,2014,1200000.0,0.0,289,Bolton,391,Preston NE,False
4,Jermaine Beckford,2013,1500000.0,,271,Leicester,289,Bolton,True


In [644]:
new_transfers.drop('free_transfer',axis=1,inplace=True)

While we are missing some values in the 'fee' and 'market_value' columns, we have essentially gathered the data that we need from the transfers df. Now we will pick up the features from our player stats and dict data sets and combine the three.

In [645]:
player_stats = pd.read_csv('stats_of_players.csv', delimiter=';')
player_dict = pd.read_csv('dict_players.csv', delimiter=';')

player_stats.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 231379 entries, 0 to 231378
Data columns (total 18 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   id                231379 non-null  int64  
 1   player_id         231379 non-null  int64  
 2   player_name       231379 non-null  object 
 3   season            231379 non-null  int64  
 4   league_id         231379 non-null  int64  
 5   league_name       231379 non-null  object 
 6   club_id           231379 non-null  int64  
 7   club_name         231379 non-null  object 
 8   apps              231379 non-null  int64  
 9   points_per_match  225422 non-null  float64
 10  goals             166957 non-null  float64
 11  assists           159543 non-null  float64
 12  conceded_goals    118416 non-null  float64
 13  clean_sheets      117263 non-null  float64
 14  yellow_card       190724 non-null  float64
 15  two_yellow_cards  117566 non-null  float64
 16  red_card          11

In [646]:
player_dict.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11382 entries, 0 to 11381
Data columns (total 24 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   id                           11382 non-null  int64  
 1   name                         11382 non-null  object 
 2   original_name                6456 non-null   object 
 3   club_id                      11382 non-null  int64  
 4   club_name                    11382 non-null  object 
 5   position_main                11362 non-null  object 
 6   other_positions              8043 non-null   object 
 7   nationality_name             11382 non-null  object 
 8   nationality_code             11088 non-null  object 
 9   other_nationality_name       3427 non-null   object 
 10  other_nationality_code       3215 non-null   object 
 11  date_of_birth                11362 non-null  object 
 12  place_of_birth_name          11278 non-null  object 
 13  place_of_birth_c

We will start with the player_stats dataframe, and gather only the columns that we need for our model.

In [647]:
player_stats.columns

Index(['id', 'player_id', 'player_name', 'season', 'league_id', 'league_name',
       'club_id', 'club_name', 'apps', 'points_per_match', 'goals', 'assists',
       'conceded_goals', 'clean_sheets', 'yellow_card', 'two_yellow_cards',
       'red_card', 'minutes_played'],
      dtype='object')

In [648]:
player_stats = player_stats.groupby(['player_name','season','club_name']).sum()[['goals','assists', 'apps', 'minutes_played']]

player_stats.reset_index().head()

Unnamed: 0,player_name,season,club_name,goals,assists,apps,minutes_played
0,Aaron Cresswell,2008,Tranmere Rovers,1.0,0.0,14,852
1,Aaron Cresswell,2009,Tranmere Rovers,0.0,1.0,16,1386
2,Aaron Cresswell,2010,Tranmere Rovers,5.0,6.0,47,4020
3,Aaron Cresswell,2011,Ipswich Town,1.0,6.0,46,4111
4,Aaron Cresswell,2012,Ipswich Town,4.0,5.0,49,4440


In [649]:
player_stats['goal_contributions'] = player_stats['goals']+player_stats['assists']
player_stats['minutes_per_appearance'] = player_stats['minutes_played']//player_stats['apps']

player_stats.drop(['goals', 'assists', 'apps'],axis=1, inplace=True)

player_stats.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,minutes_played,goal_contributions,minutes_per_appearance
player_name,season,club_name,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Aaron Cresswell,2008,Tranmere Rovers,852,1.0,60
Aaron Cresswell,2009,Tranmere Rovers,1386,1.0,86
Aaron Cresswell,2010,Tranmere Rovers,4020,11.0,85
Aaron Cresswell,2011,Ipswich Town,4111,7.0,89
Aaron Cresswell,2012,Ipswich Town,4440,9.0,90


In [650]:
player_stats.reset_index(inplace=True)

In [651]:
player_stats.head()

Unnamed: 0,player_name,season,club_name,minutes_played,goal_contributions,minutes_per_appearance
0,Aaron Cresswell,2008,Tranmere Rovers,852,1.0,60
1,Aaron Cresswell,2009,Tranmere Rovers,1386,1.0,86
2,Aaron Cresswell,2010,Tranmere Rovers,4020,11.0,85
3,Aaron Cresswell,2011,Ipswich Town,4111,7.0,89
4,Aaron Cresswell,2012,Ipswich Town,4440,9.0,90


Now that we have gathered the relevant columns that we will need for our analysis from the stats dataframe, we pull in the player info from the player_dict dataframe and merge the two together.

In [652]:
player_dict.columns

Index(['id', 'name', 'original_name', 'club_id', 'club_name', 'position_main',
       'other_positions', 'nationality_name', 'nationality_code',
       'other_nationality_name', 'other_nationality_code', 'date_of_birth',
       'place_of_birth_name', 'place_of_birth_country_name',
       'place_of_birth_country_code', 'foot', 'height', 'player_agent',
       'joined', 'contract_until', 'outfiter', 'last_extention',
       'contract_options', 'current_market_value'],
      dtype='object')

In [653]:
player_info = player_dict[['name','position_main', 'nationality_name','nationality_code', 'date_of_birth', 'height']]

player_info.head()

Unnamed: 0,name,position_main,nationality_name,nationality_code,date_of_birth,height
0,Jermaine Beckford,Centre-Forward,Jamaica,JAM,1983-12-09,188.0
1,Harry Charsley,Central Midfield,Ireland,IRL,1996-11-01,
2,Mark Davies,Central Midfield,England,GBR,1988-02-18,180.0
3,Alex McQuade,Centre-Back,England,GBR,1992-11-07,
4,Przemyslaw Kazimierczak,Goalkeeper,Poland,POL,1988-05-05,191.0


In [654]:
player_info.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11382 entries, 0 to 11381
Data columns (total 6 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   name              11382 non-null  object 
 1   position_main     11362 non-null  object 
 2   nationality_name  11382 non-null  object 
 3   nationality_code  11088 non-null  object 
 4   date_of_birth     11362 non-null  object 
 5   height            11046 non-null  float64
dtypes: float64(1), object(5)
memory usage: 533.7+ KB


It seems we are missing a few nationalities and dobs, as well as a few positions. We can leave the null position data for now and focus more on the nationalities missing, as well as dates of birth.

In [655]:
player_info[player_info['nationality_code'].isnull()]

Unnamed: 0,name,position_main,nationality_name,nationality_code,date_of_birth,height
56,Sead Kolasinac,Left-Back,Bosnia-Herzegovina,,1993-06-20,183.0
100,Wilfried Bony,Centre-Forward,Cote d'Ivoire,,1988-12-10,181.0
130,Franck Kessié,Central Midfield,Cote d'Ivoire,,1996-12-19,183.0
131,Didier Drogba,Centre-Forward,Cote d'Ivoire,,1978-03-11,189.0
138,Zvjezdan Misimovic,Attacking Midfield,Bosnia-Herzegovina,,1982-06-05,179.0
...,...,...,...,...,...,...
11321,Clarck Nsikulu,Left Winger,DR Congo,,1992-07-10,180.0
11325,Yohan Boli,Centre-Forward,Cote d'Ivoire,,1993-11-17,181.0
11345,Wilfred Moke,Defensive Midfield,DR Congo,,1988-02-12,183.0
11356,Elie Kroupi,Centre-Forward,Cote d'Ivoire,,1979-10-18,175.0


In [656]:
player_info[player_info['nationality_code'].isnull()]['nationality_name'].unique()

array(['Bosnia-Herzegovina', "Cote d'Ivoire", 'DR Congo', 'Curacao',
       'Tahiti', 'Cape Verde', 'Kosovo', 'Korea, North', 'Palästina',
       'Chinese Taipei (Taiwan)'], dtype=object)

We see that there are a few countries that pop up with no nationality code. After a quick search of FIFA country codes, we found the missing info and will fill it in now.

In [657]:
country_codes = [['Bosnia-Herzegovina', 'BIH'],["Cote d'Ivoire", 'CIV'],['DR Congo', 'CGO'],
                 ['Curacao', 'CUW'], ['Tahiti', 'TAH'], ['Cape Verde', 'CPV'],
                 ['Kosovo', 'KVX'], ['Korea, North', 'PRK'], ['Palästina', 'PLE'], ['Chinese Taipei (Taiwan)', 'TPE']]

for country in country_codes:
    for ind in player_info.index:
        if country[0] == player_info.loc[ind,'nationality_name']:
            player_info.loc[ind,'nationality_code'] = country[1]
            
player_info.info()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[item] = s


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11382 entries, 0 to 11381
Data columns (total 6 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   name              11382 non-null  object 
 1   position_main     11362 non-null  object 
 2   nationality_name  11382 non-null  object 
 3   nationality_code  11382 non-null  object 
 4   date_of_birth     11362 non-null  object 
 5   height            11046 non-null  float64
dtypes: float64(1), object(5)
memory usage: 533.7+ KB


In [658]:
player_info[player_info['date_of_birth'].isnull()]

Unnamed: 0,name,position_main,nationality_name,nationality_code,date_of_birth,height
20,Karim Matmour,Right Winger,Algeria,DZA,,181.0
3408,Mineiro,Defensive Midfield,Brazil,BRA,,169.0
3947,David Odonkor,Right Winger,Germany,DEU,,172.0
5485,Carsten Ramelow,Defensive Midfield,Germany,DEU,,186.0
7209,Faruk Namdar,Attacking Midfield,Turkey,TUR,,184.0
7549,Markus Bollmann,Centre-Back,Germany,DEU,,190.0
7841,Markus Kurth,Centre-Forward,Germany,DEU,,180.0
7922,Philipp Bönig,Left-Back,Germany,DEU,,175.0
8378,Daniel Halfar,Attacking Midfield,Germany,DEU,,173.0
8631,Moses Sichone,Centre-Back,Zambia,ZMB,,187.0


After doing some quick searches, we realize that these players are all retired, so rather than filling in the information now, we will wait until after we have merged the player info to the stats and transfer data frames before deciding if these need to be fixed.

As for height, we will use a mean function grouped around position to fill in the missing data, as there are too many mising entries to fill in via a search online.

In [659]:
player_info[player_info['height'].isnull()][['name','position_main']].head(10)

Unnamed: 0,name,position_main
1,Harry Charsley,Central Midfield
3,Alex McQuade,Centre-Back
76,Marcelo Bordon,Centre-Back
317,Michael Ballack,Central Midfield
335,Ewerthon,Centre-Forward
388,Fernando Morientes,Centre-Forward
468,Cris,Centre-Back
595,Martin Petrov,Left Winger
773,Maniche,Central Midfield
779,Tomás Ujfalusi,Centre-Back


In [660]:
height_means = player_info.groupby('position_main')['height'].mean().reset_index()

height_means = pd.DataFrame(height_means)

A note can be made that we can simplify the positions listed, since many of the positions are similar, such as Centre-Back and Sweeper or Right Winger and Right Midfield. We can combine common positions to help make our analysis more realistic, since positions names are relative to formations, which shouldn't matter in regards to the market value.

In [661]:
for ind in player_info.index:
    if np.isnan(player_info.loc[ind,'height']):
        for height in height_means.index:
            if height_means.loc[height,'position_main'] == player_info.loc[ind,'position_main']:
                player_info.loc[ind,'height'] = height_means.loc[height,'height']
            
player_info.info()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[item] = s


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11382 entries, 0 to 11381
Data columns (total 6 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   name              11382 non-null  object 
 1   position_main     11362 non-null  object 
 2   nationality_name  11382 non-null  object 
 3   nationality_code  11382 non-null  object 
 4   date_of_birth     11362 non-null  object 
 5   height            11381 non-null  float64
dtypes: float64(1), object(5)
memory usage: 533.7+ KB


In [662]:
# Removing the only player in our database that has no position or height.
player_info = player_info[player_info['name'] != 'Müslim Can']
player_info.head()

Unnamed: 0,name,position_main,nationality_name,nationality_code,date_of_birth,height
0,Jermaine Beckford,Centre-Forward,Jamaica,JAM,1983-12-09,188.0
1,Harry Charsley,Central Midfield,Ireland,IRL,1996-11-01,179.777417
2,Mark Davies,Central Midfield,England,GBR,1988-02-18,180.0
3,Alex McQuade,Centre-Back,England,GBR,1992-11-07,186.662449
4,Przemyslaw Kazimierczak,Goalkeeper,Poland,POL,1988-05-05,191.0


In [663]:
player_info['position_main'].unique()

array(['Centre-Forward', 'Central Midfield', 'Centre-Back', 'Goalkeeper',
       'Right Winger', 'Left Winger', 'Second Striker',
       'Defensive Midfield', 'Attacking Midfield', 'Right Midfield',
       'Right-Back', 'Left-Back', 'Left Midfield', nan, 'Sweeper'],
      dtype=object)

In [664]:
positions = {'Centre-Forward':'S', 'Central Midfield': 'CM', 'Centre-Back':'CB', 'Goalkeeper':'GK',
       'Right Winger':'RM', 'Left Winger':'LM', 'Second Striker':'S',
       'Defensive Midfield':'CDM', 'Attacking Midfield':'CAM', 'Right Midfield':'RM',
       'Right-Back':'RB', 'Left-Back':'LB', 'Left Midfield':'LM', 'Sweeper':'CB'}

player_info.replace({'position_main':positions}, inplace=True)
player_info.head()

Unnamed: 0,name,position_main,nationality_name,nationality_code,date_of_birth,height
0,Jermaine Beckford,S,Jamaica,JAM,1983-12-09,188.0
1,Harry Charsley,CM,Ireland,IRL,1996-11-01,179.777417
2,Mark Davies,CM,England,GBR,1988-02-18,180.0
3,Alex McQuade,CB,England,GBR,1992-11-07,186.662449
4,Przemyslaw Kazimierczak,GK,Poland,POL,1988-05-05,191.0


In [665]:
player_info.rename(columns={'position_main':'position', 'name':'player_name'},inplace=True)
player_info.head()

Unnamed: 0,player_name,position,nationality_name,nationality_code,date_of_birth,height
0,Jermaine Beckford,S,Jamaica,JAM,1983-12-09,188.0
1,Harry Charsley,CM,Ireland,IRL,1996-11-01,179.777417
2,Mark Davies,CM,England,GBR,1988-02-18,180.0
3,Alex McQuade,CB,England,GBR,1992-11-07,186.662449
4,Przemyslaw Kazimierczak,GK,Poland,POL,1988-05-05,191.0


Now, we will merge the two player specific columns together, using player name as the point to merge.

In [666]:
player_info = player_info[['player_name','position','nationality_code','date_of_birth','height']]
player_df = pd.merge(player_stats,player_info, how='left', on='player_name')

player_df.head()

Unnamed: 0,player_name,season,club_name,minutes_played,goal_contributions,minutes_per_appearance,position,nationality_code,date_of_birth,height
0,Aaron Cresswell,2008,Tranmere Rovers,852,1.0,60,LB,GBR,1989-12-15,170.0
1,Aaron Cresswell,2009,Tranmere Rovers,1386,1.0,86,LB,GBR,1989-12-15,170.0
2,Aaron Cresswell,2010,Tranmere Rovers,4020,11.0,85,LB,GBR,1989-12-15,170.0
3,Aaron Cresswell,2011,Ipswich Town,4111,7.0,89,LB,GBR,1989-12-15,170.0
4,Aaron Cresswell,2012,Ipswich Town,4440,9.0,90,LB,GBR,1989-12-15,170.0


In [667]:
player_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 146277 entries, 0 to 146276
Data columns (total 10 columns):
 #   Column                  Non-Null Count   Dtype  
---  ------                  --------------   -----  
 0   player_name             146277 non-null  object 
 1   season                  146277 non-null  int64  
 2   club_name               146277 non-null  object 
 3   minutes_played          146277 non-null  int64  
 4   goal_contributions      146277 non-null  float64
 5   minutes_per_appearance  146277 non-null  int64  
 6   position                146086 non-null  object 
 7   nationality_code        146264 non-null  object 
 8   date_of_birth           145958 non-null  object 
 9   height                  146264 non-null  float64
dtypes: float64(2), int64(3), object(5)
memory usage: 12.3+ MB


In [668]:
player_df = player_df[player_df['date_of_birth'].notnull()]
player_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 145958 entries, 0 to 146276
Data columns (total 10 columns):
 #   Column                  Non-Null Count   Dtype  
---  ------                  --------------   -----  
 0   player_name             145958 non-null  object 
 1   season                  145958 non-null  int64  
 2   club_name               145958 non-null  object 
 3   minutes_played          145958 non-null  int64  
 4   goal_contributions      145958 non-null  float64
 5   minutes_per_appearance  145958 non-null  int64  
 6   position                145780 non-null  object 
 7   nationality_code        145958 non-null  object 
 8   date_of_birth           145958 non-null  object 
 9   height                  145958 non-null  float64
dtypes: float64(2), int64(3), object(5)
memory usage: 12.2+ MB


In [669]:
player_df[player_df['position'].isnull()]['player_name'].unique()

array(['Abdelmajid Oulmers', "Alain N'Kong",
       'Aleksandar Yordanov Aleksandrov', 'André Paulo Pinto',
       'Can Cumhur Bozaci', 'Cristiano', 'Cyrille Watier', 'Devran Ayhan',
       'Emra Tahirovic', 'Fransergio', 'Gastón Curbelo',
       'Goran Stavrevski', 'Hasan Yurt', 'Laurentiu Rosu', 'Lubos Pecka',
       'Mithat Yavas', 'Pini Balili', 'Ramazan Tunc', 'Serkan Bensol'],
      dtype=object)

With only 19 players info missing for position, we can just remove these entries and get a dataframe with no null values.

In [670]:
player_df = player_df[player_df['position'].notnull()]
player_df.head(10)

Unnamed: 0,player_name,season,club_name,minutes_played,goal_contributions,minutes_per_appearance,position,nationality_code,date_of_birth,height
0,Aaron Cresswell,2008,Tranmere Rovers,852,1.0,60,LB,GBR,1989-12-15,170.0
1,Aaron Cresswell,2009,Tranmere Rovers,1386,1.0,86,LB,GBR,1989-12-15,170.0
2,Aaron Cresswell,2010,Tranmere Rovers,4020,11.0,85,LB,GBR,1989-12-15,170.0
3,Aaron Cresswell,2011,Ipswich Town,4111,7.0,89,LB,GBR,1989-12-15,170.0
4,Aaron Cresswell,2012,Ipswich Town,4440,9.0,90,LB,GBR,1989-12-15,170.0
5,Aaron Cresswell,2013,Ipswich Town,3835,16.0,89,LB,GBR,1989-12-15,170.0
6,Aaron Cresswell,2014,West Ham United,3810,6.0,90,LB,GBR,1989-12-15,170.0
7,Aaron Cresswell,2015,West Ham United,4305,6.0,91,LB,GBR,1989-12-15,170.0
8,Aaron Cresswell,2016,West Ham United,2344,2.0,80,LB,GBR,1989-12-15,170.0
9,Aaron Cresswell,2017,West Ham United,3317,8.0,85,LB,GBR,1989-12-15,170.0


Now, we refer back to our transfer data set and combine the columns we want to include.

In [671]:
new_transfers.head()

Unnamed: 0,player_name,season,market_value,fee,from_club_id,from_club_name,to_club_id,to_club_name
0,Jermaine Beckford,2017,500000.0,0.0,391,Preston NE,392,Bury
1,Jermaine Beckford,2015,750000.0,0.0,289,Bolton,391,Preston NE
2,Jermaine Beckford,2014,750000.0,0.0,391,Preston NE,289,Bolton
3,Jermaine Beckford,2014,1200000.0,0.0,289,Bolton,391,Preston NE
4,Jermaine Beckford,2013,1500000.0,,271,Leicester,289,Bolton


In [672]:
df = pd.merge(player_df, new_transfers, how='left', on=['player_name','season'])
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 189527 entries, 0 to 189526
Data columns (total 16 columns):
 #   Column                  Non-Null Count   Dtype  
---  ------                  --------------   -----  
 0   player_name             189527 non-null  object 
 1   season                  189527 non-null  int64  
 2   club_name               189527 non-null  object 
 3   minutes_played          189527 non-null  int64  
 4   goal_contributions      189527 non-null  float64
 5   minutes_per_appearance  189527 non-null  int64  
 6   position                189527 non-null  object 
 7   nationality_code        189527 non-null  object 
 8   date_of_birth           189527 non-null  object 
 9   height                  189527 non-null  float64
 10  market_value            104544 non-null  float64
 11  fee                     95084 non-null   float64
 12  from_club_id            121799 non-null  float64
 13  from_club_name          121799 non-null  object 
 14  to_club_id          

In [680]:
df.drop(['from_club_id','from_club_name','to_club_id','to_club_name'],axis=1, inplace=True)

df.head()

Unnamed: 0,player_name,season,club_name,minutes_played,goal_contributions,minutes_per_appearance,position,nationality_code,date_of_birth,height,market_value,fee
0,Aaron Cresswell,2008,Tranmere Rovers,852,1.0,60,LB,GBR,1989-12-15,170.0,,
1,Aaron Cresswell,2009,Tranmere Rovers,1386,1.0,86,LB,GBR,1989-12-15,170.0,,
2,Aaron Cresswell,2010,Tranmere Rovers,4020,11.0,85,LB,GBR,1989-12-15,170.0,,
3,Aaron Cresswell,2011,Ipswich Town,4111,7.0,89,LB,GBR,1989-12-15,170.0,50000.0,275000.0
4,Aaron Cresswell,2012,Ipswich Town,4440,9.0,90,LB,GBR,1989-12-15,170.0,,


The last piece of the dataframe we would like to include is the league that the player plays in. For this information, we will need to access the clubs_in_leagues.csv file as well as the dict_league.csv file.

In [262]:
clubs_leagues = pd.read_csv('clubs_in_leagues.csv', delimiter=';')

clubs_leagues.head()

Unnamed: 0.1,Unnamed: 0,id,club_id,club_name,league_id,season,matches_played,matches_overall,wins,draws,loses,goals_scored,goals_cons,goals_diff,points,place,qualified_to,is_champion,is_cup_winner,is_promoted
0,0,1,1,Bayern Munich,4,1999,34,34,22,7,5,73,28,45,73,1,CL,1.0,1.0,0.0
1,1,2,2,Bay. Leverkusen,4,1999,34,34,21,10,3,74,36,38,73,2,CL,0.0,0.0,0.0
2,2,3,3,Hamburger SV,4,1999,34,34,16,11,7,63,39,24,59,3,CL Quals,0.0,0.0,0.0
3,3,4,4,1860 Munich,4,1999,34,34,14,11,9,55,48,7,53,4,CL Quals,0.0,0.0,0.0
4,4,5,5,1.FC K'lautern,4,1999,34,34,15,5,14,54,59,-5,50,5,EL Quals,0.0,0.0,0.0


In [674]:
clubs_leagues.columns

Index(['Unnamed: 0', 'id', 'club_id', 'club_name', 'league_id', 'season',
       'matches_played', 'matches_overall', 'wins', 'draws', 'loses',
       'goals_scored', 'goals_cons', 'goals_diff', 'points', 'place',
       'qualified_to', 'is_champion', 'is_cup_winner', 'is_promoted'],
      dtype='object')

In [676]:
clubs_leagues = clubs_leagues[['club_id', 'club_name', 'league_id']]
clubs_leagues.groupby('club_id').head()

Unnamed: 0,club_id,club_name,league_id
0,1,Bayern Munich,4
1,2,Bay. Leverkusen,4
2,3,Hamburger SV,4
3,4,1860 Munich,4
4,5,1.FC K'lautern,4
...,...,...,...
3556,389,KV Oostende,9
3559,390,Mouscron,9
3560,386,KAS Eupen,9
3571,390,Mouscron,9


In [677]:
leagues = pd.read_csv('dict_leagues.csv', delimiter=';')

leagues.head()

Unnamed: 0,id,name,country,country_id,num,evaluation,group
0,1,Premier League,England,GBR,1,8270000000.0,1
1,2,LaLiga,Spain,ESP,1,5530000000.0,1
2,3,Serie A,Italy,ITA,1,4700000000.0,2
3,4,Bundesliga,Germany,DEU,1,4290000000.0,2
4,5,Ligue 1,France,FRA,1,3330000000.0,2


In [678]:
leagues = leagues[['id','name']]

leagues.head()

Unnamed: 0,id,name
0,1,Premier League
1,2,LaLiga
2,3,Serie A
3,4,Bundesliga
4,5,Ligue 1


In [682]:
clubs = pd.merge(clubs_leagues, leagues, how='left',left_on='league_id',right_on='id')
clubs.rename(columns={'name':'league'},inplace=True)
clubs.head()

Unnamed: 0,club_id,club_name,league_id,id,league
0,1,Bayern Munich,4,4,Bundesliga
1,2,Bay. Leverkusen,4,4,Bundesliga
2,3,Hamburger SV,4,4,Bundesliga
3,4,1860 Munich,4,4,Bundesliga
4,5,1.FC K'lautern,4,4,Bundesliga


In [683]:
df_final = pd.merge(df,clubs[['club_name','league']], how='left', on='club_name')

df_final.head()

Unnamed: 0,player_name,season,club_name,minutes_played,goal_contributions,minutes_per_appearance,position,nationality_code,date_of_birth,height,market_value,fee,league
0,Aaron Cresswell,2008,Tranmere Rovers,852,1.0,60,LB,GBR,1989-12-15,170.0,,,
1,Aaron Cresswell,2009,Tranmere Rovers,1386,1.0,86,LB,GBR,1989-12-15,170.0,,,
2,Aaron Cresswell,2010,Tranmere Rovers,4020,11.0,85,LB,GBR,1989-12-15,170.0,,,
3,Aaron Cresswell,2011,Ipswich Town,4111,7.0,89,LB,GBR,1989-12-15,170.0,50000.0,275000.0,
4,Aaron Cresswell,2012,Ipswich Town,4440,9.0,90,LB,GBR,1989-12-15,170.0,,,


In [685]:
df_final.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 776902 entries, 0 to 776901
Data columns (total 13 columns):
 #   Column                  Non-Null Count   Dtype  
---  ------                  --------------   -----  
 0   player_name             776902 non-null  object 
 1   season                  776902 non-null  int64  
 2   club_name               776902 non-null  object 
 3   minutes_played          776902 non-null  int64  
 4   goal_contributions      776902 non-null  float64
 5   minutes_per_appearance  776902 non-null  int64  
 6   position                776902 non-null  object 
 7   nationality_code        776902 non-null  object 
 8   date_of_birth           776902 non-null  object 
 9   height                  776902 non-null  float64
 10  market_value            394794 non-null  float64
 11  fee                     367102 non-null  float64
 12  league                  636155 non-null  object 
dtypes: float64(4), int64(3), object(6)
memory usage: 83.0+ MB


In [687]:
df_final['league'].fillna('Other League',inplace=True)

df_final.head()

Unnamed: 0,player_name,season,club_name,minutes_played,goal_contributions,minutes_per_appearance,position,nationality_code,date_of_birth,height,market_value,fee,league
0,Aaron Cresswell,2008,Tranmere Rovers,852,1.0,60,LB,GBR,1989-12-15,170.0,,,Other League
1,Aaron Cresswell,2009,Tranmere Rovers,1386,1.0,86,LB,GBR,1989-12-15,170.0,,,Other League
2,Aaron Cresswell,2010,Tranmere Rovers,4020,11.0,85,LB,GBR,1989-12-15,170.0,,,Other League
3,Aaron Cresswell,2011,Ipswich Town,4111,7.0,89,LB,GBR,1989-12-15,170.0,50000.0,275000.0,Other League
4,Aaron Cresswell,2012,Ipswich Town,4440,9.0,90,LB,GBR,1989-12-15,170.0,,,Other League


And there we have it! A data set that contains all the relevant data, with only null values for market value and fee in any spots. We will now save this dataframe to a csv file and explore creating models in the next Jupyter notebook.

In [688]:
df_final.to_csv('football_data.csv',index=False)