In [1]:
# Importing the libraries
import pandas as pd

## Obtaining the data

This phase consists of obtaining the data, where data will be collected regarding the performance of the players during the last season of the Premier League (United Kingdom), La Liga (Spain), Ligue 1 (France), Bundesliga (Germany), Serie A TIM (Italy), Eredivisie (Netherlands), Primeira Liga (Portugal) and Campeonato Brasileiro Série A (Brazil).

Other championships with available data such as Liga MX (Mexico) and MLS (USA) will not be considered, as they have a different competition format from the others, which follow the consecutive points system.

In addition, the data that will be used refer to the last season already finished, 2021/2022 for European championships and 2022 for the Brazilian Serie A. Data referring to current seasons will not be taken into account, as the European season is taking place and the Brazilian is only at the beginning, so it may be that there are players still in their old clubs.

All data has been taken from the website [fbref](https://fbref.com).

The following data regarding players in each league will be collected:
- Goalkeeping
- Shooting
- Passing
- Defensive Actions

#### Big 5 European Leagues
In this session, we will get the data of the 5 biggest leagues in Europe: Premier League, La Liga, Bundesliga, Ligue 1 and Serie A TIM.

In [2]:
# Reading all data
goalkeepers_bg5_df = pd.read_csv('../Data/goalkeepers_bg5.csv')
shoots_bg5_df = pd.read_csv('../Data/shoots_bg5.csv')
passes_bg5_df = pd.read_csv('../Data/passes_bg5.csv')
defense_bg5_df = pd.read_csv('../Data/defense_actions_bg5.csv')

#### Eredivisie
In this session, we will collect the Eredivise league data.

Now, let's check the datasets.

In [3]:
# Checking the first rows
goalkeepers_bg5_df.head()

Unnamed: 0,Rk,Player,Nation,Pos,Squad,Comp,Age,Born,MP,Starts,...,L,CS,CS%,PKatt,PKA,PKsv,PKm,Save%.1,Matches,-9999
0,1,Julen Agirrezabala,es ESP,GK,Athletic Club,es La Liga,20,2000,4,4,...,1,1,25.0,0,0,0,0,,Matches,a2c1a8d3
1,2,Doğan Alemdar,tr TUR,GK,Rennes,fr Ligue 1,18,2002,12,12,...,5,4,33.3,2,1,1,0,50.0,Matches,9e17ccff
2,3,Alisson,br BRA,GK,Liverpool,eng Premier League,28,1992,36,36,...,2,20,55.6,0,0,0,0,,Matches,7a2e46a8
3,4,Alphonse Areola,fr FRA,GK,West Ham,eng Premier League,28,1993,1,1,...,1,0,0.0,0,0,0,0,,Matches,2f965a72
4,5,Kepa Arrizabalaga,es ESP,GK,Chelsea,eng Premier League,26,1994,4,4,...,1,2,50.0,0,0,0,0,,Matches,28d596a0


In [4]:
# Checking the first rows
shoots_bg5_df.head()

Unnamed: 0,Rk,Player,Nation,Pos,Squad,Comp,Age,Born,90s,Gls,...,FK,PK,PKatt,xG,npxG,npxG/Sh,G-xG,np:G-xG,Matches,-9999
0,1,Max Aarons,eng ENG,DF,Norwich City,eng Premier League,21,2000,32.0,0,...,0.0,0,0,0.9,0.9,0.07,-0.9,-0.9,Matches,774cf58b
1,2,Yunis Abdelhamid,ma MAR,DF,Reims,fr Ligue 1,33,1987,33.1,2,...,0.0,0,0,1.5,1.5,0.08,0.5,0.5,Matches,32c2d95f
2,3,Salis Abdul Samed,gh GHA,MF,Clermont Foot,fr Ligue 1,21,2000,27.4,1,...,0.0,0,0,1.1,1.1,0.06,-0.1,-0.1,Matches,82464ce3
3,4,Laurent Abergel,fr FRA,MF,Lorient,fr Ligue 1,28,1993,32.8,0,...,0.0,0,0,2.1,2.1,0.07,-2.1,-2.1,Matches,31626657
4,5,Charles Abi,fr FRA,FW,Saint-Étienne,fr Ligue 1,21,2000,0.5,0,...,0.0,0,0,0.0,0.0,,0.0,0.0,Matches,469d3d84


In [5]:
# Checking the first rows
passes_bg5_df.head()

Unnamed: 0,Rk,Player,Nation,Pos,Squad,Comp,Age,Born,90s,Cmp,...,xAG,xA,A-xAG,KP,1/3,PPA,CrsPA,Prog,Matches,-9999
0,1,Max Aarons,eng ENG,DF,Norwich City,eng Premier League,21,2000,32.0,1107.0,...,1.6,1.7,0.4,20.0,50.0,37.0,9.0,86.0,Matches,774cf58b
1,2,Yunis Abdelhamid,ma MAR,DF,Reims,fr Ligue 1,33,1987,33.1,1284.0,...,0.9,0.6,-0.9,9.0,95.0,7.0,0.0,96.0,Matches,32c2d95f
2,3,Salis Abdul Samed,gh GHA,MF,Clermont Foot,fr Ligue 1,21,2000,27.4,1535.0,...,1.0,0.8,-1.0,17.0,87.0,13.0,1.0,80.0,Matches,82464ce3
3,4,Laurent Abergel,fr FRA,MF,Lorient,fr Ligue 1,28,1993,32.8,1341.0,...,4.4,2.6,-2.4,35.0,147.0,23.0,9.0,126.0,Matches,31626657
4,5,Charles Abi,fr FRA,FW,Saint-Étienne,fr Ligue 1,21,2000,0.5,3.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Matches,469d3d84


In [6]:
# Checking the first rows
defense_bg5_df.head()

Unnamed: 0,Rk,Player,Nation,Pos,Squad,Comp,Age,Born,90s,Tkl,...,Past,Blocks,Sh,Pass,Int,Tkl+Int,Clr,Err,Matches,-9999
0,1,Max Aarons,eng ENG,DF,Norwich City,eng Premier League,21,2000,32.0,64.0,...,18.0,39.0,19.0,20.0,28,92.0,96.0,1.0,Matches,774cf58b
1,2,Yunis Abdelhamid,ma MAR,DF,Reims,fr Ligue 1,33,1987,33.1,48.0,...,16.0,50.0,25.0,25.0,68,116.0,104.0,0.0,Matches,32c2d95f
2,3,Salis Abdul Samed,gh GHA,MF,Clermont Foot,fr Ligue 1,21,2000,27.4,43.0,...,37.0,18.0,1.0,17.0,42,85.0,15.0,0.0,Matches,82464ce3
3,4,Laurent Abergel,fr FRA,MF,Lorient,fr Ligue 1,28,1993,32.8,110.0,...,95.0,58.0,1.0,57.0,55,165.0,13.0,0.0,Matches,31626657
4,5,Charles Abi,fr FRA,FW,Saint-Étienne,fr Ligue 1,21,2000,0.5,0.0,...,1.0,2.0,0.0,2.0,0,0.0,0.0,0.0,Matches,469d3d84


Looking at the datasets, we can see that:

- There are missing values. They are probably players who were injured during the season or just didn't score points in that regard, for this reason, the value of 0 will be imputed in missing values.
- All datasets have a Match column, which can be discarded.
- The last column appears to be an identifier for each player.
- Country data for each league can be taken from the Comp column.

In [7]:
#Creating a function to fill 0 in missing values
def input_zero(dataset):
    new_dataset = dataset.fillna(0)

    return new_dataset

In [8]:
# Creating a function to clean the columns of all datasets
def clean_columns(dataset):
    new_dataset = dataset.drop(columns="Matches")

    return new_dataset

In [9]:
# Filling 0 on NaN Values
goalkeepers_bg5_df = input_zero(goalkeepers_bg5_df)
passes_bg5_df = input_zero(passes_bg5_df)
shoots_bg5_df = input_zero(shoots_bg5_df)
defense_bg5_df = input_zero(defense_bg5_df)

In [10]:
# Cleaning columns
goalkeepers_bg5_df = clean_columns(goalkeepers_bg5_df)
passes_bg5_df = clean_columns(passes_bg5_df)
shoots_bg5_df = clean_columns(shoots_bg5_df)
defense_bg5_df = clean_columns(defense_bg5_df)