In [1]:
# Importing the libraries
import pandas as pd

## Obtaining the data

This phase consists of obtaining the data, where data will be collected 
regarding the performance of the players during the last season of the 
Premier League (United Kingdom), La Liga (Spain), Ligue 1 (France), 
Bundesliga (Germany), Serie A TIM (Italy), Eredivisie (Netherlands), 
Primeira Liga (Portugal) and Campeonato Brasileiro Série A (Brazil).

Other championships with available data such as Liga MX (Mexico) and MLS 
(USA) will not be considered, as they have a different competition format 
from the others, which follow the consecutive points system.

In addition, the data that will be used refer to the last season already 
finished, 2021/2022 for European championships and 2022 for the Brazilian 
Serie A. Data referring to current seasons will not be taken into account, 
as the European season is taking place and the Brazilian is only at the 
beginning, so it may be that there are players still in their old clubs.

All data has been taken from the website [fbref](https://fbref.com/en/).

The following data regarding players in each league will be collected:

- Goalkeeping
- Shooting
- Passing
- Defensive Actions

#### Big 5 European Leagues
In this session, we will load the data of the 5 biggest leagues in Europe: 
Premier League, La Liga, Bundesliga, Ligue 1 and Serie A TIM.

In [2]:
# Reading all data
goalkeepers_bg5_df = pd.read_csv("../data/raw/goalkeepers_bg5.csv")
shoots_bg5_df = pd.read_csv('../data/raw/shoots_bg5.csv')
passes_bg5_df = pd.read_csv('../data/raw/passes_bg5.csv')
defense_bg5_df = pd.read_csv('../data/raw/defense_actions_bg5.csv')

#### Eredivisie
In this session, we will load the Eredivise league data.

In [3]:
# Reading all data
goalkeepers_Eredivisie_df = pd.read_csv('../data/raw/goalkeepers_Eredivisie.csv')
shoots_Eredivisie_df = pd.read_csv('../data/raw/shoots_Eredivisie.csv')
passes_Eredivisie_df = pd.read_csv('../data/raw/passes_Eredivisie.csv')
defense_Eredivisie_df = pd.read_csv('../data/raw/defense_actions_Eredivisie.csv')

#### Primeira Liga
Now, let's load the Portuguese league data.

In [4]:
# Reading all data
goalkeepers_PrimeiraLiga_df = pd.read_csv('../data/raw/goalkeepers_PrimeiraLiga.csv')
shoots_PrimeiraLiga_df = pd.read_csv('../data/raw/shoots_PrimeiraLiga.csv')
passes_PrimeiraLiga_df = pd.read_csv('../data/raw/passes_PrimeiraLiga.csv')
defense_PrimeiraLiga_df = pd.read_csv('../data/raw/defense_actions_PrimeiraLiga.csv')

#### Brasileiro Serie A
And finally, we want to load the data of the Brazilian league.

In [5]:
goalkeepers_BrasileiroSerieA_df = pd.read_csv('../data/raw/goalkeepers_BrasileiroSerieA.csv')
shoots_BrasileiroSerieA_df = pd.read_csv('../data/raw/shoots_BrasileiroSerieA.csv')
passes_BrasileiroSerieA_df = pd.read_csv('../data/raw/passes_BrasileiroSerieA.csv')
defense_BrasileiroSerieA_df = pd.read_csv('../data/raw/defense_actions_BrasileiroSerieA.csv')


## Understanding the data
At this stage, I will try to understand the data, it will be something quick, the steps that will be followed are:

- Note the first lines

- Observe the shape of the data

- Search and observe differences between datasets

- Check column types


#### Passes
In this session, we will try to understand the datasets that contain the data 
about the passes of each player.




First, let's take a look at the data.

In [15]:
# Checking the first rows
passes_bg5_df.head()

Unnamed: 0,Rk,Player,Nation,Pos,Squad,Comp,Age,Born,90s,Cmp,...,xAG,xA,A-xAG,KP,1/3,PPA,CrsPA,Prog,Matches,-9999
0,1,Max Aarons,eng ENG,DF,Norwich City,eng Premier League,21,2000,32.0,1107.0,...,1.6,1.7,0.4,20.0,50.0,37.0,9.0,86.0,Matches,774cf58b
1,2,Yunis Abdelhamid,ma MAR,DF,Reims,fr Ligue 1,33,1987,33.1,1284.0,...,0.9,0.6,-0.9,9.0,95.0,7.0,0.0,96.0,Matches,32c2d95f
2,3,Salis Abdul Samed,gh GHA,MF,Clermont Foot,fr Ligue 1,21,2000,27.4,1535.0,...,1.0,0.8,-1.0,17.0,87.0,13.0,1.0,80.0,Matches,82464ce3
3,4,Laurent Abergel,fr FRA,MF,Lorient,fr Ligue 1,28,1993,32.8,1341.0,...,4.4,2.6,-2.4,35.0,147.0,23.0,9.0,126.0,Matches,31626657
4,5,Charles Abi,fr FRA,FW,Saint-Étienne,fr Ligue 1,21,2000,0.5,3.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Matches,469d3d84


In [16]:
# Checking the first rows
passes_Eredivisie_df.head()

Unnamed: 0,Rk,Player,Nation,Pos,Squad,Age,Born,90s,Cmp,Att,...,xAG,xA,A-xAG,KP,1/3,PPA,CrsPA,Prog,Matches,-9999
0,1,Trustin van 't Loo,nl NED,MF,Heerenveen,17,2004,0.3,2.0,7.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Matches,b9887ae0
1,2,Dirk Abels,nl NED,DFMF,Sparta R'dam,24,1997,32.2,1049.0,1460.0,...,0.9,1.0,-0.9,12.0,90.0,7.0,3.0,81.0,Matches,3c3bd200
2,3,Zakaria Aboukhlal,ma MAR,FWMF,AZ Alkmaar,21,2000,9.1,170.0,249.0,...,1.0,0.8,0.0,7.0,6.0,14.0,2.0,22.0,Matches,c2a6033c
3,4,Paulos Abraham,se SWE,FWMF,Groningen,19,2002,15.0,210.0,310.0,...,1.3,1.8,0.7,13.0,25.0,12.0,4.0,29.0,Matches,fd99de9d
4,5,Shawn Adewoye,be BEL,DF,RKC Waalwijk,21,2000,19.1,579.0,748.0,...,0.3,0.4,-0.3,6.0,35.0,1.0,0.0,32.0,Matches,be98fc34


In [17]:
# Checking the first rows
passes_PrimeiraLiga_df.head()

Unnamed: 0,Rk,Player,Nation,Pos,Squad,Age,Born,90s,Cmp,Att,...,xAG,xA,A-xAG,KP,1/3,PPA,CrsPA,Prog,Matches,-9999
0,1,Rodrigo Abascal,uy URU,DF,Boavista,27,1994,26.5,1080.0,1425.0,...,0.6,0.6,1.4,9.0,80.0,5.0,2.0,57.0,Matches,088ae0d1
1,2,Giorgi Aburjania,ge GEO,MF,Gil Vicente FC,26,1995,8.6,339.0,416.0,...,1.4,1.2,-0.4,11.0,48.0,6.0,0.0,40.0,Matches,0921a46c
2,3,Antonio Adán,es ESP,GK,Sporting CP,34,1987,33.0,925.0,1099.0,...,0.0,0.0,0.0,0.0,6.0,0.0,0.0,1.0,Matches,65d62814
3,4,João Afonso Crispim,br BRA,MF,Gil Vicente FC,26,1995,0.2,21.0,22.0,...,0.0,0.0,0.0,0.0,2.0,1.0,0.0,2.0,Matches,e40bdb5d
4,5,João Afonso,pt POR,DF,Santa Clara,31,1990,12.4,387.0,483.0,...,0.6,0.3,-0.6,4.0,22.0,0.0,0.0,17.0,Matches,a8f07c8f


In [18]:
# Checking the first rows
passes_BrasileiroSerieA_df.head()

Unnamed: 0,Rk,Player,Nation,Pos,Squad,Age,Born,90s,Cmp,Att,...,xAG,xA,A-xAG,KP,1/3,PPA,CrsPA,Prog,Matches,-9999
0,1,Abner,br BRA,DF,Atl Paranaense,21,2000,23.4,986.0,1283.0,...,2.2,1.7,-0.2,31.0,105.0,16.0,6.0,84.0,Matches,7f9c5d2d
1,2,Adryelson,br BRA,DF,Botafogo (RJ),23,1998,16.1,540.0,662.0,...,0.3,0.1,-0.3,3.0,15.0,0.0,0.0,12.0,Matches,e980e78d
2,3,Adson,br BRA,FWMF,Corinthians,21,2000,13.8,522.0,633.0,...,0.8,1.5,-0.8,14.0,34.0,29.0,2.0,45.0,Matches,eda38706
3,4,Airton,br BRA,FW,Atl Goianiense,22,1999,19.3,277.0,444.0,...,2.8,3.0,1.2,30.0,20.0,17.0,9.0,19.0,Matches,751ef075
4,5,Carlos Alberto,br BRA,FWMF,América (MG),19,2002,1.7,13.0,24.0,...,0.1,0.0,-0.1,1.0,0.0,0.0,0.0,0.0,Matches,08f48d96


Now, let's check the shape of datasets.

In [7]:
# Seeing the shape
print(passes_bg5_df.shape)
print(passes_Eredivisie_df.shape)
print(passes_PrimeiraLiga_df.shape)
print(passes_BrasileiroSerieA_df.shape)

(2921, 34)
(533, 33)
(581, 33)
(763, 33)


Note that the dataset with data from the top 5 leagues in Europe has one more 
column than the rest, let's check it out below:

In [19]:
# Fetching the extra column name
[col for col in passes_bg5_df.columns if col not in passes_Eredivisie_df.columns]

['Comp']

Let's check the column values

In [10]:
# Checking the column values
passes_bg5_df.Comp

0       eng Premier League
1               fr Ligue 1
2               fr Ligue 1
3               fr Ligue 1
4               fr Ligue 1
               ...        
2916            es La Liga
2917            it Serie A
2918    eng Premier League
2919            it Serie A
2920            it Serie A
Name: Comp, Length: 2921, dtype: object

The extra column deals with the league that player plays in, which makes sense, since the ``passes_bg5_df`` dataset contains data from players from 5 different leagues, unlike the other datasets obtained, which represent data from players from just one league.

This behavior is also expected in the other datasets referring to the 5 big leagues in Europe.