<a id='top'></a>

# Webscraping of FBref Data
##### Notebook to scrape raw data  from [StatsBomb](https://statsbomb.com/) via [FBref](https://fbref.com/en/) using [Beautifulsoup](https://pypi.org/project/beautifulsoup4/)

### By [Edd Webster](https://www.twitter.com/eddwebster)
Notebook first written: 31/08/2020<br>
Notebook last updated: 05/07/2021

![title](../../img/fbref-logo-banner.png)

![title](../../img/stats-bomb-logo.png)

Click [here](#section5) to jump straight to the Exploratory Data Analysis section and skip the [Task Brief](#section2), [Data Sources](#section3), and [Data Engineering](#section4) sections. Or click [here](#section6) to jump straight to the Conclusion.

___

<a id='sectionintro'></a>

## <a id='import_libraries'>Introduction</a>
This notebook scrapes player statstics data from [StatsBomb](https://statsbomb.com/) via [FBref.com](https://fbref.com/en/), using [pandas](http://pandas.pydata.org/) for data manipulation through DataFrames, [Beautifulsoup](https://pypi.org/project/beautifulsoup4/) for webscraping.

For more information about this notebook and the author, I'm available through all the following channels:
*    [eddwebster.com](https://www.eddwebster.com/);
*    edd.j.webster@gmail.com;
*    [@eddwebster](https://www.twitter.com/eddwebster);
*    [linkedin.com/in/eddwebster](https://www.linkedin.com/in/eddwebster/);
*    [github/eddwebster](https://github.com/eddwebster/);
*    [public.tableau.com/profile/edd.webster](https://public.tableau.com/profile/edd.webster);
*    [kaggle.com/eddwebster](https://www.kaggle.com/eddwebster); and
*    [hackerrank.com/eddwebster](https://www.hackerrank.com/eddwebster).

![title](../../img/fifa21eddwebsterbanner.png)

The accompanying GitHub repository for this notebook can be found [here](https://github.com/eddwebster/football_analytics) and a static version of this notebook can be found [here](https://nbviewer.jupyter.org/github/eddwebster/football_analytics/blob/master/notebooks/A%29%20Web%20Scraping/FBref%20Web%20Scraping%20and%20Parsing.ipynb).

___

<a id='sectioncontents'></a>

## <a id='notebook_contents'>Notebook Contents</a>
1.    [Notebook Dependencies](#section1)<br>
2.    [Project Brief](#section2)<br>
3.    [Data Sources](#section3)<br>
      1.    [Introduction](#section3.1)<br>
      2.    [Teams](#section3.2)<br>
            1.    [Data Dictionary](#section3.2.1)<br>
            2.    [Creating the DataFrame](#section3.2.2)<br>
            3.    [Initial Data Handling](#section3.2.3)<br>
            4.    [Export the Raw DataFrame](#section3.2.4)<br>
      2.    [Outfielder Players](#section3.3)<br>
            1.    [Data Dictionary](#section3.3.1)<br>
            2.    [Creating the DataFrame](#section3.3.2)<br>
            3.    [Initial Data Handling](#section3.3.3)<br>
            4.    [Export the Raw DataFrame](#section3.3.4)<br>
      3.    [Goalkeepers](#section3.4)<br>
            1.    [Data Dictionary](#section3.4.1)<br>
            2.    [Creating the DataFrame](#section3.4.2)<br>
            3.    [Initial Data Handling](#section3.4.3)<br>
            4.    [Export the Raw DataFrame](#section3.4.4)<br> 
4.    [Summary](#section4)<br>
5.    [Next Steps](#section5)<br>
6.    [Bibliography](#section6)<br>

___

<a id='section1'></a>

## <a id='#section1'>1. Notebook Dependencies</a>

This notebook was written using [Python 3](https://docs.python.org/3.7/) and requires the following libraries:
*    [`Jupyter notebooks`](https://jupyter.org/) for this notebook environment with which this project is presented;
*    [`NumPy`](http://www.numpy.org/) for multidimensional array computing;
*    [`pandas`](http://pandas.pydata.org/) for data analysis and manipulation;
*    `tqdm` for a clean progress bar;
*    `requests` for executing HTTP requests;
*    [`Beautifulsoup`](https://pypi.org/project/beautifulsoup4/) for web scraping; and
*    [`matplotlib`](https://matplotlib.org/contents.html?v=20200411155018) for data visualisations;

All packages used for this notebook except for BeautifulSoup can be obtained by downloading and installing the [Conda](https://anaconda.org/anaconda/conda) distribution, available on all platforms (Windows, Linux and Mac OSX). Step-by-step guides on how to install Anaconda can be found for Windows [here](https://medium.com/@GalarnykMichael/install-python-on-windows-anaconda-c63c7c3d1444) and Mac [here](https://medium.com/@GalarnykMichael/install-python-on-mac-anaconda-ccd9f2014072), as well as in the Anaconda documentation itself [here](https://docs.anaconda.com/anaconda/install/).

### Import Libraries and Modules

In [1]:
# Python ≥3.5 (ideally)
import platform
import sys, getopt
assert sys.version_info >= (3, 5)
import csv

# Import Dependencies
%matplotlib inline

# Math Operations
import numpy as np
from math import pi

# Datetime
import datetime
from datetime import date
import time

# Data Preprocessing
import pandas as pd
import os
import re
import random
from io import BytesIO
from pathlib import Path

# Reading directories
import glob
import os

# Working with JSON
import json
from pandas.io.json import json_normalize

# Web Scraping
import requests
from bs4 import BeautifulSoup
import re

# Fuzzy Matching - Record Linkage
import recordlinkage
import jellyfish
import numexpr as ne

# Data Visualisation
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('seaborn-whitegrid')
import missingno as msno

# Progress Bar
from tqdm import tqdm

# Display in Jupyter
from IPython.display import Image, YouTubeVideo
from IPython.core.display import HTML

# Ignore Warnings
import warnings
warnings.filterwarnings(action="ignore", message="^internal gelsd")

print('Setup Complete')

Setup Complete


In [2]:
# Python / module versions used here for reference
print('Python: {}'.format(platform.python_version()))
print('NumPy: {}'.format(np.__version__))
print('pandas: {}'.format(pd.__version__))
print('matplotlib: {}'.format(mpl.__version__))
print('Seaborn: {}'.format(sns.__version__))

Python: 3.7.6
NumPy: 1.18.1
pandas: 1.0.1
matplotlib: 3.1.3
Seaborn: 0.10.0


### Defined Variables and Lists

#### Variables

In [3]:
# Datetime
import datetime
from datetime import date
import time

In [4]:
# Define today's date
today = datetime.datetime.now().strftime('%d/%m/%Y').replace('/', '')

#### Lists

In [5]:
# Standard(stats)
stats = ["player","nationality","position","squad","age","birth_year","games","games_starts","minutes","goals","assists","pens_made","pens_att","cards_yellow","cards_red","goals_per90","assists_per90","goals_assists_per90","goals_pens_per90","goals_assists_pens_per90","xg","npxg","xa","xg_per90","xa_per90","xg_xa_per90","npxg_per90","npxg_xa_per90"]
stats3 = ["players_used","possession","games","games_starts","minutes","goals","assists","pens_made","pens_att","cards_yellow","cards_red","goals_per90","assists_per90","goals_assists_per90","goals_pens_per90","goals_assists_pens_per90","xg","npxg","xa","xg_per90","xa_per90","xg_xa_per90","npxg_per90","npxg_xa_per90"] 

# Goalkeeping(keepers)
keepers = ["player","nationality","position","squad","age","birth_year","games_gk","games_starts_gk","minutes_gk","goals_against_gk","goals_against_per90_gk","shots_on_target_against","saves","save_pct","wins_gk","draws_gk","losses_gk","clean_sheets","clean_sheets_pct","pens_att_gk","pens_allowed","pens_saved","pens_missed_gk"]
keepers3 = ["players_used","games_gk","games_starts_gk","minutes_gk","goals_against_gk","goals_against_per90_gk","shots_on_target_against","saves","save_pct","wins_gk","draws_gk","losses_gk","clean_sheets","clean_sheets_pct","pens_att_gk","pens_allowed","pens_saved","pens_missed_gk"]

# Advance goalkeeping(keepersadv)
keepersadv = ["player","nationality","position","squad","age","birth_year","minutes_90s_gk","goals_against_gk","pens_allowed","free_kick_goals_against_gk","corner_kick_goals_against_gk","own_goals_against_gk","psxg_gk","psnpxg_per_shot_on_target_against","psxg_net_gk","psxg_net_per90_gk","passes_completed_launched_gk","passes_launched_gk","passes_pct_launched_gk","passes_gk","passes_throws_gk","pct_passes_launched_gk","passes_length_avg_gk","goal_kicks","pct_goal_kicks_launched","goal_kick_length_avg","crosses_gk","crosses_stopped_gk","crosses_stopped_pct_gk","def_actions_outside_pen_area_gk","def_actions_outside_pen_area_per90_gk","avg_distance_def_actions_gk"]
keepersadv2 = ["minutes_90s_gk","goals_against_gk","pens_allowed","free_kick_goals_against_gk","corner_kick_goals_against_gk","own_goals_against_gk","psxg_gk","psnpxg_per_shot_on_target_against","psxg_net_gk","psxg_net_per90_gk","passes_completed_launched_gk","passes_launched_gk","passes_pct_launched_gk","passes_gk","passes_throws_gk","pct_passes_launched_gk","passes_length_avg_gk","goal_kicks","pct_goal_kicks_launched","goal_kick_length_avg","crosses_gk","crosses_stopped_gk","crosses_stopped_pct_gk","def_actions_outside_pen_area_gk","def_actions_outside_pen_area_per90_gk","avg_distance_def_actions_gk"]

# Shooting(shooting)
shooting = ["player","nationality","position","squad","age","birth_year","minutes_90s","goals","pens_made","pens_att","shots_total","shots_on_target","shots_free_kicks","shots_on_target_pct","shots_total_per90","shots_on_target_per90","goals_per_shot","goals_per_shot_on_target","xg","npxg","npxg_per_shot","xg_net","npxg_net"]
shooting2 = ["minutes_90s","goals","pens_made","pens_att","shots_total","shots_on_target","shots_free_kicks","shots_on_target_pct","shots_total_per90","shots_on_target_per90","goals_per_shot","goals_per_shot_on_target","xg","npxg","npxg_per_shot","xg_net","npxg_net"]
shooting3 = ["goals","pens_made","pens_att","shots_total","shots_on_target","shots_free_kicks","shots_on_target_pct","shots_total_per90","shots_on_target_per90","goals_per_shot","goals_per_shot_on_target","xg","npxg","npxg_per_shot","xg_net","npxg_net"]

# Passing(passing)
passing = ["player","nationality","position","squad","age","birth_year","minutes_90s","passes_completed","passes","passes_pct","passes_total_distance","passes_progressive_distance","passes_completed_short","passes_short","passes_pct_short","passes_completed_medium","passes_medium","passes_pct_medium","passes_completed_long","passes_long","passes_pct_long","assists","xa","xa_net","assisted_shots","passes_into_final_third","passes_into_penalty_area","crosses_into_penalty_area","progressive_passes"]
passing2 = ["passes_completed","passes","passes_pct","passes_total_distance","passes_progressive_distance","passes_completed_short","passes_short","passes_pct_short","passes_completed_medium","passes_medium","passes_pct_medium","passes_completed_long","passes_long","passes_pct_long","assists","xa","xa_net","assisted_shots","passes_into_final_third","passes_into_penalty_area","crosses_into_penalty_area","progressive_passes"]

# Passtypes(passing_types)
passing_types = ["player","nationality","position","squad","age","birth_year","minutes_90s","passes","passes_live","passes_dead","passes_free_kicks","through_balls","passes_pressure","passes_switches","crosses","corner_kicks","corner_kicks_in","corner_kicks_out","corner_kicks_straight","passes_ground","passes_low","passes_high","passes_left_foot","passes_right_foot","passes_head","throw_ins","passes_other_body","passes_completed","passes_offsides","passes_oob","passes_intercepted","passes_blocked"]
passing_types2 = ["passes","passes_live","passes_dead","passes_free_kicks","through_balls","passes_pressure","passes_switches","crosses","corner_kicks","corner_kicks_in","corner_kicks_out","corner_kicks_straight","passes_ground","passes_low","passes_high","passes_left_foot","passes_right_foot","passes_head","throw_ins","passes_other_body","passes_completed","passes_offsides","passes_oob","passes_intercepted","passes_blocked"]

# Goal and shot creation(gca)
gca = ["player","nationality","position","squad","age","birth_year","minutes_90s","sca","sca_per90","sca_passes_live","sca_passes_dead","sca_dribbles","sca_shots","sca_fouled","gca","gca_per90","gca_passes_live","gca_passes_dead","gca_dribbles","gca_shots","gca_fouled","gca_og_for"]
gca2 = ["sca","sca_per90","sca_passes_live","sca_passes_dead","sca_dribbles","sca_shots","sca_fouled","gca","gca_per90","gca_passes_live","gca_passes_dead","gca_dribbles","gca_shots","gca_fouled","gca_og_for"]

# Defensive actions(defense)
defense = ["player","nationality","position","squad","age","birth_year","minutes_90s","tackles","tackles_won","tackles_def_3rd","tackles_mid_3rd","tackles_att_3rd","dribble_tackles","dribbles_vs","dribble_tackles_pct","dribbled_past","pressures","pressure_regains","pressure_regain_pct","pressures_def_3rd","pressures_mid_3rd","pressures_att_3rd","blocks","blocked_shots","blocked_shots_saves","blocked_passes","interceptions","clearances","errors"]
defense2 = ["tackles","tackles_won","tackles_def_3rd","tackles_mid_3rd","tackles_att_3rd","dribble_tackles","dribbles_vs","dribble_tackles_pct","dribbled_past","pressures","pressure_regains","pressure_regain_pct","pressures_def_3rd","pressures_mid_3rd","pressures_att_3rd","blocks","blocked_shots","blocked_shots_saves","blocked_passes","interceptions","clearances","errors"]

# Possession(possession)
possession = ["player","nationality","position","squad","age","birth_year","minutes_90s","touches","touches_def_pen_area","touches_def_3rd","touches_mid_3rd","touches_att_3rd","touches_att_pen_area","touches_live_ball","dribbles_completed","dribbles","dribbles_completed_pct","players_dribbled_past","nutmegs","carries","carry_distance","carry_progressive_distance","pass_targets","passes_received","passes_received_pct","miscontrols","dispossessed"]
possession2 = ["touches","touches_def_pen_area","touches_def_3rd","touches_mid_3rd","touches_att_3rd","touches_att_pen_area","touches_live_ball","dribbles_completed","dribbles","dribbles_completed_pct","players_dribbled_past","nutmegs","carries","carry_distance","carry_progressive_distance","pass_targets","passes_received","passes_received_pct","miscontrols","dispossessed"]

# Playingtime(playingtime)
playingtime = ["player","nationality","position","squad","age","birth_year","minutes_90s","games","minutes","minutes_per_game","minutes_pct","games_starts","minutes_per_start","games_subs","minutes_per_sub","unused_subs","points_per_match","on_goals_for","on_goals_against","plus_minus","plus_minus_per90","plus_minus_wowy","on_xg_for","on_xg_against","xg_plus_minus","xg_plus_minus_per90","xg_plus_minus_wowy"]
playingtime2 = ["games","minutes","minutes_per_game","minutes_pct","games_starts","minutes_per_start","games_subs","minutes_per_sub","unused_subs","points_per_match","on_goals_for","on_goals_against","plus_minus","plus_minus_per90","plus_minus_wowy","on_xg_for","on_xg_against","xg_plus_minus","xg_plus_minus_per90","xg_plus_minus_wowy"]

# Miscellaneous(misc)
misc = ["player","nationality","position","squad","age","birth_year","minutes_90s","cards_yellow","cards_red","cards_yellow_red","fouls","fouled","offsides","crosses","interceptions","tackles_won","pens_won","pens_conceded","own_goals","ball_recoveries","aerials_won","aerials_lost","aerials_won_pct"]
misc2 = ["cards_yellow","cards_red","cards_yellow_red","fouls","fouled","offsides","crosses","interceptions","tackles_won","pens_won","pens_conceded","own_goals","ball_recoveries","aerials_won","aerials_lost","aerials_won_pct"]

### Define Filepaths

In [6]:
# Set up initial paths to subfolders
base_dir = os.path.join('..', '..', )
data_dir = os.path.join(base_dir, 'data')
data_dir_fbref = os.path.join(base_dir, 'data', 'fbref')
img_dir = os.path.join(base_dir, 'img')
fig_dir = os.path.join(base_dir, 'img', 'fig')
video_dir = os.path.join(base_dir, 'video')

### Custom Functions

In [7]:
# The following code is from parth1902. His GitHub repository can be found here: https://github.com/parth1902/Scrape-FBref-data
# From this point, the code and comments are parth1902's

## Much of the scraping code is taken from this repository: https://github.com/chmartin/FBref_EPL
## I've made the the necessary changes for the recently added data and for combining it

# Functions to get the data in a DataFrame using BeautifulSoup

def get_tables(url):
    res = requests.get(url)
    ## The next two lines get around the issue with comments breaking the parsing.
    comm = re.compile("<!--|-->")
    soup = BeautifulSoup(comm.sub("",res.text),'lxml')
    all_tables = soup.findAll("tbody")
    team_table = all_tables[0]
    player_table = all_tables[1]
    return player_table, team_table

def get_frame(features, player_table):
    pre_df_player = dict()
    features_wanted_player = features
    rows_player = player_table.find_all('tr')
    for row in rows_player:
        if(row.find('th',{"scope":"row"}) != None):
    
            for f in features_wanted_player:
                cell = row.find("td",{"data-stat": f})
                a = cell.text.strip().encode()
                text=a.decode("utf-8")
                if(text == ''):
                    text = '0'
                if((f!='player')&(f!='nationality')&(f!='position')&(f!='squad')&(f!='age')&(f!='birth_year')):
                    text = float(text.replace(',',''))
                if f in pre_df_player:
                    pre_df_player[f].append(text)
                else:
                    pre_df_player[f] = [text]
    df_player = pd.DataFrame.from_dict(pre_df_player)
    return df_player

def get_frame_team(features, team_table):
    pre_df_squad = dict()
    #Note: features does not contain squad name, it requires special treatment
    features_wanted_squad = features
    rows_squad = team_table.find_all('tr')
    for row in rows_squad:
        if(row.find('th',{"scope":"row"}) != None):
            name = row.find('th',{"data-stat":"squad"}).text.strip().encode().decode("utf-8")
            if 'squad' in pre_df_squad:
                pre_df_squad['squad'].append(name)
            else:
                pre_df_squad['squad'] = [name]
            for f in features_wanted_squad:
                cell = row.find("td",{"data-stat": f})
                a = cell.text.strip().encode()
                text=a.decode("utf-8")
                if(text == ''):
                    text = '0'
                if((f!='player')&(f!='nationality')&(f!='position')&(f!='squad')&(f!='age')&(f!='birth_year')):
                    text = float(text.replace(',',''))
                if f in pre_df_squad:
                    pre_df_squad[f].append(text)
                else:
                    pre_df_squad[f] = [text]
    df_squad = pd.DataFrame.from_dict(pre_df_squad)
    return df_squad

def frame_for_category(category,top,end,features):
    url = (top + category + end)
    player_table, team_table = get_tables(url)
    df_player = get_frame(features, player_table)
    return df_player

def frame_for_category_team(category,top,end,features):
    url = (top + category + end)
    player_table, team_table = get_tables(url)
    df_team = get_frame_team(features, team_table)
    return df_team

In [8]:
# Function to get the player data for outfield player, includes all categories - standard stats, shooting passing, passing types, goal and shot creation, defensive actions, possession, and miscallaneous
def get_outfield_data(top, end):
    df1 = frame_for_category('stats',top,end,stats)
    df2 = frame_for_category('shooting',top,end,shooting2)
    df3 = frame_for_category('passing',top,end,passing2)
    df4 = frame_for_category('passing_types',top,end,passing_types2)
    df5 = frame_for_category('gca',top,end,gca2)
    df6 = frame_for_category('defense',top,end,defense2)
    df7 = frame_for_category('possession',top,end,possession2)
    df8 = frame_for_category('misc',top,end,misc2)
    df = pd.concat([df1, df2, df3, df4, df5, df6, df7, df8], axis=1)
    df = df.loc[:,~df.columns.duplicated()]
    return df

In [9]:
# Function to get keeping and advance goalkeeping data
def get_keeper_data(top,end):
    df1 = frame_for_category('keepers',top,end,keepers)
    df2 = frame_for_category('keepersadv',top,end,keepersadv2)
    df = pd.concat([df1, df2], axis=1)
    df = df.loc[:,~df.columns.duplicated()]
    return df

In [10]:
# Function to get team-wise data accross all categories as mentioned above
def get_team_data(top,end):
    df1 = frame_for_category_team('stats',top,end,stats3)
    df2 = frame_for_category_team('keepers',top,end,keepers3)
    df3 = frame_for_category_team('keepersadv',top,end,keepersadv2)
    df4 = frame_for_category_team('shooting',top,end,shooting3)
    df5 = frame_for_category_team('passing',top,end,passing2)
    df6 = frame_for_category_team('passing_types',top,end,passing_types2)
    df7 = frame_for_category_team('gca',top,end,gca2)
    df8 = frame_for_category_team('defense',top,end,defense2)
    df9 = frame_for_category_team('possession',top,end,possession2)
    df10 = frame_for_category_team('misc',top,end,misc2)
    df = pd.concat([df1, df2, df3, df4, df5, df6, df7, df8, df9, df10], axis=1)
    df = df.loc[:,~df.columns.duplicated()]
    return df

---

<a id='section2'></a>

## <a id='#section2'>2. Project Brief</a>
This Jupyter notebook is part of a series of notebooks to scrape, parse, engineer, unify, and the model, culminating in a an Expected Transfer (xTransfer) player performance vs. valuation model. This model aims to determine the under- and over-performing players based on their on-the-pitch output against transfer fee and wages.

This particular notebook is one of several web scraping notebooks, that takes data from the [FBref](https://fbref.com/en/) website, provided by [StatsBomb](https://statsbomb.com/), and scrapes it using [Beautifulsoup](https://pypi.org/project/beautifulsoup4/) and manipulates it as Dataframes using [pandas](http://pandas.pydata.org/).

This notebook, along with the other notebooks in this project workflow are shown in the following diagram:

![roadmap](../../img/football_analytics_data_roadmap.png)

Links to these notebooks in the [`football_analytics`](https://github.com/eddwebster/football_analytics) GitHub repository can be found at the following:
*    [Webscraping](https://github.com/eddwebster/football_analytics/tree/master/notebooks/1_data_scraping)
     +    [FBref Player Stats Webscraping](https://github.com/eddwebster/football_analytics/blob/master/notebooks/1_data_scraping/FBref%20Player%20Stats%20Web%20Scraping.ipynb)
     +    [TransferMarket Player Bio and Status Webscraping](https://github.com/eddwebster/football_analytics/blob/master/notebooks/1_data_scraping/TransferMarkt%20Player%20Bio%20and%20Status%20Web%20Scraping.ipynb)
     +    [TransferMarkt Player Recorded Transfer Fees Webscraping]()
     +    [Capology Player Salary Webscraping](https://github.com/eddwebster/football_analytics/blob/master/notebooks/1_data_scraping/Capology%20Player%20Web%20Scraping.ipynb)
     +    [FBref Team Stats Webscraping](https://github.com/eddwebster/football_analytics/blob/master/notebooks/1_data_scraping/FBref%20Team%20Stats%20Web%20Scraping.ipynb)
*    [Data Parsing](https://github.com/eddwebster/football_analytics/tree/master/notebooks/2_data_parsing)
     +    [ELO Team Ratings Data Parsing]()
*    [Data Engineering](https://github.com/eddwebster/football_analytics/tree/master/notebooks/3_data_engineering)
     +    [FBref Player Stats Data Engineering]()
     +    [TransferMarket Player Bio and Status Data Engineering]()
     +    [TransferMarkt Player Recorded Transfer Fees Data Engineering]()
     +    [Capology Player Salary Data Engineering](https://github.com/eddwebster/football_analytics/blob/master/notebooks/3_data_engineering/Capology%20Player%20Data%20Engineering.ipynb)
     +    [FBref Team Stats Data Engineering]()
     +    [ELO Team Ratings Data Parsing]()
     +    [TransferMarkt Team Recorded Transfer Fee Data Engineering]() (aggregated from [TransferMarkt Player Recorded Transfer Fees notebook]())
     +    [Capology Team Salary Data Engineering]() (aggregated from [Capology Player Salary notebook]())
*    [Joining of Datasets]()
     +    [Player Golden ID of Football Datasets]()
     +    [Team Golden ID of Football Datasets]()
*    [Production Datasets]()
     +    [Player Performance/Market Value Dataset]()
     +    [Team Performance/Market Value Dataset]()
*    [Modeling]()
     +    [Expected Transfer (xTransfer) Model]()

---

<a id='section3'></a>

## <a id='#section3'>3. Data Sources</a>

### <a id='#section3.1'>3.1. Introduction</a>
This Data Sources section has been has been split into two subsections - outfielder, and goalkeeper.

These [FBref](https://fbref.com/en/) webscrapers specially include the recently added extensive data of the top five leagues. The code for the [FBref](https://fbref.com/en/)  web scraping was written by [this](https://github.com/parth1902/Scrape-FBref-data) repository by [parth1902](https://github.com/parth1902) which in turn was taken from [this](https://github.com/chmartin/FBref_EPL) repository written by [chmartin](https://github.com/chmartin).

The data needs to be scraped, converted to a pandas DataFrame ([Section 3](#section3)) and cleaned in the Data Engineering section ([Section 4](#section4)).

We'll be using the [pandas](http://pandas.pydata.org/) library to import our data to this workbook as a DataFrame.

The output of the FBref webscraping resulting in the following two player DataFramess:
*    outfield data
*    keeper data

From FBref it is also possible to scrape team data, this is covered separately in the following notebook in the [Data Scraping](https://github.com/eddwebster/football_analytics/tree/master/notebooks/1_data_scraping) folder [[link]()[.

### <a id='#section3.2'>3.2. Outfield Players</a>

#### <a id='#section3.2.1'>3.2.1. Data Dictionary</a>
The raw dataset has one hundred and eighty eight features (columns) with the following definitions and data types:

| Variable     | Data Type    | Description    |
|------|-----|-----|
| `squad`    | object    | Squad name e.g. Arsenal    |
| `players_used`    | float64    | Number of Players used in Games    |
| `possession`    | float64    | Percentage of time with possession of the ball    |


<br>
The features will be cleaned, converted and also additional features will be created in the [Data Engineering](#section4) section (Section 4).

#### <a id='#section3.3.2'>3.3.2. Creating the DataFrame - scraping the data</a>
Scrape the data and save as a pandas DataFrame using the function `get_outfield_data`.

For outfielders, we are not required to download the data for individual leagues and concatenate them, they can be downloaded as one from the 'Big 5' European leagues players page.

In [None]:
# Run this script to scrape latest version of the player data from FBref

## Notes
### Go to the 'Standard stats' page of the league
### For the Big 5 European Leagues for 2020/21, the link is this: https://fbref.com/en/comps/9/stats/Premier-League-Stats
### For the Big 5 European Leagues for 2019/20, the link is this: https://fbref.com/en/comps/Big5/2019-2020/stats/players/2019-2020-Big-5-European-Leagues-Stats
### Remove the 'stats', and pass the first and third part of the link as parameters like below


## Start timer
tic = datetime.datetime.now()


## 20/21
df_fbref_outfield_big5_2021_raw = get_outfield_data('https://fbref.com/en/comps/Big5/','/players/Big-5-European-Leagues-Stats')
df_fbref_outfield_big5_2021_raw.to_csv(data_dir_fbref + f'/raw/outfield/2021/archive/' + f'player_big5_2021_raw_last_updated_{today}.csv', index=None, header=True)
df_fbref_outfield_big5_2021_raw.to_csv(data_dir_fbref + f'/raw/outfield/2021/' + f'player_big5_2021_raw_latest.csv', index=None, header=True)

## 19/20
df_fbref_outfield_big5_1920_raw = get_outfield_data('https://fbref.com/en/comps/Big5/2019-2020/','/players/2019-2020-Big-5-European-Leagues-Stats')
df_fbref_outfield_big5_1920_raw.to_csv(data_dir_fbref + '/raw/outfield/1920/' + 'player_big5_1920_raw.csv', index=None, header=True)

## 18/19
df_fbref_outfield_big5_1819_raw = get_outfield_data('https://fbref.com/en/comps/Big5/2018-2019/','/players/2018-2019-Big-5-European-Leagues-Stats')
df_fbref_outfield_big5_1819_raw.to_csv(data_dir_fbref + '/raw/outfield/1819/' + 'player_big5_1819_raw.csv', index=None, header=True)

## 17/18
df_fbref_outfield_big5_1718_raw = get_outfield_data('https://fbref.com/en/comps/Big5/2017-2018/','/players/2017-2018-Big-5-European-Leagues-Stats')
df_fbref_outfield_big5_1718_raw.to_csv(data_dir_fbref + '/raw/outfield/1718/' + 'player_big5_1718_raw.csv', index=None, header=True)


## End timer
toc = datetime.datetime.now()


## Calculate time take
total_time = (toc-tic).total_seconds()
print(f'Time taken to scrape player data for the Big 5 leagues is: {total_time/60:0.2f} minutes.')

#### <a id='#section3.3.3'>3.3.3. Preliminary Data Handling</a>
Let's quality of the dataset for the 2020/21 season by looking first and last rows in pandas using the [head()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.head.html) and [tail()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.tail.html) methods.

In [None]:
# Display the first 5 rows of the raw DataFrame, df_fbref_outfield_big5_2021_raw
df_fbref_outfield_big5_2021_raw.head()

In [None]:
# Display the last 5 rows of the raw DataFrame, df_fbref_outfield_big5_2021_raw
df_fbref_outfield_big5_2021_raw.tail()

[shape](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dtypes.html) returns a tuple representing the dimensionality of the DataFrame.

In [None]:
# Print the shape of the raw DataFrame, df_fbref_outfield_big5_2021
print(df_fbref_outfield_big5_2021_raw.shape)

[columns](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.columns.html) returns the column labels of the DataFrame.

In [None]:
# Features (column names) of the raw DataFrame, df_fbref_outfield_big5_2021_raw
df_fbref_outfield_big5_2021_raw.columns

The [dtypes](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dtypes.html) method returns the data types of each attribute in the DataFrame.

In [None]:
# Data types of the features of the raw DataFrame, df_fbref_outfield_big5_2021_raw
df_fbref_outfield_big5_2021_raw.dtypes

In [None]:
# Displays all one hundered and four columns
with pd.option_context('display.max_rows', None, 'display.max_columns', None):
    print(df_fbref_outfield_big5_2021_raw.dtypes)

The [info](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.info.html) method to get a quick description of the data, in particular the total number of rows, and each attribute’s type and number of non-null values.

In [None]:
# Info for the raw DataFrame, df_fbref_outfield_big5_2021_raw
df_fbref_outfield_big5_2021_raw.info()

The [describe](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.describe.html) method to show some useful statistics for each numerical column in the DataFrame.

In [None]:
# Description of the raw DataFrame, df_fbref_outfield_big5_2021_raw, showing some summary statistics for each numberical column in the DataFrame
df_fbref_outfield_big5_2021_raw.describe()

Next, we will check to see how many missing values we have i.e. the number of NULL values in the dataset, and in what features these missing values are located. This can be plotted nicely using the [missingno](https://pypi.org/project/missingno/) library (pip install missingno).

In [None]:
# Plot visualisation of the missing values for each feature of the raw DataFrame, df_fbref_outfield_big5_2021_raw
msno.matrix(df_fbref_outfield_big5_2021_raw, figsize = (30, 7))

In [None]:
# Counts of missing values
null_value_stats = df_fbref_outfield_big5_2021_raw.isnull().sum(axis=0)
null_value_stats[null_value_stats != 0]

The visualisation shows us very quickly that there are no missing values in the dataset.

### <a id='#section3.4'>3.4. Goalkeepers</a>

#### <a id='#section3.4.1'>3.4.1. Data Dictionary</a>
The raw dataset has one hundred and eighty eight features (columns) with the following definitions and data types:

| Variable     | Data Type    | Description    |
|------|-----|-----|
| `squad`    | object    | Squad name e.g. Arsenal    |
| `players_used`    | float64    | Number of Players used in Games    |
| `possession`    | float64    | Percentage of time with possession of the ball    |


<br>
The features will be cleaned, converted and also additional features will be created in the [Data Engineering](#section4) section (Section 4).

#### <a id='#section3.4.2'>3.4.2. Creating the DataFrame - scraping the data</a>
Scrape the data and save as a pandas DataFrame using the function `get_keeper_data`.

Like the outfielders, to download the goalkeeper data we are not required to download the data for individual leagues and concatenate them, they can be downloaded as one from the 'Big 5' European leagues goalkeepers page.

In [None]:
# Run this script to scrape latest version of the goalkeeper data from FBref

## Notes
### Go to the 'Standard stats' page of the league
### For the Big 5 European Leagues for 2020/21, the link is this: https://fbref.com/en/comps/Big5/stats/players/Big-5-European-Leagues-Stats
### For the Big 5 European Leagues for 2019/20, the link is this: https://fbref.com/en/comps/Big5/2019-2020/stats/players/2019-2020-Big-5-European-Leagues-Stats
### Remove the 'stats', and pass the first and third part of the link as parameters like below


## Start timer
tic = datetime.datetime.now()


## Download goalkeeper data from the 'Top 5' leagues for the 19/20 season

### 20/21
df_fbref_goalkeeper_big5_2021_raw = get_keeper_data('https://fbref.com/en/comps/Big5/','/players/Big-5-European-Leagues-Stats')
df_fbref_outfield_big5_2021_raw.to_csv(data_dir_fbref + f'/raw/outfield/2021/archive/' + f'player_big5_2021_raw_last_updated_{today}.csv', index=None, header=True)
df_fbref_outfield_big5_2021_raw.to_csv(data_dir_fbref + f'/raw/outfield/2021/' + f'player_big5_2021_raw_latest.csv', index=None, header=True)

### 19/20
df_fbref_goalkeeper_big5_1920_raw = get_keeper_data('https://fbref.com/en/comps/Big5/2019-2020/','/players/2019-2020-Big-5-European-Leagues-Stats')
df_fbref_goalkeeper_big5_1920_raw.to_csv(data_dir_fbref + '/raw/goalkeeper/1920/' + 'goalkeeper_big5_1920_raw.csv', index=None, header=True)

### 18/19
df_fbref_goalkeeper_big5_1819_raw = get_keeper_data('https://fbref.com/en/comps/Big5/2018-2019/','/players/2018-2019-Big-5-European-Leagues-Stats')
df_fbref_goalkeeper_big5_1819_raw.to_csv(data_dir_fbref + '/raw/goalkeeper/1819/' + 'goalkeeper_big5_1819_raw.csv', index=None, header=True)

### 17/18
df_fbref_goalkeeper_big5_1718_raw = get_keeper_data('https://fbref.com/en/comps/Big5/2017-2018/','/players/2017-2018-Big-5-European-Leagues-Stats')
df_fbref_goalkeeper_big5_1718_raw.to_csv(data_dir_fbref + '/raw/goalkeeper/1718/' + 'goalkeeper_big5_1718_raw.csv', index=None, header=True)


## End timer
toc = datetime.datetime.now()


## Calculate time take
total_time = (toc-tic).total_seconds()
print(f'Time taken to scrape goalkeeper data for the Big 5 leagues is: {total_time:0.2f} seconds.')

#### <a id='#section3.4.3'>3.4.3. Preliminary Data Handling</a>
Let's quality of the dataset by looking first and last rows in pandas using the [head()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.head.html) and [tail()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.tail.html) methods.

In [None]:
# Display the first 5 rows of the raw DataFrame, df_fbref_goalkeeper_big5_2021_raw
df_fbref_goalkeeper_big5_2021_raw.head()

In [None]:
# Display the last 5 rows of the raw DataFrame, df_fbref_goalkeeper_big5_2021_raw
df_fbref_goalkeeper_big5_2021_raw.tail()

[shape](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dtypes.html) returns a tuple representing the dimensionality of the DataFrame.

In [None]:
# Print the shape of the raw DataFrame, df_fbref_goalkeeper_big5_2021_raw
print(df_fbref_goalkeeper_big5_2021_raw.shape)

The raw DataFrame has:
*    744 observations (rows), each observation represents one individual tourist stranded in Peru, and
*    20 attributes (columns).

[columns](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.columns.html) returns the column labels of the DataFrame.

In [None]:
# Features (column names) of the raw DataFrame, df_fbref_goalkeeper_big5_2021_raw
df_fbref_goalkeeper_big5_2021_raw.columns

The [dtypes](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dtypes.html) method returns the data types of each attribute in the DataFrame.

In [None]:
# Data types of the features of the raw DataFrame, df_fbref_goalkeeper_big5_2021_raw
df_fbref_goalkeeper_big5_2021_raw.dtypes

In [None]:
# Displays all one hundered and four columns
with pd.option_context('display.max_rows', None, 'display.max_columns', None):
    print(df_fbref_goalkeeper_big5_2021_raw.dtypes)

The [info](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.info.html) method to get a quick description of the data, in particular the total number of rows, and each attribute’s type and number of non-null values.

In [None]:
# Info for the raw DataFrame, df_fbref_goalkeeper_big5_2021_raw
df_fbref_goalkeeper_big5_2021_raw.info()

The [describe](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.describe.html) method to show some useful statistics for each numerical column in the DataFrame.

In [None]:
# Description of the raw DataFrame, df_fbref_goalkeeper_big5_2021_raw, showing some summary statistics for each numberical column in the DataFrame
df_fbref_goalkeeper_big5_2021_raw.describe()

Next, we will check to see how many missing values we have i.e. the number of NULL values in the dataset, and in what features these missing values are located. This can be plotted nicely using the [missingno](https://pypi.org/project/missingno/) library (pip install missingno).

In [None]:
# Plot visualisation of the missing values for each feature of the raw DataFrame, df_fbref_goalkeeper_big5_2021_raw
msno.matrix(df_fbref_goalkeeper_big5_2021_raw, figsize = (30, 7))

In [None]:
# Counts of missing values
null_value_stats = df_fbref_goalkeeper_big5_2021_raw.isnull().sum(axis=0)
null_value_stats[null_value_stats != 0]

The visualisation shows us very quickly that there are no missing values in the dataset.

___

<a id='section4'></a>

## <a id='#section4'>4. Summary</a>
This notebook scrapes player statstics data from [StatsBomb](https://statsbomb.com/) via [FBref.com](https://fbref.com/en/), using [pandas](http://pandas.pydata.org/) for data manipulation through DataFrames, [Beautifulsoup](https://pypi.org/project/beautifulsoup4/) for webscraping.

With this notebook we now have aggregated player performance data for players in the 'Big 5' European leagues for the 17/18-present seasons.

___

<a id='section5'></a>

## <a id='#section5'>5. Next Steps</a>
This data is now ready to be exported and analysed in further Jupyter notebooks or Tableau.

The Data Engineering subfolder in GitHub can be found [here](https://github.com/eddwebster/football_analytics/tree/master/notebooks/B\)%20Data%20Engineering) and a static version of the record linkage notebook in which the FBref data is joined to TransferMarkt data can be found [here](https://nbviewer.jupyter.org/github/eddwebster/football_analytics/blob/master/notebooks/B%29%20Data%20Engineering/Record%20Linkage%20of%20FBref%20and%20TransferMarkt%20Datasets.ipynb).

___

<a id='section6'></a>

## <a id='#section6'>6. References</a>

#### Data and Web Scraping
*    [FBref](https://fbref.com/) for the data to scrape
*    FBref statement for using StatsBomb's data: https://fbref.com/en/statsbomb/
*    [StatsBomb](https://statsbomb.com/) providing the data to FBref
*    [FBref_EPL GitHub repository](https://github.com/chmartin/FBref_EPL) by [chmartin](https://github.com/chmartin) for the original web scraping code
*    [Scrape-FBref-data GitHub repository](https://github.com/parth1902/Scrape-FBref-data) by [parth1902](https://github.com/parth1902) for the revised web scraping code for the new FBref metrics


#### Countries
*    [Comparison of alphabetic country codes Wiki](https://en.wikipedia.org/wiki/Comparison_of_alphabetic_country_codes)

---

***Visit my website [EddWebster.com](https://www.eddwebster.com) or my [GitHub Repository](https://github.com/eddwebster) for more projects. If you'd like to get in contact, my Twitter handle is [@eddwebster](http://www.twitter.com/eddwebster) and my email is: edd.j.webster@gmail.com.***

[Back to the top](#top)