# Portfolio Project - Web Scraping Football Matches From The EPL in Python

### Introduction

This project is a first part of a Machine Learning project where I will predict the winner of each football match on the English Premier League (EPL).
In this project the main goal is gathering data for a cleaned pandas dataframe via web scraping.

### Downloading and Exploring the EPL Stats Page

In [45]:
import requests

In [46]:
url = "https://fbref.com/en/comps/9/2021-2022/2021-2022-Premier-League-Stats"

In [47]:
domain = "https://fbref.com"

In [48]:
page = requests.get(url)

In [49]:
page.status_code # if the code is 200 then OK

429

In [50]:
page.headers

{'Date': 'Wed, 16 Oct 2024 19:20:02 GMT', 'Content-Type': 'text/html; charset=UTF-8', 'Content-Length': '109731', 'Connection': 'keep-alive', 'Retry-After': '3585', 'X-Frame-Options': 'SAMEORIGIN', 'Referrer-Policy': 'same-origin', 'Cache-Control': 'private, max-age=0, no-store, no-cache, must-revalidate, post-check=0, pre-check=0', 'Expires': 'Thu, 01 Jan 1970 00:00:01 GMT', 'Set-Cookie': '__cf_bm=oS.BbroWotUVuwHwpJ3U02iqnMQHTwSBAiFonQdUHU0-1729106402-1.0.1.1-0GsmPbSUO8ofIyFScRenHfDLFYYejjfKL_7lN4vOqOgd3s.mQ3h347eVhp38NiQYj2fKEPBia24U62hVVhR3Jg; path=/; expires=Wed, 16-Oct-24 19:50:02 GMT; domain=.fbref.com; HttpOnly; Secure; SameSite=None', 'Vary': 'Accept-Encoding', 'Server': 'cloudflare', 'CF-RAY': '8d3a676969545adc-VIE'}

Inspecting the page in web browser I can see the `id` of the table-tag which contains the teams and theirs results is `results2021-202291_overall`. Furthermore the HTML Tag which contains the Team name and their URL to details is `team`.

### Parsing HTML Links

In [7]:
from bs4 import BeautifulSoup

In [8]:
soup = BeautifulSoup(page.content, "html.parser")

In [9]:
teams_table = soup.find("table", id="results2021-202291_overall")

In [10]:
links = teams_table.find_all("a")

In [11]:
"""
Example links:

Link: /en/squads/19538871/2021-2022/Manchester-United-Stats, Text: Manchester Utd    - if contains word 'squads' then it refer to club
Link: /en/players/dea698d9/Cristiano-Ronaldo, Text: Cristiano Ronaldo
"""

club_paths = []

for link in links:
    path = link.get("href")
    
    if ("squads") in path:
        club_paths.append(path)

In [12]:
club_paths

['/en/squads/b8fd03ef/2021-2022/Manchester-City-Stats',
 '/en/squads/822bd0ba/2021-2022/Liverpool-Stats',
 '/en/squads/cff3d9bb/2021-2022/Chelsea-Stats',
 '/en/squads/361ca564/2021-2022/Tottenham-Hotspur-Stats',
 '/en/squads/18bb7c10/2021-2022/Arsenal-Stats',
 '/en/squads/19538871/2021-2022/Manchester-United-Stats',
 '/en/squads/7c21e445/2021-2022/West-Ham-United-Stats',
 '/en/squads/a2d435b3/2021-2022/Leicester-City-Stats',
 '/en/squads/d07537b9/2021-2022/Brighton-and-Hove-Albion-Stats',
 '/en/squads/8cec06e1/2021-2022/Wolverhampton-Wanderers-Stats',
 '/en/squads/b2b47a98/2021-2022/Newcastle-United-Stats',
 '/en/squads/47c64c55/2021-2022/Crystal-Palace-Stats',
 '/en/squads/cd051869/2021-2022/Brentford-Stats',
 '/en/squads/8602292d/2021-2022/Aston-Villa-Stats',
 '/en/squads/33c895d4/2021-2022/Southampton-Stats',
 '/en/squads/d3fd31cc/2021-2022/Everton-Stats',
 '/en/squads/5bfb9659/2021-2022/Leeds-United-Stats',
 '/en/squads/943e8050/2021-2022/Burnley-Stats',
 '/en/squads/2abfe087/2021-

In [13]:
club_urls = [domain + path for path in club_paths] # creating a list of full URLs (domain url + paths of each club)

In [14]:
club_urls

['https://fbref.com/en/squads/b8fd03ef/2021-2022/Manchester-City-Stats',
 'https://fbref.com/en/squads/822bd0ba/2021-2022/Liverpool-Stats',
 'https://fbref.com/en/squads/cff3d9bb/2021-2022/Chelsea-Stats',
 'https://fbref.com/en/squads/361ca564/2021-2022/Tottenham-Hotspur-Stats',
 'https://fbref.com/en/squads/18bb7c10/2021-2022/Arsenal-Stats',
 'https://fbref.com/en/squads/19538871/2021-2022/Manchester-United-Stats',
 'https://fbref.com/en/squads/7c21e445/2021-2022/West-Ham-United-Stats',
 'https://fbref.com/en/squads/a2d435b3/2021-2022/Leicester-City-Stats',
 'https://fbref.com/en/squads/d07537b9/2021-2022/Brighton-and-Hove-Albion-Stats',
 'https://fbref.com/en/squads/8cec06e1/2021-2022/Wolverhampton-Wanderers-Stats',
 'https://fbref.com/en/squads/b2b47a98/2021-2022/Newcastle-United-Stats',
 'https://fbref.com/en/squads/47c64c55/2021-2022/Crystal-Palace-Stats',
 'https://fbref.com/en/squads/cd051869/2021-2022/Brentford-Stats',
 'https://fbref.com/en/squads/8602292d/2021-2022/Aston-Vill

### Extracting Match Stats

In [15]:
page = requests.get(club_urls[5]) # Get the stats of Manchester United ( My favourite club :) )

In [16]:
page.status_code

200

In [17]:
soup = BeautifulSoup(page.content, 'html.parser')

In [18]:
mu_stats = soup.find(id="matchlogs_for") # ID of table that contains the "Scores & Fixtures" is "match_for_logs"

In [19]:
# Extracting stats from table called "Scores & Fixtures" 
def extracting_stats(table):
    
    columns = [] # It will be the column names of the DataFrame.

    for name in mu_stats.find_all("tr")[0]:
        if name.string != " ": # It stores a space between elements. I will just ommit them.
            columns.append(name.string)

    match_stats = []

    for match in mu_stats.find_all("tr")[1:]: # Ommiting the header (first "tr" tag)
        stats = []
    
        for stat in match:
            if stat.find("span"): # The time data is inner a "span" tag within "td" tag
                stats.append(stat.find("span").text)
            else:
                stats.append(stat.string) # else I added the value os "td" tag to list

        match_stats.append(stats)

    return columns, match_stats

In [20]:
import pandas as pd

In [21]:
columns, data = extracting_stats(mu_stats)

In [22]:
mu_table = pd.DataFrame(data=data, columns=columns)

In [23]:
mu_table.head()

Unnamed: 0,Date,Time,Comp,Round,Day,Venue,Result,GF,GA,Opponent,xG,xGA,Poss,Attendance,Captain,Formation,Opp Formation,Referee,Match Report,Notes
0,2021-08-14,12:30,Premier League,Matchweek 1,Sat,Home,W,5,1,Leeds United,1.5,0.5,49,72732,Harry Maguire,4-2-3-1,4-1-4-1,Paul Tierney,Match Report,
1,2021-08-22,14:00,Premier League,Matchweek 2,Sun,Away,D,1,1,Southampton,1.8,0.7,63,32000,Harry Maguire,4-2-3-1,4-4-2,Craig Pawson,Match Report,
2,2021-08-29,16:30,Premier League,Matchweek 3,Sun,Away,W,1,0,Wolves,0.6,2.1,56,30621,Harry Maguire,4-2-3-1,3-4-3,Mike Dean,Match Report,
3,2021-09-11,15:00,Premier League,Matchweek 4,Sat,Home,W,4,1,Newcastle Utd,2.5,0.4,63,72732,Harry Maguire,4-2-3-1,5-4-1,Anthony Taylor,Match Report,
4,2021-09-14,18:45,Champions Lg,Group stage,Tue,Away,L,1,2,ch,0.5,1.4,46,31120,Harry Maguire,4-2-3-1,4-2-3-1,François Letexier,Match Report,


### Getting Match Shooting Stats

I will gathering the another stats (such as the number of shots, the number of shots on target, the number of free kicks, and the number of penalty kicks) for Manchester United which are in another table in a Tab called `Shoots`.

In [24]:
links_on_page = soup.find_all("a")

In [25]:
import re

shoots_link = domain + soup.find("a", href=re.compile("/matchlogs/all_comps/shooting/")).get("href")

In [26]:
shoots_link

'https://fbref.com/en/squads/19538871/2021-2022/matchlogs/all_comps/shooting/Manchester-United-Match-Logs-All-Competitions'

In [27]:
page = requests.get(shoots_link)
soup = BeautifulSoup(page.content, "html.parser")

shoots_stat_table = soup.find("table", id="matchlogs_for")

In [28]:
def extracting_shooting_stats(table):
    shooting_stats = [stat for stat in shoots_stat_table.find_all("tr")]

    columns = []

    for column_name in shooting_stats[1].find_all("th"):
        columns.append(column_name.string)

    data = []

    for row in shooting_stats[2:]: # omitting header
        event_data = [] # storing data from each row and after added to the data list that contains all of them
    
        for col in row:
            if col.find("span"): # the time data is inner a "span" tag within "td" tag
                event_data.append(col.find("span").text)
            else:
                event_data.append(col.string)

        data.append(event_data)

    return columns, data

In [29]:
columns, data = extracting_shooting_stats(shoots_stat_table)

In [30]:
shoots_table = pd.DataFrame(data=data, columns=columns)

In [31]:
shoots_table.head()

Unnamed: 0,Date,Time,Comp,Round,Day,Venue,Result,GF,GA,Opponent,...,Dist,FK,PK,PKatt,xG,npxG,npxG/Sh,G-xG,np:G-xG,Match Report
0,2021-08-14,12:30,Premier League,Matchweek 1,Sat,Home,W,5,1,Leeds United,...,18.2,0,0,0,1.5,1.5,0.09,3.5,3.5,Match Report
1,2021-08-22,14:00,Premier League,Matchweek 2,Sun,Away,D,1,1,Southampton,...,15.1,1,0,0,1.8,1.8,0.14,-0.8,-0.8,Match Report
2,2021-08-29,16:30,Premier League,Matchweek 3,Sun,Away,W,1,0,Wolves,...,18.8,1,0,0,0.6,0.6,0.06,0.4,0.4,Match Report
3,2021-09-11,15:00,Premier League,Matchweek 4,Sat,Home,W,4,1,Newcastle Utd,...,20.5,0,0,0,2.5,2.5,0.12,1.5,1.5,Match Report
4,2021-09-14,18:45,Champions Lg,Group stage,Tue,Away,L,1,2,ch,...,10.8,0,0,0,0.5,0.5,0.26,0.5,0.5,Match Report


### Cleaning and Merging Scraped Data

In [32]:
shoots_table.shape

(50, 26)

In [33]:
mu_table.shape

(49, 20)

In [34]:
mu_table.tail()

Unnamed: 0,Date,Time,Comp,Round,Day,Venue,Result,GF,GA,Opponent,xG,xGA,Poss,Attendance,Captain,Formation,Opp Formation,Referee,Match Report,Notes
44,2022-04-23,12:30,Premier League,Matchweek 34,Sat,Away,L,1,3,Arsenal,1.8,2.3,46,60223,Bruno Fernandes,4-2-3-1,4-2-3-1,Craig Pawson,Match Report,
45,2022-04-28,19:45,Premier League,Matchweek 37,Thu,Home,D,1,1,Chelsea,0.5,1.9,36,73564,Bruno Fernandes,4-2-3-1,3-4-1-2,Mike Dean,Match Report,
46,2022-05-02,20:00,Premier League,Matchweek 35,Mon,Home,W,3,0,Brentford,2.0,0.6,63,73482,Bruno Fernandes,4-2-3-1,5-3-2,Chris Kavanagh,Match Report,
47,2022-05-07,17:30,Premier League,Matchweek 36,Sat,Away,L,0,4,Brighton,0.9,2.3,58,31637,Bruno Fernandes,4-2-3-1,3-4-3,Andy Madley,Match Report,
48,2022-05-22,16:00,Premier League,Matchweek 38,Sun,Away,L,0,1,Crystal Palace,0.7,0.6,61,25434,Harry Maguire,4-2-3-1,4-3-3,Martin Atkinson,Match Report,


In [35]:
shoots_table.tail()

Unnamed: 0,Date,Time,Comp,Round,Day,Venue,Result,GF,GA,Opponent,...,Dist,FK,PK,PKatt,xG,npxG,npxG/Sh,G-xG,np:G-xG,Match Report
45,2022-04-28,19:45,Premier League,Matchweek 37,Thu,Home,D,1.0,1.0,Chelsea,...,15.9,0,0,0,0.5,0.5,0.09,0.5,0.5,Match Report
46,2022-05-02,20:00,Premier League,Matchweek 35,Mon,Home,W,3.0,0.0,Brentford,...,20.3,1,1,1,2.0,1.2,0.15,1.0,0.8,Match Report
47,2022-05-07,17:30,Premier League,Matchweek 36,Sat,Away,L,0.0,4.0,Brighton,...,19.4,1,0,0,0.9,0.9,0.06,-0.9,-0.9,Match Report
48,2022-05-22,16:00,Premier League,Matchweek 38,Sun,Away,L,0.0,1.0,Crystal Palace,...,19.8,1,0,0,0.7,0.7,0.08,-0.7,-0.7,Match Report
49,,,,,,,--,,,,...,17.7,24,3,6,,,0.11,70.0,67.0,


In [36]:
shoots_table = shoots_table[:49] # The last row was overall scores in table. I just omitted them

In [37]:
shoots_table.drop(["xG", "Match Report"], axis=1, inplace=True) # These columns aren't necessary and they aren't at begin to omitt easily via slice so I drop them before concating

In [38]:
scraped_stats = pd.concat([mu_table, shoots_table.loc[:, "Gls":]], axis=1) # The latter table contains same columns in their begin so I omitted them

In [39]:
scraped_stats.head()

Unnamed: 0,Date,Time,Comp,Round,Day,Venue,Result,GF,GA,Opponent,...,G/Sh,G/SoT,Dist,FK,PK,PKatt,npxG,npxG/Sh,G-xG,np:G-xG
0,2021-08-14,12:30,Premier League,Matchweek 1,Sat,Home,W,5,1,Leeds United,...,0.31,0.63,18.2,0,0,0,1.5,0.09,3.5,3.5
1,2021-08-22,14:00,Premier League,Matchweek 2,Sun,Away,D,1,1,Southampton,...,0.07,0.33,15.1,1,0,0,1.8,0.14,-0.8,-0.8
2,2021-08-29,16:30,Premier League,Matchweek 3,Sun,Away,W,1,0,Wolves,...,0.1,0.33,18.8,1,0,0,0.6,0.06,0.4,0.4
3,2021-09-11,15:00,Premier League,Matchweek 4,Sat,Home,W,4,1,Newcastle Utd,...,0.19,0.67,20.5,0,0,0,2.5,0.12,1.5,1.5
4,2021-09-14,18:45,Champions Lg,Group stage,Tue,Away,L,1,2,ch,...,0.5,0.5,10.8,0,0,0,0.5,0.26,0.5,0.5


### Scraping Data for Multiple Seasons and Teams

In [40]:
import time # for delaying the scraping avoiding the quick downloading of page thus overloading the website and blocking myself

In [41]:
# generating the links of seasons from 2021-22 to 2023-24 descending

season_links = [f"https://fbref.com/en/comps/9/20{i:02d}-20{i+1:02d}/20{i:02d}-20{i+1:02d}-Premier-League-Stats" for i in range(23, 21, -1)]

In [42]:
season_links

['https://fbref.com/en/comps/9/2023-2024/2023-2024-Premier-League-Stats',
 'https://fbref.com/en/comps/9/2022-2023/2022-2023-Premier-League-Stats']

In [43]:
dataframes = [] # each DataFrames in list is a match log for one team in one season 

In [44]:
##### Scraping over multiple seasons

for url in season_links[:1]:
    
    print("Start a season ")
    
    page = requests.get(url)
    
    soup = BeautifulSoup(page.content, "html.parser")

    results_table = soup.find("table")

    team_urls = []

    for url in results_table.find_all("a"):
        path = url.get("href") 
    
        if "squads" in path: # this contains the url of club stats
            team_urls.append(domain + path)

    for url in team_urls:

        print("Start a team")
        time.sleep(0.5)
        
        page = requests.get(url)
        soup = BeautifulSoup(page.content, "html.parser")

        club_stats = soup.find(id="matchlogs_for")

        columns, data = extracting_stats(club_stats)

        club_dataframe = pd.DataFrame(data=data, columns=columns)

        shoots_link = domain + soup.find("a", href=re.compile("/matchlogs/all_comps/shooting/")).get("href")

        shoots_page = requests.get(shoots_link)
        soup = BeautifulSoup(shoots_page.content, "html.parser")
        time.sleep(0.5)

        shoots_stat_table = soup.find(id="matchlogs_for")

        columns, data = extracting_shooting_stats(shoots_stat_table)

        shoots_dataframe = pd.DataFrame(data=data, columns=columns)
        shoots_dataframe = shoots_dataframe[:49]
        #shoots_dataframe.drop(["xG", "Match Report"], axis=1, inplace=True)

        combined_dataframe = pd.concat([club_dataframe, shoots_dataframe])

        dataframes.append(combined_dataframe)

Start a season 
Start a team
Start a team
Start a team
Start a team
Start a team
Start a team
Start a team


AttributeError: 'NoneType' object has no attribute 'get'