## Web Scraping English Premier League Stats from FBref


#### Overview:
The aim of this project is to collect comprehensive statistical data from the English Premier League (EPL) across multiple seasons (2020-2023) from the FBref website. The data will encompass various match details including team performances, match outcomes and additional match-related information.

#### Website:
The data will be extracted from FBref's dedicated page for the English Premier League: [FBref - Premier League Stats](https://fbref.com/en/comps/9/2022-2023/2022-2023-Premier-League-Stats).

#### Data to be Scraped:
The project will scrape the following key data points for each match:

- `Team`: Name of the participating team.
- `Year`: The year of the season (2020-2023).
- `Date`: Date of the match.
- `Time`: Kick-off time of the match.
- `Competition (Comp)`: Indicates the competition (Premier League).
- `Round`: The round of the match within the season.
- `Day`: Day of the week the match was played.
- `Venue`: Stadium where the match took place.
- `Result`: Final result of the match.
- `Goals For (GF)`: Total goals scored by the team.
- `Goals Against (GA)`: Total goals conceded by the team.
- `Opponent`: Name of the opposing team.
- `Expected Goals (xG)`: Predicted number of goals expected based on the shots taken.
- `Expected Goals Allowed (xGA)`: Predicted number of goals expected to be conceded based on opponent's shots.
- `Pass`: Number of passes attempted/completed by the team.
- `Attendance`: Number of spectators present at the venue.
- `Captain`: Captain of the team for the match.
- `Formation`: Tactical formation used by the team.
- `Referee`: Name of the match referee.
- `Notes`: Any additional notes.

#### Storage Format:
The collected data will be stored in an Excel file for easy access and analysis. Each row in the Excel file will represent a single match, with columns corresponding to the aforementioned data points.

#### Project Workflow:
- Fetch the HTML content of the FBref page using web scraping libraries like BeautifulSoup and requests.
- Extract the URLs for each team's page from the team status table.
- Iterate through each team's page URL and scrape the match data.
- Organize the extracted data into a structured format.
- Store the structured data in an Excel file using the pandas library.
- Ensure data integrity and correctness through validation and testing.
- Repeat the scraping process for each season (2020-2023) to accumulate comprehensive data.

#### Benefits:
- Provides access to detailed statistical insights into English Premier League matches.
- Facilitates historical analysis and comparison of team performances across multiple seasons.
- Enables data-driven decision-making for various stakeholders including fans, analysts, and sports professionals.


In [1]:
# import necessary libraries
import pandas as pd 
import requests
from bs4 import BeautifulSoup as bf
import time

This phase of the project involves retrieving the standing table data for each individual team participating in the English Premier League (EPL) during the 2022-2023 season. The standing table provides crucial information about each team's performance, including their rank, points, matches played, wins, draws, losses, goals scored, and goals conceded.
for remaining seasons i am not going to store this info in a dataframe.

In [2]:
# 2022 - 2023 seeason
url = 'https://fbref.com/en/comps/9/2022-2023/2022-2023-Premier-League-Stats'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; Win 64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.162 Safari/537.36'}
data = requests.get(url, headers = headers)
soup = bf(data.text, 'html.parser')
standing_table = soup.select('table.stats_table')[0]
standing_table_2022_2023 = pd.read_html(data.text, match = 'Regular season Table')[0]
standing_table_2022_2023

Unnamed: 0,Rk,Squad,MP,W,D,L,GF,GA,GD,Pts,Pts/MP,xG,xGA,xGD,xGD/90,Attendance,Top Team Scorer,Goalkeeper,Notes
0,1,Manchester City,38,28,5,5,94,33,61,89,2.34,78.6,32.1,46.5,1.22,53249,Erling Haaland - 36,Ederson,→ Champions League via league finish
1,2,Arsenal,38,26,6,6,88,43,45,84,2.21,71.9,42.0,29.9,0.79,60191,"Martin Ødegaard, Gabriel Martinelli - 15",Aaron Ramsdale,→ Champions League via league finish
2,3,Manchester Utd,38,23,6,9,58,43,15,75,1.97,67.7,50.4,17.3,0.45,73671,Marcus Rashford - 17,David de Gea,→ Champions League via league finish
3,4,Newcastle Utd,38,19,14,5,68,33,35,71,1.87,72.0,39.6,32.4,0.85,52127,Callum Wilson - 18,Nick Pope,→ Champions League via league finish
4,5,Liverpool,38,19,10,9,75,47,28,67,1.76,72.6,50.9,21.7,0.57,53163,Mohamed Salah - 19,Alisson,→ Europa League via league finish
5,6,Brighton,38,18,8,12,72,53,19,62,1.63,73.3,50.2,23.1,0.61,31477,Alexis Mac Allister - 10,Robert Sánchez,→ Europa League via league finish
6,7,Aston Villa,38,18,7,13,51,46,5,61,1.61,50.2,52.5,-2.2,-0.06,39485,Ollie Watkins - 15,Emiliano Martínez,→ Europa Conference League via league finish
7,8,Tottenham,38,18,6,14,70,63,7,60,1.58,57.1,49.7,7.4,0.2,61585,Harry Kane - 30,Hugo Lloris,
8,9,Brentford,38,15,14,9,58,46,12,59,1.55,56.8,49.9,6.8,0.18,17078,Ivan Toney - 20,David Raya,
9,10,Fulham,38,15,7,16,55,53,2,52,1.37,46.2,63.8,-17.6,-0.46,23746,Aleksandar Mitrović - 14,Bernd Leno,


This is the 2022 - 2023 season table stats. my aim is to scrap each team match status. in order to do that i have to do the following.
find link to each team, for detailed informations

In [3]:
team_links = standing_table.find_all('a')
team_links_squads= [link.get('href') for link in team_links if link.get('href') and '/squads/' in link.get('href')]
team_links_squads = [f'https://fbref.com{link}' for link in team_links_squads]
team_links_squads # link to each teams 

['https://fbref.com/en/squads/b8fd03ef/2022-2023/Manchester-City-Stats',
 'https://fbref.com/en/squads/18bb7c10/2022-2023/Arsenal-Stats',
 'https://fbref.com/en/squads/19538871/2022-2023/Manchester-United-Stats',
 'https://fbref.com/en/squads/b2b47a98/2022-2023/Newcastle-United-Stats',
 'https://fbref.com/en/squads/822bd0ba/2022-2023/Liverpool-Stats',
 'https://fbref.com/en/squads/d07537b9/2022-2023/Brighton-and-Hove-Albion-Stats',
 'https://fbref.com/en/squads/8602292d/2022-2023/Aston-Villa-Stats',
 'https://fbref.com/en/squads/361ca564/2022-2023/Tottenham-Hotspur-Stats',
 'https://fbref.com/en/squads/cd051869/2022-2023/Brentford-Stats',
 'https://fbref.com/en/squads/fd962109/2022-2023/Fulham-Stats',
 'https://fbref.com/en/squads/47c64c55/2022-2023/Crystal-Palace-Stats',
 'https://fbref.com/en/squads/cff3d9bb/2022-2023/Chelsea-Stats',
 'https://fbref.com/en/squads/8cec06e1/2022-2023/Wolverhampton-Wanderers-Stats',
 'https://fbref.com/en/squads/7c21e445/2022-2023/West-Ham-United-Stats'

scrapping informations of 3 seasons 2020 - 2021, 2021 -2022, and 2022 - 2023

In [4]:
# create a list of years
years = list(range(2023,2019, -1))
years

[2023, 2022, 2021, 2020]

In [5]:
all_matches_2020_2023 = [] # append each dataframe into this list

url = 'https://fbref.com/en/comps/9/2022-2023/2022-2023-Premier-League-Stats'
for year in years: # iterate through each year
    
    # Set user agent header to simulate a web browser
    headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; Win 64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.162 Safari/537.36'}
    data = requests.get(url, headers = headers)   # Send a GET request to the specified URL, with headers
    soup = bf(data.text, 'html.parser')  # Parse the HTML content of the response using BeautifulSoup
    standings_table = soup.select('table.stats_table')[0]   # Select the standings table from the HTML page
    team_links = standings_table.find_all('a')   # Find all links ('a' tags) within the standings table
    
      # Extract links that contain '/squads/' in their href attribute
    team_links_squads= [link.get('href') for link in team_links if link.get('href') and '/squads/' in link.get('href')]
    
     # Prepend 'https://fbref.com' to each squad link
    team_links_squads = [f'https://fbref.com{link}' for link in team_links_squads]
    
    # iterating on each team stats
    for team_url in team_links_squads:
        data = requests.get(team_url)   # Send a GET request to the team URL
        team_name = team_url.split("/")[-1].replace("-Stats", "").replace("-", " ")  # extracting the teamname from the link 
        
        # Read HTML tables from the response where the table contains 'Scores & Fixtures'
        matches = pd.read_html(data.text, match = 'Scores & Fixtures')[0] 
        matches['Team'] = team_name  # add team name as a new column
        matches['Year'] = year  # add year as a new column
        matches = matches[matches['Comp'] == 'Premier League']  # we are just interested in premier league matches
        matches = matches.drop('Match Report', axis = 1)  # this column is irrelevant
        matches.columns = [column.lower() for column in matches.columns]  # lower case each column name
        all_matches_2020_2023.append(matches)    # Append matches DataFrame to the list for the specified years
        time.sleep(1) # Sleep for 1 second to be polite with web scraping
        
    prevnext_div  = soup.select('div.prevnext')
    previous_season_link = prevnext_div[0].find('a', class_='prev')['href']  # link to previous season
    url = f'https://fbref.com{previous_season_link}'
   

In [7]:
# Concatenate multiple DataFrames into one, ignoring existing indices
all_seasons_data = pd.concat(all_matches_2020_2023, ignore_index=True) 

In [8]:
all_seasons_data

Unnamed: 0,date,time,comp,round,day,venue,result,gf,ga,opponent,xg,xga,poss,attendance,captain,formation,referee,notes,team,year
0,2022-08-07,16:30,Premier League,Matchweek 1,Sun,Away,W,2,0,West Ham,2.2,0.5,75.0,62443.0,İlkay Gündoğan,4-3-3,Michael Oliver,,Manchester City,2023
1,2022-08-13,15:00,Premier League,Matchweek 2,Sat,Home,W,4,0,Bournemouth,1.7,0.1,67.0,53453.0,İlkay Gündoğan,4-2-3-1,David Coote,,Manchester City,2023
2,2022-08-21,16:30,Premier League,Matchweek 3,Sun,Away,D,3,3,Newcastle Utd,2.1,1.8,69.0,52258.0,İlkay Gündoğan,4-3-3,Jarred Gillett,,Manchester City,2023
3,2022-08-27,15:00,Premier League,Matchweek 4,Sat,Home,W,4,2,Crystal Palace,2.2,0.1,74.0,53112.0,Kevin De Bruyne,4-2-3-1,Darren England,,Manchester City,2023
4,2022-08-31,19:30,Premier League,Matchweek 5,Wed,Home,W,6,0,Nott'ham Forest,3.3,0.7,74.0,53409.0,İlkay Gündoğan,4-2-3-1,Paul Tierney,,Manchester City,2023
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3035,2020-07-07,18:00,Premier League,Matchweek 34,Tue,Away,L,1,2,Watford,1.2,1.2,56.0,,Alexander Tettey,4-2-3-1,Anthony Taylor,,Norwich City,2020
3036,2020-07-11,12:30,Premier League,Matchweek 35,Sat,Home,L,0,4,West Ham,0.6,3.5,53.0,,Alexander Tettey,4-2-3-1,Kevin Friend,,Norwich City,2020
3037,2020-07-14,20:15,Premier League,Matchweek 36,Tue,Away,L,0,1,Chelsea,0.1,2.5,33.0,,Alexander Tettey,4-1-4-1,Jonathan Moss,,Norwich City,2020
3038,2020-07-18,17:30,Premier League,Matchweek 37,Sat,Home,L,0,2,Burnley,0.3,1.8,42.0,,Alexander Tettey,4-2-3-1,Kevin Friend,,Norwich City,2020


In [9]:
# Write the concatenated DataFrame to an Excel file
with pd.ExcelWriter('EPL_stats_2020_2023.xlsx') as writer:
     # Write the DataFrame to a specific sheet named 'Sheet1', without including the index
    all_seasons_data.to_excel(writer, sheet_name='EPL', index=False)

#### Conclusion:
By systematically collecting and organizing English Premier League match data from FBref, this project aims to provide valuable insights and resources for football enthusiasts, analysts, and researchers interested in exploring and understanding the dynamics of one of the world's most popular football leagues.