Cell 1: Import Libraries

This cell imports the pandas library, which is essential for data manipulation and for its powerful web scraping capabilities.

In [18]:
import pandas as pd
import os

Cell 2: Configuration

Here, we define the URL for the data source (FBref) and the path where we'll save our raw data. This makes it easy to update the season or file location later.

In [22]:
# Define the directory to store our data
DATA_DIR = 'data'

# Define the URL for the Premier League season stats on FBref
URL = 'https://fbref.com/en/comps/9/Premier-League-Stats'

# Define the full output path for the raw data CSV file
OUTPUT_CSV_PATH = os.path.join(DATA_DIR, 'raw_team_data.csv')

# Ensure the data directory exists before we try to save anything to it
os.makedirs(DATA_DIR, exist_ok=True)
print(f"Directory '{DATA_DIR}' is ready.")

Directory 'data' is ready.


Cell 3: Scrape Data from Web

This is the main step. We use the pd.read_html() function to scan the webpage and extract all HTML tables into a list of DataFrames. The primary team statistics table is typically the first one on the page (all_tables[0]).

In [23]:
try:
    # Use pandas to scrape all tables from the URL
    all_tables = pd.read_html(URL)

    # The main team stats table is the first one in the list.
    team_stats_raw = all_tables[0]

    print("✅ Data successfully scraped from FBref.")
    print("First 5 rows of the scraped data:")
    display(team_stats_raw.head())

except Exception as e:
    print(f"❌ An error occurred during scraping: {e}")
    print("Please check your internet connection or the URL.")

✅ Data successfully scraped from FBref.
First 5 rows of the scraped data:


Unnamed: 0,Rk,Squad,MP,W,D,L,GF,GA,GD,Pts,Pts/MP,xG,xGA,xGD,xGD/90,Last 5,Attendance,Top Team Scorer,Goalkeeper,Notes
0,1,Arsenal,4,3,0,1,9,1,8,9,2.25,6.2,2.4,3.9,0.97,W W L W,60139,Viktor Gyökeres - 3,David Raya,
1,2,Tottenham,4,3,0,1,8,1,7,9,2.25,4.8,4.6,0.2,0.05,W W L W,61164,"Richarlison, Brennan Johnson - 2",Guglielmo Vicario,
2,3,Liverpool,3,3,0,0,8,4,4,9,3.0,5.8,3.3,2.5,0.62,W W W W,60385,"Mohamed Salah, Hugo Ekitike - 2",Alisson,
3,4,Bournemouth,4,3,0,1,6,5,1,9,2.25,6.0,3.5,2.5,0.63,L W W W,11119,Antoine Semenyo - 3,Đorđe Petrović,
4,5,Chelsea,4,2,2,0,9,3,6,8,2.0,7.7,3.7,4.0,1.0,D W W D,39712,"João Pedro, Enzo Fernández... - 2",Robert Sánchez,


Cell 4: Save Raw Data to CSV

To avoid repeatedly scraping the website every time we run our analysis, we save the raw, untouched DataFrame to a CSV file in our data folder.

In [24]:
# Save the raw DataFrame to a CSV file inside the 'data' directory
team_stats_raw.to_csv(OUTPUT_CSV_PATH, index=False)

print(f"✅ Raw data saved successfully to '{OUTPUT_CSV_PATH}'")
print(f"Shape of the saved data: {team_stats_raw.shape}")

✅ Raw data saved successfully to 'data/raw_team_data.csv'
Shape of the saved data: (20, 20)


Cell 5: Verification

This final cell confirms that the script ran successfully by printing a confirmation message and showing the dimensions (rows, columns) of the scraped data.

In [25]:
# Print a success message and the shape of the scraped data for verification
print(f"✅ Raw data successfully scraped and saved to '{OUTPUT_CSV_PATH}'")
print(f"Shape of the raw data: {team_stats_raw.shape}")

✅ Raw data successfully scraped and saved to 'data/raw_team_data.csv'
Shape of the raw data: (20, 20)
