Cell 1: Import Libraries

In [1]:
import pandas as pd
from sklearn.preprocessing import StandardScaler

Cell 2: Load Raw Data

This cell loads the CSV file you created in the last notebook. A key step here is using header=1 because the first row of the raw CSV from FBref is a secondary header, not the actual column names.



In [5]:
# Load the raw data from the CSV file
# Load without header initially to capture the actual header row
df_raw = pd.read_csv('raw_team_data.csv', header=None)

# Set the first row as the header
df_raw.columns = df_raw.iloc[0]

# Drop the first row (which is now the header)
df_raw = df_raw[1:].copy()


print("Successfully loaded raw data. Here are the first 5 rows:")
display(df_raw.head())

Successfully loaded raw data. Here are the first 5 rows:


Unnamed: 0,Rk,Squad,MP,W,D,L,GF,GA,GD,Pts,Pts/MP,xG,xGA,xGD,xGD/90,Last 5,Attendance,Top Team Scorer,Goalkeeper,Notes
1,1,Arsenal,4,3,0,1,9,1,8,9,2.25,6.2,2.4,3.9,0.97,W W L W,60139,Viktor Gyökeres - 3,David Raya,
2,2,Tottenham,4,3,0,1,8,1,7,9,2.25,4.8,4.6,0.2,0.05,W W L W,61164,"Richarlison, Brennan Johnson - 2",Guglielmo Vicario,
3,3,Liverpool,3,3,0,0,8,4,4,9,3.0,5.8,3.3,2.5,0.62,W W W W,60385,"Mohamed Salah, Hugo Ekitike - 2",Alisson,
4,4,Bournemouth,4,3,0,1,6,5,1,9,2.25,6.0,3.5,2.5,0.63,L W W W,11119,Antoine Semenyo - 3,Đorđe Petrović,
5,5,Chelsea,4,2,2,0,9,3,6,8,2.0,7.7,3.7,4.0,1.0,D W W D,39712,"João Pedro, Enzo Fernández... - 2",Robert Sánchez,


Cell 3: Feature Selection

The raw table contains many columns. We only need the ones that help describe a team's attacking and defensive style. We'll select columns like Goals For (GF), Goals Against (GA), and the "Expected Goals" metrics (xG, xGA), which are excellent indicators of performance quality.

In [6]:
# Select the columns we need for our analysis
# We'll keep 'Squad' for identification and select key performance metrics.
features_to_select = [
    'Squad',
    'GF',      # Goals For: A measure of attacking output
    'GA',      # Goals Against: A measure of defensive solidity
    'xG',      # Expected Goals For: Quality of chances created
    'xGA',     # Expected Goals Against: Quality of chances conceded
    'xGD'      # Expected Goal Difference: Overall performance indicator
]

df_selected = df_raw[features_to_select].copy()

# Remove any rows that have missing values
df_selected.dropna(inplace=True)

print("Selected relevant features for analysis:")
display(df_selected.head())

Selected relevant features for analysis:


Unnamed: 0,Squad,GF,GA,xG,xGA,xGD
1,Arsenal,9,1,6.2,2.4,3.9
2,Tottenham,8,1,4.8,4.6,0.2
3,Liverpool,8,4,5.8,3.3,2.5
4,Bournemouth,6,5,6.0,3.5,2.5
5,Chelsea,9,3,7.7,3.7,4.0


Cell 4: Prepare Data for Scaling

Machine learning models that use distance calculations (like PCA and K-Means) are sensitive to the scale of features. For example, 'Goals For' (e.g., 0-100) and 'xG' (e.g., 0-80) are on different scales. Scaling fixes this by giving all features a similar weight.

Here, we'll separate the team names from the numerical data that needs to be scaled.

In [7]:
# Separate the team names (which don't get scaled) from the numerical data
teams = df_selected['Squad']
numerical_data = df_selected.drop('Squad', axis=1)

Cell 5: Scale the Numerical Data

We use the StandardScaler to transform our data. It will rescale each feature to have a mean of 0 and a standard deviation of 1.



In [8]:
# Initialize the StandardScaler
scaler = StandardScaler()

# Fit the scaler to the data and transform it
scaled_data = scaler.fit_transform(numerical_data)

# Create a new DataFrame with the scaled data and the original column names
df_scaled = pd.DataFrame(scaled_data, columns=numerical_data.columns)

# Add the team names back to the scaled DataFrame
df_scaled.insert(0, 'Squad', teams)

print("Numerical data has been successfully scaled. Here's a preview:")
display(df_scaled.head())

Numerical data has been successfully scaled. Here's a preview:


Unnamed: 0,Squad,GF,GA,xG,xGA,xGD
0,,1.833714,-1.407233,0.652753,-1.64535,1.403277
1,Arsenal,1.41217,-1.407233,-0.217584,-0.32907,0.071963
2,Tottenham,1.41217,-0.250603,0.404085,-1.106872,0.899537
3,Liverpool,0.569084,0.13494,0.528419,-0.98721,0.899537
4,Bournemouth,1.833714,-0.636146,1.585258,-0.867548,1.439259


Cell 6: Save Processed Data

Finally, we save our cleaned, selected, and scaled data to a new CSV file. This file will be the input for our next notebook on modeling.

In [9]:
# Define the output path for the processed data
OUTPUT_CSV_PATH = 'processed_team_data.csv'

# Save the scaled DataFrame to a new CSV file
df_scaled.to_csv(OUTPUT_CSV_PATH, index=False)

print(f"✅ Preprocessed data saved successfully to '{OUTPUT_CSV_PATH}'")

✅ Preprocessed data saved successfully to 'processed_team_data.csv'
