GOALS — Game Outcome and Analytics Learning System

"Given recent player performance trends, what can we expect to happen in the next match?"

GOALS is a multi-stage machine learning pipeline that analyses historical player statistics across the Premier League and La Liga to deliver interpretable forecasts of future match outcomes. Rather than recapping what has already happened on the park, this system steps forward and asks the question every manager, pundit, and supporter truly wants answered — then answers it with data.

Course: CS7180 - Vibe Coding Team: Amine Kebichi · Nicholas Annunziata Leagues: Premier League · La Liga — 4 seasons each (2021/22 – 2024/25)

The Pipeline

Three complementary ML tasks working in concert, like a well-drilled midfield unit:

FBref + FotMob Data
        │
        ▼
 Feature Engineering
        │
        ▼
 Position-Aware          ──►  Regression
 Composite Score                  │
        │                         ▼
        │               K-Means Clustering
        │                         │
        ▼                         ▼
 Team Feature  ◄──────────────────┘
  Aggregation
        │
        ▼
 Match Outcome
 Classification
   (W / D / L)

Stage	Task	Goal
Regression	Predict future player composite scores	Quantify individual form heading into the next fixture
Clustering	Discover data-driven player archetypes	Map statistical playstyles to recognisable tactical roles
Classification	Predict match outcomes (Win / Draw / Loss)	Forecast results from aggregated team performance features

Research Questions

How strongly does recent individual player form influence team-level match outcomes?
Which statistical metrics — xG, progressive passes, defensive actions — are the most reliable indicators of what a player will do next?
Do data-driven player archetypes correspond to recognisable tactical roles on the pitch?
How effectively can aggregated player-level predictions be combined to forecast the final result?

Dataset

Sources

Source	Role	Access
FBref	Primary — season-level player stats: xG, progressive passes, pressures, shooting, misc	Scraped via `GOALS_notebook.ipynb`
FotMob	Supplementary — per-match player stats: dribbles, tackles won, diving saves, chances created, player ratings	Scraped via `fotmob_final.ipynb`
StatsBomb Open Data	Validation + contextual match info	External reference

FotMob's proprietary per-match player rating acts as an independent cross-validation signal against the composite performance scores constructed in this project.

Scope and Scale

Property	Value
Leagues	Premier League (England) · La Liga (Spain)
Clubs	All 20 clubs per league across seasons in scope
Seasons	4 per league (2021/22 – 2024/25)
Matches	~1,520 per league (~380 per season)
Player-match observations	~20,000–30,000 per league
Features per observation	~40 statistical metrics
Labels	Match outcome: Win / Draw / Loss

Feature Categories

Category	Example Metrics
Attacking	Goals, Shots on Target, xG, Dribbles Completed
Playmaking	Assists, Chances Created, xA, Progressive Passes
Passing	Pass Completion Rate, Through Balls, Crosses Completed
Defensive	Tackles Won, Interceptions, Clearances, Aerial Duels Won, Blocks
Goalkeeping	Saves, Diving Saves, Saves Inside Box, High Claims, Acted as Sweeper
Physical	Recoveries, Touches, Dispossessed, Distance Covered
Contextual	Opponent Strength, Home/Away Indicator, Match Importance

Train / Test Split

The split is strictly temporal — random shuffling is explicitly avoided, as it would allow future information to leak into training and produce inflated, misleading results. Football data flows in one direction.

Partition	Seasons	Approx. Matches	Approx. Player-Match Obs.
Training	2021/22, 2022/23, 2023/24	~1,140	~15,000–22,500
Testing	2024/25	~380	~5,000–7,500
Total	4 seasons	~1,520	~20,000–30,000

Within the three training seasons, hyperparameter tuning uses time-series-aware cross-validation — the data is walked forward chronologically, with earlier matchdays training and later matchdays validating in each fold.

Position-Aware Composite Performance Scores

A pivotal design decision: rather than applying a single universal formula to every player on the pitch (which would be like judging a goalkeeper on the same criteria as a striker), the system constructs four position-specific scores that faithfully reflect the distinct tactical contribution expected of each role.

All input metrics are first z-score normalised across the training set before weights are applied.

Attacker Score

Attacker performance lives and dies by goal contributions and chance creation.

Score_ATT = 0.25·(G+A) + 0.20·xG + 0.15·xA + 0.15·Dribbles
          + 0.10·Shots + 0.10·ChancesCreated + 0.05·Recoveries

Metric	Source	Weight
Goals + Assists (per 90)	FBref / FotMob	0.25
Expected Goals (xG)	FBref / FotMob	0.20
Expected Assists (xA)	FBref / FotMob	0.15
Successful Dribbles	FotMob	0.15
Total Shots	FotMob	0.10
Chances Created	FotMob	0.10
Ball Recoveries	FotMob	0.05

Midfielder Score

The modern midfielder must do everything — link defence to attack, win the ball back, and still arrive late into the box.

Score_MID = 0.20·ProgPass + 0.20·ChancesCreated + 0.15·xA + 0.15·(G+A)
          + 0.15·TacklesWon + 0.10·Interceptions + 0.05·Recoveries

Metric	Source	Weight
Progressive Passes	FBref	0.20
Chances Created	FotMob	0.20
Expected Assists (xA)	FBref / FotMob	0.15
Goals + Assists (per 90)	FBref / FotMob	0.15
Tackles Won	FotMob	0.15
Interceptions	FBref / FotMob	0.10
Ball Recoveries	FotMob	0.05

Defender Score

A defender's first duty is to defend — and this score holds them to exactly that standard.

Score_DEF = 0.25·TacklesWon + 0.20·AerialDuelsWon + 0.20·Clearances
          + 0.15·Interceptions + 0.10·Blocks + 0.10·ProgPass

Metric	Source	Weight
Tackles Won	FotMob	0.25
Aerial Duels Won	FotMob	0.20
Clearances	FBref / FotMob	0.20
Interceptions	FBref / FotMob	0.15
Blocks	FotMob	0.10
Progressive Passes	FBref	0.10

Goalkeeper Score

Built exclusively from shot-stopping and sweeping metrics — a clean sheet starts here.

Score_GK = 0.30·Saves + 0.25·xGOTFaced + 0.15·DivingSaves
         + 0.15·SavesInsideBox + 0.10·HighClaims + 0.05·SweeperActions

Metric	Source	Weight
Saves	FotMob	0.30
xGoals on Target Faced (xGOT)	FotMob	0.25
Diving Saves	FotMob	0.15
Saves Inside Box	FotMob	0.15
High Claims	FotMob	0.10
Acted as Sweeper	FotMob	0.05

Position labels are sourced from FotMob's per-match lineup data. If a player shifts position between fixtures, the scoring scheme shifts right along with them.

Algorithms and Evaluation

Algorithms

Algorithm	Stage	Purpose
Ridge Regression	Regression	Predict composite scores; L2 regularisation manages multicollinearity between football statistics
Random Forest	Regression + Classification	Captures nonlinear relationships and feature interactions
K-Means	Clustering	Groups players into statistical archetypes

Baseline Models

Task	Baseline
Regression	Mean composite score predictor
Classification	Majority-class predictor; last-match result predictor
Clustering	Random cluster assignment

Evaluation Metrics

Task	Metrics
Regression	MSE, RMSE, R²
Classification	Accuracy, Precision, Recall, F1 Score (macro), Confusion Matrix
Clustering	Silhouette Score, Elbow Method (inertia vs. k), PCA Visualisation

Class-weighted loss functions are used throughout classification — La Liga's home win bias (~45%) would otherwise cause a naive model to ignore draws and away wins entirely.

Player-to-Team Aggregation

Individual player predictions are aggregated into team-level features before classification. For each match:

Mean predicted performance of the starting XI
Minutes-weighted performance averages
Offensive vs. defensive contribution totals across the lineup
Distribution of player archetypes within the selected squad

These aggregated features serve as the primary inputs to the match outcome classifier.

Repository Structure

GOALS/
├── CLAUDE.md                        # Persistent AI session context
├── fotmob_final.ipynb               # FotMob scraper (LEAGUE_ID=47 for Premier League)
├── GOALS_notebook.ipynb             # FBref scraper (data already collected)
├── data/
│   ├── FBref/
│   │   └── premier_league/{season}/ # standard, shooting, misc, goalkeeping, playing_time CSVs
│   └── 47/{season}/                 # FotMob Premier League output
│       ├── raw/                     # Cached match JSON (one file per match_id)
│       └── output/
│           ├── outfield_players.parquet
│           ├── goalkeepers.parquet
│           ├── fixtures.parquet
│           └── player_stats.parquet
└── notebooks/
    ├── 01_data_merge.ipynb          # FBref + FotMob join (fuzzy player name matching)
    ├── 02_eda.ipynb                 # Distributions, correlations, PCA
    ├── 03_feature_engineering.ipynb # Z-score normalisation + composite score construction
    ├── 04_regression.ipynb          # Ridge + Random Forest; time-series CV
    ├── 05_clustering.ipynb          # K-Means archetypes; Silhouette + Elbow
    └── 06_classification.ipynb     # Win/Draw/Loss prediction; class-weighted

FBref season folder names: 2021-2022, 2022-2023, 2023-2024, 2024-2025 FotMob season folder names: 2021_2022, 2022_2023, 2023_2024, 2024_2025

Data Collection

FBref (complete)

FBref data for both the Premier League and La Liga (plus Bundesliga) across all 4 seasons has already been scraped and is stored under data/FBref/.

FotMob (complete)

fotmob_final.ipynb is production-ready with rate-limiting, HMAC auth, retry logic, and idempotent JSON caching. Data for both supported leagues has been scraped across all 4 seasons and is stored under data/{league_id}/.

League	FotMob ID	Data path	Status
Premier League	`47`	`data/47/{season}/output/`	Complete
La Liga	`87`	`data/87/{season}/output/`	Complete

To re-scrape or extend to a new season, open fotmob_final.ipynb and configure Cell 1:

LEAGUE_ID = 47          # 47 = Premier League, 87 = La Liga
SEASON    = '2024/2025' # target season

Each run is fully resumable from cached JSON if interrupted.

Training the ML Model

Once all four FotMob seasons are scraped, train the Random Forest classifier from the terminal (activate your venv first):

python -c "from goals_app.services.ml_service import train; train()"

This loads the three training seasons (2021_2022, 2022_2023, 2023_2024), runs walk-forward cross-validation, and saves artifacts to goals_app/ml/artifacts/:

Artifact	Description
`rf_classifier.pkl`	Trained Random Forest classifier
`outfield_scaler.pkl`	StandardScaler fit on outfield training data
`gk_scaler.pkl`	StandardScaler fit on goalkeeper training data
`metrics.json`	CV fold results + confusion matrix

Expected CV performance (Premier League):

CV Fold	Train	Val	Accuracy	Macro F1
1	2021/22	2022/23	90.5%	89.7%
2	2021/22 + 2022/23	2023/24	92.9%	92.7%

Artifacts are git-ignored and must be regenerated locally before running the app. Re-run training any time new season data is scraped.

Expected Challenges

Challenge	Mitigation
No ground-truth performance score	Four position-specific composite scores constructed from domain-motivated weights; sensitivity analysis across three weighting schemes validates robustness
Multicollinearity (xG and goals correlate heavily)	Ridge regularisation (L2); correlation matrix analysis to flag redundant features
Class imbalance (~45% home wins, ~25% draws)	`class_weight='balanced'` throughout; macro F1 evaluation
Sparse minutes for fringe players	Minimum appearance threshold filter; 20-club × 4-season volume absorbs filtering without significant data loss
FBref / FotMob name mismatches	Fuzzy string matching with date-keyed match anchors; unmatched records retained from available source rather than discarded
Promotion and relegation	Clubs included only for seasons in which they were in the relevant top division; insufficient coverage → excluded from analysis

Timeline

Milestone	Description	Target
1	Data collection and preprocessing (FBref + FotMob)	Feb 21 – Feb 28
2	Exploratory data analysis and feature engineering	Feb 28 – Mar 14
3	Regression and clustering models	Mar 14 – Mar 28
4	Classification model and pipeline integration	Mar 21 – Apr 4
5	Evaluation, refinement, and forward fixture forecasting	Apr 4 – Apr 11
6	Final report and presentation	Apr 11 – Apr 18

Team

Member	Responsibilities
Amine Kebichi	Regression modelling, evaluation framework, report writing
Nicholas Annunziata	Clustering analysis, classification models, visualisation
Both	Data preprocessing, EDA, evaluation, presentation

References

T. Decroos and J. Van Haaren, Soccerdata: A Python package for scraping soccer data, 2023.
Sports Reference LLC, FBref advanced football statistics, fbref.com, 2024.
StatsBomb, StatsBomb Open Data, github.com/statsbomb/open-data, 2023.

Name		Name	Last commit message	Last commit date
Latest commit History 76 Commits
.claude		.claude
.github/workflows		.github/workflows
data		data
frontend		frontend
goals_app		goals_app
nextjs-app		nextjs-app
notebooks		notebooks
tests		tests
.coverage		.coverage
.gitignore		.gitignore
.gitleaks.toml		.gitleaks.toml
.mcp.json		.mcp.json
AmineKebichi_Reflection.md		AmineKebichi_Reflection.md
BLOG_POST.md		BLOG_POST.md
CLAUDE.md		CLAUDE.md
GOALS_PRD.md		GOALS_PRD.md
GOALS_notebook.ipynb		GOALS_notebook.ipynb
HW4_FINAL_WRITEUP.md		HW4_FINAL_WRITEUP.md
HW4_SESSION_LOG.md		HW4_SESSION_LOG.md
HW5_RETROSPECTIVE.md		HW5_RETROSPECTIVE.md
HW5_SESSION_LOG.md		HW5_SESSION_LOG.md
HW5_WRITEUP.md		HW5_WRITEUP.md
NicholasAnnunziata_Reflection.md		NicholasAnnunziata_Reflection.md
P3_WRITEUP.md		P3_WRITEUP.md
README.md		README.md
SETUP.md		SETUP.md
fotmob_final.ipynb		fotmob_final.ipynb
package-lock.json		package-lock.json
requirements.txt		requirements.txt
scrape_fixtures.py		scrape_fixtures.py
start.bat		start.bat
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GOALS — Game Outcome and Analytics Learning System

The Pipeline

Research Questions

Dataset

Sources

Scope and Scale

Feature Categories

Train / Test Split

Position-Aware Composite Performance Scores

Attacker Score

Midfielder Score

Defender Score

Goalkeeper Score

Algorithms and Evaluation

Algorithms

Baseline Models

Evaluation Metrics

Player-to-Team Aggregation

Repository Structure

Data Collection

FBref (complete)

FotMob (complete)

Training the ML Model

Expected Challenges

Timeline

Team

References

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

GOALS — Game Outcome and Analytics Learning System

The Pipeline

Research Questions

Dataset

Sources

Scope and Scale

Feature Categories

Train / Test Split

Position-Aware Composite Performance Scores

Attacker Score

Midfielder Score

Defender Score

Goalkeeper Score

Algorithms and Evaluation

Algorithms

Baseline Models

Evaluation Metrics

Player-to-Team Aggregation

Repository Structure

Data Collection

FBref (complete)

FotMob (complete)

Training the ML Model

Expected Challenges

Timeline

Team

References

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages