"Given recent player performance trends, what can we expect to happen in the next match?"
GOALS is a multi-stage machine learning pipeline that analyses historical player statistics across the Premier League and La Liga to deliver interpretable forecasts of future match outcomes. Rather than recapping what has already happened on the park, this system steps forward and asks the question every manager, pundit, and supporter truly wants answered — then answers it with data.
Course: CS7180 - Vibe Coding Team: Amine Kebichi · Nicholas Annunziata Leagues: Premier League · La Liga — 4 seasons each (2021/22 – 2024/25)
Three complementary ML tasks working in concert, like a well-drilled midfield unit:
FBref + FotMob Data
│
▼
Feature Engineering
│
▼
Position-Aware ──► Regression
Composite Score │
│ ▼
│ K-Means Clustering
│ │
▼ ▼
Team Feature ◄──────────────────┘
Aggregation
│
▼
Match Outcome
Classification
(W / D / L)
| Stage | Task | Goal |
|---|---|---|
| Regression | Predict future player composite scores | Quantify individual form heading into the next fixture |
| Clustering | Discover data-driven player archetypes | Map statistical playstyles to recognisable tactical roles |
| Classification | Predict match outcomes (Win / Draw / Loss) | Forecast results from aggregated team performance features |
- How strongly does recent individual player form influence team-level match outcomes?
- Which statistical metrics — xG, progressive passes, defensive actions — are the most reliable indicators of what a player will do next?
- Do data-driven player archetypes correspond to recognisable tactical roles on the pitch?
- How effectively can aggregated player-level predictions be combined to forecast the final result?
| Source | Role | Access |
|---|---|---|
| FBref | Primary — season-level player stats: xG, progressive passes, pressures, shooting, misc | Scraped via GOALS_notebook.ipynb |
| FotMob | Supplementary — per-match player stats: dribbles, tackles won, diving saves, chances created, player ratings | Scraped via fotmob_final.ipynb |
| StatsBomb Open Data | Validation + contextual match info | External reference |
FotMob's proprietary per-match player rating acts as an independent cross-validation signal against the composite performance scores constructed in this project.
| Property | Value |
|---|---|
| Leagues | Premier League (England) · La Liga (Spain) |
| Clubs | All 20 clubs per league across seasons in scope |
| Seasons | 4 per league (2021/22 – 2024/25) |
| Matches | ~1,520 per league (~380 per season) |
| Player-match observations | ~20,000–30,000 per league |
| Features per observation | ~40 statistical metrics |
| Labels | Match outcome: Win / Draw / Loss |
| Category | Example Metrics |
|---|---|
| Attacking | Goals, Shots on Target, xG, Dribbles Completed |
| Playmaking | Assists, Chances Created, xA, Progressive Passes |
| Passing | Pass Completion Rate, Through Balls, Crosses Completed |
| Defensive | Tackles Won, Interceptions, Clearances, Aerial Duels Won, Blocks |
| Goalkeeping | Saves, Diving Saves, Saves Inside Box, High Claims, Acted as Sweeper |
| Physical | Recoveries, Touches, Dispossessed, Distance Covered |
| Contextual | Opponent Strength, Home/Away Indicator, Match Importance |
The split is strictly temporal — random shuffling is explicitly avoided, as it would allow future information to leak into training and produce inflated, misleading results. Football data flows in one direction.
| Partition | Seasons | Approx. Matches | Approx. Player-Match Obs. |
|---|---|---|---|
| Training | 2021/22, 2022/23, 2023/24 | ~1,140 | ~15,000–22,500 |
| Testing | 2024/25 | ~380 | ~5,000–7,500 |
| Total | 4 seasons | ~1,520 | ~20,000–30,000 |
Within the three training seasons, hyperparameter tuning uses time-series-aware cross-validation — the data is walked forward chronologically, with earlier matchdays training and later matchdays validating in each fold.
A pivotal design decision: rather than applying a single universal formula to every player on the pitch (which would be like judging a goalkeeper on the same criteria as a striker), the system constructs four position-specific scores that faithfully reflect the distinct tactical contribution expected of each role.
All input metrics are first z-score normalised across the training set before weights are applied.
Attacker performance lives and dies by goal contributions and chance creation.
Score_ATT = 0.25·(G+A) + 0.20·xG + 0.15·xA + 0.15·Dribbles
+ 0.10·Shots + 0.10·ChancesCreated + 0.05·Recoveries
| Metric | Source | Weight |
|---|---|---|
| Goals + Assists (per 90) | FBref / FotMob | 0.25 |
| Expected Goals (xG) | FBref / FotMob | 0.20 |
| Expected Assists (xA) | FBref / FotMob | 0.15 |
| Successful Dribbles | FotMob | 0.15 |
| Total Shots | FotMob | 0.10 |
| Chances Created | FotMob | 0.10 |
| Ball Recoveries | FotMob | 0.05 |
The modern midfielder must do everything — link defence to attack, win the ball back, and still arrive late into the box.
Score_MID = 0.20·ProgPass + 0.20·ChancesCreated + 0.15·xA + 0.15·(G+A)
+ 0.15·TacklesWon + 0.10·Interceptions + 0.05·Recoveries
| Metric | Source | Weight |
|---|---|---|
| Progressive Passes | FBref | 0.20 |
| Chances Created | FotMob | 0.20 |
| Expected Assists (xA) | FBref / FotMob | 0.15 |
| Goals + Assists (per 90) | FBref / FotMob | 0.15 |
| Tackles Won | FotMob | 0.15 |
| Interceptions | FBref / FotMob | 0.10 |
| Ball Recoveries | FotMob | 0.05 |
A defender's first duty is to defend — and this score holds them to exactly that standard.
Score_DEF = 0.25·TacklesWon + 0.20·AerialDuelsWon + 0.20·Clearances
+ 0.15·Interceptions + 0.10·Blocks + 0.10·ProgPass
| Metric | Source | Weight |
|---|---|---|
| Tackles Won | FotMob | 0.25 |
| Aerial Duels Won | FotMob | 0.20 |
| Clearances | FBref / FotMob | 0.20 |
| Interceptions | FBref / FotMob | 0.15 |
| Blocks | FotMob | 0.10 |
| Progressive Passes | FBref | 0.10 |
Built exclusively from shot-stopping and sweeping metrics — a clean sheet starts here.
Score_GK = 0.30·Saves + 0.25·xGOTFaced + 0.15·DivingSaves
+ 0.15·SavesInsideBox + 0.10·HighClaims + 0.05·SweeperActions
| Metric | Source | Weight |
|---|---|---|
| Saves | FotMob | 0.30 |
| xGoals on Target Faced (xGOT) | FotMob | 0.25 |
| Diving Saves | FotMob | 0.15 |
| Saves Inside Box | FotMob | 0.15 |
| High Claims | FotMob | 0.10 |
| Acted as Sweeper | FotMob | 0.05 |
Position labels are sourced from FotMob's per-match lineup data. If a player shifts position between fixtures, the scoring scheme shifts right along with them.
| Algorithm | Stage | Purpose |
|---|---|---|
| Ridge Regression | Regression | Predict composite scores; L2 regularisation manages multicollinearity between football statistics |
| Random Forest | Regression + Classification | Captures nonlinear relationships and feature interactions |
| K-Means | Clustering | Groups players into statistical archetypes |
| Task | Baseline |
|---|---|
| Regression | Mean composite score predictor |
| Classification | Majority-class predictor; last-match result predictor |
| Clustering | Random cluster assignment |
| Task | Metrics |
|---|---|
| Regression | MSE, RMSE, R² |
| Classification | Accuracy, Precision, Recall, F1 Score (macro), Confusion Matrix |
| Clustering | Silhouette Score, Elbow Method (inertia vs. k), PCA Visualisation |
Class-weighted loss functions are used throughout classification — La Liga's home win bias (~45%) would otherwise cause a naive model to ignore draws and away wins entirely.
Individual player predictions are aggregated into team-level features before classification. For each match:
- Mean predicted performance of the starting XI
- Minutes-weighted performance averages
- Offensive vs. defensive contribution totals across the lineup
- Distribution of player archetypes within the selected squad
These aggregated features serve as the primary inputs to the match outcome classifier.
GOALS/
├── CLAUDE.md # Persistent AI session context
├── fotmob_final.ipynb # FotMob scraper (LEAGUE_ID=47 for Premier League)
├── GOALS_notebook.ipynb # FBref scraper (data already collected)
├── data/
│ ├── FBref/
│ │ └── premier_league/{season}/ # standard, shooting, misc, goalkeeping, playing_time CSVs
│ └── 47/{season}/ # FotMob Premier League output
│ ├── raw/ # Cached match JSON (one file per match_id)
│ └── output/
│ ├── outfield_players.parquet
│ ├── goalkeepers.parquet
│ ├── fixtures.parquet
│ └── player_stats.parquet
└── notebooks/
├── 01_data_merge.ipynb # FBref + FotMob join (fuzzy player name matching)
├── 02_eda.ipynb # Distributions, correlations, PCA
├── 03_feature_engineering.ipynb # Z-score normalisation + composite score construction
├── 04_regression.ipynb # Ridge + Random Forest; time-series CV
├── 05_clustering.ipynb # K-Means archetypes; Silhouette + Elbow
└── 06_classification.ipynb # Win/Draw/Loss prediction; class-weighted
FBref season folder names: 2021-2022, 2022-2023, 2023-2024, 2024-2025
FotMob season folder names: 2021_2022, 2022_2023, 2023_2024, 2024_2025
FBref data for both the Premier League and La Liga (plus Bundesliga) across all 4 seasons has already been scraped and is stored under data/FBref/.
fotmob_final.ipynb is production-ready with rate-limiting, HMAC auth, retry logic, and idempotent JSON caching. Data for both supported leagues has been scraped across all 4 seasons and is stored under data/{league_id}/.
| League | FotMob ID | Data path | Status |
|---|---|---|---|
| Premier League | 47 |
data/47/{season}/output/ |
Complete |
| La Liga | 87 |
data/87/{season}/output/ |
Complete |
To re-scrape or extend to a new season, open fotmob_final.ipynb and configure Cell 1:
LEAGUE_ID = 47 # 47 = Premier League, 87 = La Liga
SEASON = '2024/2025' # target seasonEach run is fully resumable from cached JSON if interrupted.
Once all four FotMob seasons are scraped, train the Random Forest classifier from the terminal (activate your venv first):
python -c "from goals_app.services.ml_service import train; train()"This loads the three training seasons (2021_2022, 2022_2023, 2023_2024), runs walk-forward cross-validation, and saves artifacts to goals_app/ml/artifacts/:
| Artifact | Description |
|---|---|
rf_classifier.pkl |
Trained Random Forest classifier |
outfield_scaler.pkl |
StandardScaler fit on outfield training data |
gk_scaler.pkl |
StandardScaler fit on goalkeeper training data |
metrics.json |
CV fold results + confusion matrix |
Expected CV performance (Premier League):
| CV Fold | Train | Val | Accuracy | Macro F1 |
|---|---|---|---|---|
| 1 | 2021/22 | 2022/23 | 90.5% | 89.7% |
| 2 | 2021/22 + 2022/23 | 2023/24 | 92.9% | 92.7% |
Artifacts are git-ignored and must be regenerated locally before running the app. Re-run training any time new season data is scraped.
| Challenge | Mitigation |
|---|---|
| No ground-truth performance score | Four position-specific composite scores constructed from domain-motivated weights; sensitivity analysis across three weighting schemes validates robustness |
| Multicollinearity (xG and goals correlate heavily) | Ridge regularisation (L2); correlation matrix analysis to flag redundant features |
| Class imbalance (~45% home wins, ~25% draws) | class_weight='balanced' throughout; macro F1 evaluation |
| Sparse minutes for fringe players | Minimum appearance threshold filter; 20-club × 4-season volume absorbs filtering without significant data loss |
| FBref / FotMob name mismatches | Fuzzy string matching with date-keyed match anchors; unmatched records retained from available source rather than discarded |
| Promotion and relegation | Clubs included only for seasons in which they were in the relevant top division; insufficient coverage → excluded from analysis |
| Milestone | Description | Target |
|---|---|---|
| 1 | Data collection and preprocessing (FBref + FotMob) | Feb 21 – Feb 28 |
| 2 | Exploratory data analysis and feature engineering | Feb 28 – Mar 14 |
| 3 | Regression and clustering models | Mar 14 – Mar 28 |
| 4 | Classification model and pipeline integration | Mar 21 – Apr 4 |
| 5 | Evaluation, refinement, and forward fixture forecasting | Apr 4 – Apr 11 |
| 6 | Final report and presentation | Apr 11 – Apr 18 |
| Member | Responsibilities |
|---|---|
| Amine Kebichi | Regression modelling, evaluation framework, report writing |
| Nicholas Annunziata | Clustering analysis, classification models, visualisation |
| Both | Data preprocessing, EDA, evaluation, presentation |
- T. Decroos and J. Van Haaren, Soccerdata: A Python package for scraping soccer data, 2023.
- Sports Reference LLC, FBref advanced football statistics, fbref.com, 2024.
- StatsBomb, StatsBomb Open Data, github.com/statsbomb/open-data, 2023.