# Top XI: Data-Driven Player Ranking in European Football

This project presents a data-driven approach to identifying the top eleven football players across Europe's top five leagues during the 2024/25 season. The analysis leverages a dataset containing over 300 performance metrics for more than 2,500 players.

The workflow consists of the following key steps:
- Importing and preparing the dataset for analysis
- Filtering players by position (goalkeeper, defender, midfielder, forward)
- Normalizing relevant performance metrics
- Developing a custom scoring algorithm for each position
- Selecting and ranking the top-performing players in each role

All analysis is conducted using Python and SQL, with the support of libraries such as `pandas` and `numpy`, and data processing through SQLite.


## Importing Required Libraries

The following libraries are imported to support data manipulation, database interaction, and visualization:

- `pandas` for data manipulation  
- `sqlite3` for querying the dataset using SQL  
- `matplotlib.pyplot` for basic plotting and visualization  
- `MinMaxScaler` from `sklearn.preprocessing` for normalizing performance metrics

In [60]:
import pandas as pd
import sqlite3
import matplotlib.pyplot as plt
from sklearn.preprocessing import MinMaxScaler



## Uploading Dataset

The dataset is uploaded directly using the Google Colab file upload utility. This allows manual selection of the SQLite database file from the local machine for use within the Colab environment.

In [3]:
from google.colab import files
uploaded = files.upload()

Saving players_data-2024_2025.csv to players_data-2024_2025 (1).csv


## Loading and Preparing the Dataset

This section reads player performance data from a CSV file and prepares it for analysis. The steps include:

- Computing player age based on the birth year, if available.
- Filtering out players who have played fewer than 15 full 90-minute matches to ensure data reliability.
- Creating a derived metric for goals allowed per 90 minutes (GA_per90).
- Selecting only the relevant statistical columns for further analysis.
- Dropping entries with missing key information.
- Storing the cleaned dataset in a temporary SQLite database for efficient querying.


In [42]:
# Load the dataset
df = pd.read_csv('players_data-2024_2025.csv')

# Calculate player age if birth year is provided
if 'Born' in df.columns:
    df['Age'] = 2025 - df['Born']

# Filter players with at least 15 full 90-minute appearances
# and ensure position information is available
df = df[df['90s'] >= 15].dropna(subset=['Pos'])

# Calculate goals allowed per 90 minutes for goalkeepers, if applicable
if 'GA' in df.columns:
    df['GA_per90'] = df['GA'] / df['90s']

# Define column categories for different types of performance metrics
base_cols = ['Player', 'Nation', 'Age', 'Pos', 'Squad', 'Comp', '90s']
attacking = ['Gls', 'Ast', 'G+A', 'xG', 'xAG', 'npxG', 'G-PK']
defending = ['Tkl', 'TklW', 'Blocks', 'Int', 'Tkl+Int', 'Clr', 'Err']
passing = ['PrgP', 'PrgC', 'KP', 'Cmp%_stats_passing', 'Ast_stats_passing', 'xA', 'PPA']
goalkeeping = ['GA', 'Saves', 'Save%', 'CS', 'CS%', 'PKA', 'PKsv']
possession = ['Touches', 'Carries', 'PrgR', 'Mis', 'Dis']
misc = ['CrdY', 'CrdR', 'PKwon', 'PKcon', 'Recov']

# Compile final list of available columns
cols = base_cols + attacking + defending + passing + goalkeeping + possession + misc

# Retain only those columns that are present in the dataset
df = df[[col for col in cols if col in df.columns]]

# Drop rows with missing values in essential fields
df.dropna(subset=['90s', 'Pos'], inplace=True)

# Store cleaned data in a temporary in-memory SQLite database
conn = sqlite3.connect(':memory:')
df.to_sql('players', conn, index=False, if_exists='replace')

1075

## Goalkeeper Evaluation and Ranking

This section ranks goalkeepers using a normalized weighted scoring system based on core shot-stopping metrics and goal prevention. Only goalkeepers with at least 25 full matches (90s played) are included to ensure statistical significance.

The ranking is based on the following per-90-minute metrics:

- **Shot-stopping performance**: save percentage, saves per 90, penalty saves per 90  
- **Clean sheet contribution**: clean sheets per 90  
- **Goal prevention**: goals allowed per 90 (inverted to reward lower values)

All metrics are normalized using min-max scaling. Goals allowed per 90 is inverted before normalization so that lower values correspond to better performance. A weighted sum of the normalized metrics is calculated, using the following weights:

- Save percentage: 30%  
- Saves per 90: 20%  
- Clean sheets per 90: 15%  
- Penalty saves per 90: 10%  
- Goals allowed per 90 (inverted): 25%

The goalkeeper with the highest overall score is selected as the top performer.

In [56]:
query_gk = """
SELECT
  Player, Squad, Age, "90s",
  ROUND("Save%" , 2) AS Save_pct,
  ROUND(Saves / "90s", 2) AS Saves_per90,
  ROUND(CS / "90s", 2) AS CS_per90,
  ROUND(PKsv / "90s", 2) AS PKsv_per90,
  ROUND(GA / "90s", 2) AS GA_per90
FROM players
WHERE Pos = 'GK' AND "90s" >= 25
"""
gk_df = pd.read_sql(query_gk, conn)

# Define weights
weights_gk = {
    'Save_pct': 0.30,
    'Saves_per90': 0.20,
    'CS_per90': 0.15,
    'PKsv_per90': 0.10,
    'GA_per90': 0.25  # Invert this one
}

# Prepare metrics list
gk_metrics = list(weights_gk.keys())
gk_metrics.remove('GA_per90')
gk_metrics.append('GA_per90_inv')

# Invert GA_per90
norm_gk_df = gk_df.copy()
norm_gk_df['GA_per90_inv'] = norm_gk_df['GA_per90'].max() - norm_gk_df['GA_per90']

# Normalize
scaler = MinMaxScaler()
norm_scaled_gk = pd.DataFrame(
    scaler.fit_transform(norm_gk_df[gk_metrics]),
    columns=gk_metrics
)

# Apply weights
for col in gk_metrics:
    base_col = col.replace('_inv', '')
    norm_scaled_gk[col] = norm_scaled_gk[col] * weights_gk[base_col]

# Final weighted score
norm_gk_df['normalized_weighted_score'] = norm_scaled_gk.sum(axis=1)

# Sort and select top goalkeeper
top_gk_norm = norm_gk_df.sort_values('normalized_weighted_score', ascending=False).head(1)

# Display results
top_gk_norm[['Player', 'Squad', 'Age', '90s', 'Save_pct', 'Saves_per90', 'CS_per90', 'PKsv_per90', 'GA_per90', 'normalized_weighted_score']]

Unnamed: 0,Player,Squad,Age,90s,Save_pct,Saves_per90,CS_per90,PKsv_per90,GA_per90,normalized_weighted_score
45,Đorđe Petrović,Strasbourg,26.0,28.0,80.1,3.75,0.36,0.04,1.14,0.795177


## Defender Evaluation and Ranking

This section ranks defenders using a normalized weighted scoring system tailored to defensive output and ball progression. Only players with at least 24 full matches (90s played) are included to ensure statistical validity.

The ranking is based on the following per-90-minute metrics:

- **Defensive effectiveness**: tackles + interceptions, tackles won, blocks, clearances, recoveries  
- **Progression and distribution**: progressive passes, progressive carries, key passes, passes into the penalty area  
- **Goal prevention**: goals allowed per 90 minutes (inverted so that lower values are rewarded)

Each metric is normalized using min-max scaling to allow for fair comparison across features with different ranges. Goals allowed per 90 is inverted prior to normalization. A weighted sum of these normalized metrics is calculated, with the following weights applied:

- Tackles + interceptions: 25%  
- Tackles won: 10%  
- Blocks: 15%  
- Clearances: 5%  
- Recoveries: 10%  
- Progressive passes: 5%  
- Progressive carries: 5%  
- Key passes: 5%  
- Passes into the penalty area: 5%  
- Goals allowed per 90 (inverted): 15%  

The three defenders with the highest overall weighted scores are selected as the top performers.

In [53]:
query_defenders = """
SELECT
  Player, Squad, Age, "90s",
  ROUND("Tkl+Int" / "90s", 2) AS TklInt_per90,
  ROUND(TklW / "90s", 2) AS TklW_per90,
  ROUND(Blocks / "90s", 2) AS Blocks_per90,
  ROUND(Clr / "90s", 2) AS Clr_per90,
  ROUND(Recov / "90s", 2) AS Recov_per90,
  ROUND(PrgP / "90s", 2) AS PrgP_per90,
  ROUND(PrgC / "90s", 2) AS PrgC_per90,
  ROUND(KP / "90s", 2) AS KP_per90,
  ROUND(PPA / "90s", 2) AS PPA_per90,
  ROUND(GA / "90s", 2) AS GA_per90
FROM players
WHERE Pos = 'DF' AND "90s" >= 24
"""
df_df = pd.read_sql(query_defenders, conn)

# Define weights
weights_def = {
    'TklInt_per90': 0.25,
    'TklW_per90': 0.10,
    'Blocks_per90': 0.15,
    'Clr_per90': 0.05,
    'Recov_per90': 0.10,
    'PrgP_per90': 0.05,
    'PrgC_per90': 0.05,
    'KP_per90': 0.05,
    'PPA_per90': 0.05,
    'GA_per90': 0.15  # Inverted
}

# Prepare for normalization
metrics_def = list(weights_def.keys())
metrics_def.remove('GA_per90')
metrics_def.append('GA_per90_inv')

# Invert GA_per90
norm_df_df = df_df.copy()
norm_df_df['GA_per90_inv'] = norm_df_df['GA_per90'].max() - norm_df_df['GA_per90']

# Normalize
scaler = MinMaxScaler()
norm_scaled_df = pd.DataFrame(
    scaler.fit_transform(norm_df_df[metrics_def]),
    columns=metrics_def
)

# Apply weights
for col in metrics_def:
    base_col = col.replace('_inv', '')
    norm_scaled_df[col] = norm_scaled_df[col] * weights_def[base_col]

# Calculate final score
norm_df_df['normalized_weighted_score'] = norm_scaled_df.sum(axis=1)

# Get top 3 defenders
top_defenders_norm = norm_df_df.sort_values('normalized_weighted_score', ascending=False).head(3)

# Display
top_defenders_norm[['Player', 'Squad', 'Age', '90s', 'normalized_weighted_score'] + list(weights_def.keys())]

  return xp.asarray(numpy.nanmin(X, axis=axis))
  return xp.asarray(numpy.nanmax(X, axis=axis))


Unnamed: 0,Player,Squad,Age,90s,normalized_weighted_score,TklInt_per90,TklW_per90,Blocks_per90,Clr_per90,Recov_per90,PrgP_per90,PrgC_per90,KP_per90,PPA_per90,GA_per90
136,Pedro Porro,Tottenham,26.0,27.4,0.604369,3.54,1.68,2.59,2.92,5.4,5.36,2.3,1.97,2.01,
122,Maximilian Mittelstädt,Stuttgart,28.0,24.7,0.592619,4.7,2.15,1.13,3.56,4.33,6.6,2.15,1.66,2.23,
6,Trent Alexander-Arnold,Liverpool,27.0,24.9,0.587828,4.02,1.97,1.41,2.17,5.06,8.47,1.97,2.01,2.49,


## Midfielder Evaluation and Ranking

This section ranks midfielders using a normalized weighted scoring system designed to reflect offensive output, creative playmaking, and defensive contribution. Only players with at least 25 full matches (90s played) are considered.

The ranking is based on the following per-90-minute metrics:

- **Attacking contribution**: goals per 90, assists per 90  
- **Creative playmaking**: expected assists (xA), expected goal contributions (xAG), key passes, passes into the penalty area  
- **Ball progression**: progressive passes, progressive carries  
- **Defensive work**: recoveries, tackles won, interceptions  

All metrics are normalized using min-max scaling to allow comparability across different statistical ranges. A weighted sum of these normalized values is calculated, with the following weights applied:

- Goals per 90: 15%  
- Assists per 90: 15%  
- Expected assists (xA): 10%  
- Expected goals + assists (xAG): 10%  
- Key passes: 10%  
- Passes into the penalty area: 10%  
- Progressive passes: 5%  
- Progressive carries: 5%  
- Recoveries: 5%  
- Tackles won: 5%  
- Interceptions: 5%  

The four midfielders with the highest overall scores are selected as the top performers.

In [43]:
query_mf_stats = """
SELECT
  Player, Squad, Age, "90s",
  ROUND(Gls / "90s", 2) AS Gls_per90,
  ROUND(Ast / "90s", 2) AS Ast_per90,
  ROUND(xA / "90s", 2) AS xA_per90,
  ROUND(xAG / "90s", 2) AS xAG_per90,
  ROUND(KP / "90s", 2) AS KP_per90,
  ROUND(PrgP / "90s", 2) AS PrgP_per90,
  ROUND(PrgC / "90s", 2) AS PrgC_per90,
  ROUND(PPA / "90s", 2) AS PPA_per90,
  ROUND(Recov / "90s", 2) AS Recov_per90,
  ROUND(TklW / "90s", 2) AS TklW_per90,
  ROUND(Int / "90s", 2) AS Int_per90
FROM players
WHERE (Pos = 'MF' OR Pos = 'MF,FW') AND "90s" >= 25
"""
mf_df = pd.read_sql(query_mf_stats, conn)

# Define weights
weights_mf = {
    'Gls_per90': 0.15,
    'Ast_per90': 0.15,
    'xA_per90': 0.10,
    'xAG_per90': 0.10,
    'KP_per90': 0.10,
    'PPA_per90': 0.10,
    'PrgP_per90': 0.05,
    'PrgC_per90': 0.05,
    'Recov_per90': 0.05,
    'TklW_per90': 0.05,
    'Int_per90': 0.05
}

# Create normalized version
metrics_mf = list(weights_mf.keys())
norm_mf_df = mf_df.copy()

# Normalize
scaler = MinMaxScaler()
norm_scaled_mf = pd.DataFrame(
    scaler.fit_transform(norm_mf_df[metrics_mf]),
    columns=metrics_mf
)

# Apply weights
for col in metrics_mf:
    norm_scaled_mf[col] = norm_scaled_mf[col] * weights_mf[col]

# Final score
norm_mf_df['normalized_weighted_score'] = norm_scaled_mf.sum(axis=1)

# Top 4 midfielders
top_mids_norm = norm_mf_df.sort_values('normalized_weighted_score', ascending=False).head(4)

# Display results
top_mids_norm[['Player', 'Squad', 'Age', '90s', 'normalized_weighted_score'] + metrics_mf]

Unnamed: 0,Player,Squad,Age,90s,normalized_weighted_score,Gls_per90,Ast_per90,xA_per90,xAG_per90,KP_per90,PPA_per90,PrgP_per90,PrgC_per90,Recov_per90,TklW_per90,Int_per90
7,Alex Baena,Villarreal,24.0,25.0,0.686074,0.24,0.28,0.45,0.39,3.36,2.64,6.72,3.16,5.2,0.56,0.84
30,Bruno Fernandes,Manchester Utd,31.0,30.6,0.610832,0.26,0.29,0.23,0.25,2.75,2.39,9.18,2.12,6.27,1.54,0.82
22,Matheus Cunha,Wolves,26.0,26.9,0.571731,0.56,0.22,0.18,0.27,1.9,2.01,4.83,3.9,4.57,0.78,0.63
67,Cole Palmer,Chelsea,23.0,31.5,0.564179,0.44,0.25,0.25,0.3,2.48,1.94,6.16,3.27,2.95,0.51,0.25


## Forward Evaluation and Ranking

This section ranks forwards by separating them into two role-based groups — strikers and wingers — and evaluating them using a normalized weighted scoring system based on key attacking and creative metrics. Only players with at least 25 full matches (90s played) are considered.

### Striker Evaluation

Strikers are assessed on their goal-scoring efficiency and finishing quality, using the following per-90-minute metrics:

- **Scoring output**: goals per 90, goals + assists per 90  
- **Expected contribution**: expected goals (xG), non-penalty xG (npxG)  
- **Finishing quality**: difference between goals and xG (xG_diff)

All metrics are normalized using min-max scaling. A weighted sum is computed with the following weights:

- Goals per 90: 25%  
- Expected goals (xG): 20%  
- Goals + assists per 90: 15%  
- xG difference (Gls - xG): 20%  
- Non-penalty xG: 20%

The top two players with the highest striker scores are selected.

### Winger Evaluation

Wingers are evaluated on creativity, progression, and secondary scoring contributions. The following per-90-minute metrics are used:

- **Creativity and passing**: assists, expected assisted goals (xAG), key passes  
- **Ball carrying and finishing**: progressive carries, goals per 90  

Metrics are normalized and weighted as follows:

- Assists per 90: 20%  
- Expected assisted goals (xAG): 20%  
- Key passes: 20%  
- Progressive carries: 20%  
- Goals per 90: 20%

The top-ranked winger (excluding the top two strikers) is selected to complete the front three.

In [65]:
query_fw_stats = """
SELECT
  Player, Squad, Age, "90s",
  ROUND(Gls / "90s", 2) AS Gls_per90,
  ROUND(Ast / "90s", 2) AS Ast_per90,
  ROUND((Gls + Ast) / "90s", 2) AS GA_per90,
  ROUND(xG / "90s", 2) AS xG_per90,
  ROUND(npxG / "90s", 2) AS npxG_per90,
  ROUND((Gls - xG) / "90s", 2) AS xG_diff_per90,
  ROUND(xAG / "90s", 2) AS xAG_per90,
  ROUND(KP / "90s", 2) AS KP_per90,
  ROUND(PrgC / "90s", 2) AS PrgC_per90
FROM players
WHERE (Pos = 'FW' OR Pos = 'FW,MF') AND "90s" >= 25
"""
fw_df = pd.read_sql(query_fw_stats, conn)

from sklearn.preprocessing import MinMaxScaler

# Striker metric weights
weights_striker = {
    'Gls_per90': 0.25,
    'xG_per90': 0.20,
    'GA_per90': 0.15,
    'xG_diff_per90': 0.20,
    'npxG_per90': 0.20
}

# Normalize striker metrics
striker_cols = list(weights_striker.keys())
norm_fw_df = fw_df.copy()
scaler = MinMaxScaler()
scaled_striker = pd.DataFrame(
    scaler.fit_transform(norm_fw_df[striker_cols]),
    columns=striker_cols
)

# Apply weights
for col in striker_cols:
    scaled_striker[col] *= weights_striker[col]

# Compute striker score
norm_fw_df['striker_score'] = scaled_striker.sum(axis=1)
top_strikers = norm_fw_df.sort_values('striker_score', ascending=False).head(2)
striker_names = top_strikers['Player'].tolist()

# Winger metric weights
weights_winger = {
    'Ast_per90': 0.20,
    'xAG_per90': 0.20,
    'KP_per90': 0.20,
    'PrgC_per90': 0.20,
    'Gls_per90': 0.20
}

# Normalize winger metrics
winger_cols = list(weights_winger.keys())
wingers_df = fw_df[~fw_df['Player'].isin(striker_names)].copy()
scaled_winger = pd.DataFrame(
    MinMaxScaler().fit_transform(wingers_df[winger_cols]),
    columns=winger_cols
)

# Apply weights
for col in winger_cols:
    scaled_winger[col] *= weights_winger[col]

# Compute winger score
wingers_df['winger_score'] = scaled_winger.sum(axis=1)
top_winger = wingers_df.sort_values('winger_score', ascending=False).head(1)

front_3 = pd.concat([top_strikers, top_winger])
front_3[['Player', 'Squad', 'Age', '90s', 'Gls_per90', 'Ast_per90', 'xG_per90', 'xAG_per90', 'KP_per90', 'PrgC_per90']]

Unnamed: 0,Player,Squad,Age,90s,Gls_per90,Ast_per90,xG_per90,xAG_per90,KP_per90,PrgC_per90
24,Harry Kane,Bayern Munich,32.0,25.1,0.96,0.32,0.78,0.21,1.31,1.12
29,Robert Lewandowski,Barcelona,37.0,27.1,0.92,0.07,0.9,0.08,0.7,0.92
44,Raphinha,Barcelona,29.0,27.3,0.55,0.33,0.6,0.43,3.04,3.08


## Final XI: Top Performers of the 2024/25 Season

This project identifies the top 11 players across Europe’s top five leagues using a fully data-driven, position-specific ranking system. Each player's performance was evaluated using normalized per-90-minute metrics relevant to their role, with weighted scoring tailored for goalkeepers, defenders, midfielders, and forwards.

**Minimum eligibility**:  
- At least 25 full matches (goalkeepers, midfielders, forwards)  
- At least 24 full matches (defenders)

**Formation:** 3–4–3  
- 1 Goalkeeper  
- 3 Defenders  
- 4 Midfielders  
- 3 Forwards (2 strikers + 1 winger)

The final lineup includes the highest-ranked players in each role, determined through min-max normalized scoring with carefully tuned weights for each key metric.

This lineup is not based on subjective opinions or popularity — it reflects pure on-pitch statistical performance across the 2024/25 season.

In [66]:
# Utility function to tag position for final display
def tag_position(df, pos):
    df = df.copy()
    df['Position'] = pos
    return df[['Player', 'Squad', 'Age', 'Position']]

# Assign positions
gk_final = tag_position(top_gk_norm.head(1), 'GK')
def_final = tag_position(top_defenders_norm.head(3), 'DF')
mid_final = tag_position(top_mids_norm.head(4), 'MF')
fwd_final = tag_position(front_3.head(3), 'FW')

# Concatenate in correct match order
top_xi_df = pd.concat([gk_final, def_final, mid_final, fwd_final]).reset_index(drop=True)

# Display
top_xi_df

Unnamed: 0,Player,Squad,Age,Position
0,Đorđe Petrović,Strasbourg,26.0,GK
1,Pedro Porro,Tottenham,26.0,DF
2,Maximilian Mittelstädt,Stuttgart,28.0,DF
3,Trent Alexander-Arnold,Liverpool,27.0,DF
4,Alex Baena,Villarreal,24.0,MF
5,Bruno Fernandes,Manchester Utd,31.0,MF
6,Matheus Cunha,Wolves,26.0,MF
7,Cole Palmer,Chelsea,23.0,MF
8,Harry Kane,Bayern Munich,32.0,FW
9,Robert Lewandowski,Barcelona,37.0,FW
