# Exploratory Data Analysis
## Football Alpha Analysis - 2025-26 Season
### Dual-Source Pipeline: FBref + Understat

This notebook explores the merged dataset containing player statistics from Europe's Top 5 leagues.
- **FBref**: 76 columns (Standard, Shooting, Keeper, Playing Time, Misc)
- **Understat**: 10 columns (xG, xA, npxG, xGChain, xGBuildup, shots, key passes, NPG)
- **Computed**: Finishing Alpha, Playmaking Alpha, per-90 metrics

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import sys
sys.path.append('../src')
from analysis import get_data

plt.style.use('seaborn-v0_8-whitegrid')
pd.set_option('display.max_columns', 50)

## 1. Load Data

In [None]:
df = get_data()
print(f"Dataset Shape: {df.shape}")
print(f"Total Players: {len(df)}")
print(f"Total Columns: {len(df.columns)}")
print(f"\nData Source Breakdown:")
print(f"  FBref columns: ~76")
print(f"  Understat xG columns: {df['xg'].notna().sum()} players with xG data")
print(f"  xG coverage: {df['xg'].notna().sum()/len(df)*100:.1f}%")

## 2. Data Overview

In [None]:
df.head(10)

In [None]:
df.info()

In [None]:
df.describe()

## 3. Column Inventory

Let's categorize all columns by their source and type.

In [None]:
fbref_core = ['player', 'squad', 'comp', 'pos', 'age', 'born', 'mp', 'starts', 'min', 'col_90s']
fbref_offensive = ['gls', 'ast', 'g_a', 'g_pk', 'pk', 'pkatt', 'sh', 'sot', 'sot_pct', 'sh_90', 'g_sh', 'g_sot', 'dist', 'fk']
fbref_keeper = ['ga', 'ga90', 'sota', 'saves', 'save_pct', 'w', 'd', 'l', 'cs', 'cs_pct']
fbref_misc = ['crdy', 'crdr', 'fls', 'off', 'crs', 'recov', 'won', 'lost']
understat = ['xg', 'xag', 'npxg', 'xgchain', 'xgbuildup', 'us_shots', 'us_key_passes', 'us_npg']
computed = ['finishing_alpha', 'playmaking_alpha', 'gls_per90', 'xg_per90', 'ast_per90', 'xag_per90', 'alpha_per90']

print("Column Categories:")
for name, cols in [("FBref Core", fbref_core), ("FBref Offensive", fbref_offensive),
                   ("FBref Keeper", fbref_keeper), ("FBref Misc", fbref_misc),
                   ("Understat xG", understat), ("Computed Alpha", computed)]:
    available = [c for c in cols if c in df.columns]
    print(f"  {name}: {len(available)}/{len(cols)} available")

## 4. Missing Values Analysis

In [None]:
missing = df.isnull().sum()
missing_pct = (missing / len(df) * 100).round(2)
missing_df = pd.DataFrame({'Missing': missing, 'Percentage': missing_pct})
missing_df[missing_df['Missing'] > 0].sort_values('Missing', ascending=False).head(20)

In [None]:
plt.figure(figsize=(12, 6))
missing_cols = missing_df[missing_df['Missing'] > 0].sort_values('Missing', ascending=False).head(20)
colors = ['#e74c3c' if p > 50 else '#f39c12' if p > 20 else '#3498db' for p in missing_cols['Percentage']]
plt.barh(missing_cols.index, missing_cols['Percentage'], color=colors)
plt.xlabel('Missing %')
plt.title('Top 20 Columns with Missing Values')
plt.tight_layout()
plt.show()

## 5. League Distribution

In [None]:
league_counts = df['comp'].value_counts()
print(league_counts)

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

axes[0].pie(league_counts.values, labels=league_counts.index, autopct='%1.1f%%', startangle=90)
axes[0].set_title('Player Distribution by League')

axes[1].bar(league_counts.index, league_counts.values, color=['#e74c3c', '#3498db', '#2ecc71', '#f39c12', '#9b59b6'])
axes[1].set_ylabel('Player Count')
axes[1].set_title('Players per League')
plt.xticks(rotation=30, ha='right')
plt.tight_layout()
plt.show()

## 6. Position Distribution

In [None]:
def get_main_position(pos):
    if pd.isna(pos): return 'Unknown'
    pos = pos.upper()
    if 'GK' in pos: return 'GK'
    elif 'DF' in pos: return 'DF'
    elif 'MF' in pos: return 'MF'
    elif 'FW' in pos: return 'FW'
    return 'Unknown'

df['main_pos'] = df['pos'].apply(get_main_position)
pos_counts = df['main_pos'].value_counts()

plt.figure(figsize=(8, 6))
colors = {'GK': '#9b59b6', 'DF': '#3498db', 'MF': '#2ecc71', 'FW': '#e74c3c', 'Unknown': '#95a5a6'}
plt.bar(pos_counts.index, pos_counts.values, color=[colors.get(p, '#95a5a6') for p in pos_counts.index])
plt.xlabel('Position')
plt.ylabel('Count')
plt.title('Player Distribution by Position')
for i, (pos, count) in enumerate(pos_counts.items()):
    plt.text(i, count + 5, str(count), ha='center', fontweight='bold')
plt.show()

## 7. Age Distribution

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

axes[0].hist(df['age'].dropna(), bins=20, edgecolor='black', color='#3498db')
axes[0].axvline(df['age'].mean(), color='red', linestyle='--', label=f"Mean: {df['age'].mean():.1f}")
axes[0].axvline(df['age'].median(), color='green', linestyle='--', label=f"Median: {df['age'].median():.1f}")
axes[0].set_xlabel('Age')
axes[0].set_ylabel('Count')
axes[0].set_title('Age Distribution')
axes[0].legend()

df.boxplot(column='age', by='main_pos', ax=axes[1])
axes[1].set_title('Age by Position')
axes[1].set_xlabel('Position')
axes[1].set_ylabel('Age')
plt.suptitle('')
plt.tight_layout()
plt.show()

## 8. Goals & Assists Distribution

In [None]:
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

axes[0,0].hist(df['gls'].dropna(), bins=30, edgecolor='black', color='#3498db')
axes[0,0].set_xlabel('Goals')
axes[0,0].set_title('Goals Distribution')

axes[0,1].hist(df['ast'].dropna(), bins=30, edgecolor='black', color='#2ecc71')
axes[0,1].set_xlabel('Assists')
axes[0,1].set_title('Assists Distribution')

df.boxplot(column='gls', by='main_pos', ax=axes[1,0])
axes[1,0].set_title('Goals by Position')
axes[1,0].set_xlabel('Position')

df.boxplot(column='ast', by='main_pos', ax=axes[1,1])
axes[1,1].set_title('Assists by Position')
axes[1,1].set_xlabel('Position')

plt.suptitle('')
plt.tight_layout()
plt.show()

## 9. Understat xG Coverage Analysis

How well does our Understat merge cover the dataset?

In [None]:
xg_coverage = df.groupby('comp').agg(
    total=('player', 'count'),
    with_xg=('xg', lambda x: x.notna().sum())
).assign(coverage=lambda x: (x['with_xg'] / x['total'] * 100).round(1))

print("xG Coverage by League:")
print(xg_coverage)
print(f"\nOverall: {df['xg'].notna().sum()}/{len(df)} ({df['xg'].notna().sum()/len(df)*100:.1f}%)")

# xGChain & xGBuildup availability
for col in ['xgchain', 'xgbuildup']:
    if col in df.columns:
        avail = df[col].notna().sum()
        print(f"{col}: {avail} players ({avail/len(df)*100:.1f}%)")

## 10. Correlation Analysis

In [None]:
key_metrics = ['gls', 'ast', 'xg', 'xag', 'npxg', 'finishing_alpha', 'playmaking_alpha', 'col_90s', 'age']
available = [m for m in key_metrics if m in df.columns]
corr_matrix = df[available].corr()

plt.figure(figsize=(10, 8))
sns.heatmap(corr_matrix, annot=True, cmap='RdYlGn', center=0, fmt='.2f', square=True, linewidths=0.5)
plt.title('Correlation Matrix - Key Metrics')
plt.tight_layout()
plt.show()

## 11. Top Players Overview

In [None]:
print("Top 10 Goal Scorers:")
df.nlargest(10, 'gls')[['player', 'squad', 'comp', 'gls', 'xg', 'finishing_alpha']]

In [None]:
print("Top 10 Assist Providers:")
df.nlargest(10, 'ast')[['player', 'squad', 'comp', 'ast', 'xag', 'playmaking_alpha']]

In [None]:
print("Top 10 xGChain (Involvement in Goal-Scoring Chains):")
if 'xgchain' in df.columns:
    df.dropna(subset=['xgchain']).nlargest(10, 'xgchain')[['player', 'squad', 'comp', 'xgchain', 'xgbuildup', 'gls', 'ast']]

## 12. Summary Statistics by League

In [None]:
league_summary = df.groupby('comp').agg({
    'player': 'count',
    'gls': ['sum', 'mean'],
    'ast': ['sum', 'mean'],
    'xg': ['sum', 'mean'],
    'finishing_alpha': 'mean',
    'playmaking_alpha': 'mean',
    'age': 'mean'
}).round(2)

league_summary.columns = ['Players', 'Total Goals', 'Avg Goals', 'Total Assists', 'Avg Assists',
                          'Total xG', 'Avg xG', 'Avg Finishing Alpha', 'Avg Playmaking Alpha', 'Avg Age']
league_summary

## Key Findings

1. **Dataset**: ~2,600+ players from Big 5 European Leagues (2025-26 season)
2. **Dual-Source**: FBref (76 cols) + Understat (10 cols) merged with 95%+ xG match rate
3. **New Metrics**: xGChain and xGBuildup from Understat show involvement in goal-scoring sequences
4. **Leagues**: Premier League, La Liga, Serie A, Bundesliga, Ligue 1
5. **Goals Distribution**: Heavily right-skewed (most players score few goals)
6. **Strong Correlation**: xG and actual goals are highly correlated (validating the model)