# Alpha Metrics Analysis
## Understanding Finishing Alpha and Playmaking Alpha
### Dual-Source: FBref Stats + Understat xG

This notebook explains the core "Alpha" metrics - borrowed from quantitative finance - applied to football analytics.

**Alpha = Actual Performance − Expected Performance**

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from adjustText import adjust_text
import sys
sys.path.append('../src')
from analysis import get_data

plt.style.use('seaborn-v0_8-whitegrid')
df = get_data()

# Position helper
def get_pos(p):
    if pd.isna(p): return 'Unknown'
    p = p.upper()
    if 'GK' in p: return 'GK'
    elif 'DF' in p: return 'DF'
    elif 'MF' in p: return 'MF'
    elif 'FW' in p: return 'FW'
    return 'Unknown'

df['main_pos'] = df['pos'].apply(get_pos)
print(f"Loaded {len(df)} players | xG coverage: {df['xg'].notna().sum()}/{len(df)}")

## 1. What is Alpha?

In finance, **Alpha (α)** represents the excess return of an investment relative to a benchmark index.

In football analytics, we apply the same concept:

| Metric | Formula | Interpretation |
|--------|---------|----------------|
| **Finishing Alpha** | Goals − xG | Overperforming (+) or underperforming (−) expected goals |
| **Playmaking Alpha** | Assists − xAG | Overperforming (+) or underperforming (−) expected assists |
| **Alpha per 90** | Finishing Alpha / 90s | Rate-adjusted finishing outperformance |

In [None]:
# Live example
print("Finishing Alpha = Goals - xG")
print("Playmaking Alpha = Assists - xAG")
print()

# Pick a well-known striker
for name in ['Haaland', 'Mbappé', 'Lewandowski', 'Salah']:
    match = df[df['player'].str.contains(name, case=False, na=False)]
    if len(match) > 0:
        p = match.iloc[0]
        print(f"{p['player']} ({p['squad']})")
        print(f"  Goals: {p['gls']:.0f}, xG: {p['xg']:.1f} → Finishing Alpha: {p['finishing_alpha']:+.2f}")
        if pd.notna(p.get('xag')):
            print(f"  Assists: {p['ast']:.0f}, xAG: {p['xag']:.1f} → Playmaking Alpha: {p['playmaking_alpha']:+.2f}")
        print()

## 2. Finishing Alpha Distribution

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

fa = df['finishing_alpha'].dropna()
axes[0].hist(fa, bins=50, edgecolor='black', color='#3498db')
axes[0].axvline(x=0, color='red', linestyle='--', linewidth=2, label='Zero (Expected)')
axes[0].axvline(x=fa.mean(), color='green', linestyle='--', linewidth=2, label=f"Mean: {fa.mean():.2f}")
axes[0].set_xlabel('Finishing Alpha')
axes[0].set_ylabel('Count')
axes[0].set_title('Finishing Alpha Distribution')
axes[0].legend()

df.boxplot(column='finishing_alpha', by='main_pos', ax=axes[1])
axes[1].axhline(y=0, color='red', linestyle='--')
axes[1].set_title('Finishing Alpha by Position')
axes[1].set_xlabel('Position')
plt.suptitle('')
plt.tight_layout()
plt.show()

## 3. xG vs Actual Goals Scatter Plot

In [None]:
plt.figure(figsize=(12, 10))
mask = df['xg'].notna() & df['gls'].notna()
plot_df = df[mask]

scatter = plt.scatter(plot_df['xg'], plot_df['gls'], alpha=0.5, 
                      c=plot_df['finishing_alpha'], cmap='RdYlGn', s=50)
plt.colorbar(scatter, label='Finishing Alpha')

max_val = max(plot_df['xg'].max(), plot_df['gls'].max())
plt.plot([0, max_val], [0, max_val], 'k--', linewidth=2, label='Perfect Conversion (Goals = xG)')

# Label outliers
top = plot_df.nlargest(8, 'finishing_alpha')
worst = plot_df.nsmallest(5, 'finishing_alpha')
outliers = pd.concat([top, worst])

texts = [plt.text(row['xg'], row['gls'], row['player'], fontsize=9, fontweight='bold')
         for _, row in outliers.iterrows()]
adjust_text(texts, arrowprops=dict(arrowstyle='-', color='gray', lw=0.5))

plt.xlabel('Expected Goals (xG)', fontsize=12)
plt.ylabel('Actual Goals', fontsize=12)
plt.title('xG vs Actual Goals - Who Overperforms?', fontsize=14)
plt.legend()
plt.tight_layout()
plt.show()

## 4. Top Clinical Finishers

In [None]:
top_finishers = df.dropna(subset=['finishing_alpha']).nlargest(15, 'finishing_alpha')
top_finishers[['player', 'squad', 'comp', 'gls', 'xg', 'finishing_alpha']].reset_index(drop=True)

In [None]:
plt.figure(figsize=(12, 8))
top15 = df.dropna(subset=['finishing_alpha']).nlargest(15, 'finishing_alpha')
plt.barh(top15['player'], top15['finishing_alpha'], color='#2ecc71')
plt.xlabel('Finishing Alpha (Goals − xG)')
plt.title('Top 15 Clinical Finishers - Overperforming xG')
plt.gca().invert_yaxis()
plt.tight_layout()
plt.show()

## 5. Worst Finishers

In [None]:
worst15 = df.dropna(subset=['finishing_alpha']).nsmallest(15, 'finishing_alpha')
plt.figure(figsize=(12, 8))
plt.barh(worst15['player'], worst15['finishing_alpha'], color='#e74c3c')
plt.xlabel('Finishing Alpha (Goals − xG)')
plt.title('Top 15 Underperforming Finishers - Below xG')
plt.gca().invert_yaxis()
plt.tight_layout()
plt.show()

## 6. Playmaking Alpha Analysis

In [None]:
plt.figure(figsize=(12, 10))
mask = df['xag'].notna() & df['ast'].notna()
plot_df = df[mask]

scatter = plt.scatter(plot_df['xag'], plot_df['ast'], alpha=0.5,
                      c=plot_df['playmaking_alpha'], cmap='RdYlGn', s=50)
plt.colorbar(scatter, label='Playmaking Alpha')

max_val = max(plot_df['xag'].max(), plot_df['ast'].max())
plt.plot([0, max_val], [0, max_val], 'k--', linewidth=2, label='Perfect Conversion')

top_playmakers = plot_df.nlargest(10, 'playmaking_alpha')
texts = [plt.text(row['xag'], row['ast'], row['player'], fontsize=9, fontweight='bold')
         for _, row in top_playmakers.iterrows()]
adjust_text(texts, arrowprops=dict(arrowstyle='-', color='gray', lw=0.5))

plt.xlabel('Expected Assists (xAG)', fontsize=12)
plt.ylabel('Actual Assists', fontsize=12)
plt.title('xAG vs Actual Assists - Who Creates Beyond Expectations?', fontsize=14)
plt.legend()
plt.tight_layout()
plt.show()

## 7. xGChain & xGBuildup - Understat Advanced Metrics

**xGChain**: Total xG of every possession chain a player is involved in (goals + assists + buildup).
**xGBuildup**: Same as xGChain but excluding the final shot and assist.

In [None]:
if 'xgchain' in df.columns and 'xgbuildup' in df.columns:
    chain_df = df.dropna(subset=['xgchain', 'xgbuildup'])
    
    fig, axes = plt.subplots(1, 2, figsize=(14, 5))
    
    # xGChain vs Goals+Assists
    chain_df['g_a'] = chain_df['gls'] + chain_df['ast']
    axes[0].scatter(chain_df['xgchain'], chain_df['g_a'], alpha=0.4, s=30, color='#3498db')
    axes[0].set_xlabel('xGChain')
    axes[0].set_ylabel('Goals + Assists')
    axes[0].set_title('xGChain vs Goal Contributions')
    
    # xGBuildup - pure buildup players
    top_buildup = chain_df.nlargest(15, 'xgbuildup')
    axes[1].barh(top_buildup['player'], top_buildup['xgbuildup'], color='#9b59b6')
    axes[1].set_xlabel('xGBuildup')
    axes[1].set_title('Top 15 Buildup Contributors')
    axes[1].invert_yaxis()
    
    plt.tight_layout()
    plt.show()
else:
    print("xGChain/xGBuildup columns not available")

## 8. League Comparison

In [None]:
league_alpha = df.groupby('comp').agg({
    'finishing_alpha': 'mean',
    'playmaking_alpha': 'mean'
}).round(3)

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

for i, (col, title) in enumerate([('finishing_alpha', 'Finishing'), ('playmaking_alpha', 'Playmaking')]):
    order = league_alpha.sort_values(col).index
    colors = ['#2ecc71' if league_alpha.loc[l, col] > 0 else '#e74c3c' for l in order]
    axes[i].barh(order, league_alpha.loc[order, col], color=colors)
    axes[i].axvline(x=0, color='black', linewidth=0.5)
    axes[i].set_xlabel(f'Average {title} Alpha')
    axes[i].set_title(f'League {title} Efficiency')

plt.tight_layout()
plt.show()

## 9. Team Efficiency Rankings

In [None]:
team_alpha = df.groupby('squad').agg({
    'finishing_alpha': 'mean',
    'player': 'count'
}).rename(columns={'player': 'num_players'})

team_alpha = team_alpha[team_alpha['num_players'] >= 5]
top_teams = team_alpha.nlargest(15, 'finishing_alpha')

plt.figure(figsize=(12, 8))
colors = ['#2ecc71' if x > 0 else '#e74c3c' for x in top_teams['finishing_alpha']]
plt.barh(top_teams.index, top_teams['finishing_alpha'], color=colors)
plt.xlabel('Average Finishing Alpha')
plt.title('Top 15 Most Clinical Teams (min. 5 players)')
plt.gca().invert_yaxis()
plt.tight_layout()
plt.show()

## 10. Alpha per 90 - Rate-Adjusted Analysis

Alpha per 90 normalizes for playing time, revealing who is most clinical *per match*.

In [None]:
if 'alpha_per90' in df.columns:
    qualified = df[(df['col_90s'] >= 5) & df['alpha_per90'].notna()]
    
    fig, axes = plt.subplots(1, 2, figsize=(14, 5))
    
    top10 = qualified.nlargest(10, 'alpha_per90')
    axes[0].barh(top10['player'], top10['alpha_per90'], color='#2ecc71')
    axes[0].set_xlabel('Alpha per 90')
    axes[0].set_title('Top 10 - Finishing Alpha per 90 (min 5 90s)')
    axes[0].invert_yaxis()
    
    worst10 = qualified.nsmallest(10, 'alpha_per90')
    axes[1].barh(worst10['player'], worst10['alpha_per90'], color='#e74c3c')
    axes[1].set_xlabel('Alpha per 90')
    axes[1].set_title('Worst 10 - Finishing Alpha per 90 (min 5 90s)')
    axes[1].invert_yaxis()
    
    plt.tight_layout()
    plt.show()

## Key Insights

1. **Distribution**: Most players cluster around zero alpha (the model is well-calibrated)
2. **Outliers**: Elite finishers consistently overperform xG by 3-8 goals per season
3. **Position Effect**: Forwards have highest variance; midfielders are closer to expectation
4. **xGChain**: Shows total involvement in goal-scoring - not just the final action
5. **xGBuildup**: Identifies "invisible" contributors who facilitate goals without scoring
6. **League Differences**: Each league has distinct finishing/playmaking efficiency profiles
7. **Alpha per 90**: Better metric for comparing players with different playing time