In [None]:
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Player Similarity Analysis\n",
    "## Finding Similar Players Using Machine Learning\n",
    "\n",
    "This notebook explains the player similarity algorithm using cosine similarity."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "from sklearn.preprocessing import StandardScaler\n",
    "from sklearn.metrics.pairwise import cosine_similarity\n",
    "import seaborn as sns\n",
    "import sys\n",
    "sys.path.append('../src')\n",
    "from analysis import get_data\n",
    "\n",
    "df = get_data()\n",
    "print(f\"Loaded {len(df)} players\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1. What is Cosine Similarity?\n",
    "\n",
    "Cosine similarity measures the angle between two vectors in multi-dimensional space.\n",
    "\n",
    "- **1.0** = Identical direction (most similar)\n",
    "- **0.0** = Perpendicular (no similarity)\n",
    "- **-1.0** = Opposite direction (most dissimilar)\n",
    "\n",
    "Formula: $\\cos(\\theta) = \\frac{A \\cdot B}{||A|| \\times ||B||}$"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Simple example\n",
    "from sklearn.metrics.pairwise import cosine_similarity\n",
    "\n",
    "# Two vectors representing player stats\n",
    "player_a = np.array([[10, 5, 8]])  # 10 goals, 5 assists, 8 xG\n",
    "player_b = np.array([[12, 6, 9]])  # Similar profile\n",
    "player_c = np.array([[2, 15, 3]])  # Different profile (playmaker)\n",
    "\n",
    "print(f\"Similarity A-B: {cosine_similarity(player_a, player_b)[0][0]:.3f}\")\n",
    "print(f\"Similarity A-C: {cosine_similarity(player_a, player_c)[0][0]:.3f}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2. Position-Based Metrics\n",
    "\n",
    "Different positions require different metrics for comparison."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Position-specific metrics\n",
    "FW_METRICS = ['gls', 'ast', 'g_a', 'xg', 'xag', 'npxg', 'g_pk', 'kp', 'xa', 'ppa',\n",
    "              'touches', 'carries', 'prgr', 'mis', 'pkwon']\n",
    "\n",
    "MF_METRICS = ['gls', 'ast', 'g_a', 'xg', 'xag', 'npxg', 'g_pk', 'tkl', 'tklw',\n",
    "              'int', 'tkl_int', 'prgp', 'prgc', 'kp', 'xa', 'ppa', 'touches',\n",
    "              'carries', 'prgr', 'mis', 'dis', 'crdy', 'crdr', 'recov']\n",
    "\n",
    "DF_METRICS = ['tkl', 'tklw', 'blocks', 'int', 'tkl_int', 'clr', 'err', 'prgp',\n",
    "              'prgc', 'touches', 'carries', 'mis', 'dis', 'crdy', 'crdr', 'recov']\n",
    "\n",
    "GK_METRICS = ['ga', 'saves', 'save', 'cs', 'pka', 'pksv']\n",
    "\n",
    "print(f\"Forward metrics: {len(FW_METRICS)}\")\n",
    "print(f\"Midfielder metrics: {len(MF_METRICS)}\")\n",
    "print(f\"Defender metrics: {len(DF_METRICS)}\")\n",
    "print(f\"Goalkeeper metrics: {len(GK_METRICS)}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3. Data Preprocessing\n",
    "\n",
    "Before calculating similarity, we need to:\n",
    "1. Filter by position (compare apples to apples)\n",
    "2. Handle missing values\n",
    "3. Standardize features (z-score normalization)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def get_position(pos):\n",
    "    if pd.isna(pos):\n",
    "        return None\n",
    "    pos = pos.upper()\n",
    "    if 'GK' in pos:\n",
    "        return 'GK'\n",
    "    elif 'DF' in pos:\n",
    "        return 'DF'\n",
    "    elif 'MF' in pos:\n",
    "        return 'MF'\n",
    "    elif 'FW' in pos:\n",
    "        return 'FW'\n",
    "    return None\n",
    "\n",
    "df['main_pos'] = df['pos'].apply(get_position)\n",
    "\n",
    "# Example: Filter forwards\n",
    "forwards = df[df['main_pos'] == 'FW'].copy()\n",
    "print(f\"Forwards: {len(forwards)}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Standardization example\n",
    "available_metrics = [m for m in FW_METRICS if m in forwards.columns]\n",
    "print(f\"Available metrics: {len(available_metrics)}\")\n",
    "\n",
    "# Get metrics data\n",
    "metrics_df = forwards[available_metrics].fillna(0)\n",
    "\n",
    "# Before standardization\n",
    "print(\"\\nBefore Standardization:\")\n",
    "print(metrics_df.describe().loc[['mean', 'std']].round(2))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Apply StandardScaler\n",
    "scaler = StandardScaler()\n",
    "metrics_scaled = scaler.fit_transform(metrics_df)\n",
    "metrics_scaled_df = pd.DataFrame(metrics_scaled, columns=available_metrics, index=forwards.index)\n",
    "\n",
    "print(\"After Standardization:\")\n",
    "print(metrics_scaled_df.describe().loc[['mean', 'std']].round(2))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4. Calculate Similarity Matrix"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Calculate cosine similarity\n",
    "similarity_matrix = cosine_similarity(metrics_scaled)\n",
    "\n",
    "print(f\"Similarity Matrix Shape: {similarity_matrix.shape}\")\n",
    "print(f\"\\nExample similarities (first 5 players):\")\n",
    "sim_df = pd.DataFrame(similarity_matrix[:5, :5], \n",
    "                      index=forwards['player'].iloc[:5], \n",
    "                      columns=forwards['player'].iloc[:5])\n",
    "sim_df.round(3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 5. Find Similar Players Function"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def find_similar_players(df, player_name, top_n=5):\n",
    "    \"\"\"Find most similar players to a given player\"\"\"\n",
    "    \n",
    "    # Find player\n",
    "    player_row = df[df['player'].str.lower().str.contains(player_name.lower())]\n",
    "    if len(player_row) == 0:\n",
    "        return None, f\"Player '{player_name}' not found\"\n",
    "    \n",
    "    player = player_row.iloc[0]\n",
    "    player_idx = player_row.index[0]\n",
    "    \n",
    "    # Get position\n",
    "    pos = get_position(player['pos'])\n",
    "    \n",
    "    # Select metrics based on position\n",
    "    if pos == 'FW':\n",
    "        metrics = FW_METRICS\n",
    "    elif pos == 'MF':\n",
    "        metrics = MF_METRICS\n",
    "    elif pos == 'DF':\n",
    "        metrics = DF_METRICS\n",
    "    else:\n",
    "        metrics = GK_METRICS\n",
    "    \n",
    "    # Filter same position players\n",
    "    if pos == 'GK':\n",
    "        filtered_df = df[df['main_pos'] == 'GK'].copy()\n",
    "    else:\n",
    "        filtered_df = df[df['main_pos'] != 'GK'].copy()\n",
    "    \n",
    "    # Get available metrics\n",
    "    available = [m for m in metrics if m in filtered_df.columns]\n",
    "    metrics_data = filtered_df[available].fillna(0)\n",
    "    \n",
    "    # Standardize\n",
    "    scaler = StandardScaler()\n",
    "    scaled = scaler.fit_transform(metrics_data)\n",
    "    \n",
    "    # Calculate similarity\n",
    "    sim_matrix = cosine_similarity(scaled)\n",
    "    \n",
    "    # Get player's position in filtered df\n",
    "    player_pos = filtered_df.index.get_loc(player_idx)\n",
    "    similarities = sim_matrix[player_pos]\n",
    "    \n",
    "    # Get top similar (excluding self)\n",
    "    similar_indices = np.argsort(similarities)[::-1][1:top_n+1]\n",
    "    \n",
    "    results = []\n",
    "    for idx in similar_indices:\n",
    "        sim_player = filtered_df.iloc[idx]\n",
    "        results.append({\n",
    "            'player': sim_player['player'],\n",
    "            'squad': sim_player['squad'],\n",
    "            'pos': sim_player['pos'],\n",
    "            'comp': sim_player['comp'],\n",
    "            'similarity': round(similarities[idx] * 100, 1)\n",
    "        })\n",
    "    \n",
    "    return results, player"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 6. Example: Find Players Similar to Erling Haaland"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "similar, original = find_similar_players(df, 'Haaland', top_n=10)\n",
    "\n",
    "print(f\"Players similar to: {original['player']}\")\n",
    "print(f\"Position: {original['pos']} | Team: {original['squad']}\")\n",
    "print(\"\\n\" + \"=\"*70)\n",
    "print(f\"{'Rank':<6} {'Player':<25} {'Team':<20} {'Similarity':<10}\")\n",
    "print(\"=\"*70)\n",
    "\n",
    "for i, s in enumerate(similar, 1):\n",
    "    print(f\"{i:<6} {s['player']:<25} {s['squad']:<20} {s['similarity']}%\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Visualize comparison\n",
    "metrics_to_compare = ['gls', 'ast', 'xg', 'xag', 'finishing_alpha', 'playmaking_alpha']\n",
    "\n",
    "fig, ax = plt.subplots(figsize=(10, 8), subplot_kw=dict(polar=True))\n",
    "\n",
    "angles = np.linspace(0, 2 * np.pi, len(metrics_to_compare), endpoint=False).tolist()\n",
    "angles += angles[:1]\n",
    "\n",
    "# Original player\n",
    "max_vals = {m: df[m].max() for m in metrics_to_compare}\n",
    "orig_vals = [(original[m] / max_vals[m] * 100) if max_vals[m] != 0 else 0 for m in metrics_to_compare]\n",
    "orig_vals += orig_vals[:1]\n",
    "\n",
    "ax.plot(angles, orig_vals, 'o-', linewidth=2, label=original['player'], color='#3498db')\n",
    "ax.fill(angles, orig_vals, alpha=0.25, color='#3498db')\n",
    "\n",
    "# Most similar player\n",
    "sim_player_name = similar[0]['player']\n",
    "sim_player = df[df['player'] == sim_player_name].iloc[0]\n",
    "sim_vals = [(sim_player[m] / max_vals[m] * 100) if max_vals[m] != 0 else 0 for m in metrics_to_compare]\n",
    "sim_vals += sim_vals[:1]\n",
    "\n",
    "ax.plot(angles, sim_vals, 'o-', linewidth=2, label=f\"{sim_player_name} ({similar[0]['similarity']}%)\", color='#e74c3c')\n",
    "ax.fill(angles, sim_vals, alpha=0.25, color='#e74c3c')\n",
    "\n",
    "ax.set_xticks(angles[:-1])\n",
    "ax.set_xticklabels(metrics_to_compare)\n",
    "ax.set_ylim(0, 100)\n",
    "ax.legend(loc='upper right', bbox_to_anchor=(1.3, 1.0))\n",
    "ax.set_title('Player Comparison Radar Chart')\n",
    "\n",
    "plt.tight_layout()\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 7. Example: Find Players Similar to Kevin De Bruyne"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "similar_kdb, original_kdb = find_similar_players(df, 'De Bruyne', top_n=10)\n",
    "\n",
    "if similar_kdb:\n",
    "    print(f\"Players similar to: {original_kdb['player']}\")\n",
    "    print(f\"Position: {original_kdb['pos']} | Team: {original_kdb['squad']}\")\n",
    "    print(\"\\n\" + \"=\"*70)\n",
    "    \n",
    "    for i, s in enumerate(similar_kdb, 1):\n",
    "        print(f\"{i:<6} {s['player']:<25} {s['squad']:<20} {s['similarity']}%\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Key Insights\n",
    "\n",
    "1. **Cosine Similarity:** Measures angle between player stat vectors\n",
    "2. **Position Filtering:** Only compare players in similar roles\n",
    "3. **Standardization:** Essential for fair comparison across different scales\n",
    "4. **Interpretability:** High similarity (>90%) indicates very similar playing styles\n",
    "5. **Use Cases:** Scouting, transfer targets, replacement analysis"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "name": "python",
   "version": "3.9.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}