# 01 - Exploration des Donn√©es (EDA)

**Objectif**: Comprendre la distribution et les caract√©ristiques du dataset de pr√©diction de victoire Pok√©mon Let's Go.

**Dataset**: `matchups.parquet` g√©n√©r√© par `build_battle_winner_dataset.py`

**Date**: 2026-01-21

---

## Table des Mati√®res

1. [Chargement des Donn√©es](#1-chargement-des-donn√©es)
2. [Aper√ßu du Dataset](#2-aper√ßu-du-dataset)
3. [Analyse de la Variable Cible](#3-analyse-de-la-variable-cible)
4. [Distribution des Stats Pok√©mon](#4-distribution-des-stats-pok√©mon)
5. [Analyse des Types](#5-analyse-des-types)
6. [Analyse des Moves](#6-analyse-des-moves)
7. [Corr√©lations](#7-corr√©lations)
8. [Analyse des Features D√©riv√©es](#8-analyse-des-features-d√©riv√©es)
9. [Conclusions](#9-conclusions)

## 1. Chargement des Donn√©es

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path

# Configuration
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette('husl')
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)

# Chemins
BASE_DIR = Path('../data/ml/battle_winner')
RAW_DIR = BASE_DIR / 'raw'
PROCESSED_DIR = BASE_DIR / 'processed'

print("üì¶ Biblioth√®ques charg√©es")

In [None]:
# Chargement des datasets
df_raw = pd.read_parquet(RAW_DIR / 'matchups.parquet')
df_train = pd.read_parquet(PROCESSED_DIR / 'train.parquet')
df_test = pd.read_parquet(PROCESSED_DIR / 'test.parquet')

print(f"‚úÖ Dataset brut charg√©: {df_raw.shape}")
print(f"‚úÖ Train set charg√©: {df_train.shape}")
print(f"‚úÖ Test set charg√©: {df_test.shape}")

## 2. Aper√ßu du Dataset

In [None]:
# Informations g√©n√©rales
print("=" * 80)
print("üìä INFORMATIONS G√âN√âRALES")
print("=" * 80)
df_raw.info()

In [None]:
# Premi√®res lignes
print("\n" + "=" * 80)
print("üëÄ PREMI√àRES LIGNES")
print("=" * 80)
df_raw.head(10)

In [None]:
# Statistiques descriptives
print("\n" + "=" * 80)
print("üìà STATISTIQUES DESCRIPTIVES")
print("=" * 80)
df_raw.describe()

In [None]:
# Colonnes et types
print("\n" + "=" * 80)
print("üîç COLONNES ET TYPES")
print("=" * 80)
for col in df_raw.columns:
    print(f"{col:30s} | {str(df_raw[col].dtype):15s} | Null: {df_raw[col].isna().sum():6d}")

## 3. Analyse de la Variable Cible

In [None]:
# Distribution de la cible
print("=" * 80)
print("üéØ DISTRIBUTION DE LA VARIABLE CIBLE: winner")
print("=" * 80)

target_counts = df_raw['winner'].value_counts().sort_index()
target_pct = df_raw['winner'].value_counts(normalize=True).sort_index() * 100

result = pd.DataFrame({
    'Count': target_counts,
    'Percentage': target_pct
})
result.index = ['Pok√©mon A gagne (0)', 'Pok√©mon B gagne (1)']
print(result)
print(f"\n‚úÖ Balance des classes: {min(target_pct):.1f}% / {max(target_pct):.1f}%")

In [None]:
# Visualisation de la cible
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Bar plot
target_counts.plot(kind='bar', ax=axes[0], color=['#3498db', '#e74c3c'])
axes[0].set_title('Distribution du Gagnant', fontsize=14, fontweight='bold')
axes[0].set_xlabel('Gagnant')
axes[0].set_ylabel("Nombre d'√©chantillons")
axes[0].set_xticklabels(['Pok√©mon A', 'Pok√©mon B'], rotation=0)
axes[0].grid(axis='y', alpha=0.3)

# Pie chart
axes[1].pie(target_counts, labels=['Pok√©mon A', 'Pok√©mon B'], autopct='%1.1f%%',
            colors=['#3498db', '#e74c3c'], startangle=90)
axes[1].set_title('Proportion des Classes', fontsize=14, fontweight='bold')

plt.tight_layout()
plt.show()

## 4. Distribution des Stats Pok√©mon

In [None]:
# Distribution des stats du Pok√©mon A
stat_cols_a = ['a_hp', 'a_attack', 'a_defense', 'a_sp_attack', 'a_sp_defense', 'a_speed']

fig, axes = plt.subplots(2, 3, figsize=(15, 10))
axes = axes.flatten()

for i, col in enumerate(stat_cols_a):
    df_raw[col].plot(kind='hist', bins=30, ax=axes[i], color='#3498db', edgecolor='black', alpha=0.7)
    axes[i].set_title(f'Distribution de {col}', fontsize=12, fontweight='bold')
    axes[i].set_xlabel(col)
    axes[i].set_ylabel('Fr√©quence')
    axes[i].grid(axis='y', alpha=0.3)

plt.suptitle('Stats du Pok√©mon A', fontsize=16, fontweight='bold', y=1.02)
plt.tight_layout()
plt.show()

In [None]:
# Distribution des stats du Pok√©mon B
stat_cols_b = ['b_hp', 'b_attack', 'b_defense', 'b_sp_attack', 'b_sp_defense', 'b_speed']

fig, axes = plt.subplots(2, 3, figsize=(15, 10))
axes = axes.flatten()

for i, col in enumerate(stat_cols_b):
    df_raw[col].plot(kind='hist', bins=30, ax=axes[i], color='#e74c3c', edgecolor='black', alpha=0.7)
    axes[i].set_title(f'Distribution de {col}', fontsize=12, fontweight='bold')
    axes[i].set_xlabel(col)
    axes[i].set_ylabel('Fr√©quence')
    axes[i].grid(axis='y', alpha=0.3)

plt.suptitle('Stats du Pok√©mon B', fontsize=16, fontweight='bold', y=1.02)
plt.tight_layout()
plt.show()

In [None]:
# Comparaison des stats totales
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Distribution a_total_stats vs b_total_stats
axes[0].hist(df_raw['a_total_stats'], bins=30, alpha=0.5, label='Pok√©mon A', color='#3498db')
axes[0].hist(df_raw['b_total_stats'], bins=30, alpha=0.5, label='Pok√©mon B', color='#e74c3c')
axes[0].set_title('Distribution des Stats Totales', fontsize=14, fontweight='bold')
axes[0].set_xlabel('Stats Totales')
axes[0].set_ylabel('Fr√©quence')
axes[0].legend()
axes[0].grid(axis='y', alpha=0.3)

# Scatter plot: stats totales A vs B, color√© par gagnant
scatter = axes[1].scatter(df_raw['a_total_stats'], df_raw['b_total_stats'], 
                          c=df_raw['winner'], cmap='coolwarm', alpha=0.3, s=5)
axes[1].set_xlabel('Stats Totales Pok√©mon A')
axes[1].set_ylabel('Stats Totales Pok√©mon B')
axes[1].set_title('Stats A vs B (couleur = gagnant)', fontsize=14, fontweight='bold')
axes[1].plot([200, 700], [200, 700], 'k--', alpha=0.5, label='√âgalit√©')  # Ligne d'√©galit√©
axes[1].legend()
axes[1].grid(alpha=0.3)
plt.colorbar(scatter, ax=axes[1], label='Gagnant (0=A, 1=B)')

plt.tight_layout()
plt.show()

## 5. Analyse des Types

In [None]:
# Distribution des types primaires
print("=" * 80)
print("üî• DISTRIBUTION DES TYPES PRIMAIRES")
print("=" * 80)

fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Type 1 Pok√©mon A
type_a_counts = df_raw['a_type_1'].value_counts()
type_a_counts.plot(kind='barh', ax=axes[0], color='#3498db')
axes[0].set_title('Types Primaires Pok√©mon A', fontsize=12, fontweight='bold')
axes[0].set_xlabel('Nombre')
axes[0].grid(axis='x', alpha=0.3)

# Type 1 Pok√©mon B
type_b_counts = df_raw['b_type_1'].value_counts()
type_b_counts.plot(kind='barh', ax=axes[1], color='#e74c3c')
axes[1].set_title('Types Primaires Pok√©mon B', fontsize=12, fontweight='bold')
axes[1].set_xlabel('Nombre')
axes[1].grid(axis='x', alpha=0.3)

plt.tight_layout()
plt.show()

In [None]:
# Taux de victoire par type primaire du Pok√©mon A
print("\n" + "=" * 80)
print("üèÜ TAUX DE VICTOIRE PAR TYPE PRIMAIRE (Pok√©mon A)")
print("=" * 80)

win_rate_by_type_a = df_raw.groupby('a_type_1').agg({
    'winner': ['count', lambda x: (x == 0).sum(), 'mean']
}).round(3)
win_rate_by_type_a.columns = ['Total Matchups', 'Wins (A)', 'Win Rate']
win_rate_by_type_a['Win Rate'] = 1 - win_rate_by_type_a['Win Rate']  # Inverser car winner=0 signifie A gagne
win_rate_by_type_a = win_rate_by_type_a.sort_values('Win Rate', ascending=False)
print(win_rate_by_type_a)

# Visualisation
fig, ax = plt.subplots(figsize=(12, 8))
colors = ['#2ecc71' if wr > 0.5 else '#e74c3c' for wr in win_rate_by_type_a['Win Rate']]
win_rate_by_type_a['Win Rate'].plot(kind='barh', ax=ax, color=colors)
ax.axvline(x=0.5, color='black', linestyle='--', alpha=0.5)
ax.set_title('Taux de Victoire par Type Primaire (Pok√©mon A)', fontsize=14, fontweight='bold')
ax.set_xlabel('Taux de Victoire')
ax.grid(axis='x', alpha=0.3)
plt.tight_layout()
plt.show()

## 6. Analyse des Moves

In [None]:
# Top moves utilis√©s par Pok√©mon A
print("=" * 80)
print("‚öîÔ∏è TOP 15 MOVES S√âLECTIONN√âS (Pok√©mon A)")
print("=" * 80)
move_a_counts = df_raw['a_move_name'].value_counts().head(15)
print(move_a_counts)

fig, axes = plt.subplots(1, 2, figsize=(16, 6))

move_a_counts.plot(kind='barh', ax=axes[0], color='#3498db')
axes[0].set_title('Top 15 Moves (Pok√©mon A)', fontsize=12, fontweight='bold')
axes[0].set_xlabel('Nombre de s√©lections')
axes[0].grid(axis='x', alpha=0.3)

move_b_counts = df_raw['b_move_name'].value_counts().head(15)
move_b_counts.plot(kind='barh', ax=axes[1], color='#e74c3c')
axes[1].set_title('Top 15 Moves (Pok√©mon B)', fontsize=12, fontweight='bold')
axes[1].set_xlabel('Nombre de s√©lections')
axes[1].grid(axis='x', alpha=0.3)

plt.tight_layout()
plt.show()

In [None]:
# Distribution des puissances de moves
print("\n" + "=" * 80)
print("üí• DISTRIBUTION DES PUISSANCES DE MOVES")
print("=" * 80)

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Move power A
df_raw['a_move_power'].plot(kind='hist', bins=20, ax=axes[0], color='#3498db', 
                             edgecolor='black', alpha=0.7)
axes[0].set_title('Puissance Move Pok√©mon A', fontsize=12, fontweight='bold')
axes[0].set_xlabel('Puissance')
axes[0].set_ylabel('Fr√©quence')
axes[0].axvline(df_raw['a_move_power'].mean(), color='red', linestyle='--', label=f"Moyenne: {df_raw['a_move_power'].mean():.0f}")
axes[0].legend()
axes[0].grid(axis='y', alpha=0.3)

# Move power B
df_raw['b_move_power'].plot(kind='hist', bins=20, ax=axes[1], color='#e74c3c', 
                             edgecolor='black', alpha=0.7)
axes[1].set_title('Puissance Move Pok√©mon B', fontsize=12, fontweight='bold')
axes[1].set_xlabel('Puissance')
axes[1].set_ylabel('Fr√©quence')
axes[1].axvline(df_raw['b_move_power'].mean(), color='blue', linestyle='--', label=f"Moyenne: {df_raw['b_move_power'].mean():.0f}")
axes[1].legend()
axes[1].grid(axis='y', alpha=0.3)

plt.tight_layout()
plt.show()

In [None]:
# Analyse des priorit√©s
print("\n" + "=" * 80)
print("‚ö° DISTRIBUTION DES PRIORIT√âS DE MOVES")
print("=" * 80)

print("\nPriorit√©s Pok√©mon A:")
print(df_raw['a_move_priority'].value_counts().sort_index())

print("\nPriorit√©s Pok√©mon B:")
print(df_raw['b_move_priority'].value_counts().sort_index())

# Visualisation
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

df_raw['a_move_priority'].value_counts().sort_index().plot(kind='bar', ax=axes[0], color='#3498db')
axes[0].set_title('Priorit√© Move Pok√©mon A', fontsize=12, fontweight='bold')
axes[0].set_xlabel('Priorit√©')
axes[0].set_ylabel('Fr√©quence')
axes[0].grid(axis='y', alpha=0.3)

df_raw['b_move_priority'].value_counts().sort_index().plot(kind='bar', ax=axes[1], color='#e74c3c')
axes[1].set_title('Priorit√© Move Pok√©mon B', fontsize=12, fontweight='bold')
axes[1].set_xlabel('Priorit√©')
axes[1].set_ylabel('Fr√©quence')
axes[1].grid(axis='y', alpha=0.3)

plt.tight_layout()
plt.show()

## 7. Corr√©lations

In [None]:
# Matrice de corr√©lation pour les features num√©riques
numeric_cols = [
    'a_hp', 'a_attack', 'a_defense', 'a_sp_attack', 'a_sp_defense', 'a_speed',
    'b_hp', 'b_attack', 'b_defense', 'b_sp_attack', 'b_sp_defense', 'b_speed',
    'a_move_power', 'a_move_priority', 'a_move_stab', 'a_move_type_mult',
    'b_move_power', 'b_move_priority', 'b_move_stab', 'b_move_type_mult',
    'speed_diff', 'hp_diff', 'a_total_stats', 'b_total_stats', 'a_moves_first',
    'winner'
]

# S√©lectionner uniquement les colonnes existantes
numeric_cols = [col for col in numeric_cols if col in df_raw.columns]

corr_matrix = df_raw[numeric_cols].corr()

fig, ax = plt.subplots(figsize=(18, 15))
sns.heatmap(corr_matrix, annot=True, fmt='.2f', cmap='coolwarm', center=0,
            square=True, linewidths=0.5, cbar_kws={"shrink": 0.8}, ax=ax,
            annot_kws={"size": 8})
ax.set_title('Matrice de Corr√©lation des Features Num√©riques', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

In [None]:
# Corr√©lation avec la variable cible
print("=" * 80)
print("üéØ CORR√âLATION AVEC winner")
print("=" * 80)
target_corr = corr_matrix['winner'].sort_values(ascending=False)
print(target_corr)

fig, ax = plt.subplots(figsize=(10, 8))
target_corr.drop('winner').plot(kind='barh', ax=ax, color='#16a085')
ax.set_title('Corr√©lation des Features avec winner', fontsize=14, fontweight='bold')
ax.set_xlabel('Corr√©lation')
ax.set_ylabel('Feature')
ax.axvline(x=0, color='black', linestyle='--', linewidth=0.8)
ax.grid(axis='x', alpha=0.3)
plt.tight_layout()
plt.show()

## 8. Analyse des Features D√©riv√©es

In [None]:
# Analyse de speed_diff et son impact sur le gagnant
print("=" * 80)
print("‚ö° IMPACT DE speed_diff SUR LE GAGNANT")
print("=" * 80)

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Boxplot speed_diff par gagnant
df_raw.boxplot(column='speed_diff', by='winner', ax=axes[0])
axes[0].set_title('Speed Diff par Gagnant', fontsize=12, fontweight='bold')
axes[0].set_xlabel('Gagnant (0=A, 1=B)')
axes[0].set_ylabel('Speed Diff (A - B)')
plt.suptitle('')  # Supprimer le titre auto

# Distribution de a_moves_first par gagnant
moves_first_win = df_raw.groupby(['a_moves_first', 'winner']).size().unstack(fill_value=0)
moves_first_win.plot(kind='bar', ax=axes[1], color=['#3498db', '#e74c3c'])
axes[1].set_title('Gagnant selon qui attaque en premier', fontsize=12, fontweight='bold')
axes[1].set_xlabel('A attaque en premier (0=Non, 1=Oui)')
axes[1].set_ylabel('Nombre de matchups')
axes[1].legend(['A gagne', 'B gagne'])
axes[1].set_xticklabels(['Non', 'Oui'], rotation=0)
axes[1].grid(axis='y', alpha=0.3)

plt.tight_layout()
plt.show()

In [None]:
# Impact du STAB et type_multiplier
print("\n" + "=" * 80)
print("üî• IMPACT DU STAB ET TYPE MULTIPLIER")
print("=" * 80)

fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# STAB A par gagnant
stab_a_win = df_raw.groupby(['a_move_stab', 'winner']).size().unstack(fill_value=0)
stab_a_win.plot(kind='bar', ax=axes[0, 0], color=['#3498db', '#e74c3c'])
axes[0, 0].set_title('Impact STAB Pok√©mon A sur Gagnant', fontsize=12)
axes[0, 0].set_xlabel('STAB A')
axes[0, 0].legend(['A gagne', 'B gagne'])
axes[0, 0].grid(axis='y', alpha=0.3)

# STAB B par gagnant
stab_b_win = df_raw.groupby(['b_move_stab', 'winner']).size().unstack(fill_value=0)
stab_b_win.plot(kind='bar', ax=axes[0, 1], color=['#3498db', '#e74c3c'])
axes[0, 1].set_title('Impact STAB Pok√©mon B sur Gagnant', fontsize=12)
axes[0, 1].set_xlabel('STAB B')
axes[0, 1].legend(['A gagne', 'B gagne'])
axes[0, 1].grid(axis='y', alpha=0.3)

# Type mult A distribution par gagnant
df_raw.boxplot(column='a_move_type_mult', by='winner', ax=axes[1, 0])
axes[1, 0].set_title('Type Multiplier A par Gagnant', fontsize=12)
axes[1, 0].set_xlabel('Gagnant (0=A, 1=B)')
plt.suptitle('')

# Type mult B distribution par gagnant
df_raw.boxplot(column='b_move_type_mult', by='winner', ax=axes[1, 1])
axes[1, 1].set_title('Type Multiplier B par Gagnant', fontsize=12)
axes[1, 1].set_xlabel('Gagnant (0=A, 1=B)')
plt.suptitle('')

plt.tight_layout()
plt.show()

In [None]:
# Valeurs manquantes
print("=" * 80)
print("‚ùì ANALYSE DES VALEURS MANQUANTES")
print("=" * 80)

missing = df_raw.isnull().sum()
missing_pct = (missing / len(df_raw)) * 100
missing_df = pd.DataFrame({
    'Missing Count': missing,
    'Missing %': missing_pct
}).sort_values('Missing Count', ascending=False)

if missing_df['Missing Count'].sum() == 0:
    print("\n‚úÖ Aucune valeur manquante d√©tect√©e!")
else:
    print(missing_df[missing_df['Missing Count'] > 0])
    print(f"\n‚ö†Ô∏è Total valeurs manquantes: {missing_df['Missing Count'].sum()}")

## 9. Conclusions

### Observations Principales

#### 1. **√âquilibre des Classes**
- Le dataset pr√©sente un excellent √©quilibre (~50/50) entre Pok√©mon A gagnant et Pok√©mon B gagnant
- Cet √©quilibre est naturel car tous les matchups sont sym√©triques (A vs B = B vs A invers√©)

#### 2. **Features les Plus Corr√©l√©es au Gagnant**
- `a_moves_first`: Si A attaque en premier (vitesse ou priorit√©), il a plus de chances de gagner
- `speed_diff`: La diff√©rence de vitesse impacte directement qui attaque en premier
- `a_move_type_mult` / `b_move_type_mult`: L'avantage de type est crucial
- `a_total_stats` / `b_total_stats`: Les stats totales refl√®tent la puissance g√©n√©rale

#### 3. **Impact de la Priorit√©**
- La plupart des moves ont priorit√© 0 (normal)
- Les moves prioritaires (+1, +2) sont rares mais peuvent renverser le combat

#### 4. **STAB et Type Effectiveness**
- Le STAB (Same Type Attack Bonus) donne un boost de x1.5
- Le multiplicateur de type (0.25x √† 4x) peut compl√®tement changer l'issue

#### 5. **Qualit√© des Donn√©es**
- ‚úÖ Aucune valeur manquante
- ‚úÖ Types coh√©rents
- ‚úÖ Distribution √©quilibr√©e

---

### Recommandations pour la Suite

1. **Feature Engineering** (notebook 02):
   - Encoder les types (one-hot ou embeddings)
   - Cr√©er des ratios (stats A / stats B)
   - Features d'interaction (STAB √ó type_mult)

2. **Mod√©lisation** (notebook 03):
   - Commencer par Random Forest (robuste, interpr√©table)
   - Tester XGBoost/LightGBM pour performances
   - Cross-validation stratifi√©e

---

**‚úÖ Dataset valid√© et pr√™t pour le feature engineering!**