# 01 - Exploration des Donn√©es (EDA)

**Objectif**: Comprendre la distribution et les caract√©ristiques du dataset de batailles Pok√©mon Let's Go.

**Dataset**: `battle_samples.parquet` g√©n√©r√© par `build_classification_dataset.py`

**Date**: 2026-01-20

---

## Table des Mati√®res

1. [Chargement des Donn√©es](#1-chargement-des-donn√©es)
2. [Aper√ßu du Dataset](#2-aper√ßu-du-dataset)
3. [Analyse de la Variable Cible](#3-analyse-de-la-variable-cible)
4. [Distribution des Features Num√©riques](#4-distribution-des-features-num√©riques)
5. [Analyse des Features Cat√©gorielles](#5-analyse-des-features-cat√©gorielles)
6. [Corr√©lations](#6-corr√©lations)
7. [Valeurs Manquantes](#7-valeurs-manquantes)
8. [Analyse par Type d'Attaque](#8-analyse-par-type-dattaque)
9. [Conclusions](#9-conclusions)

## 1. Chargement des Donn√©es

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path

# Configuration
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette('husl')
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)

# Chemins
BASE_DIR = Path('../data/ml')
RAW_DIR = BASE_DIR / 'raw'
PROCESSED_DIR = BASE_DIR / 'processed'

print("üì¶ Biblioth√®ques charg√©es")

In [None]:
# Chargement des datasets
df_raw = pd.read_parquet(RAW_DIR / 'battle_samples.parquet')
df_train = pd.read_parquet(PROCESSED_DIR / 'train.parquet')
df_test = pd.read_parquet(PROCESSED_DIR / 'test.parquet')

print(f"‚úÖ Dataset brut charg√©: {df_raw.shape}")
print(f"‚úÖ Train set charg√©: {df_train.shape}")
print(f"‚úÖ Test set charg√©: {df_test.shape}")

## 2. Aper√ßu du Dataset

In [None]:
# Informations g√©n√©rales
print("=" * 80)
print("üìä INFORMATIONS G√âN√âRALES")
print("=" * 80)
df_raw.info()

In [None]:
# Premi√®res lignes
print("\n" + "=" * 80)
print("üëÄ PREMI√àRES LIGNES")
print("=" * 80)
df_raw.head(10)

In [None]:
# Statistiques descriptives
print("\n" + "=" * 80)
print("üìà STATISTIQUES DESCRIPTIVES")
print("=" * 80)
df_raw.describe()

In [None]:
# Colonnes et types
print("\n" + "=" * 80)
print("üîç COLONNES ET TYPES")
print("=" * 80)
for col in df_raw.columns:
    print(f"{col:30s} | {str(df_raw[col].dtype):15s} | Null: {df_raw[col].isna().sum():6d}")

## 3. Analyse de la Variable Cible

In [None]:
# Distribution de la cible
print("=" * 80)
print("üéØ DISTRIBUTION DE LA VARIABLE CIBLE: is_effective")
print("=" * 80)

target_counts = df_raw['is_effective'].value_counts().sort_index()
target_pct = df_raw['is_effective'].value_counts(normalize=True).sort_index() * 100

result = pd.DataFrame({
    'Count': target_counts,
    'Percentage': target_pct
})
result.index = ['Not Effective (0)', 'Effective (1)']
print(result)
print(f"\n‚úÖ Balance des classes: {min(target_pct):.1f}% / {max(target_pct):.1f}%")

In [None]:
# Visualisation de la cible
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Bar plot
target_counts.plot(kind='bar', ax=axes[0], color=['#e74c3c', '#2ecc71'])
axes[0].set_title('Distribution de is_effective', fontsize=14, fontweight='bold')
axes[0].set_xlabel('Classe')
axes[0].set_ylabel('Nombre d\'√©chantillons')
axes[0].set_xticklabels(['Not Effective', 'Effective'], rotation=0)
axes[0].grid(axis='y', alpha=0.3)

# Pie chart
axes[1].pie(target_counts, labels=['Not Effective', 'Effective'], autopct='%1.1f%%',
            colors=['#e74c3c', '#2ecc71'], startangle=90)
axes[1].set_title('Proportion des Classes', fontsize=14, fontweight='bold')

plt.tight_layout()
plt.show()

## 4. Distribution des Features Num√©riques

In [None]:
# Distribution du multiplicateur de type
print("=" * 80)
print("‚ö° DISTRIBUTION DU TYPE_MULTIPLIER")
print("=" * 80)
multiplier_counts = df_raw['type_multiplier'].value_counts().sort_index()
print(multiplier_counts)
print(f"\nValeurs uniques: {sorted(df_raw['type_multiplier'].unique())}")

In [None]:
# Visualisation du multiplicateur
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Histogram
df_raw['type_multiplier'].plot(kind='hist', bins=20, ax=axes[0], color='#3498db', edgecolor='black')
axes[0].set_title('Distribution de type_multiplier', fontsize=14, fontweight='bold')
axes[0].set_xlabel('Type Multiplier')
axes[0].set_ylabel('Fr√©quence')
axes[0].grid(axis='y', alpha=0.3)

# Bar plot par valeur exacte
multiplier_counts.plot(kind='bar', ax=axes[1], color='#9b59b6')
axes[1].set_title('Comptage par Valeur de Multiplicateur', fontsize=14, fontweight='bold')
axes[1].set_xlabel('Multiplicateur')
axes[1].set_ylabel('Nombre d\'√©chantillons')
axes[1].set_xticklabels(axes[1].get_xticklabels(), rotation=45)
axes[1].grid(axis='y', alpha=0.3)

plt.tight_layout()
plt.show()

In [None]:
# Distribution des statistiques de combat (stats)
stat_cols = ['attack', 'defense', 'sp_attack', 'sp_defense', 'speed', 'hp']

fig, axes = plt.subplots(2, 3, figsize=(16, 10))
axes = axes.flatten()

for i, col in enumerate(stat_cols):
    df_raw[col].plot(kind='hist', bins=30, ax=axes[i], color='#1abc9c', edgecolor='black', alpha=0.7)
    axes[i].set_title(f'Distribution de {col}', fontsize=12, fontweight='bold')
    axes[i].set_xlabel(col)
    axes[i].set_ylabel('Fr√©quence')
    axes[i].grid(axis='y', alpha=0.3)

plt.tight_layout()
plt.show()

In [None]:
# Distribution de la puissance des moves
fig, ax = plt.subplots(figsize=(12, 5))
df_raw['move_power'].plot(kind='hist', bins=30, ax=ax, color='#e67e22', edgecolor='black', alpha=0.7)
ax.set_title('Distribution de move_power', fontsize=14, fontweight='bold')
ax.set_xlabel('Puissance de la capacit√©')
ax.set_ylabel('Fr√©quence')
ax.grid(axis='y', alpha=0.3)
plt.tight_layout()
plt.show()

print(f"Puissance min: {df_raw['move_power'].min()}")
print(f"Puissance max: {df_raw['move_power'].max()}")
print(f"Puissance moyenne: {df_raw['move_power'].mean():.2f}")
print(f"Puissance m√©diane: {df_raw['move_power'].median()}")

## 5. Analyse des Features Cat√©gorielles

In [None]:
# Distribution des cat√©gories de moves
print("=" * 80)
print("ü•ä DISTRIBUTION DES CAT√âGORIES DE MOVES")
print("=" * 80)
category_counts = df_raw['move_category'].value_counts()
print(category_counts)

fig, ax = plt.subplots(figsize=(10, 5))
category_counts.plot(kind='bar', ax=ax, color='#f39c12')
ax.set_title('Distribution des Cat√©gories de Capacit√©s', fontsize=14, fontweight='bold')
ax.set_xlabel('Cat√©gorie')
ax.set_ylabel('Nombre d\'√©chantillons')
ax.set_xticklabels(ax.get_xticklabels(), rotation=45)
ax.grid(axis='y', alpha=0.3)
plt.tight_layout()
plt.show()

In [None]:
# Distribution des types de Pok√©mon attaquants
print("\n" + "=" * 80)
print("üî• TOP 10 TYPES ATTAQUANTS")
print("=" * 80)
type_counts = df_raw['attacker_type'].value_counts().head(10)
print(type_counts)

fig, ax = plt.subplots(figsize=(12, 6))
type_counts.plot(kind='barh', ax=ax, color='#e74c3c')
ax.set_title('Top 10 Types de Pok√©mon Attaquants', fontsize=14, fontweight='bold')
ax.set_xlabel('Nombre d\'√©chantillons')
ax.set_ylabel('Type')
ax.grid(axis='x', alpha=0.3)
plt.tight_layout()
plt.show()

In [None]:
# Distribution des types de moves
print("\n" + "=" * 80)
print("‚öîÔ∏è TOP 10 TYPES DE CAPACIT√âS")
print("=" * 80)
move_type_counts = df_raw['move_type'].value_counts().head(10)
print(move_type_counts)

fig, ax = plt.subplots(figsize=(12, 6))
move_type_counts.plot(kind='barh', ax=ax, color='#2ecc71')
ax.set_title('Top 10 Types de Capacit√©s', fontsize=14, fontweight='bold')
ax.set_xlabel('Nombre d\'√©chantillons')
ax.set_ylabel('Type de Capacit√©')
ax.grid(axis='x', alpha=0.3)
plt.tight_layout()
plt.show()

In [None]:
# Distribution du STAB (Same Type Attack Bonus)
print("\n" + "=" * 80)
print("‚ö° DISTRIBUTION DU STAB")
print("=" * 80)
stab_counts = df_raw['has_stab'].value_counts()
print(stab_counts)
print(f"\nSTAB ratio: {stab_counts.get(True, 0) / len(df_raw) * 100:.1f}%")

fig, ax = plt.subplots(figsize=(8, 5))
stab_counts.plot(kind='bar', ax=ax, color=['#95a5a6', '#f1c40f'])
ax.set_title('Distribution du STAB (Same Type Attack Bonus)', fontsize=14, fontweight='bold')
ax.set_xlabel('Has STAB')
ax.set_ylabel('Nombre d\'√©chantillons')
ax.set_xticklabels(['No STAB', 'Has STAB'], rotation=0)
ax.grid(axis='y', alpha=0.3)
plt.tight_layout()
plt.show()

## 6. Corr√©lations

In [None]:
# Matrice de corr√©lation pour les features num√©riques
numeric_cols = ['attack', 'defense', 'sp_attack', 'sp_defense', 'speed', 'hp', 
                'move_power', 'type_multiplier', 'is_effective']

corr_matrix = df_raw[numeric_cols].corr()

fig, ax = plt.subplots(figsize=(12, 10))
sns.heatmap(corr_matrix, annot=True, fmt='.2f', cmap='coolwarm', center=0,
            square=True, linewidths=1, cbar_kws={"shrink": 0.8}, ax=ax)
ax.set_title('Matrice de Corr√©lation des Features Num√©riques', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

In [None]:
# Corr√©lation avec la variable cible
print("=" * 80)
print("üéØ CORR√âLATION AVEC is_effective")
print("=" * 80)
target_corr = corr_matrix['is_effective'].sort_values(ascending=False)
print(target_corr)

fig, ax = plt.subplots(figsize=(10, 6))
target_corr.drop('is_effective').plot(kind='barh', ax=ax, color='#16a085')
ax.set_title('Corr√©lation des Features avec is_effective', fontsize=14, fontweight='bold')
ax.set_xlabel('Corr√©lation')
ax.set_ylabel('Feature')
ax.axvline(x=0, color='black', linestyle='--', linewidth=0.8)
ax.grid(axis='x', alpha=0.3)
plt.tight_layout()
plt.show()

## 7. Valeurs Manquantes

In [None]:
# Analyse des valeurs manquantes
print("=" * 80)
print("‚ùì ANALYSE DES VALEURS MANQUANTES")
print("=" * 80)

missing = df_raw.isnull().sum()
missing_pct = (missing / len(df_raw)) * 100
missing_df = pd.DataFrame({
    'Missing Count': missing,
    'Missing %': missing_pct
}).sort_values('Missing Count', ascending=False)

print(missing_df[missing_df['Missing Count'] > 0])

if missing_df['Missing Count'].sum() == 0:
    print("\n‚úÖ Aucune valeur manquante d√©tect√©e!")
else:
    print(f"\n‚ö†Ô∏è Total valeurs manquantes: {missing_df['Missing Count'].sum()}")

## 8. Analyse par Type d'Attaque

In [None]:
# Efficacit√© par cat√©gorie de move (physique vs sp√©cial)
print("=" * 80)
print("ü•ä EFFICACIT√â PAR CAT√âGORIE DE MOVE")
print("=" * 80)

effectiveness_by_category = df_raw.groupby('move_category')['is_effective'].agg(['mean', 'count'])
effectiveness_by_category.columns = ['Effectiveness Rate', 'Sample Count']
effectiveness_by_category['Effectiveness Rate'] = effectiveness_by_category['Effectiveness Rate'] * 100
print(effectiveness_by_category)

fig, ax = plt.subplots(figsize=(10, 5))
effectiveness_by_category['Effectiveness Rate'].plot(kind='bar', ax=ax, color='#8e44ad')
ax.set_title('Taux d\'Efficacit√© par Cat√©gorie de Move', fontsize=14, fontweight='bold')
ax.set_xlabel('Cat√©gorie')
ax.set_ylabel('Taux d\'Efficacit√© (%)')
ax.set_xticklabels(ax.get_xticklabels(), rotation=45)
ax.grid(axis='y', alpha=0.3)
plt.tight_layout()
plt.show()

In [None]:
# Box plots: Stats par efficacit√©
fig, axes = plt.subplots(2, 3, figsize=(16, 10))
axes = axes.flatten()

for i, col in enumerate(stat_cols):
    df_raw.boxplot(column=col, by='is_effective', ax=axes[i])
    axes[i].set_title(f'{col} par Efficacit√©')
    axes[i].set_xlabel('is_effective')
    axes[i].set_ylabel(col)
    axes[i].set_xticklabels(['Not Effective', 'Effective'])

plt.suptitle('Distribution des Stats par Efficacit√©', fontsize=16, fontweight='bold')
plt.tight_layout()
plt.show()

In [None]:
# Impact du STAB sur l'efficacit√©
print("\n" + "=" * 80)
print("‚ö° IMPACT DU STAB SUR L'EFFICACIT√â")
print("=" * 80)

stab_effectiveness = df_raw.groupby('has_stab')['is_effective'].agg(['mean', 'count'])
stab_effectiveness.columns = ['Effectiveness Rate', 'Sample Count']
stab_effectiveness['Effectiveness Rate'] = stab_effectiveness['Effectiveness Rate'] * 100
stab_effectiveness.index = ['No STAB', 'Has STAB']
print(stab_effectiveness)

fig, ax = plt.subplots(figsize=(8, 5))
stab_effectiveness['Effectiveness Rate'].plot(kind='bar', ax=ax, color=['#95a5a6', '#f1c40f'])
ax.set_title('Taux d\'Efficacit√© avec/sans STAB', fontsize=14, fontweight='bold')
ax.set_xlabel('STAB Status')
ax.set_ylabel('Taux d\'Efficacit√© (%)')
ax.set_xticklabels(ax.get_xticklabels(), rotation=0)
ax.grid(axis='y', alpha=0.3)
plt.tight_layout()
plt.show()

## 9. Conclusions

### Observations Principales

#### 1. **√âquilibre des Classes**
- Le dataset pr√©sente un excellent √©quilibre (~50/50) entre classes effective/not effective
- Cet √©quilibre a √©t√© obtenu gr√¢ce √† l'√©chantillonnage intelligent (keep all effective, sample 15% not effective)
- Aucun probl√®me de d√©s√©quilibre de classes √† anticiper

#### 2. **Type Multiplier**
- Valeurs observ√©es: 0.0, 0.25, 0.5, 1.0, 2.0, 4.0
- Cette feature est **hautement corr√©l√©e avec la cible** (par design: is_effective = type_multiplier >= 2)
- Le multiplicateur sera probablement la feature la plus importante du mod√®le

#### 3. **Features Num√©riques (Stats)**
- Distribution normale pour la plupart des stats (attack, defense, sp_attack, sp_defense)
- Pas de corr√©lation forte entre les stats et is_effective
- Les stats du Pok√©mon attaquant ne sont pas des pr√©dicteurs directs de l'efficacit√©

#### 4. **Features Cat√©gorielles**
- R√©partition √©quilibr√©e entre moves physiques et sp√©ciaux
- Diversit√© des types de Pok√©mon et de moves
- Le STAB ne semble pas avoir d'impact majeur sur l'efficacit√© (corr√©lation faible)

#### 5. **Qualit√© des Donn√©es**
- ‚úÖ Aucune valeur manquante
- ‚úÖ Types coh√©rents
- ‚úÖ Pas d'outliers aberrants
- ‚úÖ Dataset pr√™t pour le feature engineering

---

### Recommandations pour la Suite

1. **Feature Engineering** (notebook 02):
   - Encoder les features cat√©gorielles (one-hot ou label encoding)
   - Cr√©er des interactions (attack √ó move_power, type_multiplier √ó has_stab)
   - Normaliser les stats si n√©cessaire (StandardScaler)

2. **Mod√©lisation** (notebook 03):
   - Commencer par un mod√®le simple (Logistic Regression) comme baseline
   - Tester Random Forest (robuste aux features cat√©gorielles)
   - Essayer XGBoost/LightGBM pour performances optimales

3. **Validation**:
   - Utiliser le split 80/20 d√©j√† cr√©√© (train.parquet / test.parquet)
   - Cross-validation K-fold sur le train set
   - M√©triques: Accuracy, Precision, Recall, F1-Score, ROC-AUC

---

**‚úÖ Dataset valid√© et pr√™t pour le feature engineering!**