# Modélisation Optimisée

## Learnings des versions précédentes

- **V2** (toutes features) : R² local 0.80, score soumission 0.002
- **V3 all_temporal** : score soumission **0.05** (meilleur)
- Autres stratégies : -0.2

## Stratégie

1. Features temporelles avec **très faible** corrélation géographique (< 0.15)
2. Modèles simples : Ridge, Lasso, ElasticNet
3. XGBoost avec forte régularisation
4. Ensemble de modèles

In [1]:
import warnings
warnings.filterwarnings("ignore")

import pandas as pd
import numpy as np

from sklearn.linear_model import Ridge, Lasso, ElasticNet
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.model_selection import cross_val_score, GroupKFold
from sklearn.cluster import KMeans

import xgboost as xgb
import lightgbm as lgb

print("Imports OK!")

Imports OK!


## Étape 1 : Charger les données

In [2]:
train_df = pd.read_csv("../data/processed/merged_training.csv")
test_df = pd.read_csv("../data/processed/merged_validation.csv")

TARGET_COLS = ['Total Alkalinity', 'Electrical Conductance', 'Dissolved Reactive Phosphorus']
TARGET_NAMES = ['Alkalinity', 'Conductivity', 'Phosphorus']

print(f"Training : {train_df.shape}")
print(f"Test : {test_df.shape}")

Training : (9319, 79)
Test : (200, 79)


## Étape 2 : Sélectionner les features les moins géographiques

In [3]:
# Colonnes à exclure
EXCLUDE_COLS = ['Latitude', 'Longitude', 'Sample Date'] + TARGET_COLS

# Toutes les features
all_features = [c for c in train_df.columns if c not in EXCLUDE_COLS]

# Calculer la corrélation avec la latitude
correlations = {}
for col in all_features:
    try:
        corr = np.abs(train_df[col].corr(train_df['Latitude']))
        if not np.isnan(corr):
            correlations[col] = corr
    except:
        pass

# Features avec corrélation < 0.15 (très faible)
low_geo_features = [f for f, c in correlations.items() if c < 0.15]

# Features avec corrélation < 0.10 (extrêmement faible)
very_low_geo_features = [f for f, c in correlations.items() if c < 0.10]

print(f"Toutes features : {len(all_features)}")
print(f"Corrélation < 0.15 : {len(low_geo_features)}")
print(f"Corrélation < 0.10 : {len(very_low_geo_features)}")

print(f"\nFeatures très faible corrélation (<0.10) :")
for f in very_low_geo_features:
    print(f"  {f}: {correlations[f]:.3f}")

Toutes features : 73
Corrélation < 0.15 : 56
Corrélation < 0.10 : 36

Features très faible corrélation (<0.10) :
  blue: 0.003
  green: 0.003
  red: 0.016
  nir: 0.082
  swir16: 0.094
  blue_std: 0.001
  green_std: 0.029
  red_std: 0.010
  swir16_std: 0.006
  swir22_std: 0.002
  NDWI: 0.096
  MNDWI: 0.100
  NDWI_std: 0.082
  MNDWI_std: 0.051
  aet: 0.014
  ppt: 0.009
  ppt_lag1: 0.013
  ppt_lag2: 0.017
  ppt_lag3: 0.008
  ppt_sum4: 0.016
  ppt_mean4: 0.016
  ppt_anomaly: 0.003
  tmin: 0.005
  soil_anomaly: 0.013
  def_lag1: 0.096
  def_lag2: 0.091
  def_lag3: 0.089
  def_anomaly: 0.016
  pdsi: 0.011
  vpd_anomaly: 0.028
  lc_cropland: 0.014
  lc_builtup: 0.039
  lc_bare: 0.096
  lc_wetland: 0.032
  soil_sand: 0.017
  aspect: 0.017


In [4]:
# =============================================================================
# PRÉPARER LES DONNÉES
# =============================================================================

def prepare_features(df, feature_cols):
    X = df[feature_cols].copy()
    
    # Encoder water_type si présent
    if 'water_type' in X.columns:
        le = LabelEncoder()
        X['water_type'] = le.fit_transform(X['water_type'].fillna('unknown'))
    
    # Convertir en numérique
    for col in X.columns:
        if X[col].dtype == 'object':
            X[col] = pd.to_numeric(X[col], errors='coerce')
    
    # Remplir les NaN
    X = X.fillna(X.median())
    
    return X

# Préparer avec features très faible corrélation
X_train = prepare_features(train_df, very_low_geo_features)
X_test = prepare_features(test_df, very_low_geo_features)
y_train = train_df[TARGET_COLS].copy()

print(f"X_train : {X_train.shape}")
print(f"X_test : {X_test.shape}")

# Créer clusters pour validation spatiale
kmeans = KMeans(n_clusters=10, random_state=42, n_init=10)
train_df['geo_cluster'] = kmeans.fit_predict(train_df[['Latitude', 'Longitude']])
groups = train_df['geo_cluster'].values

X_train : (9319, 36)
X_test : (200, 36)


## Étape 3 : Tester différents modèles

In [5]:
# =============================================================================
# TESTER DIFFÉRENTS MODÈLES
# =============================================================================

# Normaliser pour les modèles linéaires
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

models = {
    'Ridge (alpha=10)': Ridge(alpha=10),
    'Ridge (alpha=100)': Ridge(alpha=100),
    'Lasso (alpha=1)': Lasso(alpha=1),
    'ElasticNet': ElasticNet(alpha=1, l1_ratio=0.5),
    'XGBoost (regularized)': xgb.XGBRegressor(
        n_estimators=50, max_depth=3, learning_rate=0.05,
        reg_alpha=5, reg_lambda=5, subsample=0.7,
        random_state=42, n_jobs=-1, verbosity=0
    ),
    'LightGBM (regularized)': lgb.LGBMRegressor(
        n_estimators=50, max_depth=3, learning_rate=0.05,
        reg_alpha=5, reg_lambda=5, subsample=0.7,
        random_state=42, n_jobs=-1, verbosity=-1
    ),
}

gkf = GroupKFold(n_splits=5)

results = {}

print("Validation spatiale avec features très faible corrélation géo")
print("=" * 60)

for model_name, model in models.items():
    print(f"\n{model_name}:")
    
    scores_by_target = []
    for target, target_name in zip(TARGET_COLS, TARGET_NAMES):
        # Utiliser données scaled pour modèles linéaires
        if 'Ridge' in model_name or 'Lasso' in model_name or 'Elastic' in model_name:
            X_use = X_train_scaled
        else:
            X_use = X_train
        
        scores = cross_val_score(
            model, X_use, y_train[target],
            cv=gkf, groups=groups, scoring='r2'
        )
        scores_by_target.append(scores.mean())
        print(f"  {target_name}: R² = {scores.mean():.4f}")
    
    mean_score = np.mean(scores_by_target)
    results[model_name] = mean_score
    print(f"  → MOYENNE: {mean_score:.4f}")

print(f"\n{'=' * 60}")
print("CLASSEMENT:")
for name, score in sorted(results.items(), key=lambda x: x[1], reverse=True):
    print(f"  {name}: {score:.4f}")

Validation spatiale avec features très faible corrélation géo

Ridge (alpha=10):
  Alkalinity: R² = -0.5994
  Conductivity: R² = -0.5084
  Phosphorus: R² = -1.2971
  → MOYENNE: -0.8016

Ridge (alpha=100):
  Alkalinity: R² = -0.5715
  Conductivity: R² = -0.4526
  Phosphorus: R² = -1.1020
  → MOYENNE: -0.7087

Lasso (alpha=1):
  Alkalinity: R² = -0.5383
  Conductivity: R² = -0.4724
  Phosphorus: R² = -0.8243
  → MOYENNE: -0.6116

ElasticNet:
  Alkalinity: R² = -0.5644
  Conductivity: R² = -0.4962
  Phosphorus: R² = -0.5264
  → MOYENNE: -0.5290

XGBoost (regularized):
  Alkalinity: R² = -0.6717
  Conductivity: R² = -0.3598
  Phosphorus: R² = -0.4362
  → MOYENNE: -0.4892

LightGBM (regularized):
  Alkalinity: R² = -0.6669
  Conductivity: R² = -0.3256
  Phosphorus: R² = -0.4944
  → MOYENNE: -0.4957

CLASSEMENT:
  XGBoost (regularized): -0.4892
  LightGBM (regularized): -0.4957
  ElasticNet: -0.5290
  Lasso (alpha=1): -0.6116
  Ridge (alpha=100): -0.7087
  Ridge (alpha=10): -0.8016


## Étape 4 : Essayer un ensemble simple (moyenne)

In [6]:
# =============================================================================
# ENSEMBLE : MOYENNE DE PLUSIEURS MODÈLES
# =============================================================================

print("Génération des prédictions avec ensemble")
print("=" * 60)

# Modèles pour l'ensemble
ensemble_models = [
    ('Ridge', Ridge(alpha=100)),
    ('XGBoost', xgb.XGBRegressor(
        n_estimators=50, max_depth=3, learning_rate=0.05,
        reg_alpha=5, reg_lambda=5, random_state=42, n_jobs=-1, verbosity=0
    )),
    ('LightGBM', lgb.LGBMRegressor(
        n_estimators=50, max_depth=3, learning_rate=0.05,
        reg_alpha=5, reg_lambda=5, random_state=42, n_jobs=-1, verbosity=-1
    )),
]

predictions_ensemble = {target: [] for target in TARGET_COLS}

for model_name, model in ensemble_models:
    print(f"\n{model_name}:")
    
    for target, target_name in zip(TARGET_COLS, TARGET_NAMES):
        if 'Ridge' in model_name:
            model.fit(X_train_scaled, y_train[target])
            pred = model.predict(X_test_scaled)
        else:
            model.fit(X_train, y_train[target])
            pred = model.predict(X_test)
        
        pred = np.maximum(pred, 0)  # Pas de valeurs négatives
        predictions_ensemble[target].append(pred)
        print(f"  {target_name}: mean={pred.mean():.1f}")

# Moyenne des prédictions
final_predictions = {}
print("\nEnsemble (moyenne):")
for target, target_name in zip(TARGET_COLS, TARGET_NAMES):
    preds = np.array(predictions_ensemble[target])
    final_predictions[target] = np.mean(preds, axis=0)
    print(f"  {target_name}: mean={final_predictions[target].mean():.1f}")

Génération des prédictions avec ensemble

Ridge:
  Alkalinity: mean=92.7
  Conductivity: mean=361.9
  Phosphorus: mean=22.9

XGBoost:
  Alkalinity: mean=94.3
  Conductivity: mean=312.0
  Phosphorus: mean=25.0

LightGBM:
  Alkalinity: mean=94.0
  Conductivity: mean=311.3
  Phosphorus: mean=26.3

Ensemble (moyenne):
  Alkalinity: mean=93.7
  Conductivity: mean=328.4
  Phosphorus: mean=24.7


## Étape 5 : Générer les soumissions

In [7]:
# =============================================================================
# GÉNÉRER LES SOUMISSIONS
# =============================================================================

template = pd.read_csv("../data/raw/submission_template.csv")

# Soumission 1: Ensemble
sub_ensemble = template.copy()
for target in TARGET_COLS:
    sub_ensemble[target] = final_predictions[target]
sub_ensemble.to_csv("../data/submission_ensemble_v4.csv", index=False)
print("✅ submission_ensemble_v4.csv")

# Soumission 2: Ridge seul
sub_ridge = template.copy()
ridge = Ridge(alpha=100)
for target, target_name in zip(TARGET_COLS, TARGET_NAMES):
    ridge.fit(X_train_scaled, y_train[target])
    pred = np.maximum(ridge.predict(X_test_scaled), 0)
    sub_ridge[target] = pred
sub_ridge.to_csv("../data/submission_ridge_v4.csv", index=False)
print("✅ submission_ridge_v4.csv")

# Soumission 3: XGBoost très régularisé
sub_xgb = template.copy()
for target, target_name in zip(TARGET_COLS, TARGET_NAMES):
    xgb_model = xgb.XGBRegressor(
        n_estimators=30, max_depth=2, learning_rate=0.05,
        reg_alpha=10, reg_lambda=10, subsample=0.6,
        random_state=42, n_jobs=-1, verbosity=0
    )
    xgb_model.fit(X_train, y_train[target])
    pred = np.maximum(xgb_model.predict(X_test), 0)
    sub_xgb[target] = pred
sub_xgb.to_csv("../data/submission_xgb_regularized_v4.csv", index=False)
print("✅ submission_xgb_regularized_v4.csv")

print("\n3 fichiers créés dans ../data/")

✅ submission_ensemble_v4.csv
✅ submission_ridge_v4.csv
✅ submission_xgb_regularized_v4.csv

3 fichiers créés dans ../data/


In [8]:
# Comparer les prédictions
print("Comparaison des prédictions")
print("=" * 60)

print("\nMoyennes des prédictions par soumission :")
for name, sub in [('Ensemble', sub_ensemble), ('Ridge', sub_ridge), ('XGBoost reg.', sub_xgb)]:
    print(f"\n{name}:")
    for target, target_name in zip(TARGET_COLS, TARGET_NAMES):
        print(f"  {target_name}: {sub[target].mean():.1f}")

# Comparaison avec les vraies valeurs du training
print("\nVraies valeurs (training) :")
for target, target_name in zip(TARGET_COLS, TARGET_NAMES):
    print(f"  {target_name}: {y_train[target].mean():.1f}")

Comparaison des prédictions

Moyennes des prédictions par soumission :

Ensemble:
  Alkalinity: 93.7
  Conductivity: 328.4
  Phosphorus: 24.7

Ridge:
  Alkalinity: 92.7
  Conductivity: 361.9
  Phosphorus: 22.9

XGBoost reg.:
  Alkalinity: 102.2
  Conductivity: 359.9
  Phosphorus: 32.5

Vraies valeurs (training) :
  Alkalinity: 119.1
  Conductivity: 485.0
  Phosphorus: 43.5


## Résumé

### Stratégie

1. **Features** : Uniquement celles avec corrélation géographique < 0.10
2. **Modèles** : Ridge, XGBoost, LightGBM très régularisés
3. **Ensemble** : Moyenne des 3 modèles

### Fichiers créés

- `submission_ensemble_v4.csv` - Moyenne des 3 modèles
- `submission_ridge_v4.csv` - Ridge seul (modèle linéaire)
- `submission_xgb_regularized_v4.csv` - XGBoost très régularisé

### À tester

Soumettre les 3 fichiers et comparer avec le score de 0.05 de la version précédente.