<div style="text-align: center; padding: 25px; margin: 20px 0; background: linear-gradient(135deg, #667eea 0%, #764ba2 100%); border-radius: 20px; box-shadow: 0 10px 20px rgba(0,0,0,0.15); color: white;">
  <h1 style="color: white; margin-bottom: 15px; font-size: 2.5em;">🚀 Kaggle Playground Series S5E6</h1>
  <h2 style="color: white; margin-bottom: 20px; font-weight: 300;">XGBoost with 10-Fold CV - Fertilizer Prediction</h2>
  <p style="font-size: 18px; margin-bottom: 0; font-style: italic;">Advanced XGBoost modeling pipeline with stratified cross-validation for agricultural fertilizer recommendation</p>
</div>

<div style="padding: 20px; margin: 20px 0; background: linear-gradient(135deg, #f093fb 0%, #f5576c 100%); border-radius: 15px; box-shadow: 0 8px 16px rgba(0,0,0,0.1); color: white;">
  <h3 style="color: white; margin-bottom: 15px; text-align: center;">🎯 Competition Overview</h3>
  <p style="font-size: 16px; line-height: 1.6; text-align: center;">This notebook implements a <strong>comprehensive XGBoost modeling solution</strong> for the <strong>Kaggle Playground Series S5E6: Fertilizer Prediction Challenge</strong>. It builds upon <a href="https://www.kaggle.com/code/felixpradoh/ps-s5e6-exploratory-data-anlaysis" style="color: #FFE4E1; text-decoration: underline;">exploratory data analysis (EDA) insights</a> to create a robust, competition-ready model with advanced feature engineering and rigorous validation.</p>
</div>

## 🌱 Agricultural Context & Problem Domain

This challenge focuses on **precision agriculture** and **intelligent fertilizer recommendation systems**. Understanding the agricultural context is crucial for effective modeling:

<div style="display: flex; flex-wrap: wrap; gap: 15px; margin: 20px 0;">

<div style="flex: 1; min-width: 300px; padding: 15px; background: linear-gradient(135deg, #84fab0 0%, #8fd3f4 100%); border-radius: 10px; color: #2c3e50;">
  <h3 style="color: #2c3e50; margin-bottom: 10px;">🌾 Soil Science Insights</h3>
  <ul style="margin: 0; padding-left: 20px;">
    <li><strong>Nutrient balance</strong>: N-P-K ratios determine crop health</li>
    <li><strong>pH optimization</strong>: Affects nutrient availability</li>
    <li><strong>Micronutrient interactions</strong>: Complex relationships between elements</li>
    <li><strong>Soil composition</strong>: Sandy, loamy, clay affect fertilizer needs</li>
  </ul>
</div>

<div style="flex: 1; min-width: 300px; padding: 15px; background: linear-gradient(135deg, #a8edea 0%, #fed6e3 100%); border-radius: 10px; color: #2c3e50;">
  <h3 style="color: #2c3e50; margin-bottom: 10px;">🌡️ Environmental Factors</h3>
  <ul style="margin: 0; padding-left: 20px;">
    <li><strong>Temperature impact</strong>: Affects nutrient uptake rates</li>
    <li><strong>Rainfall patterns</strong>: Influences fertilizer timing</li>
    <li><strong>Crop types</strong>: Different species have unique requirements</li>
    <li><strong>Growth stages</strong>: Fertilizer needs change over time</li>
  </ul>
</div>

</div>

## 🤖 Why XGBoost for This Challenge?

**XGBoost (eXtreme Gradient Boosting)** is the optimal choice for this agricultural classification task for several reasons:

<div style="display: flex; flex-wrap: wrap; gap: 15px; margin: 20px 0;">
  
<div style="flex: 1; min-width: 300px; padding: 15px; background: linear-gradient(135deg, #84fab0 0%, #8fd3f4 100%); border-radius: 10px; color: #2c3e50;">
  <h3 style="color: #2c3e50; margin-bottom: 10px;">🏆 Algorithmic Advantages</h3>
  <ul style="margin: 0; padding-left: 20px;">
    <li><strong>Tree-based architecture</strong>: Naturally handles feature interactions</li>
    <li><strong>Built-in regularization</strong>: Prevents overfitting with L1/L2 penalties</li>
    <li><strong>Missing value handling</strong>: Robust to data inconsistencies</li>
    <li><strong>Feature selection</strong>: Automatic relevance weighting</li>
  </ul>
</div>

<div style="flex: 1; min-width: 300px; padding: 15px; background: linear-gradient(135deg, #a8edea 0%, #fed6e3 100%); border-radius: 10px; color: #2c3e50;">
  <h3 style="color: #2c3e50; margin-bottom: 10px;">📊 Multi-class Excellence</h3>
  <ul style="margin: 0; padding-left: 20px;">
    <li><strong>Native multi-class support</strong>: Handles 22 fertilizer classes efficiently</li>
    <li><strong>Probability outputs</strong>: Essential for MAP@3 ranking optimization</li>
    <li><strong>Class imbalance handling</strong>: Sample weighting for balanced performance</li>
    <li><strong>Calibrated predictions</strong>: Reliable probability estimates</li>
  </ul>
</div>

</div>

<div style="display: flex; flex-wrap: wrap; gap: 15px; margin: 20px 0;">

<div style="flex: 1; min-width: 300px; padding: 15px; background: linear-gradient(135deg, #ffecd2 0%, #fcb69f 100%); border-radius: 10px; color: #2c3e50;">
  <h3 style="color: #2c3e50; margin-bottom: 10px;">⚡ Performance & Scalability</h3>
  <ul style="margin: 0; padding-left: 20px;">
    <li><strong>Fast training</strong>: Efficient gradient boosting implementation</li>
    <li><strong>Memory efficiency</strong>: Handles large feature sets</li>
    <li><strong>Early stopping</strong>: Prevents overfitting with validation</li>
    <li><strong>Cross-validation friendly</strong>: Stable performance across splits</li>
  </ul>
</div>

<div style="flex: 1; min-width: 300px; padding: 15px; background: linear-gradient(135deg, #d299c2 0%, #fef9d7 100%); border-radius: 10px; color: #2c3e50;">
  <h3 style="color: #2c3e50; margin-bottom: 10px;">🎯 Competition Benefits</h3>
  <ul style="margin: 0; padding-left: 20px;">
    <li><strong>Ensemble capability</strong>: Multiple models from CV folds</li>
    <li><strong>Hyperparameter sensitivity</strong>: Extensive tuning options</li>
    <li><strong>Proven track record</strong>: Dominant in Kaggle competitions</li>
    <li><strong>MAP@3 optimization</strong>: Probability-based outputs ideal</li>
  </ul>
</div>

</div>

---

<div style="padding: 20px; margin: 20px 0; background: linear-gradient(135deg, #667eea 0%, #764ba2 100%); border-radius: 15px; box-shadow: 0 8px 16px rgba(0,0,0,0.1); color: white;">
  <h2 style="color: white; margin-bottom: 15px; text-align: left;">🗂️ Notebook Structure</h2>
  <p style="font-size: 16px; text-align: left; margin-bottom: 15px;"><strong>This comprehensive agricultural ML pipeline covers:</strong></p>
  
  <div style="color: white; font-size: 16px; line-height: 1.8; text-align: left; max-width: 600px; margin: 0 auto;">
    <ol style="padding-left: 5px;">
      <li><strong>📚 Library Import & Setup</strong></li>
      <li><strong>📂 Data Loading & Preparation</strong></li>
      <li><strong>🌱 Agricultural Feature Engineering</strong></li>
      <li><strong>🔢 Categorical Variable Encoding</strong></li>
      <li><strong>🎯 Strategic Feature Selection</strong></li>
      <li><strong>🔄 Cross-Validation Setup</strong></li>
      <li><strong>⚙️ XGBoost Configuration</strong></li>
      <li><strong>🏋️ Model Training & Validation</strong></li>
      <li><strong>📊 Performance Evaluation</strong></li>
      <li><strong>🔮 Test Predictions & Submission</strong></li>
      <li><strong>💾 Model Persistence</strong></li>
    </ol>
  </div>
</div>

## 📚 1. Library Import & Setup

**Why?** We import essential libraries for the complete XGBoost modeling pipeline. These libraries provide data manipulation, machine learning algorithms, evaluation metrics, and model persistence capabilities required for competition-level performance.

In [None]:
# Core data manipulation and analysis
import pandas as pd
import numpy as np
import os
import warnings
warnings.filterwarnings('ignore')

# Machine Learning - Scikit-learn
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.utils.class_weight import compute_class_weight

# XGBoost
from xgboost import XGBClassifier
from xgboost.callback import EarlyStopping

# Model persistence and metadata
import joblib
import json
from datetime import datetime

# Utilities
import time
from collections import Counter

# Configuration
np.random.seed(513)


## 📏 MAP@K Evaluation Function

**Why?**

MAP@3 (Mean Average Precision at 3) is the official Kaggle competition metric. This function calculates how well our model ranks the correct fertilizer within the top 3 predictions for each sample.

**Function Details:**
- **Input**: Actual fertilizer labels and predicted rankings (top-k)
- **Output**: Score between 0 and 1 (higher is better)
- **Logic**: Rewards correct predictions more heavily when they appear earlier in the ranking
- **Competition Critical**: This exact implementation matches Kaggle's evaluation system

In [None]:
def mapk(actual, predicted, k=3):
    """Compute mean average precision at k (MAP@k)."""
    def apk(a, p, k):
        score = 0.0
        for i in range(min(k, len(p))):
            if p[i] == a:
                score += 1.0 / (i + 1)
                break  # only the first correct prediction counts
        return score
    return np.mean([apk(a, p, k) for a, p in zip(actual, predicted)])

## 📂 2. Data Loading & Preparation

**Why?**

Load competition datasets and perform initial data separation. EDA insights guide our feature engineering approach.

In [None]:
# Define file paths
data_path = '../data'
train_path = os.path.join(data_path, 'train.csv')
test_path = os.path.join(data_path, 'test.csv')
sample_submission_path = os.path.join(data_path, 'sample_submission.csv')

# Load datasets
print("📂 Loading datasets...")
train_df = pd.read_csv(train_path)
test_df = pd.read_csv(test_path)
sample_submission = pd.read_csv(sample_submission_path)

# print("✅ Data loaded successfully:")
# print(f"  • Training set: {train_df.shape}")
# print(f"  • Test set: {test_df.shape}")
# print(f"  • Sample submission: {sample_submission.shape}")

# Separate features and target variable
target_column = 'Fertilizer Name'

# Split training data
X_raw = train_df.drop(columns=[target_column])
y_raw = train_df[target_column]
X_test_raw = test_df.copy()

print("✅ Data separation completed:")
print(f"  • Training features: {X_raw.shape}")
print(f"  • Training target: {y_raw.shape}")
print(f"  • Test features: {X_test_raw.shape}")
print(f"  • Target classes: {y_raw.nunique()}")

## 🔬 3. Advanced Feature Engineering

**Why?**

Create sophisticated features that capture agricultural relationships and domain knowledge based on EDA findings.

In [None]:
def create_features(df):
    """
    Create engineered features based on agricultural domain knowledge
    
    Args:
        df: DataFrame with agricultural features
        
    Returns:
        DataFrame with additional engineered features
    """
    df_eng = df.copy()
    
    # NPK Ratios (crucial for agricultural decisions)
    df_eng['N_P_ratio'] = df_eng['Nitrogen'] / (df_eng['Phosphorous'] + 0.001)
    df_eng['N_K_ratio'] = df_eng['Nitrogen'] / (df_eng['Potassium'] + 0.001)
    df_eng['P_K_ratio'] = df_eng['Phosphorous'] / (df_eng['Potassium'] + 0.001)
    
    # Total NPK and NPK Balance
    df_eng['Total_NPK'] = df_eng['Nitrogen'] + df_eng['Phosphorous'] + df_eng['Potassium']
    npk_mean = df_eng[['Nitrogen', 'Phosphorous', 'Potassium']].mean(axis=1)
    df_eng['NPK_Balance'] = df_eng[['Nitrogen', 'Phosphorous', 'Potassium']].std(axis=1) / (npk_mean + 0.001)
    
    # Environmental indices
    df_eng['Temp_Hum_index'] = df_eng['Temparature'] * df_eng['Humidity'] / 100
    df_eng['Moist_Balance'] = df_eng['Moisture'] - df_eng['Humidity']
    df_eng['Environ_Stress'] = np.sqrt((df_eng['Temparature'] - 25)**2 + (df_eng['Humidity'] - 65)**2)
    df_eng['Temp_Moist_inter'] = df_eng['Temparature'] * df_eng['Moisture'] / 100
    
    # Dominant nutrient
    npk_cols = ['Nitrogen', 'Phosphorous', 'Potassium']
    df_eng['Dominant_NPK'] = df_eng[npk_cols].idxmax(axis=1)
    
    # Categorical binning
    df_eng['Temp_Cat'] = pd.cut(df_eng['Temparature'], bins=3, labels=['Low', 'Medium', 'High'])
    df_eng['Hum_Cat'] = pd.cut(df_eng['Humidity'], bins=3, labels=['Low', 'Medium', 'High'])
    df_eng['N_Level'] = pd.cut(df_eng['Nitrogen'], bins=3, labels=['Low', 'Medium', 'High'])
    df_eng['K_Level'] = pd.cut(df_eng['Potassium'], bins=3, labels=['Low', 'Medium', 'High'])
    df_eng['P_Level'] = pd.cut(df_eng['Phosphorous'], bins=3, labels=['Low', 'Medium', 'High'])
    
    # Soil-Crop interaction
    df_eng['Soil_Crop_Combo'] = df_eng['Soil Type'].astype(str) + '_' + df_eng['Crop Type'].astype(str)
    
    return df_eng

# Apply feature engineering
print("🔧 Applying feature engineering...")
X_train_featured = create_features(X_raw)
X_test_featured = create_features(X_test_raw)

# Display new feature names
original_features = set(X_raw.columns)
new_features = [col for col in X_train_featured.columns if col not in original_features]

print(f"✅ Feature engineering completed: {X_raw.shape[1]} → {X_train_featured.shape[1]} features (+{X_train_featured.shape[1] - X_raw.shape[1]})")

## 🔢 4. Categorical Variable Encoding

**Why?**

XGBoost requires numerical inputs, so we convert categorical variables (soil types, crop types, engineered categories) to numerical representations while maintaining consistency between training and test sets.

In [None]:
def encode_categorical_features(X_train, X_test, y_train):
    """
    Encode categorical features using LabelEncoder
    
    Args:
        X_train: Training features
        X_test: Test features  
        y_train: Training target
        
    Returns:
        Tuple of (X_train_encoded, X_test_encoded, y_encoded, encoders_dict)
    """
    
    # Initialize encoders dictionary
    encoders = {}
    
    # Create copies to avoid modifying originals
    X_train_enc = X_train.copy()
    X_test_enc = X_test.copy()
    
    # Identify categorical columns
    categorical_cols = X_train.select_dtypes(include=['object', 'category']).columns.tolist()
    
    print(f"🔢 Encoding categorical features...")
    print(f"Categorical columns found: {categorical_cols}")
    
    # Encode categorical features
    for col in categorical_cols:
        print(f"  • Encoding: {col}")
        
        # Create encoder
        encoder = LabelEncoder()
        
        # Fit on combined training and test data to ensure consistency
        combined_values = pd.concat([X_train[col], X_test[col]]).astype(str)
        encoder.fit(combined_values)
        
        # Transform both datasets
        X_train_enc[col] = encoder.transform(X_train[col].astype(str))
        X_test_enc[col] = encoder.transform(X_test[col].astype(str))
        
        # Store encoder
        encoders[col] = encoder
            
    # Encode target variable
    print(f"\n🎯 Encoding target variable: {target_column}")
    target_encoder = LabelEncoder()
    y_encoded = target_encoder.fit_transform(y_train)
    encoders['target'] = target_encoder
    
    print(f"  • Target classes: {len(target_encoder.classes_)}")
    # print(f"  • Class mapping preview: {dict(zip(target_encoder.classes_[:5], range(5)))}")
    
    return X_train_enc, X_test_enc, y_encoded, encoders

# Apply encoding
X_train_encoded, X_test_encoded, y_encoded, label_encoders = encode_categorical_features(
    X_train_featured, X_test_featured, y_raw
)

print(f"\n✅ Encoding completed:")
print(f"  • Training features: {X_train_encoded.shape}")
print(f"  • Test features: {X_test_encoded.shape}")
print(f"  • Encoded target: {y_encoded.shape}")
print(f"  • Encoders stored: {len(label_encoders)}")

## 🎯 5. Strategic Feature Selection

**Why?**

Implement flexible feature selection based on EDA insights and mutual information analysis for optimal model performance.

**Selection Philosophy:**

- **EDA-driven**: Prioritize features with high mutual information scores
- **Domain-informed**: Include agriculturally meaningful combinations
- **Computational efficiency**: Balance between information gain and training speed
- **Easy experimentation**: Toggle features on/off with simple commenting system

In [None]:
# =============================================================================
# FEATURE SELECTION FOR THE MODEL
# =============================================================================

# Feature selection
features_to_use = [
    # Original features
    # 'Temparature',
    # 'Humidity', 
    # 'Moisture',
    # 'Nitrogen',
    # 'Potassium', 
    # 'Phosphorous',
    
    # Engineered features
    'N_P_ratio',
    # 'N_K_ratio',
    # 'P_K_ratio',
    # 'Total_NPK',
    # 'NPK_Balance',
    # 'Temp_Hum_index',
    # 'Moist_Balance',
    # 'Environ_Stress',
    # 'Temp_Moist_inter',
    # 'Temp_Cat',
    # 'Hum_Cat',
    # 'N_Level',
    # 'K_Level',
    # 'P_Level',
    
    # Combinations
    'Soil_Crop_Combo',
    # 'Dominant_NPK',
    
    # Categorical features
    # 'Soil Type',
    'Crop Type',
]

# Validate features
available_features = [f for f in features_to_use if f in X_train_encoded.columns]
missing_features = [f for f in features_to_use if f not in X_train_encoded.columns]

features_to_use = available_features

if missing_features:
    print(f"Missing features: {missing_features}")

print(f"✅ Selected features ({len(features_to_use)}): {features_to_use}")

# Create final datasets
X_final = X_train_encoded[features_to_use].copy()
X_test_final = X_test_encoded[features_to_use].copy()

# print(f"Training: {X_final.shape}, Test: {X_test_final.shape}, Target: {y_encoded.shape}")

## 🔄 6. Stratified 10-Fold Cross-Validation Setup

**Why?**

Robust cross-validation is critical for reliable model evaluation and preventing overfitting using stratified 10-fold CV.

In [None]:
# =============================================================================
# STRATIFIED 10-FOLD CROSS-VALIDATION CONFIGURATION
# =============================================================================

# Cross-validation parameters
N_SPLITS = 10  # 10-fold cross-validation for robust evaluation
RANDOM_STATE = 513
SHUFFLE = True

# Initialize StratifiedKFold to maintain class distribution
skf = StratifiedKFold(
    n_splits=N_SPLITS, 
    shuffle=SHUFFLE, 
    random_state=RANDOM_STATE
)

print(f"🔄 CROSS-VALIDATION CONFIGURATION:")
print(f"  • Number of folds: {N_SPLITS}")
print(f"  • Strategy: Stratified (maintains class proportions)")
print(f"  • Shuffle: {SHUFFLE}")
print(f"  • Random state: {RANDOM_STATE}")

# Analyze class distribution for stratification
print(f"\n📊 Class distribution analysis:")
unique_classes, class_counts = np.unique(y_encoded, return_counts=True)
print(f"  • Total classes: {len(unique_classes)}")
print(f"  • Total samples: {len(y_encoded)}")
print(f"  • Samples per fold: ~{len(y_encoded) // N_SPLITS}")

# Check minimum class size for stratification
min_class_count = min(class_counts)
print(f"  • Minimum class size: {min_class_count}")
if min_class_count < N_SPLITS:
    print(f"  ⚠️ Warning: Smallest class has {min_class_count} samples, less than {N_SPLITS} folds")
    print(f"    Some folds may not contain all classes")
else:
    print(f"  ✅ All classes have sufficient samples for {N_SPLITS}-fold CV")

# Preview fold splits
print(f"\n🔍 Fold size preview:")
for fold_idx, (train_idx, val_idx) in enumerate(skf.split(X_final, y_encoded)):
    if fold_idx < 3:  # Show first 3 folds
        print(f"  Fold {fold_idx + 1}: Train={len(train_idx)}, Val={len(val_idx)}")
    elif fold_idx == 3:
        print("  ...")
    elif fold_idx == N_SPLITS - 1:  # Show last fold
        print(f"  Fold {fold_idx + 1}: Train={len(train_idx)}, Val={len(val_idx)}")
        break

## ⚙️ 7. XGBoost Hyperparameter Configuration

**Why?**

Optimal hyperparameter selection is crucial for XGBoost performance on multi-class classification tasks with balanced class weights.

In [None]:
# =============================================================================
# XGBOOST HYPERPARAMETER CONFIGURATION
#
#🎯 EXPERIMENTATION ENCOURAGED!
# These parameters provide a solid baseline, but feel free to experiment!
# Try different learning rates, depths, or regularization for better scores.
# =============================================================================

# Calculate class weights for imbalanced dataset
class_weights = compute_class_weight(
    'balanced',
    classes=np.unique(y_encoded),
    y=y_encoded
)
class_weight_dict = dict(zip(np.unique(y_encoded), class_weights))

print("⚖️ Class weight calculation:")
print(f"  • Balanced class weights computed for {len(class_weight_dict)} classes")
print(f"  • Weight range: {min(class_weights):.3f} - {max(class_weights):.3f}")

# XGBoost hyperparameters (optimized for multi-class classification)
xgb_params = {
    # Multi-class objective
    'objective': 'multi:softprob',
    'num_class': len(label_encoders['target'].classes_),
    'eval_metric': 'mlogloss',
    
    # Tree structure
    'max_depth': 12,
    'min_child_weight': 3,
    'subsample': 0.86,
    'colsample_bytree': 0.467,
    
    # Learning parameters
    'learning_rate': 0.5,
    'n_estimators': 40,  # High number with early stopping
    
    # Regularization
    'reg_alpha': 2.7,  # L1 regularization
    'reg_lambda': 1.4,  # L2 regularization
    'gamma': 0.25,      # Minimum split loss
    'max_delta_step': 4,  # Maximum delta step for tree weights
    
    # Performance
    'random_state': RANDOM_STATE,
    'n_jobs': -1,
    'verbosity': 0,
    
    # 'device': 'cpu',
    'tree_method': 'hist', 

    # GPU acceleration (comment out if no GPU available)
    # 'gpu_id': 0,
    # 'tree_method': 'gpu_hist',
    
    # Early stopping will be handled separately
    'early_stopping_rounds': 150
}

# Early stopping configuration
es = 150
eval_metric = 'mlogloss'

print(f"\n🚀 XGBOOST CONFIGURATION:")
print(f"  • Objective: {xgb_params['objective']}")
print(f"  • Number of classes: {xgb_params['num_class']}")
print(f"  • Max depth: {xgb_params['max_depth']}")
print(f"  • Learning rate: {xgb_params['learning_rate']}")
print(f"  • Max estimators: {xgb_params['n_estimators']}")
print(f"  • Early stopping: {xgb_params['early_stopping_rounds']} rounds")
print(f"  • Evaluation metric: {xgb_params['eval_metric']}")
print(f"  • Regularization: L1={xgb_params['reg_alpha']}, L2={xgb_params['reg_lambda']}")
print(f"  • Tree method: {xgb_params.get('tree_method', 'hist')}")
print(f"  • GPU ID: {xgb_params.get('gpu_id', 'N/A')}")
print(f"  • Class balancing: Enabled")


## 🏋️ 8. Model Training with 10-Fold Cross-Validation

**Why?**

Implement core training pipeline using stratified 10-fold cross-validation with early stopping and class balancing.


In [None]:
# =============================================================================
# 10-FOLD CROSS-VALIDATION TRAINING
# =============================================================================

def train_xgboost_cv(X, y, features, cv_splitter, params):
    """
    Train XGBoost models using cross-validation
    
    Args:
        X: Feature matrix
        y: Target vector (encoded)
        features: List of feature names to use
        cv_splitter: Cross-validation splitter (StratifiedKFold)
        params: XGBoost parameters
        early_stopping_rounds: Early stopping patience
        
    Returns:
        Dict with trained models, predictions, and metrics
    """
    
    # Initialize storage
    models = {}
    oof_predictions = np.zeros((len(X), params['num_class']))  # Out-of-fold predictions
    cv_scores = []
    feature_importance_list = []
    
    print(f"🏋️ Starting {N_SPLITS}-Fold Cross-Validation Training...")
    print(f"⏰ Training started at: {time.strftime('%H:%M:%S')}")
    
    # Cross-validation loop
    for fold_idx, (train_idx, val_idx) in enumerate(cv_splitter.split(X, y)):
        
        fold_start_time = time.time()
        print(f"\n📁 FOLD {fold_idx + 1}/{N_SPLITS}")
        print(f"  • Train samples: {len(train_idx)}")
        print(f"  • Validation samples: {len(val_idx)}")
        
        # Split data
        X_train_fold = X.iloc[train_idx][features]
        X_val_fold = X.iloc[val_idx][features]
        y_train_fold = y[train_idx]
        y_val_fold = y[val_idx]
        
        # Calculate sample weights for this fold
        fold_class_weights = compute_class_weight(
            'balanced',
            classes=np.unique(y_train_fold),
            y=y_train_fold
        )
        fold_class_weight_dict = dict(zip(np.unique(y_train_fold), fold_class_weights))
        sample_weights = np.array([fold_class_weight_dict.get(label, 1.0) for label in y_train_fold])
        
        # Initialize model
        model = XGBClassifier(**params)
        
        # Train with early stopping
        model.fit(
            X_train_fold, y_train_fold,
            sample_weight=sample_weights,
            eval_set=[(X_val_fold, y_val_fold)],
            verbose=False
        )
        
        # Predict validation set
        val_pred_proba = model.predict_proba(X_val_fold)
        val_pred_classes = model.predict(X_val_fold)
        
        # Store out-of-fold predictions
        oof_predictions[val_idx] = val_pred_proba
        
        # Calculate fold metrics
        fold_accuracy = accuracy_score(y_val_fold, val_pred_classes)
        
        # Calculate MAP@3 for this fold and get top 3 predictions for each sample
        val_top3_indices = np.argsort(val_pred_proba, axis=1)[:, -3:][:, ::-1]
        
        # Convert to lists for mapk function
        actual_list = y_val_fold.tolist() if hasattr(y_val_fold, 'tolist') else list(y_val_fold)
        predicted_list = val_top3_indices.tolist()
        
        # Calculate MAP@3 using the correct format
        fold_map3 = mapk(actual_list, predicted_list, k=3)
        
        # Store results
        cv_scores.append({
            'fold': fold_idx + 1,
            'accuracy': fold_accuracy,
            'map3': fold_map3,
            'best_iteration': model.best_iteration,
            'train_samples': len(train_idx),
            'val_samples': len(val_idx),
            'training_time': time.time() - fold_start_time
        })
        
        # Store model and feature importance
        models[f'fold_{fold_idx + 1}'] = model
        
        if hasattr(model, 'feature_importances_'):
            importance_df = pd.DataFrame({
                'feature': features,
                'importance': model.feature_importances_,
                'fold': fold_idx + 1
            })
            feature_importance_list.append(importance_df)
        
        fold_time = time.time() - fold_start_time
        print(f"  ✅ Fold completed in {fold_time:.1f}s")
        print(f"  📊 Accuracy: {fold_accuracy:.4f} | MAP@3: {fold_map3:.4f}")
        print(f"  🔄 Best iteration: {model.best_iteration}")
    
    # Calculate overall metrics
    oof_pred_classes = np.argmax(oof_predictions, axis=1)
    overall_accuracy = accuracy_score(y, oof_pred_classes)
    
    # Calculate overall MAP@3 and get top 3 predictions for each sample
    oof_top3_indices = np.argsort(oof_predictions, axis=1)[:, -3:][:, ::-1]
    
    # Convert to lists for mapk function
    actual_list = y.tolist() if hasattr(y, 'tolist') else list(y)
    predicted_list = oof_top3_indices.tolist()
    
    # Calculate MAP@3 using the correct format
    overall_map3 = mapk(actual_list, predicted_list, k=3)
    
    # Combine feature importance across folds
    if feature_importance_list:
        feature_importance_df = pd.concat(feature_importance_list, ignore_index=True)
        feature_importance_summary = feature_importance_df.groupby('feature')['importance'].agg(['mean', 'std']).reset_index()
        feature_importance_summary = feature_importance_summary.sort_values('mean', ascending=False)
    else:
        feature_importance_summary = None
    
    return {
        'models': models,
        'oof_predictions': oof_predictions,
        'cv_scores': cv_scores,
        'overall_accuracy': overall_accuracy,
        'overall_map3': overall_map3,
        'feature_importance': feature_importance_summary
    }

# Execute cross-validation training
print("🚀 Starting model training...")
start_time = time.time()

training_results = train_xgboost_cv(
    X=X_final,
    y=y_encoded,
    features=features_to_use,
    cv_splitter=skf,
    params=xgb_params,
)

total_time = time.time() - start_time

print(f"\n🎉 CROSS-VALIDATION TRAINING COMPLETED!")
print(f"⏰ Training finished at: {time.strftime('%H:%M:%S')}")
print(f"⏱️ Total training time: {total_time:.1f}s ({total_time/60:.1f}min)")
print(f"📊 Overall Accuracy: {training_results['overall_accuracy']:.4f}")
print(f"📊 Overall MAP@3: {training_results['overall_map3']:.4f}")

## 📊 9. Comprehensive Model Evaluation & Analysis

**Why?**

Analyze model performance, stability, and competition readiness through statistical metrics and performance indicators.


In [None]:
# =============================================================================
# CROSS-VALIDATION RESULTS EVALUATION
# =============================================================================

print("📊 CROSS-VALIDATION RESULTS")
print("=" * 60)

# Extract results from training
cv_results_df = pd.DataFrame(training_results['cv_scores'])

# Calculate statistics
accuracy_mean = cv_results_df['accuracy'].mean()
accuracy_std = cv_results_df['accuracy'].std()
map3_mean = cv_results_df['map3'].mean()
map3_std = cv_results_df['map3'].std()

print(f"🎯 FINAL METRICS:")
print(f"  📈 Cross-Validation Accuracy: {accuracy_mean:.4f} ± {accuracy_std:.4f}")
print(f"  📈 Cross-Validation MAP@3:    {map3_mean:.4f} ± {map3_std:.4f}")
print(f"  📈 Out-of-Fold Accuracy:      {training_results['overall_accuracy']:.4f}")
print(f"  📈 Out-of-Fold MAP@3:         {training_results['overall_map3']:.4f}")

# Stability evaluation
accuracy_cv = accuracy_std / accuracy_mean if accuracy_mean > 0 else 0
map3_cv = map3_std / map3_mean if map3_mean > 0 else 0

print(f"\n🔍 STABILITY ANALYSIS:")
print(f"  📊 Coefficient of variation (Accuracy): {accuracy_cv:.3f}")
print(f"  📊 Coefficient of variation (MAP@3):    {map3_cv:.3f}")
print(f"  {'✅ Stable model' if accuracy_cv < 0.05 else '⚠️ Variable model'} (Accuracy CV < 0.05)")
print(f"  {'✅ Stable model' if map3_cv < 0.05 else '⚠️ Variable model'} (MAP@3 CV < 0.05)")

# Training time analysis
avg_fold_time = cv_results_df['training_time'].mean()
print(f"\n⏱️ TRAINING TIMES:")
print(f"  📊 Average time per fold: {avg_fold_time:.1f}s")
print(f"  📊 Total time: {total_time:.1f}s ({total_time/60:.1f}min)")

# Detailed results by fold
print(f"\n📋 DETAILED RESULTS BY FOLD:")
print("Fold  Accuracy   MAP@3    Best_Iter  Time(s)")
print("-" * 50)
for _, row in cv_results_df.iterrows():
    print(f"{row['fold']:2.0f}    {row['accuracy']:.4f}   {row['map3']:.4f}     {row['best_iteration']:4.0f}   {row['training_time']:6.1f}")

print("-" * 50)
print(f"Mean  {accuracy_mean:.4f}   {map3_mean:.4f}     {cv_results_df['best_iteration'].mean():4.0f}   {avg_fold_time:6.1f}")

# Feature importance analysis
if training_results['feature_importance'] is not None:
    print(f"\n🔍 TOP 10 MOST IMPORTANT FEATURES:")
    print("Rank  Feature               Importance")
    print("-" * 40)
    for i, (_, row) in enumerate(training_results['feature_importance'].head(10).iterrows()):
        print(f"{i+1:2d}.   {row['feature']:20} {row['mean']:8.4f}")

## 💾 10. Model Persistence & Competition Submission

**Why?**

Save essential model artifacts and performance metrics for competition submission and reproducibility.


In [None]:
# =============================================================================
# FILE SAVING CONFIGURATION
# =============================================================================

# Configure model name based on MAP@3
overall_map3 = training_results['overall_map3']
model_name = f"XGB_10CV_MAP@3-{overall_map3:.5f}".replace('.', '')
model_dir = f"../models/XGB/{N_SPLITS}CV/{model_name}"

# Create directory if it doesn't exist
os.makedirs(model_dir, exist_ok=True)

print(f"📁 MODEL DIRECTORY:")
print(f"  {model_dir}")

# File name configuration - KAGGLE COMPETITION RECOMMENDED AND ESSENTIALS ONLY
base_filename = model_name
files_to_create = {
    'hparams': f"{base_filename}_hparams.json",                 # ✅ RECOMMENDED - Hyperparameters for reproducibility
    'metrics': f"{base_filename}_metrics.json",                 # ✅ RECOMMENDED - Performance metrics and config
    'submission': f"{base_filename}_submission.csv",            # ✅ ESSENTIAL - Competition submission file
    'submission_info': f"{base_filename}_submission_info.json"  # ✅ RECOMMENDED - Submission metadata
}

print(f"\n📝 FILES TO CREATE:")
for file_type, filename in files_to_create.items():
    print(f"  {file_type:15}: {filename}")



In [None]:
# =============================================================================
# SAVE HYPERPARAMETERS AND METRICS - COMPETITION
# =============================================================================

# Extract metrics from training results
cv_results_df = pd.DataFrame(training_results['cv_scores'])
map3_mean = cv_results_df['map3'].mean()
map3_std = cv_results_df['map3'].std()
accuracy_mean = cv_results_df['accuracy'].mean()
accuracy_std = cv_results_df['accuracy'].std()

# 1. HYPERPARAMETERS DATA
hparams_data = {
    "model_type": "XGBClassifier",
    "model_abbreviation": "XGB",
    "cv_strategy": f"{N_SPLITS}-Fold Stratified Cross Validation",
    "optimization_method": "Manual hyperparameter tuning",
    "ensemble_method": "Average of fold predictions",
    
    # Fixed hyperparameters used
    "hyperparameters": xgb_params,
    
    # General configuration
    "features_selected": features_to_use,
    "num_features": len(features_to_use),
    "class_weights_used": True,
    "random_state": RANDOM_STATE,
    "cv_splits": N_SPLITS,
    "total_models": len(training_results['models']),
    "early_stopping_rounds": es
}

# 2. METRICS DATA  
metrics_data = {
    "model_type": "XGBClassifier",
    "model_abbreviation": "XGB",
    "tier": "10_FOLD_CV",
    "target_variable": "Fertilizer Name",
    "cv_strategy": f"{N_SPLITS}-Fold Stratified Cross Validation",
    "optimization_method": "Manual hyperparameter tuning",
    
    # Main performance metrics
    "map3_score_cv_mean": float(map3_mean),
    "map3_score_cv_std": float(map3_std),
    "map3_score_oof": float(training_results['overall_map3']),
    "accuracy_cv_mean": float(accuracy_mean),
    "accuracy_cv_std": float(accuracy_std),
    "accuracy_oof": float(training_results['overall_accuracy']),
    
    # Model configuration
    "num_classes": len(label_encoders['target'].classes_),
    "features_used": len(features_to_use),
    "features_list": features_to_use,
    "cv_folds": N_SPLITS,
    "total_models_trained": len(training_results['models']),
    
    # Detailed fold results
    "fold_results": training_results['cv_scores'],
    
    # Stability statistics
    "accuracy_cv_coefficient": float(accuracy_std / accuracy_mean) if accuracy_mean > 0 else 0.0,
    "map3_cv_coefficient": float(map3_std / map3_mean) if map3_mean > 0 else 0.0,
    
    # Training performance
    "training_time_total": float(total_time),
    "training_time_per_fold_avg": float(cv_results_df['training_time'].mean()),
    
    # Hyperparameters used
    "hyperparameters": xgb_params,
    
    # Feature importance summary (lightweight)
    "top_features": training_results['feature_importance'].head(10).to_dict('records') if training_results['feature_importance'] is not None else None,
    
    # Metadata
    "timestamp": datetime.now().isoformat(),
    "kaggle_competition": "playground-series-s5e6",
    "ensemble_method": "Average of 10-fold CV models",
    "models_saved": False,  # Models used in-memory only for predictions
    "memory_optimized": True
}

# Save both files
hparams_file = os.path.join(model_dir, files_to_create['hparams'])
metrics_file = os.path.join(model_dir, files_to_create['metrics'])

with open(hparams_file, 'w') as f:
    json.dump(hparams_data, f, indent=2)
    
with open(metrics_file, 'w') as f:
    json.dump(metrics_data, f, indent=2)

print(f"✅ Competition files saved:")
print(f"  📄 Hyperparameters: {files_to_create['hparams']}")
print(f"  📊 Metrics: {files_to_create['metrics']}")
print(f"  💾 Size: Lightweight (~20-35 KB total)")
print(f"  🚀 Competition ready!")

## 🔮 11. Test Predictions & Submission Generation

**Why?**

Generate final predictions using 10-fold ensemble and create Kaggle submission file with MAP@3 optimization.

In [None]:
# =============================================================================
# GENERATE TEST PREDICTIONS AND CREATE KAGGLE SUBMISSION
# =============================================================================

print(f"🔮 Generating test predictions using {len(training_results['models'])}-model ensemble...")

# Generate ensemble predictions using all trained models
test_predictions_all = []
for fold_name, model in training_results['models'].items():
    pred_proba = model.predict_proba(X_test_final)
    test_predictions_all.append(pred_proba)

# Average predictions across all folds
test_predictions_ensemble = np.mean(test_predictions_all, axis=0)

# Get top 3 predictions for each sample (MAP@3 format)
test_top3_indices = np.argsort(test_predictions_ensemble, axis=1)[:, -3:][:, ::-1]

# Convert prediction indices to fertilizer names
test_top3_names = []
for i in range(len(test_top3_indices)):
    top3_for_sample = []
    for j in range(3):
        class_idx = test_top3_indices[i, j]
        class_name = label_encoders['target'].inverse_transform([class_idx])[0]
        top3_for_sample.append(class_name)
    test_top3_names.append(top3_for_sample)

# Create submission format (space-separated top 3 fertilizers)
submission_predictions = [' '.join(top3_names) for top3_names in test_top3_names]

# Create final submission DataFrame
submission = pd.DataFrame({
    'id': sample_submission['id'].copy(),
    'Fertilizer Name': submission_predictions
})

# Save submission file
submission_file = os.path.join(model_dir, files_to_create['submission'])
submission.to_csv(submission_file, index=False)

# Create submission metadata
submission_info = {
    "model_type": "XGBClassifier",
    "cv_strategy": f"{N_SPLITS}-Fold Stratified Cross Validation",
    "ensemble_method": "Average of 10-fold CV models",
    "map3_score_cv": f"{map3_mean:.5f} ± {map3_std:.5f}",
    "map3_score_oof": float(training_results['overall_map3']),
    "submission_file": files_to_create['submission'],
    "num_predictions": len(submission),
    "features_used": len(features_to_use),
    "hyperparameters": xgb_params,
    "timestamp": datetime.now().isoformat(),
    "kaggle_competition": "playground-series-s5e6"
}

# Save submission metadata
submission_info_file = os.path.join(model_dir, files_to_create['submission_info'])
with open(submission_info_file, 'w') as f:
    json.dump(submission_info, f, indent=2)

print(f"✅ KAGGLE SUBMISSION READY")
print(f"  📄 File: {files_to_create['submission']}")
print(f"  📊 Samples: {len(submission):,}")
print(f"  📈 MAP@3 (CV): {map3_mean:.5f} ± {map3_std:.5f}")
print(f"  📈 MAP@3 (OOF): {training_results['overall_map3']:.5f}")
print(f"  ⏱️ Training: {total_time/60:.1f} minutes")
print(f"  🚀 Ready for competition upload!")



---

<div style="text-align: center; padding: 20px; margin: 20px 0; background: linear-gradient(135deg, #667eea 0%, #764ba2 100%); border-radius: 15px; box-shadow: 0 8px 16px rgba(0,0,0,0.1); color: white;">
  <h2 style="color: white; margin-bottom: 15px;">🎉 Thank You for Reading! 🎉</h2>
  <p style="font-size: 18px; margin-bottom: 10px;"><strong>Dear Fellow Kagglers & Data Scientists,</strong> 🤝</p>
  <p style="font-size: 16px; line-height: 1.6;">Thank you for taking the time to explore this notebook! I hope you found the analysis insightful and the methodology useful for your own projects.</p>
</div>

### 💬 **Your Feedback Matters!**

- **👍 Upvote** if this notebook was helpful
- **💭 Comment** with your thoughts, suggestions, or questions
- **🔧 Fork & Improve** - I'd love to see your enhancements!
- **📊 Share your results** and insights

---

**🔗 Connect & Collaborate:**
- Follow for more data science content and competition solutions
- Join the discussion in the comments below
- Share this notebook if you found it valuable

**🏆 Keep exploring, keep learning, keep growing!** 🌱

---