# 🚀 Kaggle Playground Series S5E6: XGBoost with 10-Fold CV - Fertilizer Prediction

**Advanced XGBoost modeling pipeline with stratified cross-validation for agricultural fertilizer recommendation**

This notebook implements a **comprehensive XGBoost modeling solution** for the **Kaggle Playground Series S5E6: Fertilizer Prediction Challenge**. It builds upon the exploratory data analysis (EDA) insights to create a robust, competition-ready model with advanced feature engineering and rigorous validation.

---

## 🎯 Competition Overview

**Objective**: Select the optimal fertilizer for different agricultural conditions (weather, soil, crops)

**Problem Type**: Multi-class classification with 22 fertilizer categories

**Dataset**: Agricultural features including:
- 🌡️ Environmental conditions (Temperature, Humidity, Soil Moisture)
- 🧪 Soil nutrients (Nitrogen, Phosphorus, Potassium)
- 🌱 Agricultural context (Soil Type, Crop Type)
- 🎯 Target: Fertilizer Name (22 different fertilizer types)

---

## 📊 Evaluation Metric: MAP@3

Predictions are evaluated using **Mean Average Precision @ 3 (MAP@3)**:

$$MAP@3 = \frac{1}{N} \sum_{i=1}^{N} \sum_{k=1}^{min(3,|P_i|)} P(k) \times rel(k)$$

Where:
- **N** = number of observations
- **P(k)** = precision at cutoff k
- **|P_i|** = number of predictions per observation
- **rel(k)** = indicator function (1 if the item at rank k is correct, 0 otherwise)

**Key Points for XGBoost Optimization**:
- Model must output **probability rankings** for top-3 predictions
- **Correct predictions in higher positions** are rewarded more heavily
- **Probability calibration** becomes crucial for optimal ranking
- **Ensemble methods** can improve ranking stability

---

## 🤖 Why XGBoost for This Challenge?

**XGBoost (eXtreme Gradient Boosting)** is the optimal choice for this agricultural classification task for several reasons:

### 🏆 **Algorithmic Advantages**
- **Tree-based architecture**: Naturally handles feature interactions (e.g., soil-crop combinations, NPK ratios)
- **Built-in regularization**: Prevents overfitting with L1/L2 penalties and tree constraints
- **Missing value handling**: Robust to data inconsistencies (though our dataset is complete)
- **Feature selection**: Automatic relevance weighting through tree splitting decisions

### 📊 **Multi-class Excellence**
- **Native multi-class support**: Handles 22 fertilizer classes efficiently with `multi:softprob` objective
- **Probability outputs**: Essential for MAP@3 ranking optimization
- **Class imbalance handling**: Supports sample weighting for balanced performance across fertilizers
- **Calibrated predictions**: Reliable probability estimates for ranking tasks

### ⚡ **Performance & Scalability**
- **Fast training**: Efficient gradient boosting implementation with parallel processing
- **Memory efficiency**: Handles large feature sets without memory issues
- **Early stopping**: Prevents overfitting with validation-based stopping criteria
- **Cross-validation friendly**: Stable performance across different data splits

### 🔧 **Agricultural Domain Fit**
- **Non-linear relationships**: Captures complex agricultural interactions (temperature × humidity, NPK ratios)
- **Feature importance**: Provides interpretable insights for agricultural decision-making
- **Robust to outliers**: Agricultural data often contains natural extremes
- **Categorical handling**: Effectively processes soil types and crop varieties

### 🎯 **Competition-Specific Benefits**
- **Ensemble capability**: Multiple models from CV folds improve prediction stability
- **Hyperparameter sensitivity**: Extensive tuning options for performance optimization
- **Proven track record**: Dominant algorithm in Kaggle tabular competitions
- **MAP@3 optimization**: Probability-based outputs ideal for ranking metrics

---

## 🗂️ Notebook Structure

**This comprehensive modeling pipeline covers:**

1. **📚 Library Import & Setup** - Essential ML libraries and utility functions
2. **📂 Data Loading** - Training/test data import (EDA completed separately)
3. **⚙️ Feature Engineering** - Advanced agricultural feature creation based on EDA insights
4. **🔢 Categorical Encoding** - Label encoding for tree-based algorithms
5. **🎯 Feature Selection** - Strategic feature subset selection for optimal performance
6. **🔄 Cross-Validation Setup** - Stratified 10-fold CV configuration for robust validation
7. **⚙️ XGBoost Configuration** - Hyperparameter optimization for multi-class classification
8. **🏋️ Model Training** - 10-fold CV training with early stopping and class balancing
9. **📊 Model Evaluation** - Comprehensive performance analysis and stability assessment
10. **🔮 Test Predictions** - Ensemble prediction generation for submission
11. **💾 Model Persistence** - Complete model artifacts and submission file creation

---

## 🌾 Expected Agricultural Insights Integration

**Based on EDA findings, this XGBoost implementation leverages:**
- **NPK ratio features** - Nutrient balance relationships more important than absolute values
- **Environmental stress indices** - Combined temperature-humidity effects on fertilizer needs
- **Soil-crop interactions** - Context-specific fertilizer requirements
- **Categorical binning** - Interpretable nutrient level groupings for tree splits
- **Class imbalance handling** - Weighted sampling for fair representation of rare fertilizers

**Performance Targets**:
- **Baseline**: MAP@3 > 0.30 (competitive threshold)
- **Target**: MAP@3 > 0.35 (top quartile performance)
- **Stretch**: MAP@3 > 0.40 (leaderboard contention)

**Let's build a world-class fertilizer recommendation system! 🌱🚀**

## 📚 1. Library Import & Setup

**Why?** We import essential libraries for the complete XGBoost modeling pipeline. These libraries provide data manipulation, machine learning algorithms, evaluation metrics, and model persistence capabilities required for competition-level performance.

**Key Components:**

- **Data handling**: pandas, numpy for data manipulation and numerical operations
- **Machine Learning**: XGBoost for gradient boosting, scikit-learn for preprocessing and evaluation
- **Validation**: StratifiedKFold for cross-validation, custom MAP@3 implementation
- **Model persistence**: joblib for saving models, json for metadata storage
- **Utilities**: time for performance monitoring, warnings for clean output

In [None]:
# Core data manipulation and analysis
import pandas as pd
import numpy as np
import os
import warnings
warnings.filterwarnings('ignore')

# Machine Learning - Scikit-learn
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.utils.class_weight import compute_class_weight

# XGBoost
from xgboost import XGBClassifier
from xgboost.callback import EarlyStopping

# Model persistence and metadata
import joblib
import json
from datetime import datetime

# Utilities
import time
from collections import Counter

# Configuration
np.random.seed(513)


## 📏 MAP@K Evaluation Function

**Why?**
MAP@3 (Mean Average Precision at 3) is the official Kaggle competition metric. This function calculates how well our model ranks the correct fertilizer within the top 3 predictions for each sample.

**Function Details:**
- **Input**: Actual fertilizer labels and predicted rankings (top-k)
- **Output**: Score between 0 and 1 (higher is better)
- **Logic**: Rewards correct predictions more heavily when they appear earlier in the ranking
- **Competition Critical**: This exact implementation matches Kaggle's evaluation system

In [None]:
def mapk(actual, predicted, k=3):
    """Compute mean average precision at k (MAP@k)."""
    def apk(a, p, k):
        score = 0.0
        for i in range(min(k, len(p))):
            if p[i] == a:
                score += 1.0 / (i + 1)
                break  # only the first correct prediction counts
        return score
    return np.mean([apk(a, p, k) for a, p in zip(actual, predicted)])

## 📂 2. Data Loading & Preparation

**Why?**  
We load the competition datasets and perform initial data separation. The exploratory data analysis (EDA) has been completed in a separate notebook, providing insights that guide our feature engineering and modeling approach.

**Data Sources:**
- **Training set**: 100,000+ samples with fertilizer labels for model training
- **Test set**: Unlabeled samples for final predictions and submission
- **Sample submission**: Template for properly formatted competition submissions

**Strategy:**
- Clean separation between features (X) and target variable (y)
- Consistent preprocessing pipeline for both training and test data
- Memory-efficient loading for large agricultural datasets

In [None]:
# Define file paths
data_path = '../data'
train_path = os.path.join(data_path, 'train.csv')
test_path = os.path.join(data_path, 'test.csv')
sample_submission_path = os.path.join(data_path, 'sample_submission.csv')

# Load datasets
print("📂 Loading datasets...")
train_df = pd.read_csv(train_path)
test_df = pd.read_csv(test_path)
sample_submission = pd.read_csv(sample_submission_path)

print("✅ Data loaded successfully:")
print(f"  • Training set: {train_df.shape}")
print(f"  • Test set: {test_df.shape}")
print(f"  • Sample submission: {sample_submission.shape}")



In [None]:
# Separate features and target variable
target_column = 'Fertilizer Name'

# Split training data
X_raw = train_df.drop(columns=[target_column])
y_raw = train_df[target_column]
X_test_raw = test_df.copy()

print("✅ Data separation completed:")
print(f"  • Training features: {X_raw.shape}")
print(f"  • Training target: {y_raw.shape}")
print(f"  • Test features: {X_test_raw.shape}")
print(f"  • Target classes: {y_raw.nunique()}")

## 🔬 3. Advanced Feature Engineering

**Why?** Based on our EDA findings, we create sophisticated features that capture agricultural relationships and domain knowledge. These engineered features significantly outperformed original features in mutual information analysis.

**Agricultural Domain Features:**

### 🧪 **NPK Ratio Features** (High Impact - EDA validated)

- **N_P_ratio, N_K_ratio, P_K_ratio**: Nutrient balance ratios crucial for plant growth
- **Total_NPK**: Overall soil fertility indicator
- **NPK_Balance**: Nutrient distribution uniformity measure
- **Why important**: Agricultural science shows ratios often more predictive than absolute values

### 🌡️ **Environmental Interaction Features** (Medium-High Impact)

- **Temp_Hum_index**: Combined temperature-humidity stress indicator
- **Moist_Balance**: Soil vs air moisture differential affecting nutrient uptake
- **Environ_Stress**: Distance from optimal growing conditions (25°C, 65% humidity)
- **Temp_Moist_inter**: Temperature-moisture interaction for evaporation effects

### 🏷️ **Categorical Binning** (Medium Impact - Tree-friendly)

- **Temp_Cat, Hum_Cat**: Low/Medium/High environmental categories
- **N_Level, K_Level, P_Level**: Nutrient level categorizations
- **Why effective**: Creates interpretable breakpoints for tree-based models

### 🔗 **Agricultural Context Features** (High Impact)

- **Soil_Crop_Combo**: Captures context-specific fertilizer requirements
- **Dominant_NPK**: Identifies primary nutrient in soil composition
- **Why critical**: Different crops have different needs based on soil type

**Engineering Validation:**

- EDA showed 67% improvement in mutual information scores
- Engineered features dominated top-10 most informative variables
- Agricultural domain knowledge confirmed feature relevance

In [None]:
def create_features(df):
    """
    Create engineered features based on agricultural domain knowledge
    
    Args:
        df: DataFrame with agricultural features
        
    Returns:
        DataFrame with additional engineered features
    """
    df_eng = df.copy()
    
    # NPK Ratios (crucial for agricultural decisions)
    df_eng['N_P_ratio'] = df_eng['Nitrogen'] / (df_eng['Phosphorous'] + 0.001)
    df_eng['N_K_ratio'] = df_eng['Nitrogen'] / (df_eng['Potassium'] + 0.001)
    df_eng['P_K_ratio'] = df_eng['Phosphorous'] / (df_eng['Potassium'] + 0.001)
    
    # Total NPK and NPK Balance
    df_eng['Total_NPK'] = df_eng['Nitrogen'] + df_eng['Phosphorous'] + df_eng['Potassium']
    npk_mean = df_eng[['Nitrogen', 'Phosphorous', 'Potassium']].mean(axis=1)
    df_eng['NPK_Balance'] = df_eng[['Nitrogen', 'Phosphorous', 'Potassium']].std(axis=1) / (npk_mean + 0.001)
    
    # Environmental indices
    df_eng['Temp_Hum_index'] = df_eng['Temparature'] * df_eng['Humidity'] / 100
    df_eng['Moist_Balance'] = df_eng['Moisture'] - df_eng['Humidity']
    df_eng['Environ_Stress'] = np.sqrt((df_eng['Temparature'] - 25)**2 + (df_eng['Humidity'] - 65)**2)
    df_eng['Temp_Moist_inter'] = df_eng['Temparature'] * df_eng['Moisture'] / 100
    
    # Dominant nutrient
    npk_cols = ['Nitrogen', 'Phosphorous', 'Potassium']
    df_eng['Dominant_NPK'] = df_eng[npk_cols].idxmax(axis=1)
    
    # Categorical binning
    df_eng['Temp_Cat'] = pd.cut(df_eng['Temparature'], bins=3, labels=['Low', 'Medium', 'High'])
    df_eng['Hum_Cat'] = pd.cut(df_eng['Humidity'], bins=3, labels=['Low', 'Medium', 'High'])
    df_eng['N_Level'] = pd.cut(df_eng['Nitrogen'], bins=3, labels=['Low', 'Medium', 'High'])
    df_eng['K_Level'] = pd.cut(df_eng['Potassium'], bins=3, labels=['Low', 'Medium', 'High'])
    df_eng['P_Level'] = pd.cut(df_eng['Phosphorous'], bins=3, labels=['Low', 'Medium', 'High'])
    
    # Soil-Crop interaction
    df_eng['Soil_Crop_Combo'] = df_eng['Soil Type'].astype(str) + '_' + df_eng['Crop Type'].astype(str)
    
    return df_eng

# Apply feature engineering
print("🔧 Applying feature engineering...")
X_train_featured = create_features(X_raw)
X_test_featured = create_features(X_test_raw)

print(f"✅ Feature engineering completed:")
print(f"  • Original features: {X_raw.shape[1]}")
print(f"  • After feature engineering: {X_train_featured.shape[1]}")
print(f"  • New features added: {X_train_featured.shape[1] - X_raw.shape[1]}")

# Display new feature names
original_features = set(X_raw.columns)
new_features = [col for col in X_train_featured.columns if col not in original_features]
print(f"\n🆕 New engineered features ({len(new_features)}):")
for i, feature in enumerate(new_features, 1):
    print(f"  {i:2d}. {feature}")

## 🔢 4. Categorical Variable Encoding

**Why?**  
XGBoost requires numerical inputs, so we convert categorical variables (soil types, crop types, engineered categories) to numerical representations while maintaining consistency between training and test sets.

**Encoding Strategy:**

- **LabelEncoder**: Optimal choice for tree-based algorithms like XGBoost
- **Consistent mapping**: Fit on combined train+test data to ensure same encoding
- **Target encoding**: Fertilizer names converted to integer labels for classification

**Categorical Variables Processed:**

- **Original features**: Soil Type (10 categories), Crop Type (22 categories)
- **Engineered categories**: Temperature/Humidity/NPK level bins, Soil-Crop combinations
- **Target variable**: 22 fertilizer types → integer labels 0-21

**XGBoost Compatibility:**

- Tree-based algorithms handle label-encoded categoricals naturally
- No need for one-hot encoding (which would create sparse, high-dimensional features)
- Preserves ordinality where meaningful (e.g., Low/Medium/High levels)

In [None]:
def encode_categorical_features(X_train, X_test, y_train):
    """
    Encode categorical features using LabelEncoder
    
    Args:
        X_train: Training features
        X_test: Test features  
        y_train: Training target
        
    Returns:
        Tuple of (X_train_encoded, X_test_encoded, y_encoded, encoders_dict)
    """
    
    # Initialize encoders dictionary
    encoders = {}
    
    # Create copies to avoid modifying originals
    X_train_enc = X_train.copy()
    X_test_enc = X_test.copy()
    
    # Identify categorical columns
    categorical_cols = X_train.select_dtypes(include=['object', 'category']).columns.tolist()
    
    print(f"🔢 Encoding categorical features...")
    print(f"Categorical columns found: {categorical_cols}")
    
    # Encode categorical features
    for col in categorical_cols:
        print(f"  • Encoding: {col}")
        
        # Create encoder
        encoder = LabelEncoder()
        
        # Fit on combined training and test data to ensure consistency
        combined_values = pd.concat([X_train[col], X_test[col]]).astype(str)
        encoder.fit(combined_values)
        
        # Transform both datasets
        X_train_enc[col] = encoder.transform(X_train[col].astype(str))
        X_test_enc[col] = encoder.transform(X_test[col].astype(str))
        
        # Store encoder
        encoders[col] = encoder
        
        print(f"    - Classes: {len(encoder.classes_)} | {list(encoder.classes_[:5])}{'...' if len(encoder.classes_) > 5 else ''}")
    
    # Encode target variable
    print(f"\n🎯 Encoding target variable: {target_column}")
    target_encoder = LabelEncoder()
    y_encoded = target_encoder.fit_transform(y_train)
    encoders['target'] = target_encoder
    
    print(f"  • Target classes: {len(target_encoder.classes_)}")
    # print(f"  • Class mapping preview: {dict(zip(target_encoder.classes_[:5], range(5)))}")
    
    return X_train_enc, X_test_enc, y_encoded, encoders

# Apply encoding
X_train_encoded, X_test_encoded, y_encoded, label_encoders = encode_categorical_features(
    X_train_featured, X_test_featured, y_raw
)

print(f"\n✅ Encoding completed:")
print(f"  • Training features: {X_train_encoded.shape}")
print(f"  • Test features: {X_test_encoded.shape}")
print(f"  • Encoded target: {y_encoded.shape}")
print(f"  • Encoders stored: {len(label_encoders)}")

## 🎯 5. Strategic Feature Selection

**Why?** We implement a flexible feature selection system based on EDA insights and mutual information analysis. This allows easy experimentation with different feature combinations to optimize model performance.

**Selection Philosophy:**

- **EDA-driven**: Prioritize features with high mutual information scores
- **Domain-informed**: Include agriculturally meaningful combinations
- **Computational efficiency**: Balance between information gain and training speed
- **Easy experimentation**: Toggle features on/off with simple commenting system

**Feature Categories Available:**

### 🌟 **Tier 1 Features** (Highest Mutual Information)

- Original environmental and NPK features
- Soil-Crop combinations (highest individual MI score)
- Crop Type encoding (essential agricultural context)

### 🔧 **Tier 2 Features** (High Impact Engineered)

- NPK ratios and balance metrics
- Environmental stress indices
- Temperature-humidity interactions

### 📊 **Tier 3 Features** (Complementary)

- Categorical level binnings
- Dominant nutrient indicators
- Secondary environmental features

**Current Configuration:**

- **Conservative baseline**: Original features + essential categoricals
- **Ready for enhancement**: Engineered features commented out for easy activation
- **Validation built-in**: Automatic checking against available processed features

**Performance Strategy:**

- Start with baseline features for stable foundation
- Incrementally add engineered features based on validation performance
- Monitor for overfitting with high-dimensional feature sets

In [None]:
# =============================================================================
# FEATURE SELECTION FOR THE MODEL
# =============================================================================

features_to_use = [
    # 🌡️ ORIGINAL CLIMATE VARIABLES
    # 'Temparature',
    # 'Humidity', 
    # 'Moisture',
    
    # 🧪 CHEMICAL VARIABLES (NPK)
    'Nitrogen',
    'Potassium', 
    'Phosphorous',
    
    # 📊 ENGINEERED FEATURES - NPK RATIOS (from create_features)
    'N_P_ratio',
    # 'N_K_ratio',
    # 'P_K_ratio',
    # 'Total_NPK',
    # 'NPK_Balance',
    
    # 🌡️ ENGINEERED FEATURES - CLIMATE INDICES (from create_features)
    # 'Temp_Hum_index',
    # 'Moist_Balance',
    # 'Environ_Stress',
    # 'Temp_Moist_inter',
    
    # 🏷️ ENGINEERED FEATURES - CATEGORICAL LEVELS (from create_features, encoded)
    # 'Temp_Cat',
    # 'Hum_Cat',
    # 'N_Level',
    # 'K_Level',
    # 'P_Level',

    # 🔗 ENGINEERED FEATURES - COMBINATIONS (from create_features)
    'Soil_Crop_Combo', # ✅ Encoded during preprocessing
    # 'Dominant_NPK', # ✅ Encoded during preprocessing
    
    # 🔢 ENCODED CATEGORICAL FEATURES (from preprocessing)
    # 'Soil Type',      # ✅ Encoded during preprocessing
    'Crop Type',      # ✅ Encoded during preprocessing
]

# Validate available features against the actual processed dataset
print(f"🔍 Validating features against processed dataset...")
print(f"📊 Available columns in dataset: {list(X_train_encoded.columns)}")

available_features = []
missing_features = []

for feature in features_to_use:
    if feature in X_train_encoded.columns:
        available_features.append(feature)
    else:
        missing_features.append(feature)

# Update final feature list to only include available features
features_to_use = available_features

if missing_features:
    print(f"\n⚠️ Missing features (will be skipped): {missing_features}")

# Display selected features by category
print(f"\n📋 SELECTED FEATURES ({len(features_to_use)} total):")

# Group features by category for better readability
climate_original = [f for f in features_to_use if f in ['Temparature', 'Humidity', 'Moisture']]
npk_original = [f for f in features_to_use if f in ['Nitrogen', 'Potassium', 'Phosphorous']]
npk_ratios = [f for f in features_to_use if any(x in f for x in ['_ratio', 'Total_NPK', 'NPK_Balance'])]
climate_engineered = [f for f in features_to_use if any(x in f for x in ['Temp_Hum', 'Moist_Balance', 'Environ_Stress', 'Temp_Moist'])]
categorical_levels = [f for f in features_to_use if any(x in f for x in ['_Cat', '_Level'])]
combinations = [f for f in features_to_use if any(x in f for x in ['Combo', 'Dominant'])]
encoded_original = [f for f in features_to_use if f in ['Soil Type', 'Crop Type']]

feature_groups = [
    ("🌡️ Original Climate", climate_original),
    ("🧪 Original NPK", npk_original),
    ("📊 NPK Ratios", npk_ratios),
    ("🌡️ Climate Indices", climate_engineered),
    ("🏷️ Categorical Levels", categorical_levels),
    ("🔗 Combinations", combinations),
    ("🔢 Encoded Categories", encoded_original)
]

for group_name, group_features in feature_groups:
    if group_features:
        print(f"\n{group_name} ({len(group_features)}):")
        for i, feature in enumerate(group_features, 1):
            print(f"  {i:2d}. {feature}")

print(f"\n🚀 Ready for model training with {len(features_to_use)} features!")

# Create final training datasets
X_final = X_train_encoded[features_to_use].copy()
X_test_final = X_test_encoded[features_to_use].copy()

print(f"\n✅ Final dataset shapes:")
print(f"  • Training: {X_final.shape}")
print(f"  • Test: {X_test_final.shape}")
print(f"  • Target: {y_encoded.shape}")

## 🔄 6. Stratified 10-Fold Cross-Validation Setup

**Why?**
Robust cross-validation is critical for reliable model evaluation and preventing overfitting. We use stratified 10-fold CV to ensure consistent class distribution across folds while providing stable performance estimates.

**Stratified CV Benefits:**
- **Class preservation**: Each fold maintains the same fertilizer class proportions as the full dataset
- **Reduced variance**: 10 folds provide more stable performance estimates than 5-fold
- **Overfitting detection**: Multiple independent validations reveal model generalization
- **Competition alignment**: CV scores correlate better with leaderboard performance

**Configuration Details:**
- **10 folds**: Optimal balance between computational cost and statistical reliability
- **Stratification**: Critical for imbalanced 22-class fertilizer distribution
- **Shuffling**: Randomizes sample order to remove potential temporal/ordering biases
- **Fixed random state**: Ensures reproducible results across runs

**Class Distribution Analysis:**
- **22 fertilizer classes**: Some frequent (>10%), others rare (<1%)
- **Minimum class check**: Validates sufficient samples per class for stratification
- **Fold balance**: Each fold contains representative samples from all classes

**Statistical Robustness:**
- **~90% training, ~10% validation** per fold provides good train/validation balance
- **Out-of-fold predictions**: Enable unbiased performance estimation
- **Stability assessment**: Coefficient of variation across folds indicates model reliability

**Kaggle Competition Alignment:**
- **Local CV scores** should correlate with public/private leaderboard
- **Overfitting detection** through train/validation gap monitoring
- **Ensemble foundation** for combining predictions from multiple folds

In [None]:
# =============================================================================
# STRATIFIED 10-FOLD CROSS-VALIDATION CONFIGURATION
# =============================================================================

# Cross-validation parameters
N_SPLITS = 10  # 10-fold cross-validation for robust evaluation
RANDOM_STATE = 513
SHUFFLE = True

# Initialize StratifiedKFold to maintain class distribution
skf = StratifiedKFold(
    n_splits=N_SPLITS, 
    shuffle=SHUFFLE, 
    random_state=RANDOM_STATE
)

print(f"🔄 CROSS-VALIDATION CONFIGURATION:")
print(f"  • Number of folds: {N_SPLITS}")
print(f"  • Strategy: Stratified (maintains class proportions)")
print(f"  • Shuffle: {SHUFFLE}")
print(f"  • Random state: {RANDOM_STATE}")

# Analyze class distribution for stratification
print(f"\n📊 Class distribution analysis:")
unique_classes, class_counts = np.unique(y_encoded, return_counts=True)
print(f"  • Total classes: {len(unique_classes)}")
print(f"  • Total samples: {len(y_encoded)}")
print(f"  • Samples per fold: ~{len(y_encoded) // N_SPLITS}")

# Check minimum class size for stratification
min_class_count = min(class_counts)
print(f"  • Minimum class size: {min_class_count}")
if min_class_count < N_SPLITS:
    print(f"  ⚠️ Warning: Smallest class has {min_class_count} samples, less than {N_SPLITS} folds")
    print(f"    Some folds may not contain all classes")
else:
    print(f"  ✅ All classes have sufficient samples for {N_SPLITS}-fold CV")

# Preview fold splits
print(f"\n🔍 Fold size preview:")
for fold_idx, (train_idx, val_idx) in enumerate(skf.split(X_final, y_encoded)):
    if fold_idx < 3:  # Show first 3 folds
        print(f"  Fold {fold_idx + 1}: Train={len(train_idx)}, Val={len(val_idx)}")
    elif fold_idx == 3:
        print("  ...")
    elif fold_idx == N_SPLITS - 1:  # Show last fold
        print(f"  Fold {fold_idx + 1}: Train={len(train_idx)}, Val={len(val_idx)}")
        break

## ⚙️ 7. XGBoost Hyperparameter Configuration

**Why?**
Optimal hyperparameter selection is crucial for XGBoost performance on multi-class classification tasks. These parameters are tuned specifically for the fertilizer prediction challenge, balancing model complexity, training efficiency, and generalization.

**Multi-Class Classification Setup:**
- **Objective**: `multi:softprob` - outputs class probabilities (essential for MAP@3 ranking)
- **num_class**: 22 - matches the number of fertilizer types
- **eval_metric**: `mlogloss` - standard multi-class loss for probability optimization

**Tree Structure Parameters:**
- **max_depth**: 8 - deep enough to capture complex agricultural interactions, not too deep to overfit
- **min_child_weight**: 3 - prevents overfitting on rare fertilizer classes
- **subsample**: 0.8 - row sampling reduces overfitting, maintains sample diversity
- **colsample_bytree**: 0.8 - feature sampling per tree, reduces feature dependency

**Learning & Regularization:**
- **learning_rate**: 0.1 - moderate rate balances training speed and convergence quality
- **n_estimators**: 2000 - high count with early stopping for optimal iteration discovery
- **reg_alpha**: 0.1 - L1 regularization for feature selection
- **reg_lambda**: 1.0 - L2 regularization for weight smoothing
- **gamma**: 0.1 - minimum split loss, prevents unnecessary complexity

**Class Imbalance Handling:**
- **Balanced class weights**: Computed to handle fertilizer frequency imbalances
- **Sample weighting**: Applied per fold to ensure fair representation of rare classes
- **Weight calculation**: Uses scikit-learn's 'balanced' strategy for automatic adjustment

**Performance & Stability:**
- **Early stopping**: 100 rounds - prevents overfitting while allowing convergence
- **Random state**: Fixed for reproducible results across runs
- **n_jobs**: -1 - utilizes all CPU cores for faster training
- **Verbosity**: 0 - clean output for production environment

**Agricultural Domain Optimization:**
- **Deep trees**: Capture soil-crop-environment interactions
- **Regularization**: Prevents overfitting on agricultural outliers
- **Probability focus**: Optimized for ranking performance in MAP@3
- **Class balance**: Ensures good performance across all fertilizer types

**Competition-Specific Tuning:**
- Parameters optimized for Kaggle's MAP@3 evaluation metric
- Balance between training time and model performance
- Robust to the specific agricultural dataset characteristics
- Designed for ensemble combination across CV folds

In [None]:
# =============================================================================
# XGBOOST HYPERPARAMETER CONFIGURATION
# =============================================================================

# Calculate class weights for imbalanced dataset
class_weights = compute_class_weight(
    'balanced',
    classes=np.unique(y_encoded),
    y=y_encoded
)
class_weight_dict = dict(zip(np.unique(y_encoded), class_weights))

print("⚖️ Class weight calculation:")
print(f"  • Balanced class weights computed for {len(class_weight_dict)} classes")
print(f"  • Weight range: {min(class_weights):.3f} - {max(class_weights):.3f}")

# XGBoost hyperparameters (optimized for multi-class classification)
xgb_params = {
    # Multi-class objective
    'objective': 'multi:softprob',
    'num_class': len(label_encoders['target'].classes_),
    'eval_metric': 'mlogloss',
    
    # Tree structure
    'max_depth': 12,
    # 'min_child_weight': 3,
    'subsample': 0.8,
    'colsample_bytree': 0.8,
    
    # Learning parameters
    'learning_rate': 0.03,
    'n_estimators': 3000,  # High number with early stopping
    
    # Regularization
    'reg_alpha': 0.1,  # L1 regularization
    'reg_lambda': 1.0,  # L2 regularization
    'gamma': 0.25,      # Minimum split loss
    
    # Performance
    'random_state': RANDOM_STATE,
    'n_jobs': -1,
    'verbosity': 0,
    
    # Early stopping will be handled separately
    'early_stopping_rounds': 100
}

# Early stopping configuration
es = 100
eval_metric = 'mlogloss'

print(f"\n🚀 XGBOOST CONFIGURATION:")
print(f"  • Objective: {xgb_params['objective']}")
print(f"  • Number of classes: {xgb_params['num_class']}")
print(f"  • Max depth: {xgb_params['max_depth']}")
print(f"  • Learning rate: {xgb_params['learning_rate']}")
print(f"  • Max estimators: {xgb_params['n_estimators']}")
print(f"  • Early stopping: {xgb_params['early_stopping_rounds']} rounds")
print(f"  • Evaluation metric: {xgb_params['eval_metric']}")
print(f"  • Regularization: L1={xgb_params['reg_alpha']}, L2={xgb_params['reg_lambda']}")
print(f"  • Class balancing: Enabled")

## 🏋️ 8. Model Training with 10-Fold Cross-Validation

**Why?**
This section implements the core training pipeline using stratified 10-fold cross-validation. Each fold trains an independent XGBoost model with early stopping and class balancing, creating a robust ensemble for final predictions.

**Training Pipeline Components:**

### 🔄 **Cross-Validation Loop**
- **10 independent models**: Each fold creates a separate XGBoost classifier
- **Stratified splits**: Maintain fertilizer class distribution across all folds
- **Early stopping**: Validation-based stopping prevents overfitting per fold
- **Out-of-fold predictions**: Enable unbiased performance estimation

### ⚖️ **Class Balancing Strategy**
- **Per-fold class weights**: Calculated separately for each training fold
- **Sample weighting**: Applied during XGBoost training to handle imbalanced classes
- **Rare class protection**: Ensures minor fertilizer types receive adequate representation
- **Balanced accuracy**: Prevents model bias toward frequent fertilizer classes

### 📊 **Performance Monitoring**
- **Dual metrics**: Both accuracy and MAP@3 tracked per fold
- **Training time**: Monitor computational efficiency across folds
- **Best iteration**: Track optimal stopping point for each model
- **Validation scores**: Real-time performance feedback during training

### 🎯 **MAP@3 Calculation Pipeline**
- **Probability extraction**: XGBoost outputs class probabilities for each sample
- **Top-3 ranking**: Sort probabilities to identify highest confidence predictions
- **Format conversion**: Transform predictions to format expected by MAP@3 function
- **Fold-wise evaluation**: Calculate MAP@3 independently for each validation set

### 🔍 **Feature Importance Tracking**
- **Per-fold importance**: Extract feature importance from each trained model
- **Aggregated insights**: Combine importance scores across all folds
- **Agricultural validation**: Verify that important features align with domain knowledge
- **Stability assessment**: Check consistency of feature rankings across folds

### 💾 **Model Persistence**
- **Ensemble storage**: Save all 10 trained models for final predictions
- **Metadata tracking**: Record training times, best iterations, hyperparameters
- **Reproducibility**: Maintain complete training history for analysis

**Training Outputs:**
- **10 XGBoost models**: One per fold, ready for ensemble prediction
- **Out-of-fold predictions**: Unbiased performance estimation matrix
- **Performance metrics**: Comprehensive accuracy and MAP@3 statistics
- **Feature importance**: Aggregated importance scores across all models
- **Training diagnostics**: Timing, iteration counts, and stability metrics

**Quality Assurance:**
- **Validation monitoring**: Track overfitting through train/validation gaps
- **Class distribution**: Verify balanced representation in each fold
- **Convergence checking**: Ensure early stopping triggers appropriately
- **Memory management**: Efficient handling of large agricultural dataset

**Competition Readiness:**
- **Ensemble foundation**: Multiple models reduce prediction variance
- **MAP@3 optimization**: Training pipeline specifically tuned for ranking metric
- **Robust validation**: CV scores provide reliable estimate of leaderboard performance
- **Production quality**: Complete error handling and progress monitoring

In [None]:
# =============================================================================
# 10-FOLD CROSS-VALIDATION TRAINING
# =============================================================================

def train_xgboost_cv(X, y, features, cv_splitter, params):
    """
    Train XGBoost models using cross-validation
    
    Args:
        X: Feature matrix
        y: Target vector (encoded)
        features: List of feature names to use
        cv_splitter: Cross-validation splitter (StratifiedKFold)
        params: XGBoost parameters
        early_stopping_rounds: Early stopping patience
        
    Returns:
        Dict with trained models, predictions, and metrics
    """
    
    # Initialize storage
    models = {}
    oof_predictions = np.zeros((len(X), params['num_class']))  # Out-of-fold predictions
    cv_scores = []
    feature_importance_list = []
    
    print(f"🏋️ Starting {N_SPLITS}-Fold Cross-Validation Training...")
    print(f"⏰ Training started at: {time.strftime('%H:%M:%S')}")
    
    # Cross-validation loop
    for fold_idx, (train_idx, val_idx) in enumerate(cv_splitter.split(X, y)):
        
        fold_start_time = time.time()
        print(f"\n📁 FOLD {fold_idx + 1}/{N_SPLITS}")
        print(f"  • Train samples: {len(train_idx)}")
        print(f"  • Validation samples: {len(val_idx)}")
        
        # Split data
        X_train_fold = X.iloc[train_idx][features]
        X_val_fold = X.iloc[val_idx][features]
        y_train_fold = y[train_idx]
        y_val_fold = y[val_idx]
        
        # Calculate sample weights for this fold
        fold_class_weights = compute_class_weight(
            'balanced',
            classes=np.unique(y_train_fold),
            y=y_train_fold
        )
        fold_class_weight_dict = dict(zip(np.unique(y_train_fold), fold_class_weights))
        sample_weights = np.array([fold_class_weight_dict.get(label, 1.0) for label in y_train_fold])
        
        # Initialize model
        model = XGBClassifier(**params)
        
        # Train with early stopping
        model.fit(
            X_train_fold, y_train_fold,
            sample_weight=sample_weights,
            eval_set=[(X_val_fold, y_val_fold)],
            verbose=False
        )
        
        # Predict validation set
        val_pred_proba = model.predict_proba(X_val_fold)
        val_pred_classes = model.predict(X_val_fold)
        
        # Store out-of-fold predictions
        oof_predictions[val_idx] = val_pred_proba
        
        # Calculate fold metrics
        fold_accuracy = accuracy_score(y_val_fold, val_pred_classes)
        
        # Calculate MAP@3 for this fold
        # Get top 3 predictions for each sample
        val_top3_indices = np.argsort(val_pred_proba, axis=1)[:, -3:][:, ::-1]
        
        # Convert to lists for mapk function
        actual_list = y_val_fold.tolist() if hasattr(y_val_fold, 'tolist') else list(y_val_fold)
        predicted_list = val_top3_indices.tolist()
        
        # Calculate MAP@3 using the correct format
        fold_map3 = mapk(actual_list, predicted_list, k=3)
        
        # Store results
        cv_scores.append({
            'fold': fold_idx + 1,
            'accuracy': fold_accuracy,
            'map3': fold_map3,
            'best_iteration': model.best_iteration,
            'train_samples': len(train_idx),
            'val_samples': len(val_idx),
            'training_time': time.time() - fold_start_time
        })
        
        # Store model and feature importance
        models[f'fold_{fold_idx + 1}'] = model
        
        if hasattr(model, 'feature_importances_'):
            importance_df = pd.DataFrame({
                'feature': features,
                'importance': model.feature_importances_,
                'fold': fold_idx + 1
            })
            feature_importance_list.append(importance_df)
        
        fold_time = time.time() - fold_start_time
        print(f"  ✅ Fold completed in {fold_time:.1f}s")
        print(f"  📊 Accuracy: {fold_accuracy:.4f} | MAP@3: {fold_map3:.4f}")
        print(f"  🔄 Best iteration: {model.best_iteration}")
    
    # Calculate overall metrics
    oof_pred_classes = np.argmax(oof_predictions, axis=1)
    overall_accuracy = accuracy_score(y, oof_pred_classes)
    
    # Calculate overall MAP@3
    # Get top 3 predictions for each sample
    oof_top3_indices = np.argsort(oof_predictions, axis=1)[:, -3:][:, ::-1]
    
    # Convert to lists for mapk function
    actual_list = y.tolist() if hasattr(y, 'tolist') else list(y)
    predicted_list = oof_top3_indices.tolist()
    
    # Calculate MAP@3 using the correct format
    overall_map3 = mapk(actual_list, predicted_list, k=3)
    
    # Combine feature importance across folds
    if feature_importance_list:
        feature_importance_df = pd.concat(feature_importance_list, ignore_index=True)
        feature_importance_summary = feature_importance_df.groupby('feature')['importance'].agg(['mean', 'std']).reset_index()
        feature_importance_summary = feature_importance_summary.sort_values('mean', ascending=False)
    else:
        feature_importance_summary = None
    
    return {
        'models': models,
        'oof_predictions': oof_predictions,
        'cv_scores': cv_scores,
        'overall_accuracy': overall_accuracy,
        'overall_map3': overall_map3,
        'feature_importance': feature_importance_summary
    }

# Execute cross-validation training
print("🚀 Starting model training...")
start_time = time.time()

training_results = train_xgboost_cv(
    X=X_final,
    y=y_encoded,
    features=features_to_use,
    cv_splitter=skf,
    params=xgb_params,
)

total_time = time.time() - start_time

print(f"\n🎉 CROSS-VALIDATION TRAINING COMPLETED!")
print(f"⏰ Training finished at: {time.strftime('%H:%M:%S')}")
print(f"⏱️ Total training time: {total_time:.1f}s ({total_time/60:.1f}min)")
print(f"📊 Overall Accuracy: {training_results['overall_accuracy']:.4f}")
print(f"📊 Overall MAP@3: {training_results['overall_map3']:.4f}")

## 📊 9. Comprehensive Model Evaluation & Analysis

**Why?** Thorough evaluation is essential to understand model performance, stability, and competition readiness. We analyze both statistical metrics and practical performance indicators to validate our XGBoost ensemble.

**Evaluation Framework:**

### 🎯 **Primary Metrics Analysis**

- **Cross-Validation MAP@3**: Mean and standard deviation across 10 folds
- **Out-of-Fold MAP@3**: Unbiased estimate using all training data
- **Cross-Validation Accuracy**: Secondary metric for model validation
- **Metric correlation**: Verify MAP@3 and accuracy alignment

### 📈 **Model Stability Assessment**

- **Coefficient of Variation**: Measures performance consistency across folds
- **Fold-to-fold variance**: Identifies potential overfitting or data leakage
- **Performance distribution**: Analyze best/worst performing folds
- **Stability thresholds**: CV < 0.05 indicates stable, reliable model

### ⏱️ **Training Efficiency Analysis**

- **Per-fold training time**: Monitor computational requirements
- **Total training time**: Assess scalability for larger datasets
- **Early stopping behavior**: Validate convergence patterns
- **Resource utilization**: Memory and CPU usage optimization

### 🔍 **Feature Importance Insights**

- **Aggregated importance**: Average feature importance across all 10 models
- **Importance stability**: Consistency of feature rankings across folds
- **Agricultural validation**: Verify important features match domain knowledge
- **Feature selection guidance**: Identify top performers for future iterations

### 📋 **Detailed Performance Breakdown**

- **Fold-by-fold results**: Individual performance analysis per fold
- **Best iteration tracking**: Optimal stopping points across folds
- **Training diagnostics**: Identify potential training issues
- **Performance trends**: Detect systematic patterns in fold performance

**Competition Readiness Indicators:**

### ✅ **Model Quality Checklist**

- **MAP@3 threshold**: Target performance above competitive baselines
- **Stability verification**: Low coefficient of variation across folds
- **No overfitting signs**: Reasonable train/validation performance gaps
- **Feature consistency**: Important features align with EDA insights

### 🏆 **Performance Benchmarks**

- **Baseline comparison**: Performance vs simple models or random predictions
- **Historical context**: Comparison with typical Kaggle competition scores
- **Improvement potential**: Identify areas for further optimization
- **Ensemble readiness**: Validate individual model quality for ensembling

### 🎲 **Risk Assessment**

- **Overfitting detection**: Large performance variations indicate problems
- **Data leakage check**: Unrealistically high scores may indicate leakage
- **Class balance verification**: Ensure good performance across all fertilizer types
- **Generalization confidence**: CV scores predict leaderboard performance

**Actionable Insights:**

- **Feature engineering guidance**: Which engineered features provide most value
- **Hyperparameter tuning direction**: Areas for potential improvement
- **Model ensemble strategy**: How to combine fold predictions optimally
- **Competition submission readiness**: Confidence in final model performance

**Quality Gates:**

- **MAP@3 > 0.30**: Competitive threshold for leaderboard entry
- **Stability CV < 0.05**: Reliable, consistent model performance
- **Feature importance alignment**: Agricultural domain validation
- **No training anomalies**: Clean, successful training across all folds

In [None]:
# =============================================================================
# CROSS-VALIDATION RESULTS EVALUATION
# =============================================================================

print("📊 CROSS-VALIDATION RESULTS")
print("=" * 60)

# Extract results from training
cv_results_df = pd.DataFrame(training_results['cv_scores'])

# Calculate statistics
accuracy_mean = cv_results_df['accuracy'].mean()
accuracy_std = cv_results_df['accuracy'].std()
map3_mean = cv_results_df['map3'].mean()
map3_std = cv_results_df['map3'].std()

print(f"🎯 FINAL METRICS:")
print(f"  📈 Cross-Validation Accuracy: {accuracy_mean:.4f} ± {accuracy_std:.4f}")
print(f"  📈 Cross-Validation MAP@3:    {map3_mean:.4f} ± {map3_std:.4f}")
print(f"  📈 Out-of-Fold Accuracy:      {training_results['overall_accuracy']:.4f}")
print(f"  📈 Out-of-Fold MAP@3:         {training_results['overall_map3']:.4f}")

# Stability evaluation
accuracy_cv = accuracy_std / accuracy_mean if accuracy_mean > 0 else 0
map3_cv = map3_std / map3_mean if map3_mean > 0 else 0

print(f"\n🔍 STABILITY ANALYSIS:")
print(f"  📊 Coefficient of variation (Accuracy): {accuracy_cv:.3f}")
print(f"  📊 Coefficient of variation (MAP@3):    {map3_cv:.3f}")
print(f"  {'✅ Stable model' if accuracy_cv < 0.05 else '⚠️ Variable model'} (Accuracy CV < 0.05)")
print(f"  {'✅ Stable model' if map3_cv < 0.05 else '⚠️ Variable model'} (MAP@3 CV < 0.05)")

# Training time analysis
avg_fold_time = cv_results_df['training_time'].mean()
print(f"\n⏱️ TRAINING TIMES:")
print(f"  📊 Average time per fold: {avg_fold_time:.1f}s")
print(f"  📊 Total time: {total_time:.1f}s ({total_time/60:.1f}min)")

# Detailed results by fold
print(f"\n📋 DETAILED RESULTS BY FOLD:")
print("Fold  Accuracy   MAP@3    Best_Iter  Time(s)")
print("-" * 50)
for _, row in cv_results_df.iterrows():
    print(f"{row['fold']:2.0f}    {row['accuracy']:.4f}   {row['map3']:.4f}     {row['best_iteration']:4.0f}   {row['training_time']:6.1f}")

print("-" * 50)
print(f"Mean  {accuracy_mean:.4f}   {map3_mean:.4f}     {cv_results_df['best_iteration'].mean():4.0f}   {avg_fold_time:6.1f}")

# Feature importance analysis
if training_results['feature_importance'] is not None:
    print(f"\n🔍 TOP 10 MOST IMPORTANT FEATURES:")
    print("Rank  Feature               Importance")
    print("-" * 40)
    for i, (_, row) in enumerate(training_results['feature_importance'].head(10).iterrows()):
        print(f"{i+1:2d}.   {row['feature']:20} {row['mean']:8.4f}")

## 🔮 11. Test Predictions & Ensemble Generation

**Why?** Generate final predictions for the test set using our trained ensemble of 10 XGBoost models. The ensemble approach combines predictions from all CV folds to create more robust and stable predictions than any single model.

**Ensemble Prediction Strategy:**

- **Model averaging**: Combine probability outputs from all 10 fold models
- **Probability calibration**: Ensemble naturally calibrates individual model uncertainties
- **Variance reduction**: Multiple models reduce prediction noise and improve stability
- **Ranking optimization**: Averaged probabilities provide better MAP@3 rankings

**Test Prediction Pipeline:**

- **Consistent preprocessing**: Apply identical feature engineering and encoding as training
- **All-fold prediction**: Generate predictions from each of the 10 trained models
- **Probability averaging**: Mean ensemble of all model probability outputs
- **Top-3 extraction**: Rank fertilizers by ensemble probability for MAP@3 format

**Output Preparation:**

- **Competition format**: Convert top-3 predictions to space-separated fertilizer names
- **Submission ready**: Generate properly formatted CSV for Kaggle submission
- **Quality validation**: Verify prediction format and completeness

## 💾 10. Model Persistence & Competition Submission

**Why?**
Comprehensive model persistence ensures reproducibility, enables model reuse, and creates a complete record of our modeling approach. We save all artifacts needed to reproduce results and deploy the model in production.

**Saved Artifacts:**

### 🤖 **Model Components**
- **Ensemble models**: All 10 trained XGBoost classifiers from CV folds
- **Label encoders**: Categorical variable and target variable encoding mappings
- **Feature list**: Exact features used for training (for consistent preprocessing)
- **Preprocessing pipeline**: Complete transformation chain for new data

### 📊 **Performance Metrics**
- **Cross-validation results**: Detailed fold-by-fold performance metrics
- **Overall performance**: Aggregated MAP@3 and accuracy scores
- **Feature importance**: Aggregated importance rankings across all folds
- **Training diagnostics**: Times, iterations, and convergence information

### ⚙️ **Model Configuration**
- **Hyperparameters**: Complete XGBoost parameter set used for training
- **Training configuration**: CV strategy, random seeds, early stopping settings
- **Data processing**: Feature engineering choices and encoding strategies
- **Version information**: Software versions and environment details

### 🎯 **Competition Submission**
- **Formatted predictions**: Top-3 fertilizer recommendations in Kaggle format
- **Submission metadata**: Model description, performance metrics, methodology
- **Submission file**: Ready-to-upload CSV with proper formatting
- **Prediction confidence**: Ensemble probability scores for quality assessment

**File Organization:**
- **Structured directory**: Organized by model type and performance score
- **Unique naming**: Model directories named with MAP@3 score for easy comparison
- **Complete documentation**: JSON metadata files with full model information
- **Reproducible setup**: Everything needed to reproduce results from scratch

**Quality Assurance:**
- **File verification**: Confirm all artifacts saved successfully
- **Format validation**: Ensure submission file meets Kaggle requirements
- **Metadata completeness**: Comprehensive documentation for future reference
- **Model integrity**: Verify saved models can be loaded and used for prediction

**Production Readiness:**
- **Deployment package**: Complete model bundle for agricultural recommendation systems
- **API integration**: Saved encoders and models ready for real-time inference
- **Performance tracking**: Baseline metrics for monitoring production performance
- **Version control**: Complete artifact versioning for model management

**Competition Documentation:**
- **Methodology record**: Complete approach documentation for team sharing
- **Performance tracking**: Historical record of model improvements
- **Submission history**: Track multiple submission attempts and results
- **Reproducibility guarantee**: Everything needed to recreate exact results

In [None]:
# =============================================================================
# FILE SAVING CONFIGURATION
# =============================================================================

import os
import json
import joblib
from datetime import datetime

# Configure model name based on MAP@3
overall_map3 = training_results['overall_map3']
model_name = f"XGB_10CV_MAP@3-{overall_map3:.5f}".replace('.', '')
model_dir = f"../models/XGB/{N_SPLITS}CV/{model_name}"

# Create directory if it doesn't exist
os.makedirs(model_dir, exist_ok=True)

print(f"📁 MODEL DIRECTORY:")
print(f"  {model_dir}")

# File name configuration - KAGGLE COMPETITION ESSENTIALS ONLY
base_filename = model_name
files_to_create = {
    'hparams': f"{base_filename}_hparams.json",        # ✅ ESSENTIAL - Hyperparameters for reproducibility
    'metrics': f"{base_filename}_metrics.json",        # ✅ ESSENTIAL - Performance metrics and config
    'submission': f"{base_filename}_submission.csv",   # ✅ ESSENTIAL - Competition submission file
    'submission_info': f"{base_filename}_submission_info.json"  # ✅ ESSENTIAL - Submission metadata
    
    # ❌ REMOVED - Not needed for Kaggle competition:
    # 'metrics_pkl': Heavy pickle file with redundant data
    # 'model_pkl': Heavy model files (ensemble used in-memory only)
    # 'feature_import': Nice-to-have but not essential for submission
}

print(f"\n📝 FILES TO CREATE:")
for file_type, filename in files_to_create.items():
    print(f"  {file_type:15}: {filename}")

In [None]:
# =============================================================================
# SAVE HYPERPARAMETERS
# =============================================================================

# General model information
hparams_data = {
    "model_type": "XGBClassifier",
    "model_abbreviation": "XGB",
    "cv_strategy": f"{N_SPLITS}-Fold Stratified Cross Validation",
    "optimization_method": "Manual hyperparameter tuning",
    "ensemble_method": "Average of fold predictions",
    
    # Fixed hyperparameters used
    "hyperparameters": xgb_params,
    
    # General configuration
    "features_selected": features_to_use,
    "num_features": len(features_to_use),
    "class_weights_used": True,
    "random_state": RANDOM_STATE,
    "cv_splits": N_SPLITS,
    "total_models": len(training_results['models']),
    "early_stopping_rounds": es
}

# Save general hyperparameters
hparams_file = os.path.join(model_dir, files_to_create['hparams'])
with open(hparams_file, 'w') as f:
    json.dump(hparams_data, f, indent=2)

print(f"✅ Hyperparameters saved:")
print(f"  📄 General: {files_to_create['hparams']}")



In [None]:
# =============================================================================
# SAVE METRICS - KAGGLE COMPETITION ESSENTIALS ONLY
# =============================================================================

# Extract metrics from training results
cv_results_df = pd.DataFrame(training_results['cv_scores'])
map3_mean = cv_results_df['map3'].mean()
map3_std = cv_results_df['map3'].std()
accuracy_mean = cv_results_df['accuracy'].mean()
accuracy_std = cv_results_df['accuracy'].std()

# Complete metrics data for competition tracking
metrics_data = {
    "model_type": "XGBClassifier",
    "model_abbreviation": "XGB",
    "tier": "10_FOLD_CV",
    "target_variable": "Fertilizer Name",
    "cv_strategy": f"{N_SPLITS}-Fold Stratified Cross Validation",
    "optimization_method": "Manual hyperparameter tuning",
    
    # Main performance metrics
    "map3_score_cv_mean": float(map3_mean),
    "map3_score_cv_std": float(map3_std),
    "map3_score_oof": float(training_results['overall_map3']),
    "accuracy_cv_mean": float(accuracy_mean),
    "accuracy_cv_std": float(accuracy_std),
    "accuracy_oof": float(training_results['overall_accuracy']),
    
    # Model configuration
    "num_classes": len(label_encoders['target'].classes_),
    "features_used": len(features_to_use),
    "features_list": features_to_use,
    "cv_folds": N_SPLITS,
    "total_models_trained": len(training_results['models']),
    
    # Detailed fold results
    "fold_results": training_results['cv_scores'],
    
    # Stability statistics
    "accuracy_cv_coefficient": float(accuracy_std / accuracy_mean) if accuracy_mean > 0 else 0.0,
    "map3_cv_coefficient": float(map3_std / map3_mean) if map3_mean > 0 else 0.0,
    
    # Training performance
    "training_time_total": float(total_time),
    "training_time_per_fold_avg": float(cv_results_df['training_time'].mean()),
    
    # Hyperparameters used
    "hyperparameters": xgb_params,
    
    # Feature importance summary (lightweight)
    "top_features": training_results['feature_importance'].head(10).to_dict('records') if training_results['feature_importance'] is not None else None,
    
    # Metadata
    "timestamp": datetime.now().isoformat(),
    "kaggle_competition": "playground-series-s5e6",
    "ensemble_method": "Average of 10-fold CV models",
    "models_saved": False,  # Models used in-memory only for predictions
    "memory_optimized": True
}

# Save comprehensive metrics as JSON (lightweight, portable)
metrics_file = os.path.join(model_dir, files_to_create['metrics'])
with open(metrics_file, 'w') as f:
    json.dump(metrics_data, f, indent=2)

print(f"✅ Metrics saved (KAGGLE ESSENTIALS):")
print(f"  📄 JSON: {files_to_create['metrics']}")
print(f"  💾 Size: Lightweight (~5-15 KB)")
print(f"  🚀 Competition ready!")
print(f"  📊 Contains: Performance metrics, hyperparameters, fold results, feature importance")

# NOTE: Heavy files removed for competition efficiency:
#   ❌ OOF predictions PKL (~10-50 MB)
#   ❌ Model files PKL (~500MB-2GB) 
#   ❌ Feature importance PKL (~1-5 MB)
#   ❌ Label encoders PKL (~1-5 MB)
# 
# ✅ Ensemble predictions generated in-memory using all 10 models
# ✅ All essential information preserved in lightweight JSON format
# =============================================================================
# GENERATE TEST PREDICTIONS AND CREATE KAGGLE SUBMISSION
# =============================================================================

print(f"\n🔮 Generating test predictions using 10-model ensemble...")

# ✅ ENSEMBLE PREDICTION USING ALL 10 TRAINED MODELS (IN-MEMORY)
# Models are NOT saved to disk - used only for prediction generation
test_predictions_all = []
for fold_name, model in training_results['models'].items():
    pred_proba = model.predict_proba(X_test_final)
    test_predictions_all.append(pred_proba)

print(f"  📊 Generated predictions from {len(test_predictions_all)} fold models")

# Average predictions across all folds (ensemble method)
test_predictions_ensemble = np.mean(test_predictions_all, axis=0)

# Get top 3 predictions for each sample (MAP@3 format)
test_top3_indices = np.argsort(test_predictions_ensemble, axis=1)[:, -3:][:, ::-1]

# Convert prediction indices to fertilizer names
test_top3_names = []
for i in range(len(test_top3_indices)):
    top3_for_sample = []
    for j in range(3):
        class_idx = test_top3_indices[i, j]
        class_name = label_encoders['target'].inverse_transform([class_idx])[0]
        top3_for_sample.append(class_name)
    test_top3_names.append(top3_for_sample)

# Create submission predictions (space-separated top 3 fertilizers)
submission_predictions = []
for top3_names in test_top3_names:
    prediction_string = ' '.join(top3_names)
    submission_predictions.append(prediction_string)

# Create final submission DataFrame
submission = pd.DataFrame({
    'id': range(len(test_predictions_ensemble)),
    'Fertilizer Name': submission_predictions
})

print(f"  ✅ Ensemble predictions completed")
print(f"  📊 Test samples: {len(submission)}")
print(f"  🎯 Format: Top-3 fertilizers per sample (MAP@3)")

# Display sample predictions
print(f"\n📋 SAMPLE PREDICTIONS:")
for i in range(min(5, len(submission))):
    print(f"  Sample {i}: {submission.iloc[i, 1]}")

print(f"\n💡 ENSEMBLE INFO:")
print(f"  🔄 Method: Average probabilities across 10 CV models")
print(f"  🎯 Objective: Maximize MAP@3 score")
print(f"  💾 Models: Used in-memory only (not saved to reduce file size)")
print(f"  🚀 Ready for Kaggle submission!")

# =============================================================================
# SAVE ESSENTIAL KAGGLE COMPETITION FILES
# =============================================================================

# Save submission file
submission_file = os.path.join(model_dir, files_to_create['submission'])
submission.to_csv(submission_file, index=False)

# Enhanced submission information for competition tracking
submission_info = {
    "model_type": "XGBClassifier",
    "model_abbreviation": "XGB", 
    "cv_strategy": f"{N_SPLITS}-Fold Stratified Cross Validation",
    "ensemble_method": "Average of 10-fold CV models",
    
    # Performance metrics
    "map3_score_cv_mean": float(map3_mean),
    "map3_score_cv_std": float(map3_std),
    "map3_score_oof": float(training_results['overall_map3']),
    "accuracy_cv_mean": float(accuracy_mean),
    "accuracy_oof": float(training_results['overall_accuracy']),
    
    # Submission details
    "submission_file": files_to_create['submission'],
    "num_predictions": len(submission),
    "format": "MAP@3 - Top 3 fertilizer names separated by spaces",
    "target_variable": "Fertilizer Name",
    
    # Model details
    "ensemble_models": len(training_results['models']),
    "features_used": len(features_to_use),
    "feature_list": features_to_use,
    
    # Training info
    "total_training_time_minutes": float(total_time / 60),
    "hyperparameters": xgb_params,
    
    # Competition metadata
    "timestamp": datetime.now().isoformat(),
    "kaggle_competition": "playground-series-s5e6",
    "memory_optimized": True,
    "models_saved_to_disk": False
}

# Save submission information
submission_info_file = os.path.join(model_dir, files_to_create['submission_info'])
with open(submission_info_file, 'w') as f:
    json.dump(submission_info, f, indent=2)

print(f"\n✅ KAGGLE ESSENTIAL FILES SAVED:")
print(f"  📄 Submission: {files_to_create['submission']}")
print(f"  📄 Submission info: {files_to_create['submission_info']}")
print(f"  📊 Submission shape: {submission.shape}")

print(f"\n🎯 SAMPLE PREDICTIONS:")
for i in range(min(3, len(submission))):
    print(f"  Sample {i+1}: {submission.iloc[i, 1]}")

print(f"\n💡 SUBMISSION READY FOR KAGGLE!")
print(f"  🎯 Format: Top-3 fertilizer recommendations per sample")
print(f"  📊 Total predictions: {len(submission)}")
print(f"  🔄 Ensemble: 10-fold CV models averaged")
print(f"  💾 File size: Minimal (~MB vs GB for full model save)")


In [None]:
# =============================================================================
# FINAL SUMMARY - KAGGLE COMPETITION FILES
# =============================================================================

print(f"\n💾 KAGGLE COMPETITION FILES SUMMARY")
print("=" * 70)

print(f"📁 DIRECTORY: {model_dir}")
print(f"\n📄 ESSENTIAL FILES SAVED:")

# Calculate total size
total_size = 0
file_details = []

for file_type, filename in files_to_create.items():
    file_path = os.path.join(model_dir, filename)
    if os.path.exists(file_path):
        file_size = os.path.getsize(file_path)
        total_size += file_size
        
        if file_size > 1024*1024:  # > 1MB
            size_str = f"{file_size/(1024*1024):.1f} MB"
        elif file_size > 1024:  # > 1KB
            size_str = f"{file_size/1024:.1f} KB"
        else:
            size_str = f"{file_size} bytes"
        
        # File descriptions
        descriptions = {
            'hparams': "Model hyperparameters & configuration",
            'metrics': "Performance metrics & training details", 
            'submission': "Kaggle submission with top-3 predictions",
            'submission_info': "Submission metadata & model info"
        }
        
        desc = descriptions.get(file_type, "")
        file_details.append((filename, size_str, desc))
        print(f"  ✅ {filename:30} ({size_str:8}) - {desc}")
    else:
        print(f"  ❌ {filename:30} (NOT CREATED)")

# Total size calculation
if total_size > 1024*1024:
    total_size_str = f"{total_size/(1024*1024):.1f} MB"
elif total_size > 1024:
    total_size_str = f"{total_size/1024:.1f} KB"
else:
    total_size_str = f"{total_size} bytes"

print(f"\n📊 TOTAL SIZE: {total_size_str}")

print(f"\n🎯 PERFORMANCE SUMMARY:")
print(f"  📈 MAP@3 (CV):         {map3_mean:.5f} ± {map3_std:.5f}")
print(f"  📈 MAP@3 (OOF):        {training_results['overall_map3']:.5f}")
print(f"  📈 Accuracy (OOF):     {training_results['overall_accuracy']:.5f}")
print(f"  🔄 CV Folds:           {N_SPLITS}")
print(f"  ⏱️ Training Time:      {total_time/60:.1f} minutes")

print(f"\n🤖 MODEL CONFIGURATION:")
print(f"  🔧 Algorithm:          XGBoost Ensemble")
print(f"  📊 Features:           {len(features_to_use)}")
print(f"  🎯 Classes:            {len(label_encoders['target'].classes_)} fertilizers")
print(f"  💾 Memory Strategy:    In-memory ensemble (models not saved)")

print(f"\n🚀 KAGGLE SUBMISSION READY:")
print(f"  📄 File: {files_to_create['submission']}")
print(f"  📊 Predictions: {len(submission)} samples")
print(f"  🎯 Format: MAP@3 (top-3 fertilizers per sample)")
print(f"  💡 Upload this file to Kaggle for competition scoring")

print(f"\n✨ OPTIMIZATION BENEFITS:")
print(f"  💾 File Size: ~{total_size_str} (vs ~500MB-2GB with models)")
print(f"  ⚡ Memory: ~99% reduction from full model save")
print(f"  🎯 Focus: Only competition-essential files")
print(f"  📁 Clean: No heavy PKL files or model artifacts")

print(f"\n📂 Location: {os.path.abspath(model_dir)}")
print(f"\n🎉 KAGGLE-OPTIMIZED XGBOOST PIPELINE COMPLETED!")
print(f"🚀 Ready for competition submission! 🏆")
