# 🚀 XGBoost with 10-Fold CV - Fertilizer Prediction

> **Objective**: Build a robust XGBoost model with 10-fold cross-validation to predict optimal fertilizers for agricultural conditions.
> 
> **Target Variable**: `Fertilizer Name` (multi-class classification)
> 
> **Strategy**: Complete pipeline from data loading to ensemble prediction with advanced feature engineering
> 
> **Main Metric**: MAP@3 (Mean Average Precision at 3) - Kaggle competition requirement
> 
> **✨ KEY FEATURES**:
> - 📊 **Complete pipeline**: Data loading, EDA, feature engineering, encoding
> - 🔬 **Advanced feature engineering**: NPK ratios, environmental indices, categorical binning
> - 🔍 **Stratified 10-Fold CV**: Maintains class distribution across folds
> - ⚖️ **Class weight balancing**: Handles imbalanced fertilizer classes
> - 🎯 **Ensemble prediction**: Combines 10 models for robust results
> - ✂️ **Early stopping**: Prevents overfitting with validation monitoring

---

## 🎯 Competition Overview

**Kaggle Playground Series S5E6: Fertilizer Prediction Challenge**

**Problem**: Select the best fertilizer for different weather, soil conditions, and crops
**Type**: Multi-class classification
**Evaluation**: MAP@3 (Mean Average Precision @ 3)

**Dataset Features**:
- 🌡️ Environmental: Temperature, Humidity, Soil Moisture
- 🧪 Chemical: Nitrogen, Phosphorus, Potassium levels
- 🌱 Agricultural: Soil Type, Crop Type
- 🎯 Target: Fertilizer Name (22 different fertilizers)

## 📚 1. Import Libraries

**Essential libraries for data manipulation, machine learning, and visualization**

In [1]:
# Core data manipulation and analysis
import pandas as pd
import numpy as np
import os
import warnings
warnings.filterwarnings('ignore')

# Machine Learning - Scikit-learn
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.utils.class_weight import compute_class_weight

# XGBoost
from xgboost import XGBClassifier
from xgboost.callback import EarlyStopping

# Model persistence and metadata
import joblib
import json
from datetime import datetime

# Utilities
import time
from collections import Counter

# Configuration
np.random.seed(513)


## 📏 MAP@k Function

In [2]:
def mapk(actual, predicted, k=3):
    """Compute mean average precision at k (MAP@k)."""
    def apk(a, p, k):
        score = 0.0
        for i in range(min(k, len(p))):
            if p[i] == a:
                score += 1.0 / (i + 1)
                break  # only the first correct prediction counts
        return score
    return np.mean([apk(a, p, k) for a, p in zip(actual, predicted)])

## 📂 2. Data Loading

**Load datasets and prepare for modeling (EDA already completed in separate notebook)**

In [3]:
# Define file paths
data_path = '../data'
train_path = os.path.join(data_path, 'train.csv')
test_path = os.path.join(data_path, 'test.csv')
sample_submission_path = os.path.join(data_path, 'sample_submission.csv')

# Load datasets
print("📂 Loading datasets...")
train_df = pd.read_csv(train_path)
test_df = pd.read_csv(test_path)
sample_submission = pd.read_csv(sample_submission_path)

print("✅ Data loaded successfully:")
print(f"  • Training set: {train_df.shape}")
print(f"  • Test set: {test_df.shape}")
print(f"  • Sample submission: {sample_submission.shape}")



📂 Loading datasets...
✅ Data loaded successfully:
  • Training set: (750000, 10)
  • Test set: (250000, 9)
  • Sample submission: (250000, 2)


In [4]:
# Separate features and target variable
target_column = 'Fertilizer Name'

# Split training data
X_raw = train_df.drop(columns=[target_column])
y_raw = train_df[target_column]
X_test_raw = test_df.copy()

print("✅ Data separation completed:")
print(f"  • Training features: {X_raw.shape}")
print(f"  • Training target: {y_raw.shape}")
print(f"  • Test features: {X_test_raw.shape}")
print(f"  • Target classes: {y_raw.nunique()}")

✅ Data separation completed:
  • Training features: (750000, 9)
  • Training target: (750000,)
  • Test features: (250000, 9)
  • Target classes: 7


## 🔬 3. Feature Engineering

**Apply advanced feature engineering based on agricultural domain knowledge**

In [5]:
def create_features(df):
    """
    Create engineered features based on agricultural domain knowledge
    
    Args:
        df: DataFrame with agricultural features
        
    Returns:
        DataFrame with additional engineered features
    """
    df_eng = df.copy()
    
    # NPK Ratios (crucial for agricultural decisions)
    df_eng['N_P_ratio'] = df_eng['Nitrogen'] / (df_eng['Phosphorous'] + 0.001)
    df_eng['N_K_ratio'] = df_eng['Nitrogen'] / (df_eng['Potassium'] + 0.001)
    df_eng['P_K_ratio'] = df_eng['Phosphorous'] / (df_eng['Potassium'] + 0.001)
    
    # Total NPK and NPK Balance
    df_eng['Total_NPK'] = df_eng['Nitrogen'] + df_eng['Phosphorous'] + df_eng['Potassium']
    npk_mean = df_eng[['Nitrogen', 'Phosphorous', 'Potassium']].mean(axis=1)
    df_eng['NPK_Balance'] = df_eng[['Nitrogen', 'Phosphorous', 'Potassium']].std(axis=1) / (npk_mean + 0.001)
    
    # Environmental indices
    df_eng['Temp_Hum_index'] = df_eng['Temparature'] * df_eng['Humidity'] / 100
    df_eng['Moist_Balance'] = df_eng['Moisture'] - df_eng['Humidity']
    df_eng['Environ_Stress'] = np.sqrt((df_eng['Temparature'] - 25)**2 + (df_eng['Humidity'] - 65)**2)
    df_eng['Temp_Moist_inter'] = df_eng['Temparature'] * df_eng['Moisture'] / 100
    
    # Dominant nutrient
    npk_cols = ['Nitrogen', 'Phosphorous', 'Potassium']
    df_eng['Dominant_NPK'] = df_eng[npk_cols].idxmax(axis=1)
    
    # Categorical binning
    df_eng['Temp_Cat'] = pd.cut(df_eng['Temparature'], bins=3, labels=['Low', 'Medium', 'High'])
    df_eng['Hum_Cat'] = pd.cut(df_eng['Humidity'], bins=3, labels=['Low', 'Medium', 'High'])
    df_eng['N_Level'] = pd.cut(df_eng['Nitrogen'], bins=3, labels=['Low', 'Medium', 'High'])
    df_eng['K_Level'] = pd.cut(df_eng['Potassium'], bins=3, labels=['Low', 'Medium', 'High'])
    df_eng['P_Level'] = pd.cut(df_eng['Phosphorous'], bins=3, labels=['Low', 'Medium', 'High'])
    
    # Soil-Crop interaction
    df_eng['Soil_Crop_Combo'] = df_eng['Soil Type'].astype(str) + '_' + df_eng['Crop Type'].astype(str)
    
    return df_eng

# Apply feature engineering
print("🔧 Applying feature engineering...")
X_train_featured = create_features(X_raw)
X_test_featured = create_features(X_test_raw)

print(f"✅ Feature engineering completed:")
print(f"  • Original features: {X_raw.shape[1]}")
print(f"  • After feature engineering: {X_train_featured.shape[1]}")
print(f"  • New features added: {X_train_featured.shape[1] - X_raw.shape[1]}")

# Display new feature names
original_features = set(X_raw.columns)
new_features = [col for col in X_train_featured.columns if col not in original_features]
print(f"\n🆕 New engineered features ({len(new_features)}):")
for i, feature in enumerate(new_features, 1):
    print(f"  {i:2d}. {feature}")

🔧 Applying feature engineering...
✅ Feature engineering completed:
  • Original features: 9
  • After feature engineering: 25
  • New features added: 16

🆕 New engineered features (16):
   1. N_P_ratio
   2. N_K_ratio
   3. P_K_ratio
   4. Total_NPK
   5. NPK_Balance
   6. Temp_Hum_index
   7. Moist_Balance
   8. Environ_Stress
   9. Temp_Moist_inter
  10. Dominant_NPK
  11. Temp_Cat
  12. Hum_Cat
  13. N_Level
  14. K_Level
  15. P_Level
  16. Soil_Crop_Combo


## 🔢 4. Label Encoding

**Encode categorical variables for machine learning compatibility**

In [6]:
def encode_categorical_features(X_train, X_test, y_train):
    """
    Encode categorical features using LabelEncoder
    
    Args:
        X_train: Training features
        X_test: Test features  
        y_train: Training target
        
    Returns:
        Tuple of (X_train_encoded, X_test_encoded, y_encoded, encoders_dict)
    """
    
    # Initialize encoders dictionary
    encoders = {}
    
    # Create copies to avoid modifying originals
    X_train_enc = X_train.copy()
    X_test_enc = X_test.copy()
    
    # Identify categorical columns
    categorical_cols = X_train.select_dtypes(include=['object', 'category']).columns.tolist()
    
    print(f"🔢 Encoding categorical features...")
    print(f"Categorical columns found: {categorical_cols}")
    
    # Encode categorical features
    for col in categorical_cols:
        print(f"  • Encoding: {col}")
        
        # Create encoder
        encoder = LabelEncoder()
        
        # Fit on combined training and test data to ensure consistency
        combined_values = pd.concat([X_train[col], X_test[col]]).astype(str)
        encoder.fit(combined_values)
        
        # Transform both datasets
        X_train_enc[col] = encoder.transform(X_train[col].astype(str))
        X_test_enc[col] = encoder.transform(X_test[col].astype(str))
        
        # Store encoder
        encoders[col] = encoder
        
        print(f"    - Classes: {len(encoder.classes_)} | {list(encoder.classes_[:5])}{'...' if len(encoder.classes_) > 5 else ''}")
    
    # Encode target variable
    print(f"\n🎯 Encoding target variable: {target_column}")
    target_encoder = LabelEncoder()
    y_encoded = target_encoder.fit_transform(y_train)
    encoders['target'] = target_encoder
    
    print(f"  • Target classes: {len(target_encoder.classes_)}")
    print(f"  • Class mapping preview: {dict(zip(target_encoder.classes_[:5], range(5)))}")
    
    return X_train_enc, X_test_enc, y_encoded, encoders

# Apply encoding
X_train_encoded, X_test_encoded, y_encoded, label_encoders = encode_categorical_features(
    X_train_featured, X_test_featured, y_raw
)

print(f"\n✅ Encoding completed:")
print(f"  • Training features: {X_train_encoded.shape}")
print(f"  • Test features: {X_test_encoded.shape}")
print(f"  • Encoded target: {y_encoded.shape}")
print(f"  • Encoders stored: {len(label_encoders)}")

🔢 Encoding categorical features...
Categorical columns found: ['Soil Type', 'Crop Type', 'Dominant_NPK', 'Temp_Cat', 'Hum_Cat', 'N_Level', 'K_Level', 'P_Level', 'Soil_Crop_Combo']
  • Encoding: Soil Type
    - Classes: 5 | ['Black', 'Clayey', 'Loamy', 'Red', 'Sandy']
  • Encoding: Crop Type
    - Classes: 11 | ['Barley', 'Cotton', 'Ground Nuts', 'Maize', 'Millets']...
  • Encoding: Dominant_NPK
    - Classes: 3 | ['Nitrogen', 'Phosphorous', 'Potassium']
  • Encoding: Temp_Cat
    - Classes: 3 | ['High', 'Low', 'Medium']
  • Encoding: Hum_Cat
    - Classes: 3 | ['High', 'Low', 'Medium']
  • Encoding: N_Level
    - Classes: 3 | ['High', 'Low', 'Medium']
  • Encoding: K_Level
    - Classes: 3 | ['High', 'Low', 'Medium']
  • Encoding: P_Level
    - Classes: 3 | ['High', 'Low', 'Medium']
  • Encoding: Soil_Crop_Combo
    - Classes: 55 | ['Black_Barley', 'Black_Cotton', 'Black_Ground Nuts', 'Black_Maize', 'Black_Millets']...

🎯 Encoding target variable: Fertilizer Name
  • Target classes: 7


## 🎯 5. Feature Selection

**Select features for model training with easy toggle options**

In [7]:
# =============================================================================
# FEATURE SELECTION FOR THE MODEL
# =============================================================================

features_to_use = [
    # 🌡️ ORIGINAL CLIMATE VARIABLES
    'Temparature',
    'Humidity', 
    'Moisture',
    
    # 🧪 CHEMICAL VARIABLES (NPK)
    'Nitrogen',
    'Potassium', 
    'Phosphorous',
    
    # 📊 ENGINEERED FEATURES - NPK RATIOS (from create_features)
    # 'N_P_ratio',
    # 'N_K_ratio',
    # 'P_K_ratio',
    # 'Total_NPK',
    # 'NPK_Balance',
    
    # 🌡️ ENGINEERED FEATURES - CLIMATE INDICES (from create_features)
    # 'Temp_Hum_index',
    # 'Moist_Balance',
    # 'Environ_Stress',
    # 'Temp_Moist_inter',
    
    # 🏷️ ENGINEERED FEATURES - CATEGORICAL LEVELS (from create_features, encoded)
    # 'Temp_Cat',
    # 'Hum_Cat',
    # 'N_Level',
    # 'K_Level',
    # 'P_Level',

    # 🔗 ENGINEERED FEATURES - COMBINATIONS (from create_features)
    # 'Soil_Crop_Combo',
    # 'Dominant_NPK',
    
    # 🔢 ENCODED CATEGORICAL FEATURES (from preprocessing)
    'Soil Type',      # ✅ Encoded during preprocessing
    'Crop Type',      # ✅ Encoded during preprocessing
]

# Validate available features against the actual processed dataset
print(f"🔍 Validating features against processed dataset...")
print(f"📊 Available columns in dataset: {list(X_train_encoded.columns)}")

available_features = []
missing_features = []

for feature in features_to_use:
    if feature in X_train_encoded.columns:
        available_features.append(feature)
    else:
        missing_features.append(feature)

# Update final feature list to only include available features
features_to_use = available_features

if missing_features:
    print(f"\n⚠️ Missing features (will be skipped): {missing_features}")

# Display selected features by category
print(f"\n📋 SELECTED FEATURES ({len(features_to_use)} total):")

# Group features by category for better readability
climate_original = [f for f in features_to_use if f in ['Temparature', 'Humidity', 'Moisture']]
npk_original = [f for f in features_to_use if f in ['Nitrogen', 'Potassium', 'Phosphorous']]
npk_ratios = [f for f in features_to_use if any(x in f for x in ['_ratio', 'Total_NPK', 'NPK_Balance'])]
climate_engineered = [f for f in features_to_use if any(x in f for x in ['Temp_Hum', 'Moist_Balance', 'Environ_Stress', 'Temp_Moist'])]
categorical_levels = [f for f in features_to_use if any(x in f for x in ['_Cat', '_Level'])]
combinations = [f for f in features_to_use if any(x in f for x in ['Combo', 'Dominant'])]
encoded_original = [f for f in features_to_use if f in ['Soil Type', 'Crop Type']]

feature_groups = [
    ("🌡️ Original Climate", climate_original),
    ("🧪 Original NPK", npk_original),
    ("📊 NPK Ratios", npk_ratios),
    ("🌡️ Climate Indices", climate_engineered),
    ("🏷️ Categorical Levels", categorical_levels),
    ("🔗 Combinations", combinations),
    ("🔢 Encoded Categories", encoded_original)
]

for group_name, group_features in feature_groups:
    if group_features:
        print(f"\n{group_name} ({len(group_features)}):")
        for i, feature in enumerate(group_features, 1):
            print(f"  {i:2d}. {feature}")

print(f"\n🚀 Ready for model training with {len(features_to_use)} features!")

# Create final training datasets
X_final = X_train_encoded[features_to_use].copy()
X_test_final = X_test_encoded[features_to_use].copy()

print(f"\n✅ Final dataset shapes:")
print(f"  • Training: {X_final.shape}")
print(f"  • Test: {X_test_final.shape}")
print(f"  • Target: {y_encoded.shape}")

🔍 Validating features against processed dataset...
📊 Available columns in dataset: ['id', 'Temparature', 'Humidity', 'Moisture', 'Soil Type', 'Crop Type', 'Nitrogen', 'Potassium', 'Phosphorous', 'N_P_ratio', 'N_K_ratio', 'P_K_ratio', 'Total_NPK', 'NPK_Balance', 'Temp_Hum_index', 'Moist_Balance', 'Environ_Stress', 'Temp_Moist_inter', 'Dominant_NPK', 'Temp_Cat', 'Hum_Cat', 'N_Level', 'K_Level', 'P_Level', 'Soil_Crop_Combo']

📋 SELECTED FEATURES (8 total):

🌡️ Original Climate (3):
   1. Temparature
   2. Humidity
   3. Moisture

🧪 Original NPK (3):
   1. Nitrogen
   2. Potassium
   3. Phosphorous

🔢 Encoded Categories (2):
   1. Soil Type
   2. Crop Type

🚀 Ready for model training with 8 features!

✅ Final dataset shapes:
  • Training: (750000, 8)
  • Test: (250000, 8)
  • Target: (750000,)


## 🔄 6. Cross-Validation Setup

**Configure stratified 10-fold cross-validation to maintain class distribution across folds**

In [8]:
# =============================================================================
# STRATIFIED 10-FOLD CROSS-VALIDATION CONFIGURATION
# =============================================================================

# Cross-validation parameters
N_SPLITS = 10  # 10-fold cross-validation for robust evaluation
RANDOM_STATE = 513
SHUFFLE = True

# Initialize StratifiedKFold to maintain class distribution
skf = StratifiedKFold(
    n_splits=N_SPLITS, 
    shuffle=SHUFFLE, 
    random_state=RANDOM_STATE
)

print(f"🔄 CROSS-VALIDATION CONFIGURATION:")
print(f"  • Number of folds: {N_SPLITS}")
print(f"  • Strategy: Stratified (maintains class proportions)")
print(f"  • Shuffle: {SHUFFLE}")
print(f"  • Random state: {RANDOM_STATE}")

# Analyze class distribution for stratification
print(f"\n📊 Class distribution analysis:")
unique_classes, class_counts = np.unique(y_encoded, return_counts=True)
print(f"  • Total classes: {len(unique_classes)}")
print(f"  • Total samples: {len(y_encoded)}")
print(f"  • Samples per fold: ~{len(y_encoded) // N_SPLITS}")

# Check minimum class size for stratification
min_class_count = min(class_counts)
print(f"  • Minimum class size: {min_class_count}")
if min_class_count < N_SPLITS:
    print(f"  ⚠️ Warning: Smallest class has {min_class_count} samples, less than {N_SPLITS} folds")
    print(f"    Some folds may not contain all classes")
else:
    print(f"  ✅ All classes have sufficient samples for {N_SPLITS}-fold CV")

# Preview fold splits
print(f"\n🔍 Fold size preview:")
for fold_idx, (train_idx, val_idx) in enumerate(skf.split(X_final, y_encoded)):
    if fold_idx < 3:  # Show first 3 folds
        print(f"  Fold {fold_idx + 1}: Train={len(train_idx)}, Val={len(val_idx)}")
    elif fold_idx == 3:
        print("  ...")
    elif fold_idx == N_SPLITS - 1:  # Show last fold
        print(f"  Fold {fold_idx + 1}: Train={len(train_idx)}, Val={len(val_idx)}")
        break

🔄 CROSS-VALIDATION CONFIGURATION:
  • Number of folds: 10
  • Strategy: Stratified (maintains class proportions)
  • Shuffle: True
  • Random state: 513

📊 Class distribution analysis:
  • Total classes: 7
  • Total samples: 750000
  • Samples per fold: ~75000
  • Minimum class size: 92317
  ✅ All classes have sufficient samples for 10-fold CV

🔍 Fold size preview:
  Fold 1: Train=675000, Val=75000
  Fold 2: Train=675000, Val=75000
  Fold 3: Train=675000, Val=75000
  ...
  Fold 10: Train=675000, Val=75000


## ⚙️ 7. XGBoost Configuration

**Set up XGBoost hyperparameters optimized for the fertilizer prediction task**

In [9]:
# =============================================================================
# XGBOOST HYPERPARAMETER CONFIGURATION
# =============================================================================

# Calculate class weights for imbalanced dataset
class_weights = compute_class_weight(
    'balanced',
    classes=np.unique(y_encoded),
    y=y_encoded
)
class_weight_dict = dict(zip(np.unique(y_encoded), class_weights))

print("⚖️ Class weight calculation:")
print(f"  • Balanced class weights computed for {len(class_weight_dict)} classes")
print(f"  • Weight range: {min(class_weights):.3f} - {max(class_weights):.3f}")

# XGBoost hyperparameters (optimized for multi-class classification)
xgb_params = {
    # Multi-class objective
    'objective': 'multi:softprob',
    'num_class': len(label_encoders['target'].classes_),
    'eval_metric': 'mlogloss',
    
    # Tree structure
    'max_depth': 8,
    'min_child_weight': 3,
    'subsample': 0.8,
    'colsample_bytree': 0.8,
    
    # Learning parameters
    'learning_rate': 0.1,
    'n_estimators': 2000,  # High number with early stopping
    
    # Regularization
    'reg_alpha': 0.1,  # L1 regularization
    'reg_lambda': 1.0,  # L2 regularization
    'gamma': 0.1,      # Minimum split loss
    
    # Performance
    'random_state': RANDOM_STATE,
    'n_jobs': -1,
    'verbosity': 0,
    
    # Early stopping will be handled separately
    'early_stopping_rounds': 100
}

# Early stopping configuration
es = 100
eval_metric = 'mlogloss'

print(f"\n🚀 XGBOOST CONFIGURATION:")
print(f"  • Objective: {xgb_params['objective']}")
print(f"  • Number of classes: {xgb_params['num_class']}")
print(f"  • Max depth: {xgb_params['max_depth']}")
print(f"  • Learning rate: {xgb_params['learning_rate']}")
print(f"  • Max estimators: {xgb_params['n_estimators']}")
print(f"  • Early stopping: {xgb_params['early_stopping_rounds']} rounds")
print(f"  • Evaluation metric: {xgb_params['eval_metric']}")
print(f"  • Regularization: L1={xgb_params['reg_alpha']}, L2={xgb_params['reg_lambda']}")
print(f"  • Class balancing: Enabled")

⚖️ Class weight calculation:
  • Balanced class weights computed for 7 classes
  • Weight range: 0.936 - 1.161

🚀 XGBOOST CONFIGURATION:
  • Objective: multi:softprob
  • Number of classes: 7
  • Max depth: 8
  • Learning rate: 0.1
  • Max estimators: 2000
  • Early stopping: 100 rounds
  • Evaluation metric: mlogloss
  • Regularization: L1=0.1, L2=1.0
  • Class balancing: Enabled


## 🏋️ 8. Model Training with 10-Fold Cross-Validation

**Train XGBoost models using stratified 10-fold cross-validation with early stopping**

In [10]:
# =============================================================================
# 10-FOLD CROSS-VALIDATION TRAINING
# =============================================================================

def train_xgboost_cv(X, y, features, cv_splitter, params):
    """
    Train XGBoost models using cross-validation
    
    Args:
        X: Feature matrix
        y: Target vector (encoded)
        features: List of feature names to use
        cv_splitter: Cross-validation splitter (StratifiedKFold)
        params: XGBoost parameters
        early_stopping_rounds: Early stopping patience
        
    Returns:
        Dict with trained models, predictions, and metrics
    """
    
    # Initialize storage
    models = {}
    oof_predictions = np.zeros((len(X), params['num_class']))  # Out-of-fold predictions
    cv_scores = []
    feature_importance_list = []
    
    print(f"🏋️ Starting {N_SPLITS}-Fold Cross-Validation Training...")
    print(f"⏰ Training started at: {time.strftime('%H:%M:%S')}")
    
    # Cross-validation loop
    for fold_idx, (train_idx, val_idx) in enumerate(cv_splitter.split(X, y)):
        
        fold_start_time = time.time()
        print(f"\n📁 FOLD {fold_idx + 1}/{N_SPLITS}")
        print(f"  • Train samples: {len(train_idx)}")
        print(f"  • Validation samples: {len(val_idx)}")
        
        # Split data
        X_train_fold = X.iloc[train_idx][features]
        X_val_fold = X.iloc[val_idx][features]
        y_train_fold = y[train_idx]
        y_val_fold = y[val_idx]
        
        # Calculate sample weights for this fold
        fold_class_weights = compute_class_weight(
            'balanced',
            classes=np.unique(y_train_fold),
            y=y_train_fold
        )
        fold_class_weight_dict = dict(zip(np.unique(y_train_fold), fold_class_weights))
        sample_weights = np.array([fold_class_weight_dict.get(label, 1.0) for label in y_train_fold])
        
        # Initialize model
        model = XGBClassifier(**params)
        
        # Train with early stopping
        model.fit(
            X_train_fold, y_train_fold,
            sample_weight=sample_weights,
            eval_set=[(X_val_fold, y_val_fold)],
            verbose=False
        )
        
        # Predict validation set
        val_pred_proba = model.predict_proba(X_val_fold)
        val_pred_classes = model.predict(X_val_fold)
        
        # Store out-of-fold predictions
        oof_predictions[val_idx] = val_pred_proba
        
        # Calculate fold metrics
        fold_accuracy = accuracy_score(y_val_fold, val_pred_classes)
        
        # Calculate MAP@3 for this fold
        # Get top 3 predictions for each sample
        val_top3_indices = np.argsort(val_pred_proba, axis=1)[:, -3:][:, ::-1]
        
        # Convert to lists for mapk function
        actual_list = y_val_fold.tolist() if hasattr(y_val_fold, 'tolist') else list(y_val_fold)
        predicted_list = val_top3_indices.tolist()
        
        # Calculate MAP@3 using the correct format
        fold_map3 = mapk(actual_list, predicted_list, k=3)
        
        # Store results
        cv_scores.append({
            'fold': fold_idx + 1,
            'accuracy': fold_accuracy,
            'map3': fold_map3,
            'best_iteration': model.best_iteration,
            'train_samples': len(train_idx),
            'val_samples': len(val_idx),
            'training_time': time.time() - fold_start_time
        })
        
        # Store model and feature importance
        models[f'fold_{fold_idx + 1}'] = model
        
        if hasattr(model, 'feature_importances_'):
            importance_df = pd.DataFrame({
                'feature': features,
                'importance': model.feature_importances_,
                'fold': fold_idx + 1
            })
            feature_importance_list.append(importance_df)
        
        fold_time = time.time() - fold_start_time
        print(f"  ✅ Fold completed in {fold_time:.1f}s")
        print(f"  📊 Accuracy: {fold_accuracy:.4f} | MAP@3: {fold_map3:.4f}")
        print(f"  🔄 Best iteration: {model.best_iteration}")
    
    # Calculate overall metrics
    oof_pred_classes = np.argmax(oof_predictions, axis=1)
    overall_accuracy = accuracy_score(y, oof_pred_classes)
    
    # Calculate overall MAP@3
    # Get top 3 predictions for each sample
    oof_top3_indices = np.argsort(oof_predictions, axis=1)[:, -3:][:, ::-1]
    
    # Convert to lists for mapk function
    actual_list = y.tolist() if hasattr(y, 'tolist') else list(y)
    predicted_list = oof_top3_indices.tolist()
    
    # Calculate MAP@3 using the correct format
    overall_map3 = mapk(actual_list, predicted_list, k=3)
    
    # Combine feature importance across folds
    if feature_importance_list:
        feature_importance_df = pd.concat(feature_importance_list, ignore_index=True)
        feature_importance_summary = feature_importance_df.groupby('feature')['importance'].agg(['mean', 'std']).reset_index()
        feature_importance_summary = feature_importance_summary.sort_values('mean', ascending=False)
    else:
        feature_importance_summary = None
    
    return {
        'models': models,
        'oof_predictions': oof_predictions,
        'cv_scores': cv_scores,
        'overall_accuracy': overall_accuracy,
        'overall_map3': overall_map3,
        'feature_importance': feature_importance_summary
    }

# Execute cross-validation training
print("🚀 Starting model training...")
start_time = time.time()

training_results = train_xgboost_cv(
    X=X_final,
    y=y_encoded,
    features=features_to_use,
    cv_splitter=skf,
    params=xgb_params,
)

total_time = time.time() - start_time

print(f"\n🎉 CROSS-VALIDATION TRAINING COMPLETED!")
print(f"⏰ Training finished at: {time.strftime('%H:%M:%S')}")
print(f"⏱️ Total training time: {total_time:.1f}s ({total_time/60:.1f}min)")
print(f"📊 Overall Accuracy: {training_results['overall_accuracy']:.4f}")
print(f"📊 Overall MAP@3: {training_results['overall_map3']:.4f}")

🚀 Starting model training...
🏋️ Starting 10-Fold Cross-Validation Training...
⏰ Training started at: 22:31:36

📁 FOLD 1/10
  • Train samples: 675000
  • Validation samples: 75000
  ✅ Fold completed in 526.4s
  📊 Accuracy: 0.2033 | MAP@3: 0.3380
  🔄 Best iteration: 374

📁 FOLD 2/10
  • Train samples: 675000
  • Validation samples: 75000
  ✅ Fold completed in 473.7s
  📊 Accuracy: 0.2014 | MAP@3: 0.3365
  🔄 Best iteration: 329

📁 FOLD 3/10
  • Train samples: 675000
  • Validation samples: 75000
  ✅ Fold completed in 504.8s
  📊 Accuracy: 0.1994 | MAP@3: 0.3349
  🔄 Best iteration: 356

📁 FOLD 4/10
  • Train samples: 675000
  • Validation samples: 75000
  ✅ Fold completed in 470.5s
  📊 Accuracy: 0.1975 | MAP@3: 0.3326
  🔄 Best iteration: 321

📁 FOLD 5/10
  • Train samples: 675000
  • Validation samples: 75000
  ✅ Fold completed in 468.6s
  📊 Accuracy: 0.1983 | MAP@3: 0.3331
  🔄 Best iteration: 322

📁 FOLD 6/10
  • Train samples: 675000
  • Validation samples: 75000
  ✅ Fold completed in 423.

## 📊 9. Model Evaluation

**Complete performance analysis and cross-validation metrics**

In [11]:
# =============================================================================
# CROSS-VALIDATION RESULTS EVALUATION
# =============================================================================

print("📊 CROSS-VALIDATION RESULTS")
print("=" * 60)

# Extract results from training
cv_results_df = pd.DataFrame(training_results['cv_scores'])

# Calculate statistics
accuracy_mean = cv_results_df['accuracy'].mean()
accuracy_std = cv_results_df['accuracy'].std()
map3_mean = cv_results_df['map3'].mean()
map3_std = cv_results_df['map3'].std()

print(f"🎯 FINAL METRICS:")
print(f"  📈 Cross-Validation Accuracy: {accuracy_mean:.4f} ± {accuracy_std:.4f}")
print(f"  📈 Cross-Validation MAP@3:    {map3_mean:.4f} ± {map3_std:.4f}")
print(f"  📈 Out-of-Fold Accuracy:      {training_results['overall_accuracy']:.4f}")
print(f"  📈 Out-of-Fold MAP@3:         {training_results['overall_map3']:.4f}")

# Stability evaluation
accuracy_cv = accuracy_std / accuracy_mean if accuracy_mean > 0 else 0
map3_cv = map3_std / map3_mean if map3_mean > 0 else 0

print(f"\n🔍 STABILITY ANALYSIS:")
print(f"  📊 Coefficient of variation (Accuracy): {accuracy_cv:.3f}")
print(f"  📊 Coefficient of variation (MAP@3):    {map3_cv:.3f}")
print(f"  {'✅ Stable model' if accuracy_cv < 0.05 else '⚠️ Variable model'} (Accuracy CV < 0.05)")
print(f"  {'✅ Stable model' if map3_cv < 0.05 else '⚠️ Variable model'} (MAP@3 CV < 0.05)")

# Training time analysis
avg_fold_time = cv_results_df['training_time'].mean()
print(f"\n⏱️ TRAINING TIMES:")
print(f"  📊 Average time per fold: {avg_fold_time:.1f}s")
print(f"  📊 Total time: {total_time:.1f}s ({total_time/60:.1f}min)")

# Detailed results by fold
print(f"\n📋 DETAILED RESULTS BY FOLD:")
print("Fold  Accuracy   MAP@3    Best_Iter  Time(s)")
print("-" * 50)
for _, row in cv_results_df.iterrows():
    print(f"{row['fold']:2.0f}    {row['accuracy']:.4f}   {row['map3']:.4f}     {row['best_iteration']:4.0f}   {row['training_time']:6.1f}")

print("-" * 50)
print(f"Mean  {accuracy_mean:.4f}   {map3_mean:.4f}     {cv_results_df['best_iteration'].mean():4.0f}   {avg_fold_time:6.1f}")

# Feature importance analysis
if training_results['feature_importance'] is not None:
    print(f"\n🔍 TOP 10 MOST IMPORTANT FEATURES:")
    print("Rank  Feature               Importance")
    print("-" * 40)
    for i, (_, row) in enumerate(training_results['feature_importance'].head(10).iterrows()):
        print(f"{i+1:2d}.   {row['feature']:20} {row['mean']:8.4f}")

📊 CROSS-VALIDATION RESULTS
🎯 FINAL METRICS:
  📈 Cross-Validation Accuracy: 0.1998 ± 0.0020
  📈 Cross-Validation MAP@3:    0.3349 ± 0.0018
  📈 Out-of-Fold Accuracy:      0.1998
  📈 Out-of-Fold MAP@3:         0.3349

🔍 STABILITY ANALYSIS:
  📊 Coefficient of variation (Accuracy): 0.010
  📊 Coefficient of variation (MAP@3):    0.005
  ✅ Stable model (Accuracy CV < 0.05)
  ✅ Stable model (MAP@3 CV < 0.05)

⏱️ TRAINING TIMES:
  📊 Average time per fold: 470.8s
  📊 Total time: 4711.8s (78.5min)

📋 DETAILED RESULTS BY FOLD:
Fold  Accuracy   MAP@3    Best_Iter  Time(s)
--------------------------------------------------
 1    0.2033   0.3380      374    526.4
 2    0.2014   0.3365      329    473.6
 3    0.1994   0.3349      356    504.6
 4    0.1975   0.3326      321    470.4
 5    0.1983   0.3331      322    468.5
 6    0.2000   0.3355      285    423.2
 7    0.1989   0.3340      307    453.3
 8    0.2001   0.3357      320    466.3
 9    0.2016   0.3362      319    465.9
10    0.1969   0.3326  

## 🎯 10. Test Predictions Generation

**Final predictions using ensemble of trained models**

## 💾 11. Model Persistence and Results

**Save trained models, metrics, and create submission file**

In [12]:
# =============================================================================
# FILE SAVING CONFIGURATION
# =============================================================================

import os
import json
import joblib
from datetime import datetime

# Configure model name based on MAP@3
overall_map3 = training_results['overall_map3']
model_name = f"XGB_10CV_MAP@3-{overall_map3:.5f}".replace('.', '')
model_dir = f"../models/XGB/{N_SPLITS}CV/{model_name}"

# Create directory if it doesn't exist
os.makedirs(model_dir, exist_ok=True)

print(f"📁 MODEL DIRECTORY:")
print(f"  {model_dir}")

# File name configuration
base_filename = model_name
files_to_create = {
    'hparams': f"{base_filename}_hparams.json",
    'metrics': f"{base_filename}_metrics.json",
    'metrics_pkl': f"{base_filename}_metrics.pkl",
    'model_pkl': f"{base_filename}_model.pkl",
    'feature_import': f"{base_filename}_feature_importance.csv",
    'submission': f"{base_filename}_submission.csv",
    'submission_info': f"{base_filename}_submission_info.json"
}

print(f"\n📝 FILES TO CREATE:")
for file_type, filename in files_to_create.items():
    print(f"  {file_type:15}: {filename}")

📁 MODEL DIRECTORY:
  ../models/XGB/10CV/XGB_10CV_MAP@3-033492

📝 FILES TO CREATE:
  hparams        : XGB_10CV_MAP@3-033492_hparams.json
  metrics        : XGB_10CV_MAP@3-033492_metrics.json
  metrics_pkl    : XGB_10CV_MAP@3-033492_metrics.pkl
  model_pkl      : XGB_10CV_MAP@3-033492_model.pkl
  feature_import : XGB_10CV_MAP@3-033492_feature_importance.csv
  submission     : XGB_10CV_MAP@3-033492_submission.csv
  submission_info: XGB_10CV_MAP@3-033492_submission_info.json


In [None]:
# =============================================================================
# SAVE HYPERPARAMETERS
# =============================================================================

# General model information
hparams_data = {
    "model_type": "XGBClassifier",
    "model_abbreviation": "XGB",
    "cv_strategy": f"{N_SPLITS}-Fold Stratified Cross Validation",
    "optimization_method": "Manual hyperparameter tuning",
    "ensemble_method": "Average of fold predictions",
    
    # Fixed hyperparameters used
    "hyperparameters": xgb_params,
    
    # General configuration
    "features_selected": features_to_use,
    "num_features": len(features_to_use),
    "class_weights_used": True,
    "random_state": RANDOM_STATE,
    "cv_splits": N_SPLITS,
    "total_models": training_results['models'],
    "early_stopping_rounds": es
}

# Save general hyperparameters
hparams_file = os.path.join(model_dir, files_to_create['hparams'])
with open(hparams_file, 'w') as f:
    json.dump(hparams_data, f, indent=2)

print(f"✅ Hyperparameters saved:")
print(f"  📄 General: {files_to_create['hparams']}")



NameError: name 'trained_models' is not defined

In [None]:
# =============================================================================
# SAVE METRICS
# =============================================================================

# Extract metrics from training results
cv_results_df = pd.DataFrame(training_results['cv_scores'])
map3_mean = cv_results_df['map3'].mean()
map3_std = cv_results_df['map3'].std()
accuracy_mean = cv_results_df['accuracy'].mean()
accuracy_std = cv_results_df['accuracy'].std()

# Main metrics for JSON
metrics_data = {
    "model_type": "XGBClassifier",
    "model_abbreviation": "XGB",
    "tier": "10_FOLD_CV",
    "target_variable": "Fertilizer Name",
    "cv_strategy": f"{N_SPLITS}-Fold Stratified Cross Validation",
    "optimization_method": "Manual hyperparameter tuning",
    
    # Main metrics
    "map3_score_cv_mean": float(map3_mean),
    "map3_score_cv_std": float(map3_std),
    "map3_score_oof": float(training_results['overall_map3']),
    "accuracy_cv_mean": float(accuracy_mean),
    "accuracy_cv_std": float(accuracy_std),
    "accuracy_oof": float(training_results['overall_accuracy']),
    
    # Model information
    "num_classes": len(label_encoders['target'].classes_),
    "features_used": len(features_to_use),
    "features_list": features_to_use,
    "cv_folds": N_SPLITS,
    "total_models_trained": len(training_results['models']),
    
    # Metrics by fold
    "fold_results": training_results['cv_scores'],
    
    # Stability statistics
    "accuracy_cv_coefficient": float(accuracy_std / accuracy_mean) if accuracy_mean > 0 else 0.0,
    "map3_cv_coefficient": float(map3_std / map3_mean) if map3_mean > 0 else 0.0,
    
    # Times
    "training_time_total": float(total_time),
    "training_time_per_fold_avg": float(cv_results_df['training_time'].mean()),
    
    # Hyperparameters used
    "hyperparameters": xgb_params,
    
    # Metadata
    "timestamp": datetime.now().isoformat(),
    "kaggle_competition": "playground-series-s5e6"
}

# Save JSON metrics
metrics_file = os.path.join(model_dir, files_to_create['metrics'])
with open(metrics_file, 'w') as f:
    json.dump(metrics_data, f, indent=2)

# Complete metrics for PKL (includes complex objects)
metrics_pkl_data = {
    **metrics_data,
    "oof_predictions": training_results['oof_predictions'],
    "trained_models": training_results['models'],
    "feature_importance": training_results['feature_importance'],
    "label_encoders": label_encoders
}

# Save PKL metrics
metrics_pkl_file = os.path.join(model_dir, files_to_create['metrics_pkl'])
joblib.dump(metrics_pkl_data, metrics_pkl_file, compress=3)

print(f"✅ Metrics saved:")
print(f"  📄 JSON: {files_to_create['metrics']}")
print(f"  📄 PKL: {files_to_create['metrics_pkl']}")



In [None]:
# =============================================================================
# SAVE TRAINED MODELS AND CREATE SUBMISSION
# =============================================================================

# Save feature importance
if training_results['feature_importance'] is not None:
    feature_importance_file = os.path.join(model_dir, files_to_create['feature_import'])
    training_results['feature_importance'].to_csv(feature_importance_file, index=False)
    print(f"✅ Feature importance saved: {files_to_create['feature_import']}")

# Save the ensemble of trained models
model_data = {
    "ensemble_models": training_results['models'],
    "model_type": "XGBClassifier",
    "cv_folds": N_SPLITS,
    "features_used": features_to_use,
    "hyperparameters": xgb_params,
    "label_encoders": label_encoders,
    "training_info": {
        "map3_cv_mean": float(map3_mean),
        "map3_oof": float(training_results['overall_map3']),
        "timestamp": datetime.now().isoformat()
    }
}

# Save models
model_file = os.path.join(model_dir, files_to_create['model_pkl'])
joblib.dump(model_data, model_file, compress=3)
print(f"✅ Models saved: {files_to_create['model_pkl']}")

# =============================================================================
# GENERATE TEST PREDICTIONS AND CREATE SUBMISSION
# =============================================================================

print(f"\n🔮 Generating test predictions...")

# Generate ensemble predictions
test_predictions_all = []
for fold_name, model in training_results['models'].items():
    pred_proba = model.predict_proba(X_test_final)
    test_predictions_all.append(pred_proba)

# Average predictions (ensemble)
test_predictions_ensemble = np.mean(test_predictions_all, axis=0)

# Get top 3 predictions for each sample
test_top3_indices = np.argsort(test_predictions_ensemble, axis=1)[:, -3:][:, ::-1]

# Convert indices to fertilizer names
test_top3_names = []
for i in range(len(test_top3_indices)):
    top3_for_sample = []
    for j in range(3):
        class_idx = test_top3_indices[i, j]
        class_name = label_encoders['target'].inverse_transform([class_idx])[0]
        top3_for_sample.append(class_name)
    test_top3_names.append(top3_for_sample)

# Create submission predictions (space-separated top 3)
submission_predictions = []
for top3_names in test_top3_names:
    prediction_string = ' '.join(top3_names)
    submission_predictions.append(prediction_string)

# Create submission DataFrame
submission = pd.DataFrame({
    'id': range(len(test_predictions_ensemble)),
    'Fertilizer Name': submission_predictions
})

# Save submission file
submission_file = os.path.join(model_dir, files_to_create['submission'])
submission.to_csv(submission_file, index=False)

# Submission information
submission_info = {
    "model_type": "XGBClassifier",
    "model_abbreviation": "XGB", 
    "cv_strategy": f"{N_SPLITS}-Fold Stratified Cross Validation",
    "map3_score_cv_mean": float(map3_mean),
    "map3_score_oof": float(training_results['overall_map3']),
    "submission_file": files_to_create['submission'],
    "num_predictions": len(submission),
    "format": "MAP@3 - Top 3 fertilizer names separated by spaces",
    "target_variable": "Fertilizer Name",
    "ensemble_models": len(training_results['models']),
    "features_used": len(features_to_use),
    "total_training_time_minutes": float(total_time / 60),
    "timestamp": datetime.now().isoformat(),
    "kaggle_competition": "playground-series-s5e6"
}

# Save submission information
submission_info_file = os.path.join(model_dir, files_to_create['submission_info'])
with open(submission_info_file, 'w') as f:
    json.dump(submission_info, f, indent=2)

print(f"✅ Submission created: {files_to_create['submission']}")
print(f"✅ Submission info saved: {files_to_create['submission_info']}")
print(f"📊 Submission shape: {submission.shape}")
print(f"🎯 Sample predictions:")
for i in range(min(3, len(submission))):
    print(f"  {i+1}. {submission.iloc[i, 1]}")


In [None]:
# =============================================================================
# FINAL SUMMARY OF SAVED FILES
# =============================================================================

print(f"\n💾 FINAL SUMMARY - SAVED FILES")
print("=" * 70)

print(f"📁 DIRECTORY: {model_dir}")
print(f"\n📄 CREATED FILES:")

# Verify and show all created files
for file_type, filename in files_to_create.items():
    file_path = os.path.join(model_dir, filename)
    if os.path.exists(file_path):
        file_size = os.path.getsize(file_path)
        if file_size > 1024*1024:  # > 1MB
            size_str = f"{file_size/(1024*1024):.1f} MB"
        elif file_size > 1024:  # > 1KB
            size_str = f"{file_size/1024:.1f} KB"
        else:
            size_str = f"{file_size} bytes"
        
        print(f"  ✅ {filename:40} ({size_str})")
    else:
        print(f"  ❌ {filename:40} (NOT CREATED)")

print(f"\n🎯 MAIN METRICS:")
print(f"  📊 MAP@3 (CV Mean):    {map3_mean:.5f} ± {map3_std:.5f}")
print(f"  📊 MAP@3 (OOF):        {training_results['overall_map3']:.5f}")
print(f"  📊 Accuracy (OOF):     {training_results['overall_accuracy']:.5f}")

print(f"\n⚙️ MODEL CONFIGURATION:")
print(f"  🤖 Models:            {len(training_results['models'])} (ensemble)")
print(f"  📊 Features:          {len(features_to_use)}")
print(f"  ⏱️ Total time:        {total_time/60:.1f} minutes")

# Show some of the hyperparameters used
print(f"\n🏆 MAIN HYPERPARAMETERS:")
main_params = ['max_depth', 'learning_rate', 'n_estimators', 'reg_alpha', 'reg_lambda']
for param in main_params:
    if param in xgb_params:
        print(f"  {param:15}: {xgb_params[param]}")

print(f"\n🎉 ALL FILES SAVED SUCCESSFULLY")
print(f"📂 Location: {os.path.abspath(model_dir)}")
print(f"\n✨ XGBoost 10-FOLD CV MODEL COMPLETED")
print(f"🚀 Final MAP@3 score: {training_results['overall_map3']:.5f}")
print(f"🎯 Submission ready: {files_to_create['submission']}")
print(f"\n📈 WORKFLOW SUMMARY:")
print(f"  1. ✅ Data loaded and processed")
print(f"  2. ✅ Feature engineering applied")
print(f"  3. ✅ Categorical encoding completed")
print(f"  4. ✅ 10-fold CV training completed")
print(f"  5. ✅ Model evaluation finished")
print(f"  6. ✅ Test predictions generated")
print(f"  7. ✅ Files saved and submission created")
