# AgriFrost-AI Tutorial: End-to-End Frost Forecasting

<div align="center">

<img src="../docs/logo/AgriFrost-AI-transparent.png" alt="AgriFrost-AI Logo" width="150"/>

## üå°Ô∏è AgriFrost-AI Complete Workflow Demonstration

**Complete example from data loading to model training to prediction generation**

*F3 Innovate Frost Risk Forecasting Challenge (2025)*

</div>

---

## üìã Tutorial Contents

This notebook will guide you through the following steps:

1. **Environment Setup and Data Loading**
2. **Data Exploration and Visualization**
3. **Feature Engineering Demonstration**
4. **Model Training (LightGBM)**
5. **Model Evaluation and Visualization**
6. **Generate Predictions**

**Estimated Time**: ~30-60 minutes (depending on data size and hardware)

**Requirements**:
- Python 3.10+
- Project dependencies installed (`pip install -r requirements.txt`)
- Data downloaded to `data/raw/frost-risk-forecast-challenge/`



In [None]:
# 1. Import necessary libraries
import sys
from pathlib import Path
import warnings
warnings.filterwarnings('ignore')

# Add project root directory to path
project_root = Path.cwd().parent
sys.path.insert(0, str(project_root))

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
import json

# Set display options
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', 50)

# Set plotting style (compatible with different matplotlib versions)
try:
    plt.style.use('seaborn-v0_8-darkgrid')
except OSError:
    try:
        plt.style.use('seaborn-darkgrid')
    except OSError:
        plt.style.use('default')
sns.set_palette("husl")

print("‚úÖ Libraries imported successfully!")
print(f"üìÅ Project root directory: {project_root}")
print(f"üêç Python version: {sys.version.split()[0]}")



## 1. Data Loading and Exploration

First, let's load the raw data and explore its structure.


In [None]:
# 1.1 Load raw data
from src.data.loaders import DataLoader

data_path = project_root / "data/raw/frost-risk-forecast-challenge/cimis_all_stations.csv.gz"

if not data_path.exists():
    print(f"‚ùå Data file not found: {data_path}")
    print("Please download the data first (refer to docs/README.md)")
else:
    print(f"üìÇ Loading data: {data_path}")
    loader = DataLoader()
    df_raw = loader.load_raw_data(data_path)
    print(f"‚úÖ Data loaded successfully!")
    print(f"   Shape: {df_raw.shape}")
    print(f"   Columns: {len(df_raw.columns)}")
    print(f"   Time range: {df_raw['Date'].min()} to {df_raw['Date'].max()}")
    print(f"   Number of stations: {df_raw['Stn Id'].nunique()}")


In [None]:
# 1.2 View data overview
if 'df_raw' in locals():
    print("üìä Data Overview:")
    print(df_raw.head(10))
    print("\nüìã Data Information:")
    print(df_raw.info())
    print("\nüìà Descriptive Statistics:")
    print(df_raw.describe())


## 2. Data Visualization

Let's visualize some key patterns and features.


In [None]:
# 2.1 Time series visualization
if 'df_raw' in locals():
    # Convert Date column to datetime
    df_raw['Date'] = pd.to_datetime(df_raw['Date'])
    
    # Select a single station for visualization (e.g., Station 2)
    df_station = df_raw[df_raw['Stn Id'] == 2].copy()
    df_station = df_station.sort_values('Date')
    
    # Take the last 1000 rows for quick visualization
    df_sample = df_station.tail(1000)
    
    fig, axes = plt.subplots(2, 1, figsize=(14, 10))
    
    # Temperature time series
    axes[0].plot(df_sample['Date'], df_sample['Air Temp (C)'], label='Air Temperature', linewidth=1)
    axes[0].axhline(y=0, color='r', linestyle='--', label='Frost Threshold (0¬∞C)')
    axes[0].set_xlabel('Date')
    axes[0].set_ylabel('Temperature (¬∞C)')
    axes[0].set_title('Air Temperature Time Series (Station 2, Last 1000 Hours)')
    axes[0].legend()
    axes[0].grid(True, alpha=0.3)
    
    # Relative humidity time series
    axes[1].plot(df_sample['Date'], df_sample['Rel Hum (%)'], label='Relative Humidity', color='green', linewidth=1)
    axes[1].set_xlabel('Date')
    axes[1].set_ylabel('Relative Humidity (%)')
    axes[1].set_title('Relative Humidity Time Series (Station 2, Last 1000 Hours)')
    axes[1].legend()
    axes[1].grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.show()
    
    print(f"üìä Visualization complete! Showing Station 2's last 1000 hours of data")


In [None]:
# 2.2 Frost event statistics
if 'df_raw' in locals():
    # Identify frost events (‚â§0¬∞C)
    df_raw['is_frost'] = (df_raw['Air Temp (C)'] <= 0.0).astype(int)
    
    # Statistics of frost events by month
    df_raw['Month'] = pd.to_datetime(df_raw['Date']).dt.month
    frost_by_month = df_raw.groupby('Month')['is_frost'].agg(['sum', 'count', 'mean'])
    frost_by_month.columns = ['Frost Events', 'Total Observations', 'Frost Rate']
    
    print("üìä Monthly frost event statistics:")
    print(frost_by_month)
    
    # Visualization
    fig, axes = plt.subplots(1, 2, figsize=(14, 5))
    
    # Frost rate by month
    axes[0].bar(frost_by_month.index, frost_by_month['Frost Rate'] * 100, color='steelblue')
    axes[0].set_xlabel('Month')
    axes[0].set_ylabel('Frost Rate (%)')
    axes[0].set_title('Frost Rate by Month')
    axes[0].set_xticks(range(1, 13))
    axes[0].grid(True, alpha=0.3, axis='y')
    
    # Number of frost events by month
    axes[1].bar(frost_by_month.index, frost_by_month['Frost Events'], color='coral')
    axes[1].set_xlabel('Month')
    axes[1].set_ylabel('Number of Frost Events')
    axes[1].set_title('Total Frost Events by Month')
    axes[1].set_xticks(range(1, 13))
    axes[1].grid(True, alpha=0.3, axis='y')
    
    plt.tight_layout()
    plt.show()
    
    print(f"\n‚ùÑÔ∏è Total frost events: {df_raw['is_frost'].sum():,}")
    print(f"üìä Frost rate: {df_raw['is_frost'].mean()*100:.2f}%")


## 3. Data Processing Pipeline

Now let's use the unified data processing pipeline to clean data, generate features, and labels.


In [None]:
# 3.1 Configure data processing pipeline
from src.data import DataPipeline

# Configuration (using Top 175 features)
config = {
    "cleaning": {
        "config_path": str(project_root / "config/data_cleaning.yaml")
    },
    "labels": {
        "threshold": 0.0  # Frost threshold: 0¬∞C
    },
    "feature_engineering": {
        "enabled": True,
        "feature_selection": {
            "method": "top_k",
            "top_k": 175  # Use Top 175 features
        }
    },
    "random_state": 42
}

print("‚öôÔ∏è Configuring data pipeline...")
pipeline = DataPipeline(config=config)
print("‚úÖ Data pipeline created successfully!")


In [None]:
# 3.2 Process data (using sampling to speed up demonstration)
if 'data_path' in locals() and data_path.exists():
    print("üîÑ Starting data processing...")
    print("   ‚ö†Ô∏è  Note: For demonstration speed, we use sampled data (100,000 rows)")
    print("   üí° For actual training, remove sample_size parameter to use full data")
    
    # Process data (with sampling)
    dataset_bundle = pipeline.run(
        data_path=data_path,
        horizons=[12],  # Only process 12h horizon
        use_feature_engineering=True,
        sample_size=100000,  # Sample 100,000 rows for demonstration
        random_state=42
    )
    
    df_processed = dataset_bundle.data
    print(f"‚úÖ Data processing complete!")
    print(f"   Processed shape: {df_processed.shape}")
    print(f"   Number of features: {len(dataset_bundle.feature_columns)}")
    print(f"   Number of labels: {len(dataset_bundle.label_columns)}")
    
    # Display feature columns
    print(f"\nüìã Feature column examples (first 20):")
    for i, feat in enumerate(dataset_bundle.feature_columns[:20]):
        print(f"   {i+1}. {feat}")
    if len(dataset_bundle.feature_columns) > 20:
        print(f"   ... (Total {len(dataset_bundle.feature_columns)} features)")
else:
    print("‚ùå Data file not found, skipping data processing step")


## 4. Model Training

Now let's train a LightGBM model.


In [None]:
# 4.1 Prepare training data
if 'df_processed' in locals():
    from src.training.data_preparation import prepare_features_and_targets
    from src.evaluation.validators import CrossValidator
    from src.models.registry import get_model_class
    
    # Step 1: Time series split (to avoid data leakage)
    # Note: Must split data first, then prepare features and labels
    print("üìä Performing time series split...")
    train_df, val_df, test_df = CrossValidator.time_split(
        df=df_processed,
        train_ratio=0.7,
        val_ratio=0.15,
        date_col="Date"
    )
    
    print(f"   Training set: {len(train_df)} samples")
    print(f"   Validation set: {len(val_df)} samples")
    print(f"   Test set: {len(test_df)} samples")
    
    # Step 2: Prepare features and labels for each split (12h horizon)
    print("\nüîß Preparing training set features and labels...")
    X_train, y_frost_train, y_temp_train = prepare_features_and_targets(
        df=train_df,
        horizon=12,
        track="top175_features"
    )
    
    print("üîß Preparing validation set features and labels...")
    X_val, y_frost_val, y_temp_val = prepare_features_and_targets(
        df=val_df,
        horizon=12,
        track="top175_features"
    )
    
    print("üîß Preparing test set features and labels...")
    X_test, y_frost_test, y_temp_test = prepare_features_and_targets(
        df=test_df,
        horizon=12,
        track="top175_features"
    )
    
    print("\n‚úÖ Data preparation complete!")
    print(f"   Training set: {X_train.shape[0]} samples, {X_train.shape[1]} features")
    print(f"   Validation set: {X_val.shape[0]} samples")
    print(f"   Test set: {X_test.shape[0]} samples")
    print(f"   Frost events (training set): {y_frost_train.sum()} ({y_frost_train.mean()*100:.2f}%)")
    print(f"   Average temperature (training set): {y_temp_train.mean():.2f}¬∞C")
else:
    print("‚ùå Skipping model training (data not processed)")


In [None]:
# 4.2 Train classification model (frost probability prediction)
if 'X_train' in locals():
    print("ü§ñ Training frost classification model (LightGBM)...")
    
    # Get model class
    ModelClass = get_model_class('lightgbm')
    
    # Create model instance
    frost_model = ModelClass(
        config={
            'task_type': 'classification',
            'model_params': {
                'n_estimators': 100,  # Fewer trees for demonstration, can use more in practice
                'learning_rate': 0.05,
                'max_depth': 7,
                'random_state': 42,
                'verbosity': -1
            }
        }
    )
    
    # Train model
    frost_model.fit(
        X=X_train,
        y=y_frost_train,
        eval_set=[(X_val, y_frost_val)]
    )
    
    print("‚úÖ Classification model training complete!")
    
    # Train regression model (temperature prediction)
    print("ü§ñ Training temperature regression model (LightGBM)...")
    
    temp_model = ModelClass(
        config={
            'task_type': 'regression',
            'model_params': {
                'n_estimators': 100,
                'learning_rate': 0.05,
                'max_depth': 7,
                'random_state': 42,
                'verbosity': -1
            }
        }
    )
    
    temp_model.fit(
        X=X_train,
        y=y_temp_train,
        eval_set=[(X_val, y_temp_val)]
    )
    
    print("‚úÖ Regression model training complete!")
else:
    print("‚ùå Skipping model training (data not prepared)")


## 5. Model Evaluation and Visualization

Let's evaluate the model's performance and visualize the results.


In [None]:
# 5.1 Evaluate classification model
if 'frost_model' in locals():
    from src.evaluation.metrics import MetricsCalculator
    
    # Generate predictions
    y_frost_pred = frost_model.predict(X_test)
    y_frost_proba = frost_model.predict_proba(X_test)
    
    # Calculate metrics
    metrics_calc = MetricsCalculator()
    class_metrics = metrics_calc.calculate_classification_metrics(
        y_true=y_frost_test,
        y_pred=y_frost_pred,
        y_proba=y_frost_proba
    )
    
    print("üìä Classification Model Performance (Test Set):")
    print(f"   ROC-AUC: {class_metrics['roc_auc']:.4f}")
    print(f"   PR-AUC: {class_metrics['pr_auc']:.4f}")
    print(f"   Brier Score: {class_metrics['brier_score']:.4f}")
    print(f"   ECE: {class_metrics['ece']:.4f}")
    print(f"   Accuracy: {class_metrics['accuracy']:.4f}")
    print(f"   Precision: {class_metrics['precision']:.4f}")
    print(f"   Recall: {class_metrics['recall']:.4f}")
    print(f"   F1 Score: {class_metrics['f1_score']:.4f}")
else:
    print("‚ùå Skipping classification evaluation (model not trained)")


In [None]:
# 5.2 Evaluate regression model
if 'temp_model' in locals():
    # Generate predictions
    y_temp_pred = temp_model.predict(X_test)
    
    # Calculate metrics
    reg_metrics = metrics_calc.calculate_regression_metrics(
        y_true=y_temp_test,
        y_pred=y_temp_pred
    )
    
    print("üìä Regression Model Performance (Test Set):")
    print(f"   MAE: {reg_metrics['mae']:.4f}¬∞C")
    print(f"   RMSE: {reg_metrics['rmse']:.4f}¬∞C")
    print(f"   R¬≤: {reg_metrics['r2']:.4f}")
    print(f"   MAPE: {reg_metrics.get('mape', 'N/A')}")
else:
    print("‚ùå Skipping regression evaluation (model not trained)")


In [None]:
# 5.3 Visualize prediction results
if 'y_temp_pred' in locals() and 'y_frost_proba' in locals():
    fig, axes = plt.subplots(2, 2, figsize=(16, 12))
    
    # 1. Temperature prediction vs true values
    axes[0, 0].scatter(y_temp_test, y_temp_pred, alpha=0.5, s=10)
    axes[0, 0].plot([y_temp_test.min(), y_temp_test.max()], 
                    [y_temp_test.min(), y_temp_test.max()], 
                    'r--', lw=2, label='Perfect Prediction')
    axes[0, 0].set_xlabel('True Temperature (¬∞C)')
    axes[0, 0].set_ylabel('Predicted Temperature (¬∞C)')
    axes[0, 0].set_title(f'Temperature Prediction (R¬≤ = {reg_metrics["r2"]:.4f})')
    axes[0, 0].legend()
    axes[0, 0].grid(True, alpha=0.3)
    
    # 2. Temperature prediction error distribution
    temp_errors = y_temp_pred - y_temp_test
    axes[0, 1].hist(temp_errors, bins=50, edgecolor='black', alpha=0.7)
    axes[0, 1].axvline(x=0, color='r', linestyle='--', linewidth=2)
    axes[0, 1].set_xlabel('Prediction Error (¬∞C)')
    axes[0, 1].set_ylabel('Frequency')
    axes[0, 1].set_title(f'Temperature Prediction Error Distribution (MAE = {reg_metrics["mae"]:.4f}¬∞C)')
    axes[0, 1].grid(True, alpha=0.3, axis='y')
    
    # 3. ROC curve
    from sklearn.metrics import roc_curve, auc
    fpr, tpr, _ = roc_curve(y_frost_test, y_frost_proba)
    roc_auc = auc(fpr, tpr)
    axes[1, 0].plot(fpr, tpr, color='darkorange', lw=2, label=f'ROC curve (AUC = {roc_auc:.4f})')
    axes[1, 0].plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--', label='Random')
    axes[1, 0].set_xlabel('False Positive Rate')
    axes[1, 0].set_ylabel('True Positive Rate')
    axes[1, 0].set_title('ROC Curve for Frost Classification')
    axes[1, 0].legend(loc="lower right")
    axes[1, 0].grid(True, alpha=0.3)
    
    # 4. Frost probability distribution
    axes[1, 1].hist(y_frost_proba[y_frost_test == 0], bins=50, alpha=0.7, label='No Frost', color='blue')
    axes[1, 1].hist(y_frost_proba[y_frost_test == 1], bins=50, alpha=0.7, label='Frost', color='red')
    axes[1, 1].set_xlabel('Predicted Frost Probability')
    axes[1, 1].set_ylabel('Frequency')
    axes[1, 1].set_title('Frost Probability Distribution')
    axes[1, 1].legend()
    axes[1, 1].grid(True, alpha=0.3, axis='y')
    
    plt.tight_layout()
    plt.show()
    
    print("‚úÖ Visualization complete!")
else:
    print("‚ùå Skipping visualization (prediction results not generated)")


## 6. Feature Importance Analysis

Let's examine which features are most important to the model.


In [None]:
# 6.1 Get feature importance
if 'frost_model' in locals() and 'X_train' in locals():
    try:
        # Get feature importance (using LightGBM booster API)
        # Note: LightGBM requires using booster_.feature_importance() method
        feature_importance = frost_model.model.booster_.feature_importance(importance_type='gain')
        importance_df = pd.DataFrame({
            'feature': X_train.columns.tolist(),  # Use training set feature columns
            'importance': feature_importance
        }).sort_values('importance', ascending=False)
        
        # Display Top 20 most important features
        print("üîù Top 20 Most Important Features:")
        print(importance_df.head(20).to_string(index=False))
        
        # Visualize Top 20 feature importance
        fig, ax = plt.subplots(figsize=(12, 8))
        top_features = importance_df.head(20)
        ax.barh(range(len(top_features)), top_features['importance'].values, color='steelblue')
        ax.set_yticks(range(len(top_features)))
        ax.set_yticklabels(top_features['feature'].values)
        ax.set_xlabel('Feature Importance (Gain)')
        ax.set_title('Top 20 Feature Importance (Frost Classification Model)')
        ax.invert_yaxis()  # Most important at top
        plt.tight_layout()
        plt.show()
        
    except Exception as e:
        print(f"‚ö†Ô∏è  Unable to get feature importance: {e}")
        print("   This may be because the model type doesn't support it or model is not properly initialized")
        print("   Trying to use feature_importances_ attribute:")
        try:
            # Fallback method: use feature_importances_ attribute
            feature_importance = frost_model.model.feature_importances_
            importance_df = pd.DataFrame({
                'feature': X_train.columns.tolist(),
                'importance': feature_importance
            }).sort_values('importance', ascending=False)
            print("\n‚úÖ Successfully obtained feature importance using fallback method:")
            print(importance_df.head(20).to_string(index=False))
        except Exception as e2:
            print(f"   Fallback method also failed: {e2}")
else:
    print("‚ùå Skipping feature importance analysis (model not trained)")


## 7. Generate Predictions

Finally, let's use the trained model to generate new predictions.


In [None]:
# 7.1 Prepare new data for prediction
if 'X_test' in locals() and 'frost_model' in locals():
    # Use a portion of the test set as new data
    new_data = X_test[:100].copy()  # Take first 100 samples
    
    # Generate predictions
    frost_proba_predictions = frost_model.predict_proba(new_data)
    temp_predictions = temp_model.predict(new_data)
    
    # Create prediction results DataFrame
    predictions_df = pd.DataFrame({
        'Frost_Probability': frost_proba_predictions,
        'Temperature_Prediction_C': temp_predictions,
        'Frost_Risk': ['Low' if p < 0.1 else 'Medium' if p < 0.5 else 'High' for p in frost_proba_predictions]
    })
    
    print("üìä Prediction Results Example (First 20):")
    print(predictions_df.head(20).to_string(index=True))
    
    # Statistics of high-risk predictions
    high_risk = (predictions_df['Frost_Probability'] > 0.5).sum()
    print(f"\n‚ö†Ô∏è  High-risk predictions (probability > 0.5): {high_risk} / {len(predictions_df)} ({high_risk/len(predictions_df)*100:.1f}%)")
    
else:
    print("‚ùå Skipping prediction generation (model or data not available)")


In [None]:
# 7.2 Visualize prediction results
if 'predictions_df' in locals():
    fig, axes = plt.subplots(1, 2, figsize=(14, 5))
    
    # Frost probability distribution
    axes[0].hist(predictions_df['Frost_Probability'], bins=30, edgecolor='black', alpha=0.7, color='steelblue')
    axes[0].axvline(x=0.5, color='r', linestyle='--', linewidth=2, label='High Risk Threshold (0.5)')
    axes[0].set_xlabel('Frost Probability')
    axes[0].set_ylabel('Frequency')
    axes[0].set_title('Distribution of Frost Probability Predictions')
    axes[0].legend()
    axes[0].grid(True, alpha=0.3, axis='y')
    
    # Temperature prediction distribution
    axes[1].hist(predictions_df['Temperature_Prediction_C'], bins=30, edgecolor='black', alpha=0.7, color='coral')
    axes[1].axvline(x=0, color='r', linestyle='--', linewidth=2, label='Frost Threshold (0¬∞C)')
    axes[1].set_xlabel('Predicted Temperature (¬∞C)')
    axes[1].set_ylabel('Frequency')
    axes[1].set_title('Distribution of Temperature Predictions')
    axes[1].legend()
    axes[1].grid(True, alpha=0.3, axis='y')
    
    plt.tight_layout()
    plt.show()
    
    print("‚úÖ Prediction visualization complete!")
else:
    print("‚ùå Skipping visualization (prediction results not available)")


## 8. Summary and Next Steps

Congratulations! You have completed the complete workflow demonstration of AgriFrost-AI!

### üìã Tutorial Summary

‚úÖ **Completed**:
1. Data loading and exploration
2. Data visualization and statistical analysis
3. Data cleaning and feature engineering
4. Model training (classification and regression)
5. Model evaluation and performance analysis
6. Feature importance analysis
7. Prediction generation and visualization

### üöÄ Next Steps Suggestions

1. **Try Different Models**:
   - XGBoost: `get_model_class('xgboost')`
   - CatBoost: `get_model_class('catboost')`
   - LSTM: `get_model_class('lstm')` (requires GPU)

2. **Try Different Time Horizons**:
   - Modify `horizons=[3, 6, 12, 24]` to train multiple horizons

3. **Try Different Feature Sets**:
   - Full feature set (298 features)
   - Custom feature selection

4. **Spatial Aggregation**:
   - Try Matrix Cell C/D (multi-station features)
   - Try Matrix Cell E (graph neural networks)

5. **Model Tuning**:
   - Use hyperparameter optimization
   - Try different model configurations

### üìö Further Learning

- üìñ **Quick Start**: `docs/README.md`
- üèóÔ∏è **Implementation Guide**: `docs/IMPLEMENTATION_GUIDE.md` / `docs/IMPLEMENTATION_GUIDE_CN.md`
- üî¨ **Technical Documentation**: `docs/technical/TECHNICAL_DOCUMENTATION.md`
- ü§ñ **Model Guide**: `docs/MODELS_GUIDE.md`
- üìä **Feature Guide**: `docs/features/FEATURE_GUIDE.md`

### üí° Tips

- Using the full dataset can achieve better performance (remove `sample_size` parameter)
- Increasing `n_estimators` can improve model accuracy (but training time will be longer)
- Using GPU can accelerate deep learning model training
- LOSO evaluation can test model's spatial generalization capability

---

**Thank you for using AgriFrost-AI!** üå°Ô∏èü§ñ

**Documentation Version**: 1.0  
**Last Updated**: 2025-12-06  
**Author**: Zhengkun LI (TRIC Robotics / UF ABE)

