# 02. End-to-End Training Workflow (XGBoost Only)

This notebook demonstrates how to programmatically train the **Standalone XGBoost** gas fee prediction model using the `src.train` pipeline.

**Core Modules Tested:**
- `src.train.TrainingPipeline`
- `src.models.XGBoostGasFeeModel`

In [1]:
import sys
from pathlib import Path
import logging

# Setup logging to see pipeline output in notebook
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(message)s',
    force=True
)
logger = logging.getLogger(__name__)

# Add project root to path
project_root = Path('..').resolve()
sys.path.insert(0, str(project_root))

from src.train import TrainingPipeline
from src.fetch import BlockDataFetcher as DataFetcher
from src.features import FeatureEngineer

## 0. Dataset Preparation
Select whether to use existing data or download fresh blocks.

In [2]:
# --- USER CONFIGURATION ---
GENERATE_NEW_DATA = False
N_BLOCKS = 10000
# --------------------------

raw_data_path = project_root / 'data' / 'blocks.csv'
features_path = project_root / 'data' / 'features.parquet'

# Check if data exists
data_exists = features_path.exists()

if GENERATE_NEW_DATA or not data_exists:
    print(f"ðŸ”„ Generating/Downloading {N_BLOCKS} blocks... (This may take a while)")
    
    # 1. Fetch Data
    fetcher = DataFetcher()
    df_blocks = fetcher.fetch_blocks(n_blocks=N_BLOCKS)
    fetcher.save_to_csv(df_blocks, str(raw_data_path))
    
    # 2. Engineer Features
    engineer = FeatureEngineer()
    df_blocks = engineer.load_data(str(raw_data_path))
    # FIXED: Use correct method name 'engineer_features'
    df_features = engineer.engineer_features(df_blocks)
    
    # Save features
    df_features.to_parquet(features_path)
    print(f"âœ… Data generation complete. Saved to {features_path}")
else:
    print("âœ… Using existing dataset.")
    if features_path.exists():
        import pandas as pd 
        df = pd.read_parquet(features_path)
        print(f"   Path: {features_path}")
        print(f"   Samples: {len(df)}")

âœ… Using existing dataset.
   Path: D:\SKRIPSI\gas-ml\data\features.parquet
   Samples: 9994


## 1. Setup Configuration

In [3]:
config_path = project_root / 'cfg' / 'exp.yaml'
data_path = project_root / 'data' / 'features.parquet'

if not config_path.exists():
    raise FileNotFoundError(f"Config not found at {config_path}")

print(f"Config: {config_path}")
print(f"Data: {data_path}")

Config: D:\SKRIPSI\gas-ml\cfg\exp.yaml
Data: D:\SKRIPSI\gas-ml\data\features.parquet


## 2. Train Standalone XGBoost Model
This model uses simple lag features and does NOT require complex target normalization.

In [None]:
xgb_output_dir = project_root / 'models' / 'xgboost_notebook'

# --- REFACORTED ROBUST TRAINING CALL ---
# We use the new refactored manual training module to ensure reproduction of 98% accuracy
from src.train_xgb import train_xgboost_from_config

print("ðŸš€ Starting STANDALONE XGBoost Training (Golden Configuration)...")
print("   - Custom Asymmetric Objective (2.5x penalty)")
print("   - Optimized Data Split (20% Test, matching Inference)")
print("   - Base Score Initialization")

metrics = train_xgboost_from_config(
    cfg_path=str(config_path),
    data_path=str(data_path),
    output_dir=str(xgb_output_dir)
)

print(f"\nâœ… XGBoost model saved to: {xgb_output_dir}")
print(f"âœ… Achieved R2 Score: {metrics['r2']:.4f}")