# Logistics Training Pipeline (Cleaned)

This notebook trains global and carrier-specific LightGBM models to predict shipment **Cost** (log target) and **Transit_Days**,
applies **per-carrier calibration**, evaluates performance, and writes Streamlit-friendly artifacts:

- `model_artifacts/preprocessor.pkl`  
- `model_artifacts/cost_model.pkl`  
- `model_artifacts/cost_models_grouped.pkl` (optional carrier-specific)  
- `model_artifacts/preprocessor_days.pkl`  
- `model_artifacts/transit_days_model.pkl`  
- `model_artifacts/best_params.json` (includes features + calibration)  
- `model_artifacts/carrier_performance.csv`  
- `model_artifacts/carrier_rmse_analysis.png`

You can configure data sources via environment variables:

- **MySQL (preferred)**: `MYSQL_USER`, `MYSQL_PASSWORD`, `MYSQL_HOST`, `MYSQL_PORT`, `MYSQL_DB`  
- **Fallback CSV**: `LOGISTICS_CSV_PATH` (expects columns similar to the sample dataset)

Run the final cell to execute `main()`.


In [8]:

import os
import json
import joblib
import logging
from datetime import datetime, UTC

import numpy as np
import pandas as pd
from dotenv import load_dotenv
from sqlalchemy import create_engine

from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import train_test_split, KFold
from sklearn.metrics import mean_squared_error
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.feature_extraction import FeatureHasher

import lightgbm as lgb
import optuna
import matplotlib.pyplot as plt
import platform
import sklearn as sk

# Threading/parallelism controls — keep training stable
os.environ["LOKY_MAX_CPU_COUNT"] = "1"
os.environ["JOBLIB_MULTIPROCESSING"] = "0"
os.environ["OMP_NUM_THREADS"] = "1"

# -------------------------
# Logging
# -------------------------
load_dotenv("config.env")
os.makedirs("logs", exist_ok=True)
logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s - %(levelname)s - %(message)s",
    handlers=[
        logging.FileHandler("logs/logistics_model.log"),
        logging.StreamHandler()
    ]
)
logger = logging.getLogger(__name__)
logger.info(
    "Env versions: Python %s | numpy %s | pandas %s | sklearn %s | lightgbm %s",
    platform.python_version(), np.__version__, pd.__version__, sk.__version__, lgb.__version__
)


2025-09-07 19:17:58,199 - INFO - Env versions: Python 3.13.7 | numpy 2.3.2 | pandas 2.3.1 | sklearn 1.7.1 | lightgbm 4.6.0


## Data cleaning & feature engineering helpers

In [9]:

def handle_missing_data_smart(df: pd.DataFrame) -> pd.DataFrame:
    """Carrier-aware imputation. Dropping only rows with missing target Cost."""
    df = df.copy()
    initial_count = len(df)
    df = df.dropna(subset=['Cost'])
    logger.info("Removed %d rows with missing Cost", initial_count - len(df))

    critical = ['Distance_miles', 'Weight_kg', 'origin_warehouse', 'Destination', 'Carrier']
    for c in critical:
        if c not in df.columns:
            continue
        miss = df[c].isna()
        if not miss.any():
            continue
        logger.info("Imputing %d missing values for %s", int(miss.sum()), c)
        for carrier in df['Carrier'].dropna().unique():
            m = (df['Carrier'] == carrier) & miss
            if not m.any():
                continue
            if pd.api.types.is_numeric_dtype(df[c]):
                val = df.loc[df['Carrier'] == carrier, c].median()
            else:
                mode = df.loc[df['Carrier'] == carrier, c].mode()
                val = mode.iloc[0] if len(mode) else "Unknown"
            df.loc[m, c] = val

    for c in ['origin_warehouse', 'Destination', 'Carrier']:
        if c in df.columns:
            df[c] = df[c].fillna("Unknown")
    return df

def detect_and_handle_dhl_outliers(df: pd.DataFrame) -> pd.DataFrame:
    """Cap extreme DHL costs (3*IQR above Q3)."""
    df = df.copy()
    dhl_mask = (df['Carrier'] == 'DHL')
    if not dhl_mask.any():
        return df
    dhl_df = df.loc[dhl_mask]
    Q1 = dhl_df['Cost'].quantile(0.25)
    Q3 = dhl_df['Cost'].quantile(0.75)
    IQR = Q3 - Q1
    upper = Q3 + 3 * IQR
    extreme = dhl_df[dhl_df['Cost'] > upper]
    if len(extreme):
        logger.info("Capping %d DHL extreme costs at %.2f", len(extreme), upper)
        df.loc[dhl_mask & (df['Cost'] > upper), 'Cost'] = upper
    return df

def analyze_dhl_data_quality(df: pd.DataFrame) -> pd.DataFrame:
    """Log data quality stats for DHL subset."""
    dhl_df = df[df['Carrier'] == 'DHL'].copy()
    logger.info("DHL samples: %d", len(dhl_df))
    for col in ['Distance_miles', 'Weight_kg', 'origin_warehouse', 'Destination']:
        if col in dhl_df.columns:
            miss = int(dhl_df[col].isna().sum())
            pct = (miss / max(len(dhl_df), 1)) if len(dhl_df) else 0.0
            logger.info("Missing %s: %d (%.1f%%)", col, miss, pct * 100)
    if len(dhl_df):
        logger.info(
            "Distance %.0f-%.0f | Weight %.1f-%.1f | Cost %.0f-%.0f",
            dhl_df['Distance_miles'].min(), dhl_df['Distance_miles'].max(),
            dhl_df['Weight_kg'].min(), dhl_df['Weight_kg'].max(),
            dhl_df['Cost'].min(), dhl_df['Cost'].max()
        )
    return dhl_df

def feature_engineering(df: pd.DataFrame) -> pd.DataFrame:
    """Light, leakage-free features."""
    df = df.copy()
    if 'shipment_date' in df.columns:
        dt = pd.to_datetime(df['shipment_date'], errors='coerce')
        df['month'] = dt.dt.month.fillna(0).astype(int)
        df['day_of_week'] = dt.dt.dayofweek.fillna(0).astype(int)
        df['day_of_month'] = dt.dt.day.fillna(0).astype(int)
        df['is_weekend'] = (dt.dt.dayofweek >= 5).astype(int)
    else:
        df['month'] = 0; df['day_of_week'] = 0; df['day_of_month'] = 0; df['is_weekend'] = 0
    df['is_holiday_season'] = df['month'].isin([11, 12]).astype(int)
    df['route'] = df["origin_warehouse"].astype(str) + "→" + df["Destination"].astype(str)
    df['is_dhl'] = (df['Carrier'] == 'DHL').astype(int)
    df['is_usps'] = (df['Carrier'] == 'USPS').astype(int)
    df['is_ups'] = (df['Carrier'] == 'UPS').astype(int)
    df['is_fedex'] = (df['Carrier'] == 'FedEx').astype(int)
    df['dhl_distance'] = df['is_dhl'] * df['Distance_miles']
    df['usps_distance'] = df['is_usps'] * df['Distance_miles']
    df['dhl_weight'] = df['is_dhl'] * df['Weight_kg']
    df['usps_weight'] = df['is_usps'] * df['Weight_kg']
    df['dhl_weekend'] = ((df['Carrier'] == 'DHL') & (df['is_weekend'] == 1)).astype(int)
    df['dhl_holiday'] = ((df['Carrier'] == 'DHL') & (df['is_holiday_season'] == 1)).astype(int)
    return df

def enhanced_dhl_features(df: pd.DataFrame) -> pd.DataFrame:
    df = df.copy()
    dhl = (df['Carrier'] == 'DHL')
    df['dhl_distance_tier'] = 0
    if dhl.any():
        df.loc[dhl, 'dhl_distance_tier'] = pd.cut(
            df.loc[dhl, 'Distance_miles'], bins=[0,500,1000,1500,2000,2500,3000], labels=False
        ).astype('float').fillna(0).astype(int)
    df['dhl_weight_tier'] = 0
    if dhl.any():
        df.loc[dhl, 'dhl_weight_tier'] = pd.cut(
            df.loc[dhl, 'Weight_kg'], bins=[0,10,20,30,40,50,100], labels=False
        ).astype('float').fillna(0).astype(int)
    return df

def enhanced_dhl_distance_features(df: pd.DataFrame) -> pd.DataFrame:
    df = df.copy()
    dhl = (df['Carrier'] == 'DHL')
    df['dhl_distance_tier_detailed'] = 0
    if dhl.any():
        df.loc[dhl, 'dhl_distance_tier_detailed'] = pd.cut(
            df.loc[dhl, 'Distance_miles'], bins=[0,300,600,900,1200,1500,1800,2100,2400,3000], labels=False
        ).astype('float').fillna(0).astype(int)
    df['dhl_very_long_distance'] = (dhl & (df['Distance_miles'] > 2000)).astype(int)
    df['dhl_distance_squared'] = df['is_dhl'] * (df['Distance_miles'] ** 2)
    return df

def enhanced_dhl_weight_features(df: pd.DataFrame) -> pd.DataFrame:
    df = df.copy()
    dhl = (df['Carrier'] == 'DHL')
    df['dhl_weight_tier_detailed'] = 0
    if dhl.any():
        df.loc[dhl, 'dhl_weight_tier_detailed'] = pd.cut(
            df.loc[dhl, 'Weight_kg'], bins=[0,5,10,15,20,25,30,35,40,50,100], labels=False
        ).astype('float').fillna(0).astype(int)
    df['dhl_heavy_package'] = (dhl & (df['Weight_kg'] > 30)).astype(int)
    df['dhl_very_heavy_package'] = (dhl & (df['Weight_kg'] > 35)).astype(int)
    df['dhl_weight_distance'] = df['is_dhl'] * df['Weight_kg'] * df['Distance_miles']
    return df

def add_problem_route_features(df: pd.DataFrame) -> pd.DataFrame:
    df = df.copy()
    problem_routes = [
        'Warehouse_CHI→Houston',
        'Warehouse_HOU→Chicago',
        'Warehouse_MIA→San Francisco',
        'Warehouse_ATL→Denver',
        'Warehouse_NYC→Phoenix'
    ]
    df['is_problem_route'] = df['route'].isin(problem_routes).astype(int)
    df['is_long_distance_dhl'] = ((df['Carrier'] == 'DHL') & (df['Distance_miles'] > 1900)).astype(int)
    df['dhl_long_distance'] = df['is_dhl'] * (df['Distance_miles'] > 1900).astype(int)
    df['problem_route_distance'] = df['is_problem_route'] * df['Distance_miles']
    return df

def enhanced_days_features(df: pd.DataFrame) -> pd.DataFrame:
    """Placeholder to mirror earlier structure (no leakage here)."""
    return df.copy()


## Calibration, analysis & prediction helpers

In [10]:

def calculate_dynamic_calibration(df_context, y_true, y_pred, target_type='cost'):
    """Per-carrier median ratio (cost) or mean error (days)."""
    calibration = {}
    carriers = pd.Series(df_context['Carrier']).fillna("Unknown").values
    for carrier in pd.unique(carriers):
        m = (carriers == carrier)
        if m.sum() < 5:
            continue
        a = y_true[m]; p = y_pred[m]
        if target_type == 'cost':
            with np.errstate(divide='ignore', invalid='ignore'):
                ratios = a / p
                ratios = ratios[np.isfinite(ratios)]
            if len(ratios) > 0:
                calibration[carrier] = {'multiplier': float(np.median(ratios))}
        else:
            calibration[carrier] = {'additive': float(np.mean(a - p))}
    return calibration

def analyze_dhl_errors(df_val, y_true, y_pred, target_type='cost'):
    dhl_mask = (df_val['Carrier'] == 'DHL').to_numpy()
    if not dhl_mask.any():
        logger.info("No DHL samples in validation for analysis")
        return
    err = np.abs(y_true[dhl_mask] - y_pred[dhl_mask])
    logger.info("DHL %s error | mean=%.2f, median=%.2f, max=%.2f",
                target_type, err.mean(), np.median(err), err.max())

def handle_problematic_routes(predictions, df_context):
    calibrated = np.array(predictions).copy()
    route_corrections = {
        'Warehouse_CHI→Houston': 0.7,
        'Warehouse_HOU→Chicago': 0.8,
        'Warehouse_MIA→San Francisco': 0.85,
        'Warehouse_ATL→Denver': 0.9,
        'Warehouse_NYC→Phoenix': 0.9
    }
    routes = df_context['route'].astype(str).values
    for route, corr in route_corrections.items():
        m = (routes == route)
        if np.any(m):
            calibrated[m] *= corr
    m_long = ((df_context['Carrier'] == 'DHL').to_numpy() & (df_context['Distance_miles'].to_numpy() > 1900))
    if np.any(m_long):
        calibrated[m_long] *= 0.85
    return calibrated

def apply_calibrated_predictions(model, X_pre, df_context, target_type, calibration_factors):
    raw = model.predict(X_pre, num_iteration=getattr(model, "best_iteration", None))
    if target_type == 'cost':
        raw = np.expm1(raw)
    calibrated = np.array(raw).copy()
    carriers = df_context['Carrier'].fillna("Unknown").values
    if calibration_factors:
        for carrier, factors in calibration_factors.items():
            m = (carriers == carrier)
            if not np.any(m):
                continue
            if target_type == 'cost':
                mult = float(factors.get('multiplier', 1.0))
                calibrated[m] *= mult
            else:
                addv = float(factors.get('additive', 0.0))
                calibrated[m] += addv
    if target_type == 'cost':
        calibrated = handle_problematic_routes(calibrated, df_context)
    return calibrated

def predict_with_dhl_specialized_models(models, X_pre, df_meta, target_type='cost'):
    preds = np.zeros(len(df_meta))
    dhl_mask = (df_meta['Carrier'] == 'DHL').to_numpy()
    if 'dhl_long_distance' in models:
        m = dhl_mask & (df_meta['Distance_miles'].to_numpy() > 2000)
        idx = np.where(m)[0]
        if idx.size:
            preds[idx] = models['dhl_long_distance'].predict(X_pre[idx], num_iteration=getattr(models['dhl_long_distance'], "best_iteration", None))
    if 'dhl_heavy' in models:
        m = dhl_mask & (df_meta['Weight_kg'].to_numpy() > 30)
        idx = np.where(m)[0]
        if idx.size:
            preds[idx] = models['dhl_heavy'].predict(X_pre[idx], num_iteration=getattr(models['dhl_heavy'], "best_iteration", None))
    if 'dhl_normal' in models:
        m = dhl_mask & (preds == 0)
        idx = np.where(m)[0]
        if idx.size:
            preds[idx] = models['dhl_normal'].predict(X_pre[idx], num_iteration=getattr(models['dhl_normal'], "best_iteration", None))
    if target_type == 'cost':
        idx = np.where(dhl_mask)[0]
        preds[idx] = np.expm1(preds[idx])
    return preds

def predict_with_carrier_specific_models(models, X_pre, df_meta, target_type='cost'):
    preds = np.zeros(len(df_meta))
    dhl_models = {k: v for k, v in models.items() if k.startswith('dhl_')}
    if dhl_models:
        dhl_preds = predict_with_dhl_specialized_models(dhl_models, X_pre, df_meta, target_type)
        dhl_mask = (df_meta['Carrier'] == 'DHL').to_numpy()
        preds[dhl_mask] = dhl_preds[dhl_mask]
    if 'normal' in models and models['normal'] is not None:
        normal_carriers = ['UPS', 'FedEx', 'Amazon Logistics', 'LaserShip', 'OnTrac', 'USPS']
        m = (df_meta['Carrier'].isin(normal_carriers)).to_numpy() & (preds == 0)
        if np.any(m):
            yhat = models['normal'].predict(X_pre[m], num_iteration=getattr(models['normal'], "best_iteration", None))
            if target_type == 'cost':
                yhat = np.expm1(yhat)
            preds[m] = yhat
    return preds

def compute_adaptive_weights(global_preds, carrier_preds, y_cost_log, df_meta):
    """Return dict carrier -> w (weight for global); blend as w*global + (1-w)*carrier."""
    weights = {}
    carriers = df_meta['Carrier'].fillna("Unknown").values
    ytrue = np.expm1(y_cost_log.values if hasattr(y_cost_log, 'values') else y_cost_log)
    for carrier in pd.unique(carriers):
        m = (carriers == carrier)
        if m.sum() < 5:
            weights[carrier] = 0.5
            continue
        act = ytrue[m]
        g = global_preds[m]
        c = carrier_preds[m]
        rmse_g = np.sqrt(mean_squared_error(act, g))
        rmse_c = np.sqrt(mean_squared_error(act, c))
        inv_g = 0.0 if not np.isfinite(rmse_g) or rmse_g <= 0 else 1.0 / rmse_g
        inv_c = 0.0 if not np.isfinite(rmse_c) or rmse_c <= 0 else 1.0 / rmse_c
        denom = inv_g + inv_c
        w = inv_g / denom if denom > 0 else 0.5
        weights[carrier] = float(np.clip(w, 0.0, 1.0))
    return weights


## Models & transformers

In [11]:

class RouteHashingEncoder(BaseEstimator, TransformerMixin):
    """Hash a single 'route' column via sklearn's FeatureHasher.\n
    Expects a Series/array-like of strings. Produces dense array."""
    def __init__(self, n_features=256):
        self.n_features = n_features
        self.hasher = FeatureHasher(n_features=n_features, input_type='string')
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        if isinstance(X, pd.DataFrame):
            strings = X.iloc[:, 0].astype(str).tolist()
        else:
            strings = pd.Series(np.asarray(X).ravel()).astype(str).tolist()
        samples = [[s] for s in strings]  # iterable of iterables of strings
        mat = self.hasher.transform(samples)
        return mat.toarray()

def make_ohe():
    """Sklearn-agnostic OneHotEncoder (sparse_output for >=1.4; fallback to sparse)."""
    try:
        return OneHotEncoder(handle_unknown='ignore', sparse_output=False)
    except TypeError:
        return OneHotEncoder(handle_unknown='ignore', sparse=False)

def cv_lightgbm(X, y, params, scoring, cv=5):
    scores = []
    kf = KFold(n_splits=cv, shuffle=True, random_state=42)
    for tr, va in kf.split(X):
        Xtr, Xva = X[tr], X[va]
        ytr, yva = y[tr], y[va]
        dtr = lgb.Dataset(Xtr, label=ytr)
        dva = lgb.Dataset(Xva, label=yva)
        model = lgb.train(
            params, dtr, valid_sets=[dva],
            num_boost_round=1000,
            callbacks=[lgb.early_stopping(stopping_rounds=50)]
        )
        if scoring == 'rmse_expm1':
            yhat = np.expm1(model.predict(Xva, num_iteration=getattr(model, "best_iteration", None)))
            score = np.sqrt(mean_squared_error(np.expm1(yva), yhat))
        else:
            yhat = model.predict(Xva, num_iteration=getattr(model, "best_iteration", None))
            score = np.sqrt(mean_squared_error(yva, yhat))
        scores.append(score)
    return np.array(scores)

def plot_carrier_rmse(carrier_rmse: pd.DataFrame, output_dir: str):
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))
    df1 = carrier_rmse.sort_values('RMSE_Cost_Calibrated')
    ax1.bar(df1['Carrier'], df1['RMSE_Cost_Calibrated'], alpha=0.7)
    ax1.set_title('Cost RMSE (Calibrated) by Carrier')
    ax1.set_ylabel('RMSE ($)')
    ax1.tick_params(axis='x', rotation=45)
    ax1.grid(axis='y', alpha=0.3)
    df2 = carrier_rmse.sort_values('RMSE_Transit_Days')
    ax2.bar(df2['Carrier'], df2['RMSE_Transit_Days'], alpha=0.7)
    ax2.set_title('Transit Days RMSE by Carrier')
    ax2.set_ylabel('RMSE (Days)')
    ax2.tick_params(axis='x', rotation=45)
    ax2.grid(axis='y', alpha=0.3)
    plt.tight_layout()
    os.makedirs(output_dir, exist_ok=True)
    plt.savefig(os.path.join(output_dir, "carrier_rmse_analysis.png"), dpi=300, bbox_inches="tight")
    plt.close()

def train_advanced_dhl_model(X_train, y_train):
    if len(X_train) < 30:
        logger.warning("Insufficient DHL data for advanced training")
        return None
    params = {
        'objective': 'regression', 'metric': 'rmse',
        'learning_rate': 0.005, 'num_leaves': 127, 'max_depth': 12,
        'min_child_samples': 5, 'feature_fraction': 0.6,
        'bagging_fraction': 0.6, 'bagging_freq': 5,
        'lambda_l1': 0.5, 'lambda_l2': 0.5, 'min_data_in_leaf': 10, 'verbose': -1
    }
    dtrain = lgb.Dataset(X_train, label=y_train)
    model = lgb.train(params, dtrain, num_boost_round=1000)
    return model

def train_dhl_specialized_models(X_train, y_train, df_train):
    models = {}
    y_np = y_train.values if hasattr(y_train, 'values') else y_train
    dhl = (df_train['Carrier'] == 'DHL').to_numpy()
    long_m = dhl & (df_train['Distance_miles'].to_numpy() > 2000)
    if long_m.sum() > 10:
        idx = np.where(long_m)[0]
        dtr = lgb.Dataset(X_train[idx], label=y_np[idx])
        models['dhl_long_distance'] = lgb.train({
            'objective': 'regression', 'metric': 'rmse', 'learning_rate': 0.002,
            'num_leaves': 15, 'max_depth': 4, 'min_child_samples': 5, 'feature_fraction': 0.8,
            'lambda_l1': 1.0, 'lambda_l2': 1.0, 'verbose': -1
        }, dtr, num_boost_round=500)
    heavy_m = dhl & (df_train['Weight_kg'].to_numpy() > 30)
    if heavy_m.sum() > 5:
        idx = np.where(heavy_m)[0]
        dtr = lgb.Dataset(X_train[idx], label=y_np[idx])
        models['dhl_heavy'] = lgb.train({
            'objective': 'regression', 'metric': 'rmse', 'learning_rate': 0.005,
            'num_leaves': 20, 'max_depth': 5, 'min_child_samples': 3, 'feature_fraction': 0.7,
            'lambda_l1': 0.8, 'lambda_l2': 0.8, 'verbose': -1
        }, dtr, num_boost_round=300)
    normal_m = dhl & (~long_m) & (~heavy_m)
    if normal_m.sum() > 20:
        idx = np.where(normal_m)[0]
        models['dhl_normal'] = train_advanced_dhl_model(X_train[idx], y_np[idx])
    return models

def train_carrier_specific_models(X_train, y_train, X_val, y_val, df_train, df_val, target_type='cost'):
    models = {}
    if target_type == 'cost':
        models.update(train_dhl_specialized_models(X_train, y_train, df_train))
    normal_carriers = ['UPS', 'FedEx', 'Amazon Logistics', 'LaserShip', 'OnTrac', 'USPS']
    norm = df_train['Carrier'].isin(normal_carriers).to_numpy()
    if norm.any():
        Xn = X_train[norm]
        yn = (y_train.values if hasattr(y_train, 'values') else y_train)[norm]
        vn = df_val['Carrier'].isin(normal_carriers).to_numpy()
        Xvn = X_val[vn]
        yvn = (y_val.values if hasattr(y_val, 'values') else y_val)[vn]
        params = {
            'objective': 'regression', 'metric': 'rmse', 'learning_rate': 0.05, 'num_leaves': 31,
            'max_depth': 8, 'min_child_samples': 10, 'feature_fraction': 0.8, 'verbose': -1
        }
        dtr = lgb.Dataset(Xn, label=yn)
        if len(Xvn) > 0:
            dva = lgb.Dataset(Xvn, label=yvn)
            models['normal'] = lgb.train(params, dtr, valid_sets=[dva],
                                         num_boost_round=1000,
                                         callbacks=[lgb.early_stopping(stopping_rounds=30)])
        else:
            models['normal'] = lgb.train(params, dtr, num_boost_round=1000)
    return models


## Data loading (MySQL with CSV fallback)

In [12]:

def load_dataset() -> pd.DataFrame:
    """Try MySQL first; fallback to CSV via LOGISTICS_CSV_PATH; else raise."""
    # MySQL
    try:
        engine = create_engine(
            f"mysql+mysqlconnector://{os.getenv('MYSQL_USER','root')}:{os.getenv('MYSQL_PASSWORD','password')}"
            f"@{os.getenv('MYSQL_HOST','localhost')}:{os.getenv('MYSQL_PORT','3306')}/{os.getenv('MYSQL_DB','sys')}"
        )
        df = pd.read_sql("SELECT * FROM us_logistics;", engine)
        if not df.empty:
            logger.info("Loaded %s rows from MySQL us_logistics", len(df))
            return df
    except Exception as e:
        logger.warning("MySQL read failed: %s", e)
    # CSV
    path = os.getenv("LOGISTICS_CSV_PATH", "logistics_shipments_dataset.csv")
    if os.path.exists(path):
        df = pd.read_csv(path)
        logger.info("Loaded %s rows from CSV %s", len(df), path)
        return df
    raise RuntimeError("Could not load data from MySQL or CSV. Set DB env vars or LOGISTICS_CSV_PATH.")


## Main training pipeline

In [13]:

def main():
    try:
        logger.info("=== START TRAINING PIPELINE ===")
        df = load_dataset()

        # Basic standardization
        df = handle_missing_data_smart(df)
        # Global IQR trim for Cost
        if "Cost" in df.columns:
            Q1 = df["Cost"].quantile(0.25); Q3 = df["Cost"].quantile(0.75); IQR = Q3 - Q1
            lb, ub = Q1 - 1.5 * IQR, Q3 + 1.5 * IQR
            before = len(df)
            df = df[(df["Cost"] >= lb) & (df["Cost"] <= ub)]
            logger.info("Removed %d global cost outliers", before - len(df))

        df = detect_and_handle_dhl_outliers(df)
        df = feature_engineering(df)
        df = enhanced_dhl_features(df)
        df = enhanced_dhl_distance_features(df)
        df = enhanced_dhl_weight_features(df)
        df = add_problem_route_features(df)
        df = enhanced_days_features(df)

        df['Cost_log'] = np.log1p(df['Cost'])

        # Split
        carrier_counts = df['Carrier'].value_counts()
        stratify_col = df['Carrier'] if (len(carrier_counts) > 1 and carrier_counts.min() > 1) else None

        base_num_cols = [
            'Distance_miles','Weight_kg','month','day_of_week','day_of_month',
            'is_weekend','is_holiday_season',
            'is_dhl','is_usps','is_ups','is_fedex',
            'dhl_distance','usps_distance','dhl_weight','usps_weight',
            'is_problem_route','is_long_distance_dhl','dhl_long_distance','problem_route_distance',
            'dhl_distance_tier','dhl_weight_tier','dhl_weekend','dhl_holiday',
            'dhl_distance_tier_detailed','dhl_very_long_distance','dhl_distance_squared',
            'dhl_weight_tier_detailed','dhl_heavy_package','dhl_very_heavy_package','dhl_weight_distance'
        ]
        base_num_cols = [c for c in base_num_cols if c in df.columns]
        cat_cols_no_route = [c for c in ['Carrier','origin_warehouse','Destination'] if c in df.columns]

        cols_for_split = ['Carrier','origin_warehouse','Destination','route'] + base_num_cols + ['Cost_log']
        if 'Transit_Days' in df.columns: cols_for_split.append('Transit_Days')
        cols_for_split = list(dict.fromkeys([c for c in cols_for_split if c in df.columns]))

        df_split = df[cols_for_split].copy()
        y_cost = df_split['Cost_log']
        y_days = df_split['Transit_Days'] if 'Transit_Days' in df_split.columns else pd.Series(0, index=df_split.index)

        X_train, X_val, y_cost_train, y_cost_val, y_days_train, y_days_val = train_test_split(
            df_split.drop(columns=['Cost_log','Transit_Days'], errors='ignore'),
            y_cost, y_days, test_size=0.2, random_state=42, stratify=stratify_col
        )
        df_train = df.loc[X_train.index].copy()
        df_val = df.loc[X_val.index].copy()

        # Post-split aggregates (train-only)
        cr_counts = df_train.groupby(['Carrier','route'], observed=False).size().rename('carrier_route_count')
        def add_train_only_counts(dframe):
            idx = dframe.set_index(['Carrier','route']).index
            vals = idx.map(cr_counts).astype('float').fillna(0.0).values
            return vals
        X_train['carrier_route_count'] = add_train_only_counts(X_train)
        X_val['carrier_route_count'] = add_train_only_counts(X_val)

        def add_days_aggregates(d_target):
            if 'Transit_Days' not in df_train.columns:
                for c in ['carrier_avg_days','route_avg_days','carrier_route_avg_days']:
                    d_target[c] = 0.0
                return d_target
            carrier_map = df_train.groupby('Carrier', observed=False)['Transit_Days'].mean()
            route_map = df_train.groupby('route', observed=False)['Transit_Days'].mean()
            cr_map = df_train.groupby(['Carrier','route'], observed=False)['Transit_Days'].mean()
            gl = df_train['Transit_Days'].mean()
            d_target['carrier_avg_days'] = d_target['Carrier'].map(carrier_map)
            d_target['route_avg_days'] = d_target['route'].map(route_map)
            d_target['carrier_route_avg_days'] = d_target.set_index(['Carrier','route']).index.map(cr_map)
            for c in ['carrier_avg_days','route_avg_days','carrier_route_avg_days']:
                d_target[c] = d_target[c].astype('float').fillna(gl)
            return d_target
        X_train_days = add_days_aggregates(X_train.copy())
        X_val_days = add_days_aggregates(X_val.copy())

        analyze_dhl_data_quality(df)

        # Preprocessors
        num_cols_cost = base_num_cols + ['carrier_route_count']
        cat_cols_cost = cat_cols_no_route
        num_cols_days = base_num_cols + ['carrier_route_count','carrier_avg_days','route_avg_days','carrier_route_avg_days']
        cat_cols_days = cat_cols_no_route

        num_pipe = Pipeline([('imputer', SimpleImputer(strategy='median')), ('scaler', StandardScaler())])
        cat_pipe = Pipeline([('imputer', SimpleImputer(strategy='constant', fill_value='Unknown')), ('onehot', make_ohe())])

        preprocessor_cost = ColumnTransformer(
            transformers=[
                ('num', num_pipe, num_cols_cost),
                ('cat', cat_pipe, cat_cols_cost),
                ('route_hash', RouteHashingEncoder(n_features=256), ['route'])
            ], remainder='drop'
        )
        preprocessor_days = ColumnTransformer(
            transformers=[
                ('num', Pipeline([('imputer', SimpleImputer(strategy='median')), ('scaler', StandardScaler())]), num_cols_days),
                ('cat', Pipeline([('imputer', SimpleImputer(strategy='constant', fill_value='Unknown')), ('onehot', make_ohe())]), cat_cols_days),
                ('route_hash', RouteHashingEncoder(n_features=256), ['route'])
            ], remainder='drop'
        )

        # Dtypes
        for dfX in (X_train, X_val, X_train_days, X_val_days):
            for c in ['Distance_miles','Weight_kg']:
                if c in dfX.columns: dfX[c] = dfX[c].astype('float')
            for c in cat_cols_no_route + ['route']:
                if c in dfX.columns: dfX[c] = dfX[c].fillna("Unknown").astype(str)

        # Fit/transform
        X_train_pre_cost = preprocessor_cost.fit_transform(X_train)
        X_val_pre_cost = preprocessor_cost.transform(X_val)
        X_train_pre_days = preprocessor_days.fit_transform(X_train_days)
        X_val_pre_days = preprocessor_days.transform(X_val_days)

        # Numpy
        Xtr_c = X_train_pre_cost if isinstance(X_train_pre_cost, np.ndarray) else X_train_pre_cost.toarray() if hasattr(X_train_pre_cost, "toarray") else X_train_pre_cost
        Xva_c = X_val_pre_cost if isinstance(X_val_pre_cost, np.ndarray) else X_val_pre_cost.toarray() if hasattr(X_val_pre_cost, "toarray") else X_val_pre_cost
        ytr_c = y_cost_train.values if hasattr(y_cost_train,'values') else y_cost_train
        yva_c = y_cost_val.values if hasattr(y_cost_val,'values') else y_cost_val

        Xtr_d = X_train_pre_days if isinstance(X_train_pre_days, np.ndarray) else X_train_pre_days.toarray() if hasattr(X_train_pre_days, "toarray") else X_train_pre_days
        Xva_d = X_val_pre_days if isinstance(X_val_pre_days, np.ndarray) else X_val_pre_days.toarray() if hasattr(X_val_pre_days, "toarray") else X_val_pre_days
        ytr_d = y_days_train.values if hasattr(y_days_train,'values') else y_days_train
        yva_d = y_days_val.values if hasattr(y_days_val,'values') else y_days_val

        # Optuna tuning
        def get_common_params(trial):
            return {
                'learning_rate': trial.suggest_float('learning_rate', 0.01, 0.2, log=True),
                'num_leaves': trial.suggest_int('num_leaves', 20, 200),
                'max_depth': trial.suggest_int('max_depth', 5, 20),
                'min_child_samples': trial.suggest_int('min_child_samples', 5, 50),
                'feature_fraction': trial.suggest_float('feature_fraction', 0.6, 1.0),
                'bagging_fraction': trial.suggest_float('bagging_fraction', 0.6, 1.0),
                'bagging_freq': trial.suggest_int('bagging_freq', 1, 10),
                'lambda_l1': trial.suggest_float('lambda_l1', 1e-8, 10.0, log=True),
                'lambda_l2': trial.suggest_float('lambda_l2', 1e-8, 10.0, log=True),
            }
        def objective_cost(trial):
            params = get_common_params(trial)
            params.update({'objective':'regression','metric':'rmse','verbose':-1,'random_state':42})
            dtr = lgb.Dataset(Xtr_c, label=ytr_c); dva = lgb.Dataset(Xva_c, label=yva_c)
            mdl = lgb.train(params, dtr, valid_sets=[dva], num_boost_round=1000,
                            callbacks=[lgb.log_evaluation(0), lgb.early_stopping(stopping_rounds=50)])
            yhat = np.expm1(mdl.predict(Xva_c, num_iteration=getattr(mdl,"best_iteration",None)))
            rmse = np.sqrt(mean_squared_error(np.expm1(yva_c), yhat))
            return rmse
        def objective_days(trial):
            params = get_common_params(trial)
            params.update({'objective':'poisson','metric':'rmse','verbose':-1,'random_state':42})
            dtr = lgb.Dataset(Xtr_d, label=ytr_d); dva = lgb.Dataset(Xva_d, label=yva_d)
            mdl = lgb.train(params, dtr, valid_sets=[dva], num_boost_round=1000,
                            callbacks=[lgb.log_evaluation(0), lgb.early_stopping(stopping_rounds=50)])
            yhat = mdl.predict(Xva_d, num_iteration=getattr(mdl,"best_iteration",None))
            rmse = np.sqrt(mean_squared_error(yva_d, yhat))
            return rmse

        logger.info("Tuning with Optuna")
        study_cost = optuna.create_study(direction="minimize", sampler=optuna.samplers.TPESampler(seed=42))
        study_cost.optimize(objective_cost, n_trials=30, show_progress_bar=False)
        study_days = optuna.create_study(direction="minimize", sampler=optuna.samplers.TPESampler(seed=42))
        study_days.optimize(objective_days, n_trials=30, show_progress_bar=False)
        logger.info("Best RMSE (cost)=%.4f | (days)=%.4f", study_cost.best_value, study_days.best_value)

        best_params_cost = {**study_cost.best_params, 'objective':'regression','metric':'rmse','verbose':-1,'random_state':42}
        best_params_days = {**study_days.best_params, 'objective':'poisson','metric':'rmse','verbose':-1,'random_state':42}

        # Final global models
        mdl_cost = lgb.train(best_params_cost, lgb.Dataset(Xtr_c, label=ytr_c),
                             valid_sets=[lgb.Dataset(Xva_c, label=yva_c)],
                             num_boost_round=1000, callbacks=[lgb.log_evaluation(100), lgb.early_stopping(stopping_rounds=50)])
        mdl_days = lgb.train(best_params_days, lgb.Dataset(Xtr_d, label=ytr_d),
                             valid_sets=[lgb.Dataset(Xva_d, label=yva_d)],
                             num_boost_round=1000, callbacks=[lgb.log_evaluation(100), lgb.early_stopping(stopping_rounds=50)])

        # Carrier-specific cost models
        logger.info("Training carrier-specific models")
        cost_models = train_carrier_specific_models(Xtr_c, ytr_c, Xva_c, yva_c, df_train, df_val, 'cost')

        # Calibration
        logger.info("Calculating calibration factors")
        val_cost_raw = mdl_cost.predict(Xva_c, num_iteration=getattr(mdl_cost,"best_iteration",None))
        val_cost_exp = np.expm1(val_cost_raw)
        cost_cal = calculate_dynamic_calibration(df_val, np.expm1(yva_c), val_cost_exp, 'cost')
        val_days = mdl_days.predict(Xva_d, num_iteration=getattr(mdl_days,"best_iteration",None))
        days_cal = calculate_dynamic_calibration(df_val, yva_d, val_days, 'days')

        # Error analysis
        analyze_dhl_errors(df_val, np.expm1(yva_c), val_cost_exp, 'cost')

        # Adaptive weights (for eval/reporting)
        global_val = np.expm1(mdl_cost.predict(Xva_c, num_iteration=getattr(mdl_cost,"best_iteration",None)))
        carrier_val = predict_with_carrier_specific_models(cost_models, Xva_c, df_val, 'cost')
        carrier_val = np.where(carrier_val == 0, global_val, carrier_val)
        adaptive_w = compute_adaptive_weights(global_val, carrier_val, y_cost_val, df_val)

        # CV
        cv_cost = cv_lightgbm(Xtr_c, ytr_c, best_params_cost, 'rmse_expm1', cv=5)
        cv_days = cv_lightgbm(Xtr_d, ytr_d, best_params_days, 'rmse', cv=5)
        logger.info("CV RMSE Cost: %.4f ± %.4f | Days: %.4f ± %.4f", cv_cost.mean(), cv_cost.std(), cv_days.mean(), cv_days.std())

        # Per-carrier evaluation
        def evaluate_per_carrier(Xc, Xd, ylog, ydays, meta_df, mdl_c, mdl_d, group_models, cal_cost, cal_days, w_adapt):
            carriers = sorted(meta_df['Carrier'].dropna().unique())
            cost_global = apply_calibrated_predictions(mdl_c, Xc, meta_df, 'cost', cal_cost)
            days_global = apply_calibrated_predictions(mdl_d, Xd, meta_df, 'days', cal_days)
            # adaptive blend
            cost_adapt = np.zeros(len(meta_df))
            gpred = np.expm1(mdl_c.predict(Xc, num_iteration=getattr(mdl_c,"best_iteration",None)))
            cpred = predict_with_carrier_specific_models(group_models, Xc, meta_df, 'cost')
            for i, car in enumerate(meta_df['Carrier'].values):
                if car == 'DHL' and any(k.startswith('dhl_') for k in group_models.keys()):
                    cost_adapt[i] = cpred[i]
                else:
                    w = w_adapt.get(car, 0.5)
                    c = cpred[i] if cpred[i] != 0 else gpred[i]
                    cost_adapt[i] = w * gpred[i] + (1 - w) * c
            rows = []
            for car in carriers:
                m = (meta_df['Carrier'].values == car)
                if not np.any(m): continue
                rmse_cg = float(np.sqrt(mean_squared_error(np.expm1(ylog.values[m]), cost_global[m])))
                rmse_ca = float(np.sqrt(mean_squared_error(np.expm1(ylog.values[m]), cost_adapt[m])))
                rmse_d  = float(np.sqrt(mean_squared_error(ydays.values[m], days_global[m])))
                rows.append({"Carrier":car,"Samples":int(m.sum()),
                             "RMSE_Cost_Calibrated":rmse_cg,
                             "RMSE_Cost_Adaptive":rmse_ca,
                             "RMSE_Transit_Days":rmse_d})
            return pd.DataFrame(rows)

        carrier_rmse = evaluate_per_carrier(
            Xva_c, Xva_d, y_cost_val, y_days_val, df_val,
            mdl_cost, mdl_days, cost_models, cost_cal, days_cal, adaptive_w
        )

        # Save artifacts
        out = "model_artifacts"; os.makedirs(out, exist_ok=True)
        joblib.dump(mdl_cost, os.path.join(out, "cost_model.pkl"))
        joblib.dump(mdl_days, os.path.join(out, "transit_days_model.pkl"))
        joblib.dump(preprocessor_cost, os.path.join(out, "preprocessor_cost.pkl"))
        joblib.dump(preprocessor_days, os.path.join(out, "preprocessor_days.pkl"))
        joblib.dump(cost_models, os.path.join(out, "cost_models_grouped.pkl"))
        # Generic names for Streamlit sidebar
        joblib.dump(preprocessor_cost, os.path.join(out, "preprocessor.pkl"))
        joblib.dump(mdl_cost, os.path.join(out, "cost_model.pkl"))

        # best_params.json (features + calibration + metrics)
        best = {
            "cost_params": best_params_cost,
            "days_params": best_params_days,
            "cost_features": {"numerical": list(num_cols_cost), "categorical": list(cat_cols_cost), "route_hash_features": 256},
            "days_features": {"numerical": list(num_cols_days), "categorical": list(cat_cols_days), "route_hash_features": 256},
            "cv_scores": {"cost_mean": float(cv_cost.mean()), "cost_std": float(cv_cost.std()),
                          "days_mean": float(cv_days.mean()), "days_std": float(cv_days.std())},
            "calibration": {"cost": cost_cal, "days": days_cal}
        }
        # validation RMSE summary
        cost_val_cal = apply_calibrated_predictions(mdl_cost, Xva_c, df_val, 'cost', cost_cal)
        cost_val_adapt = np.zeros(len(df_val))
        gpred = np.expm1(mdl_cost.predict(Xva_c, num_iteration=getattr(mdl_cost,"best_iteration",None)))
        cpred = predict_with_carrier_specific_models(cost_models, Xva_c, df_val, 'cost')
        for i, car in enumerate(df_val['Carrier'].values):
            if car == 'DHL' and any(k.startswith('dhl_') for k in cost_models.keys()):
                cost_val_adapt[i] = cpred[i]
            else:
                w = adaptive_w.get(car, 0.5)
                c = cpred[i] if cpred[i] != 0 else gpred[i]
                cost_val_adapt[i] = w * gpred[i] + (1 - w) * c
        rmse_cost_cal = float(np.sqrt(mean_squared_error(np.expm1(yva_c), cost_val_cal)))
        rmse_cost_adp = float(np.sqrt(mean_squared_error(np.expm1(yva_c), cost_val_adapt)))
        days_val_cal = apply_calibrated_predictions(mdl_days, Xva_d, df_val, 'days', days_cal)
        rmse_days_cal = float(np.sqrt(mean_squared_error(yva_d, days_val_cal)))
        best["validation_rmse"] = {"cost_calibrated": rmse_cost_cal, "cost_adaptive": rmse_cost_adp, "days_calibrated": rmse_days_cal}

        with open(os.path.join(out, "best_params.json"), "w") as f:
            json.dump(best, f, indent=2)

        carrier_rmse.to_csv(os.path.join(out, "carrier_performance.csv"), index=False)
        plot_carrier_rmse(carrier_rmse, out)

        logger.info("Artifacts saved to %s", out)
        logger.info("=== TRAINING COMPLETED SUCCESSFULLY ===")

        return {
            "artifacts_dir": out,
            "best_params_json": os.path.join(out, "best_params.json"),
            "carrier_performance_csv": os.path.join(out, "carrier_performance.csv"),
            "rmse_plot_png": os.path.join(out, "carrier_rmse_analysis.png"),
            "when": datetime.now(UTC).isoformat().replace("+00:00", "Z")
        }

    except Exception as e:
        logger.exception("Error during pipeline execution")
        raise


## Run

In [14]:

if __name__ == "__main__":
    # Execute the pipeline when running the notebook programmatically
    summary = main()
    print("Summary:", json.dumps(summary, indent=2))


2025-09-07 19:18:06,290 - INFO - === START TRAINING PIPELINE ===
2025-09-07 19:18:06,328 - INFO - Loaded 2000 rows from MySQL us_logistics
2025-09-07 19:18:06,331 - INFO - Removed 41 rows with missing Cost
2025-09-07 19:18:06,335 - INFO - Removed 4 global cost outliers
2025-09-07 19:18:06,361 - INFO - DHL samples: 275
2025-09-07 19:18:06,362 - INFO - Missing Distance_miles: 0 (0.0%)
2025-09-07 19:18:06,362 - INFO - Missing Weight_kg: 0 (0.0%)
2025-09-07 19:18:06,362 - INFO - Missing origin_warehouse: 0 (0.0%)
2025-09-07 19:18:06,363 - INFO - Missing Destination: 0 (0.0%)
2025-09-07 19:18:06,363 - INFO - Distance 105-2499 | Weight 3.1-88.7 | Cost 32-464
2025-09-07 19:18:06,388 - INFO - Tuning with Optuna
[I 2025-09-07 19:18:06,388] A new study created in memory with name: no-name-aefcb138-dfd1-49b8-ba1b-fdfbbc03da8b
[I 2025-09-07 19:18:06,506] Trial 0 finished with value: 25.0287761646713 and parameters: {'learning_rate': 0.030710573677773714, 'num_leaves': 192, 'max_depth': 16, 'min_ch

Training until validation scores don't improve for 50 rounds
Early stopping, best iteration is:
[357]	valid_0's rmse: 0.153029
Training until validation scores don't improve for 50 rounds
Early stopping, best iteration is:
[116]	valid_0's rmse: 0.152512
Training until validation scores don't improve for 50 rounds


[I 2025-09-07 19:18:06,767] Trial 2 finished with value: 27.980994137976317 and parameters: {'learning_rate': 0.03647316284911211, 'num_leaves': 72, 'max_depth': 14, 'min_child_samples': 11, 'feature_fraction': 0.7168578594140872, 'bagging_fraction': 0.7465447373174766, 'bagging_freq': 5, 'lambda_l1': 0.11656915613247415, 'lambda_l2': 6.267062696005991e-07}. Best is trial 1 with value: 24.65484374287327.
[I 2025-09-07 19:18:06,847] Trial 3 finished with value: 24.287135410013047 and parameters: {'learning_rate': 0.04666963767236924, 'num_leaves': 127, 'max_depth': 5, 'min_child_samples': 32, 'feature_fraction': 0.6682096494749166, 'bagging_fraction': 0.6260206371941118, 'bagging_freq': 10, 'lambda_l1': 4.905556676028766, 'lambda_l2': 0.1886149587855396}. Best is trial 3 with value: 24.287135410013047.


Early stopping, best iteration is:
[212]	valid_0's rmse: 0.161937
Training until validation scores don't improve for 50 rounds
Early stopping, best iteration is:
[636]	valid_0's rmse: 0.155349
Training until validation scores don't improve for 50 rounds


[I 2025-09-07 19:18:06,988] Trial 4 finished with value: 24.36299639937509 and parameters: {'learning_rate': 0.02490643969382439, 'num_leaves': 37, 'max_depth': 15, 'min_child_samples': 25, 'feature_fraction': 0.6488152939379115, 'bagging_fraction': 0.798070764044508, 'bagging_freq': 1, 'lambda_l1': 1.5271567592511939, 'lambda_l2': 2.133142332373e-06}. Best is trial 3 with value: 24.287135410013047.
[I 2025-09-07 19:18:07,039] Trial 5 finished with value: 23.611632998211867 and parameters: {'learning_rate': 0.07277150634170934, 'num_leaves': 76, 'max_depth': 13, 'min_child_samples': 30, 'feature_fraction': 0.6739417822102108, 'bagging_fraction': 0.9878338511058234, 'bagging_freq': 8, 'lambda_l1': 2.8542399074977594, 'lambda_l2': 1.1309571585271492}. Best is trial 5 with value: 23.611632998211867.
[I 2025-09-07 19:18:07,123] Trial 6 finished with value: 25.567581223641096 and parameters: {'learning_rate': 0.059963338824126605, 'num_leaves': 186, 'max_depth': 6, 'min_child_samples': 14, 

Early stopping, best iteration is:
[561]	valid_0's rmse: 0.151568
Training until validation scores don't improve for 50 rounds
Early stopping, best iteration is:
[225]	valid_0's rmse: 0.148528
Training until validation scores don't improve for 50 rounds
Early stopping, best iteration is:
[173]	valid_0's rmse: 0.15158
Training until validation scores don't improve for 50 rounds


[I 2025-09-07 19:18:07,263] Trial 7 finished with value: 25.294221663461528 and parameters: {'learning_rate': 0.02911701023242742, 'num_leaves': 70, 'max_depth': 13, 'min_child_samples': 11, 'feature_fraction': 0.9208787923016158, 'bagging_fraction': 0.6298202574719083, 'bagging_freq': 10, 'lambda_l1': 0.08916674715636552, 'lambda_l2': 6.143857495033091e-07}. Best is trial 5 with value: 23.611632998211867.


Early stopping, best iteration is:
[169]	valid_0's rmse: 0.151191
Training until validation scores don't improve for 50 rounds


[I 2025-09-07 19:18:07,557] Trial 8 finished with value: 23.62095468745595 and parameters: {'learning_rate': 0.010166803740022877, 'num_leaves': 167, 'max_depth': 16, 'min_child_samples': 38, 'feature_fraction': 0.9085081386743783, 'bagging_fraction': 0.6296178606936361, 'bagging_freq': 4, 'lambda_l1': 1.1036250149900698e-07, 'lambda_l2': 0.5860448217200526}. Best is trial 5 with value: 23.611632998211867.
[I 2025-09-07 19:18:07,625] Trial 9 finished with value: 25.438641989404033 and parameters: {'learning_rate': 0.06470376604234768, 'num_leaves': 79, 'max_depth': 6, 'min_child_samples': 19, 'feature_fraction': 0.7300733288106988, 'bagging_fraction': 0.8918424713352257, 'bagging_freq': 7, 'lambda_l1': 0.9658611176861261, 'lambda_l2': 0.0001778010520878397}. Best is trial 5 with value: 23.611632998211867.
[I 2025-09-07 19:18:07,686] Trial 10 finished with value: 26.05346027521042 and parameters: {'learning_rate': 0.1788820497328707, 'num_leaves': 124, 'max_depth': 10, 'min_child_sample

Early stopping, best iteration is:
[676]	valid_0's rmse: 0.149095
Training until validation scores don't improve for 50 rounds
Early stopping, best iteration is:
[152]	valid_0's rmse: 0.154431
Training until validation scores don't improve for 50 rounds
Early stopping, best iteration is:
[78]	valid_0's rmse: 0.157359
Training until validation scores don't improve for 50 rounds
Early stopping, best iteration is:
[750]	valid_0's rmse: 0.150343


[I 2025-09-07 19:18:09,252] Trial 11 finished with value: 24.538018712983295 and parameters: {'learning_rate': 0.010205410978180531, 'num_leaves': 157, 'max_depth': 18, 'min_child_samples': 39, 'feature_fraction': 0.95621601194762, 'bagging_fraction': 0.9976023763984891, 'bagging_freq': 7, 'lambda_l1': 4.694221200463742e-08, 'lambda_l2': 9.604469751203892}. Best is trial 5 with value: 23.611632998211867.


Training until validation scores don't improve for 50 rounds


[I 2025-09-07 19:18:09,582] Trial 12 finished with value: 23.860262326441315 and parameters: {'learning_rate': 0.010486647476761056, 'num_leaves': 155, 'max_depth': 10, 'min_child_samples': 36, 'feature_fraction': 0.8624306017432583, 'bagging_fraction': 0.8925233663655074, 'bagging_freq': 4, 'lambda_l1': 0.00042371132433695466, 'lambda_l2': 0.0491451363965605}. Best is trial 5 with value: 23.611632998211867.
[I 2025-09-07 19:18:09,657] Trial 13 finished with value: 25.55403540893171 and parameters: {'learning_rate': 0.13379889603180217, 'num_leaves': 94, 'max_depth': 11, 'min_child_samples': 25, 'feature_fraction': 0.8158535596006177, 'bagging_fraction': 0.8873249911216153, 'bagging_freq': 8, 'lambda_l1': 1.3000481036980694e-07, 'lambda_l2': 0.014775475739688154}. Best is trial 5 with value: 23.611632998211867.


Early stopping, best iteration is:
[604]	valid_0's rmse: 0.148845
Training until validation scores don't improve for 50 rounds
Early stopping, best iteration is:
[83]	valid_0's rmse: 0.151238
Training until validation scores don't improve for 50 rounds


[I 2025-09-07 19:18:09,868] Trial 14 finished with value: 23.377908989710406 and parameters: {'learning_rate': 0.016030182697577208, 'num_leaves': 157, 'max_depth': 17, 'min_child_samples': 50, 'feature_fraction': 0.9994891134515507, 'bagging_fraction': 0.8147657261706673, 'bagging_freq': 6, 'lambda_l1': 8.135462362525248e-06, 'lambda_l2': 3.693544333296998e-05}. Best is trial 14 with value: 23.377908989710406.


Early stopping, best iteration is:
[492]	valid_0's rmse: 0.147902
Training until validation scores don't improve for 50 rounds


[I 2025-09-07 19:18:10,509] Trial 15 finished with value: 28.65486358384242 and parameters: {'learning_rate': 0.015916913630789346, 'num_leaves': 122, 'max_depth': 20, 'min_child_samples': 5, 'feature_fraction': 0.7531177448189, 'bagging_fraction': 0.8346098620430572, 'bagging_freq': 6, 'lambda_l1': 5.704246908471008e-06, 'lambda_l2': 2.735063947885515e-05}. Best is trial 14 with value: 23.377908989710406.


Early stopping, best iteration is:
[360]	valid_0's rmse: 0.165862
Training until validation scores don't improve for 50 rounds


[I 2025-09-07 19:18:10,753] Trial 16 finished with value: 24.405008014758035 and parameters: {'learning_rate': 0.018923194380010718, 'num_leaves': 46, 'max_depth': 18, 'min_child_samples': 50, 'feature_fraction': 0.7766748795291956, 'bagging_fraction': 0.9438599500782207, 'bagging_freq': 9, 'lambda_l1': 0.0029472905339895214, 'lambda_l2': 1.314053143275926e-08}. Best is trial 14 with value: 23.377908989710406.
[I 2025-09-07 19:18:10,819] Trial 17 finished with value: 25.239182561023025 and parameters: {'learning_rate': 0.10974601331685604, 'num_leaves': 99, 'max_depth': 12, 'min_child_samples': 44, 'feature_fraction': 0.8514142443596113, 'bagging_fraction': 0.8264255180564114, 'bagging_freq': 6, 'lambda_l1': 0.00011809048384728257, 'lambda_l2': 2.9213259776727324e-05}. Best is trial 14 with value: 23.377908989710406.


Early stopping, best iteration is:
[494]	valid_0's rmse: 0.151603
Training until validation scores don't improve for 50 rounds
Early stopping, best iteration is:
[89]	valid_0's rmse: 0.154652
Training until validation scores don't improve for 50 rounds


[I 2025-09-07 19:18:10,935] Trial 18 finished with value: 25.423095627103727 and parameters: {'learning_rate': 0.04873514165664674, 'num_leaves': 140, 'max_depth': 8, 'min_child_samples': 21, 'feature_fraction': 0.6079503762500515, 'bagging_fraction': 0.76500726490148, 'bagging_freq': 8, 'lambda_l1': 9.514769556766346e-05, 'lambda_l2': 0.005914969601806861}. Best is trial 14 with value: 23.377908989710406.
[I 2025-09-07 19:18:11,103] Trial 19 finished with value: 23.5200076730802 and parameters: {'learning_rate': 0.018453207518929104, 'num_leaves': 53, 'max_depth': 18, 'min_child_samples': 31, 'feature_fraction': 0.9916975183348293, 'bagging_fraction': 0.9347949395453552, 'bagging_freq': 5, 'lambda_l1': 0.017847429374350895, 'lambda_l2': 2.455918006464903e-08}. Best is trial 14 with value: 23.377908989710406.


Early stopping, best iteration is:
[201]	valid_0's rmse: 0.152956
Training until validation scores don't improve for 50 rounds
Early stopping, best iteration is:
[240]	valid_0's rmse: 0.146355
Training until validation scores don't improve for 50 rounds


[I 2025-09-07 19:18:11,353] Trial 20 finished with value: 23.766047611280367 and parameters: {'learning_rate': 0.015258279902564394, 'num_leaves': 58, 'max_depth': 18, 'min_child_samples': 43, 'feature_fraction': 0.999271196853437, 'bagging_fraction': 0.9354348266823663, 'bagging_freq': 3, 'lambda_l1': 0.008225723319543179, 'lambda_l2': 1.4301670465475771e-08}. Best is trial 14 with value: 23.377908989710406.


Early stopping, best iteration is:
[429]	valid_0's rmse: 0.147651
Training until validation scores don't improve for 50 rounds


[I 2025-09-07 19:18:11,541] Trial 21 finished with value: 24.098554943375305 and parameters: {'learning_rate': 0.02108068073497902, 'num_leaves': 92, 'max_depth': 17, 'min_child_samples': 31, 'feature_fraction': 0.997710605633679, 'bagging_fraction': 0.9927091738714885, 'bagging_freq': 5, 'lambda_l1': 0.03226296274236699, 'lambda_l2': 8.160822218983402e-08}. Best is trial 14 with value: 23.377908989710406.
[I 2025-09-07 19:18:11,689] Trial 22 finished with value: 26.089117676112654 and parameters: {'learning_rate': 0.01349853647886122, 'num_leaves': 55, 'max_depth': 14, 'min_child_samples': 28, 'feature_fraction': 0.9363330586447146, 'bagging_fraction': 0.9209533914828585, 'bagging_freq': 7, 'lambda_l1': 9.489175007246143, 'lambda_l2': 7.929091668734159e-06}. Best is trial 14 with value: 23.377908989710406.


Early stopping, best iteration is:
[255]	valid_0's rmse: 0.146775
Training until validation scores don't improve for 50 rounds
Early stopping, best iteration is:
[883]	valid_0's rmse: 0.161611
Training until validation scores don't improve for 50 rounds


[I 2025-09-07 19:18:11,772] Trial 23 finished with value: 25.144890706399554 and parameters: {'learning_rate': 0.08433216983182996, 'num_leaves': 34, 'max_depth': 19, 'min_child_samples': 19, 'feature_fraction': 0.9674190575821708, 'bagging_fraction': 0.8506401207677025, 'bagging_freq': 6, 'lambda_l1': 2.0800325868110037e-05, 'lambda_l2': 0.00042525755803712}. Best is trial 14 with value: 23.377908989710406.
[I 2025-09-07 19:18:11,935] Trial 24 finished with value: 23.873663156571027 and parameters: {'learning_rate': 0.022173167052138832, 'num_leaves': 111, 'max_depth': 16, 'min_child_samples': 34, 'feature_fraction': 0.8911804643585647, 'bagging_fraction': 0.9646998052924811, 'bagging_freq': 9, 'lambda_l1': 0.000769963237725454, 'lambda_l2': 1.4479384412941373e-07}. Best is trial 14 with value: 23.377908989710406.


Early stopping, best iteration is:
[70]	valid_0's rmse: 0.150352
Training until validation scores don't improve for 50 rounds
Early stopping, best iteration is:
[237]	valid_0's rmse: 0.148467
Training until validation scores don't improve for 50 rounds


[I 2025-09-07 19:18:12,035] Trial 25 finished with value: 23.679623846092458 and parameters: {'learning_rate': 0.03900400259774833, 'num_leaves': 84, 'max_depth': 14, 'min_child_samples': 28, 'feature_fraction': 0.9700505870963768, 'bagging_fraction': 0.8691338693930994, 'bagging_freq': 5, 'lambda_l1': 3.7330168354077543e-07, 'lambda_l2': 9.57505662702554e-05}. Best is trial 14 with value: 23.377908989710406.
[I 2025-09-07 19:18:12,143] Trial 26 finished with value: 24.79543173978767 and parameters: {'learning_rate': 0.05984352186495856, 'num_leaves': 59, 'max_depth': 17, 'min_child_samples': 40, 'feature_fraction': 0.8696418530804749, 'bagging_fraction': 0.8005534398406551, 'bagging_freq': 3, 'lambda_l1': 0.20307815632470713, 'lambda_l2': 0.0018477414658242184}. Best is trial 14 with value: 23.377908989710406.


Early stopping, best iteration is:
[115]	valid_0's rmse: 0.146517
Training until validation scores don't improve for 50 rounds
Early stopping, best iteration is:
[177]	valid_0's rmse: 0.151161
Training until validation scores don't improve for 50 rounds


[I 2025-09-07 19:18:12,378] Trial 27 finished with value: 23.949026038425355 and parameters: {'learning_rate': 0.0132883491064746, 'num_leaves': 103, 'max_depth': 12, 'min_child_samples': 24, 'feature_fraction': 0.9387776292756179, 'bagging_fraction': 0.9127796583864662, 'bagging_freq': 7, 'lambda_l1': 1.2061871344139381e-08, 'lambda_l2': 4.346089928350894e-06}. Best is trial 14 with value: 23.377908989710406.


Early stopping, best iteration is:
[306]	valid_0's rmse: 0.148674
Training until validation scores don't improve for 50 rounds


[I 2025-09-07 19:18:12,649] Trial 28 finished with value: 25.256785568512736 and parameters: {'learning_rate': 0.017710191303869395, 'num_leaves': 20, 'max_depth': 15, 'min_child_samples': 48, 'feature_fraction': 0.7060183365006387, 'bagging_fraction': 0.9634700074335787, 'bagging_freq': 9, 'lambda_l1': 0.015090113643213256, 'lambda_l2': 2.0896963339390293}. Best is trial 14 with value: 23.377908989710406.
[I 2025-09-07 19:18:12,773] Trial 29 finished with value: 24.27733404115834 and parameters: {'learning_rate': 0.029136632872615784, 'num_leaves': 196, 'max_depth': 19, 'min_child_samples': 30, 'feature_fraction': 0.8312076808522824, 'bagging_fraction': 0.6957765371853208, 'bagging_freq': 6, 'lambda_l1': 0.36186460501662293, 'lambda_l2': 0.001632914441352874}. Best is trial 14 with value: 23.377908989710406.
[I 2025-09-07 19:18:12,774] A new study created in memory with name: no-name-af2ef94d-d9cc-4246-ade8-995adc780ac9


Early stopping, best iteration is:
[596]	valid_0's rmse: 0.15409
Training until validation scores don't improve for 50 rounds
Early stopping, best iteration is:
[253]	valid_0's rmse: 0.15005
Training until validation scores don't improve for 50 rounds


[I 2025-09-07 19:18:12,838] Trial 0 finished with value: 1.4229921228863236 and parameters: {'learning_rate': 0.030710573677773714, 'num_leaves': 192, 'max_depth': 16, 'min_child_samples': 32, 'feature_fraction': 0.6624074561769746, 'bagging_fraction': 0.662397808134481, 'bagging_freq': 1, 'lambda_l1': 0.6245760287469887, 'lambda_l2': 0.002570603566117596}. Best is trial 0 with value: 1.4229921228863236.
[I 2025-09-07 19:18:12,874] Trial 1 finished with value: 1.3784799710634779 and parameters: {'learning_rate': 0.08341106432362087, 'num_leaves': 23, 'max_depth': 20, 'min_child_samples': 43, 'feature_fraction': 0.6849356442713105, 'bagging_fraction': 0.6727299868828402, 'bagging_freq': 2, 'lambda_l1': 5.472429642032189e-06, 'lambda_l2': 0.00052821153945323}. Best is trial 1 with value: 1.3784799710634779.
[I 2025-09-07 19:18:12,991] Trial 2 finished with value: 1.4275647385120558 and parameters: {'learning_rate': 0.03647316284911211, 'num_leaves': 72, 'max_depth': 14, 'min_child_sample

Early stopping, best iteration is:
[124]	valid_0's rmse: 1.42299
Training until validation scores don't improve for 50 rounds
Early stopping, best iteration is:
[40]	valid_0's rmse: 1.37848
Training until validation scores don't improve for 50 rounds
Early stopping, best iteration is:
[85]	valid_0's rmse: 1.42756
Training until validation scores don't improve for 50 rounds
Early stopping, best iteration is:
[88]	valid_0's rmse: 1.38026


[I 2025-09-07 19:18:13,121] Trial 4 finished with value: 1.428123607307349 and parameters: {'learning_rate': 0.02490643969382439, 'num_leaves': 37, 'max_depth': 15, 'min_child_samples': 25, 'feature_fraction': 0.6488152939379115, 'bagging_fraction': 0.798070764044508, 'bagging_freq': 1, 'lambda_l1': 1.5271567592511939, 'lambda_l2': 2.133142332373e-06}. Best is trial 1 with value: 1.3784799710634779.
[I 2025-09-07 19:18:13,165] Trial 5 finished with value: 1.3947014225625831 and parameters: {'learning_rate': 0.07277150634170934, 'num_leaves': 76, 'max_depth': 13, 'min_child_samples': 30, 'feature_fraction': 0.6739417822102108, 'bagging_fraction': 0.9878338511058234, 'bagging_freq': 8, 'lambda_l1': 2.8542399074977594, 'lambda_l2': 1.1309571585271492}. Best is trial 1 with value: 1.3784799710634779.
[I 2025-09-07 19:18:13,219] Trial 6 finished with value: 1.4205013521358603 and parameters: {'learning_rate': 0.059963338824126605, 'num_leaves': 186, 'max_depth': 6, 'min_child_samples': 14, 

Training until validation scores don't improve for 50 rounds
Early stopping, best iteration is:
[190]	valid_0's rmse: 1.42812
Training until validation scores don't improve for 50 rounds
Early stopping, best iteration is:
[45]	valid_0's rmse: 1.3947
Training until validation scores don't improve for 50 rounds
Early stopping, best iteration is:
[59]	valid_0's rmse: 1.4205
Training until validation scores don't improve for 50 rounds


[I 2025-09-07 19:18:13,338] Trial 7 finished with value: 1.4305309807438975 and parameters: {'learning_rate': 0.02911701023242742, 'num_leaves': 70, 'max_depth': 13, 'min_child_samples': 11, 'feature_fraction': 0.9208787923016158, 'bagging_fraction': 0.6298202574719083, 'bagging_freq': 10, 'lambda_l1': 0.08916674715636552, 'lambda_l2': 6.143857495033091e-07}. Best is trial 1 with value: 1.3784799710634779.
[I 2025-09-07 19:18:13,479] Trial 8 finished with value: 1.379062963099086 and parameters: {'learning_rate': 0.010166803740022877, 'num_leaves': 167, 'max_depth': 16, 'min_child_samples': 38, 'feature_fraction': 0.9085081386743783, 'bagging_fraction': 0.6296178606936361, 'bagging_freq': 4, 'lambda_l1': 1.1036250149900698e-07, 'lambda_l2': 0.5860448217200526}. Best is trial 1 with value: 1.3784799710634779.


Early stopping, best iteration is:
[99]	valid_0's rmse: 1.43053
Training until validation scores don't improve for 50 rounds
Early stopping, best iteration is:
[343]	valid_0's rmse: 1.37906
Training until validation scores don't improve for 50 rounds
Early stopping, best iteration is:
[54]	valid_0's rmse: 1.46355


[I 2025-09-07 19:18:13,527] Trial 9 finished with value: 1.4635547656158727 and parameters: {'learning_rate': 0.06470376604234768, 'num_leaves': 79, 'max_depth': 6, 'min_child_samples': 19, 'feature_fraction': 0.7300733288106988, 'bagging_fraction': 0.8918424713352257, 'bagging_freq': 7, 'lambda_l1': 0.9658611176861261, 'lambda_l2': 0.0001778010520878397}. Best is trial 1 with value: 1.3784799710634779.
[I 2025-09-07 19:18:13,566] Trial 10 finished with value: 1.4034675784988886 and parameters: {'learning_rate': 0.18168388620005324, 'num_leaves': 21, 'max_depth': 20, 'min_child_samples': 48, 'feature_fraction': 0.8182873120328862, 'bagging_fraction': 0.8760988294276582, 'bagging_freq': 3, 'lambda_l1': 4.32747580263185e-05, 'lambda_l2': 0.00020590009861590298}. Best is trial 1 with value: 1.3784799710634779.
[I 2025-09-07 19:18:13,703] Trial 11 finished with value: 1.377435632839684 and parameters: {'learning_rate': 0.010205410978180531, 'num_leaves': 144, 'max_depth': 20, 'min_child_sa

Training until validation scores don't improve for 50 rounds
Early stopping, best iteration is:
[21]	valid_0's rmse: 1.40347
Training until validation scores don't improve for 50 rounds
Early stopping, best iteration is:
[314]	valid_0's rmse: 1.37744
Training until validation scores don't improve for 50 rounds
Early stopping, best iteration is:
[28]	valid_0's rmse: 1.36576


[I 2025-09-07 19:18:13,742] Trial 12 finished with value: 1.3657561073579552 and parameters: {'learning_rate': 0.1286542361913006, 'num_leaves': 136, 'max_depth': 20, 'min_child_samples': 50, 'feature_fraction': 0.9978494133549762, 'bagging_fraction': 0.7097939893129923, 'bagging_freq': 2, 'lambda_l1': 2.0057322817130592e-08, 'lambda_l2': 0.01121764667975888}. Best is trial 12 with value: 1.3657561073579552.
[I 2025-09-07 19:18:13,777] Trial 13 finished with value: 1.363110560243712 and parameters: {'learning_rate': 0.19762750325534995, 'num_leaves': 137, 'max_depth': 18, 'min_child_samples': 50, 'feature_fraction': 0.9943295733240962, 'bagging_fraction': 0.728131760785334, 'bagging_freq': 3, 'lambda_l1': 1.7497684179511237e-08, 'lambda_l2': 9.269052342453529}. Best is trial 13 with value: 1.363110560243712.
[I 2025-09-07 19:18:13,812] Trial 14 finished with value: 1.3678774325358518 and parameters: {'learning_rate': 0.19245997693077602, 'num_leaves': 110, 'max_depth': 18, 'min_child_s

Training until validation scores don't improve for 50 rounds
Early stopping, best iteration is:
[15]	valid_0's rmse: 1.36311
Training until validation scores don't improve for 50 rounds
Early stopping, best iteration is:
[18]	valid_0's rmse: 1.36788
Training until validation scores don't improve for 50 rounds
Early stopping, best iteration is:
[27]	valid_0's rmse: 1.37094
Training until validation scores don't improve for 50 rounds
Early stopping, best iteration is:
[24]	valid_0's rmse: 1.3675
Training until validation scores don't improve for 50 rounds
Early stopping, best iteration is:
[25]	valid_0's rmse: 1.38343
Training until validation scores don't improve for 50 rounds


[I 2025-09-07 19:18:14,078] Trial 18 finished with value: 1.436067783260492 and parameters: {'learning_rate': 0.13940127601485922, 'num_leaves': 167, 'max_depth': 18, 'min_child_samples': 5, 'feature_fraction': 0.7810309772863404, 'bagging_fraction': 0.8366070683982902, 'bagging_freq': 2, 'lambda_l1': 3.379967785713623e-07, 'lambda_l2': 1.2602385917713332e-05}. Best is trial 13 with value: 1.363110560243712.
[I 2025-09-07 19:18:14,126] Trial 19 finished with value: 1.385348777988294 and parameters: {'learning_rate': 0.0893942958390587, 'num_leaves': 100, 'max_depth': 11, 'min_child_samples': 45, 'feature_fraction': 0.9499737847758472, 'bagging_fraction': 0.9461214256951717, 'bagging_freq': 5, 'lambda_l1': 5.799449092292789e-05, 'lambda_l2': 0.021610874731711048}. Best is trial 13 with value: 1.363110560243712.
[I 2025-09-07 19:18:14,167] Trial 20 finished with value: 1.3870698304192757 and parameters: {'learning_rate': 0.1592953405890677, 'num_leaves': 132, 'max_depth': 17, 'min_child_

Early stopping, best iteration is:
[24]	valid_0's rmse: 1.43607
Training until validation scores don't improve for 50 rounds
Early stopping, best iteration is:
[37]	valid_0's rmse: 1.38535
Training until validation scores don't improve for 50 rounds
Early stopping, best iteration is:
[15]	valid_0's rmse: 1.38707
Training until validation scores don't improve for 50 rounds
Early stopping, best iteration is:
[27]	valid_0's rmse: 1.35887
Training until validation scores don't improve for 50 rounds
Early stopping, best iteration is:
[27]	valid_0's rmse: 1.36689
Training until validation scores don't improve for 50 rounds


[I 2025-09-07 19:18:14,283] Trial 23 finished with value: 1.3761945850116022 and parameters: {'learning_rate': 0.1893607498525961, 'num_leaves': 155, 'max_depth': 17, 'min_child_samples': 41, 'feature_fraction': 0.9194159622335092, 'bagging_fraction': 0.8219643567573005, 'bagging_freq': 2, 'lambda_l1': 4.9435567692737205e-08, 'lambda_l2': 1.1178321976911433e-05}. Best is trial 21 with value: 1.3588743427675174.
[I 2025-09-07 19:18:14,366] Trial 24 finished with value: 1.3811717456315415 and parameters: {'learning_rate': 0.018924645553851333, 'num_leaves': 124, 'max_depth': 19, 'min_child_samples': 46, 'feature_fraction': 0.9633978700443572, 'bagging_fraction': 0.7777351385062048, 'bagging_freq': 6, 'lambda_l1': 7.474018108013456e-07, 'lambda_l2': 1.2905995891419694e-05}. Best is trial 21 with value: 1.3588743427675174.
[I 2025-09-07 19:18:14,406] Trial 25 finished with value: 1.3795349276342461 and parameters: {'learning_rate': 0.1520104772729557, 'num_leaves': 117, 'max_depth': 15, 'm

Early stopping, best iteration is:
[16]	valid_0's rmse: 1.37619
Training until validation scores don't improve for 50 rounds
Early stopping, best iteration is:
[162]	valid_0's rmse: 1.38117
Training until validation scores don't improve for 50 rounds
Early stopping, best iteration is:
[19]	valid_0's rmse: 1.37953
Training until validation scores don't improve for 50 rounds
Early stopping, best iteration is:
[72]	valid_0's rmse: 1.37343
Training until validation scores don't improve for 50 rounds


[I 2025-09-07 19:18:14,496] Trial 27 finished with value: 1.3762008896479814 and parameters: {'learning_rate': 0.1049491143066254, 'num_leaves': 143, 'max_depth': 17, 'min_child_samples': 46, 'feature_fraction': 0.9322158548901793, 'bagging_fraction': 0.7583459435013726, 'bagging_freq': 5, 'lambda_l1': 8.580848724117879e-08, 'lambda_l2': 1.157357885453882e-07}. Best is trial 21 with value: 1.3588743427675174.
[I 2025-09-07 19:18:14,542] Trial 28 finished with value: 1.3758017285529167 and parameters: {'learning_rate': 0.07729046632676011, 'num_leaves': 96, 'max_depth': 15, 'min_child_samples': 42, 'feature_fraction': 0.8526333921184073, 'bagging_fraction': 0.7083878875871941, 'bagging_freq': 3, 'lambda_l1': 9.752873201333507e-07, 'lambda_l2': 4.4885788944414506e-05}. Best is trial 21 with value: 1.3588743427675174.
[I 2025-09-07 19:18:14,575] Trial 29 finished with value: 1.369993545585721 and parameters: {'learning_rate': 0.14922007609723206, 'num_leaves': 196, 'max_depth': 17, 'min_c

Early stopping, best iteration is:
[25]	valid_0's rmse: 1.3762
Training until validation scores don't improve for 50 rounds
Early stopping, best iteration is:
[47]	valid_0's rmse: 1.3758
Training until validation scores don't improve for 50 rounds
Early stopping, best iteration is:
[20]	valid_0's rmse: 1.36999
Training until validation scores don't improve for 50 rounds
[100]	valid_0's rmse: 0.205035
[200]	valid_0's rmse: 0.154863
[300]	valid_0's rmse: 0.148759


2025-09-07 19:18:14,800 - INFO - Training carrier-specific models
2025-09-07 19:18:14,900 - INFO - Calculating calibration factors


[400]	valid_0's rmse: 0.148194
[500]	valid_0's rmse: 0.148012
Early stopping, best iteration is:
[492]	valid_0's rmse: 0.147902
Training until validation scores don't improve for 50 rounds
Early stopping, best iteration is:
[27]	valid_0's rmse: 1.35887
Training until validation scores don't improve for 30 rounds
Early stopping, best iteration is:
[122]	valid_0's rmse: 0.113437


2025-09-07 19:18:14,912 - INFO - DHL cost error | mean=15.57, median=7.91, max=309.32


Training until validation scores don't improve for 50 rounds
Early stopping, best iteration is:
[456]	valid_0's rmse: 0.169446
Training until validation scores don't improve for 50 rounds
Early stopping, best iteration is:
[534]	valid_0's rmse: 0.160388
Training until validation scores don't improve for 50 rounds
Early stopping, best iteration is:
[387]	valid_0's rmse: 0.0895428
Training until validation scores don't improve for 50 rounds
Early stopping, best iteration is:
[618]	valid_0's rmse: 0.0874412
Training until validation scores don't improve for 50 rounds
Early stopping, best iteration is:
[612]	valid_0's rmse: 0.158849


2025-09-07 19:18:15,899 - INFO - CV RMSE Cost: 22.5275 ± 3.7490 | Days: 1.0155 ± 0.0572


Training until validation scores don't improve for 50 rounds
Early stopping, best iteration is:
[67]	valid_0's rmse: 1.06364
Training until validation scores don't improve for 50 rounds
Early stopping, best iteration is:
[64]	valid_0's rmse: 0.933507
Training until validation scores don't improve for 50 rounds
Early stopping, best iteration is:
[57]	valid_0's rmse: 0.960408
Training until validation scores don't improve for 50 rounds
Early stopping, best iteration is:
[44]	valid_0's rmse: 1.07249
Training until validation scores don't improve for 50 rounds
Early stopping, best iteration is:
[45]	valid_0's rmse: 1.0477


2025-09-07 19:18:16,217 - INFO - Artifacts saved to model_artifacts
2025-09-07 19:18:16,217 - INFO - === TRAINING COMPLETED SUCCESSFULLY ===


Summary: {
  "artifacts_dir": "model_artifacts",
  "best_params_json": "model_artifacts/best_params.json",
  "carrier_performance_csv": "model_artifacts/carrier_performance.csv",
  "rmse_plot_png": "model_artifacts/carrier_rmse_analysis.png",
  "when": "2025-09-08T02:18:16.218088Z"
}
