he V18 Strategy: Training a Model on the Errors (Residual Fitting)
Instead of a simple, hand-coded rule engine, we will train a second model whose only job is to predict the errors of our champion V16 model. This is a powerful machine learning technique.

The process is as follows:

Train the Champion (V16): We train our best model as usual and get its predictions on the validation set.

Calculate the Errors (Residuals): We find the difference between the true price and the model's prediction for each item (error = true_price - predicted_price). This represents what the first model failed to learn.

Train an "Error-Corrector" Model: We train a second, simpler model. Its features are the same, but its target is now the error. It learns the patterns in the data that cause our main model to make mistakes.

Create the Final Prediction: The final, corrected prediction is simply: Final Prediction = V16 Prediction + Error-Corrector Prediction.

In [1]:
import pandas as pd
import numpy as np
import re
import lightgbm as lgb
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
import warnings
warnings.filterwarnings("ignore")

# --- 1. Load Data & V16 Feature Engineering ---
train_df = pd.read_csv('input/train.csv')
train_df = train_df.dropna(subset=['price'])
train_df['catalog_content'] = train_df['catalog_content'].astype(str).fillna('')
train_df['title'] = train_df['catalog_content'].apply(lambda t: (m.group(1).strip() if (m := re.search(r'^item name:\s*(.*)', t, re.I | re.M)) else t))
def extract_quantity(text):
    text = text.lower(); patterns = [r'pack of (\d+)', r'(\d+)\s*[-]?pack', r'(\d+)\s*count', r'(\d+)\s*per case', r'(\d+)\s*pk'];
    for p in patterns:
        if m:=re.search(p,text): return int(m.group(1))
    return 1
def extract_all_numerical_features(text):
    text = text.lower(); features = {}; unit_map = {'gb': ['gb'],'oz': ['oz', 'ounce'],'inch': ['inch', '"'],'mp': ['mp'],'lbs': ['lb', 'lbs'],'mah': ['mah'],'watts': ['watts', 'w']}
    for fn, u in unit_map.items():
        if m := re.search(fr'(\d+\.?\d*)\s*(?:{"|".join(u)})', text): features[f'feat_{fn}'] = float(m.group(1))
    return features
numerical_features_df = pd.json_normalize(train_df['catalog_content'].apply(extract_all_numerical_features))
train_df = pd.concat([train_df.reset_index(drop=True), numerical_features_df], axis=1)
train_df['quantity'] = train_df['catalog_content'].apply(extract_quantity)
PREMIUM_KEYWORDS = ['solid wood', 'leather', 'gold', 'oled', '4k', 'stainless steel']
train_df['premium_keyword_count'] = train_df['catalog_content'].apply(lambda t: sum(k in t.lower() for k in PREMIUM_KEYWORDS))
train_df['title_length'] = train_df['title'].str.len().fillna(0)
train_df['content_word_count'] = train_df['catalog_content'].str.split().str.len().fillna(0)
print("V16 features created.")

# --- 2. Create Hold-Out Set ---
numerical_cols = [col for col in train_df.columns if col.startswith('feat_')]
all_engineered_cols = ['quantity', 'premium_keyword_count', 'title_length', 'content_word_count'] + numerical_cols
X = train_df[['catalog_content'] + all_engineered_cols]
y = train_df['price']
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)
y_train_log = np.log1p(y_train)

# --- 3. Train the Champion V16 Model ---
print("\nTraining champion V16 model...")
best_params = { 'objective': 'regression_l1', 'n_estimators': 761, 'learning_rate': 0.188, 'num_leaves': 41, 'max_depth': 17, 'random_state': 42, 'n_jobs': -1, 'verbose': -1 }
numeric_pipeline = Pipeline(steps=[('imputer', SimpleImputer(strategy='constant', fill_value=0)), ('scaler', StandardScaler())])
preprocessor = ColumnTransformer(transformers=[
    ('text', TfidfVectorizer(stop_words='english', max_features=40000, ngram_range=(1, 2)), 'catalog_content'),
    ('numeric', numeric_pipeline, all_engineered_cols)
])
pipeline_v16 = Pipeline(steps=[('preprocessor', preprocessor), ('regressor', lgb.LGBMRegressor(**best_params))])
pipeline_v16.fit(X_train, y_train_log)
# Get base predictions on the training data to calculate errors
train_preds_log_v16 = pipeline_v16.predict(X_train)
train_preds_v16 = np.expm1(train_preds_log_v16)
train_preds_v16[train_preds_v16 < 0] = 0

# --- 4. Train the Error-Corrector Model ---
print("Training error-corrector model...")
# The target is the error of the V16 model
errors = y_train - train_preds_v16
# We use a simpler model (fewer estimators) to avoid overfitting to the errors
error_corrector_model = Pipeline(steps=[('preprocessor', preprocessor), ('regressor', lgb.LGBMRegressor(n_estimators=150, random_state=42, n_jobs=-1))])
error_corrector_model.fit(X_train, errors)

# --- 5. Evaluate the Full V18 System ---
print("\nEvaluating V18 system...")
# Stage 1: Get base predictions from V16 on the validation set
val_preds_log_v16 = pipeline_v16.predict(X_val)
val_preds_v16 = np.expm1(val_preds_log_v16)
val_preds_v16[val_preds_v16 < 0] = 0
# Stage 2: Get error predictions from the corrector model
error_preds = error_corrector_model.predict(X_val)
# Stage 3: Combine the predictions
final_predictions = val_preds_v16 + error_preds
final_predictions[final_predictions < 0] = 0

def smape(y_true, y_pred):
    num = np.abs(y_pred - y_true); den = (np.abs(y_true) + np.abs(y_pred)) / 2
    return np.mean(np.divide(num, den, out=np.zeros_like(num, dtype=float), where=den!=0)) * 100

v16_smape = smape(y_val, val_preds_v16)
v18_smape = smape(y_val, final_predictions)

print("\n--- Model Performance Comparison ---")
print(f"V16 Model SMAPE (champion): {v16_smape:.4f}")
print(f"V18 Model SMAPE (V16 + Error Correction): {v18_smape:.4f}")

V16 features created.

Training champion V16 model...
Training error-corrector model...

Evaluating V18 system...

--- Model Performance Comparison ---
V16 Model SMAPE (champion): 50.8384
V18 Model SMAPE (V16 + Error Correction): 55.7419
