# 🚀 CatBoost Training - Thai Phone Number Price Prediction

**Model**: CatBoost Only
**Duration**: 1-2 hours
**Expected R²**: 0.85-0.89

---

## ⚙️ Setup Requirements:
1. **GPU**: Turn ON (Settings → Accelerator → GPU P100)
2. **Internet**: Turn ON (Settings → Internet → ON)
3. **Dataset**: Add `number-ml-kaggle-dataset` (contains code + data)

## 📦 Cell 1: Copy Dataset to Working Directory

In [None]:
%%bash
# Copy dataset to working directory
cp -r /kaggle/input/number-ml-kaggle-dataset/* /kaggle/working/
cd /kaggle/working

# Verify files
echo "✅ Files copied:"
ls -lh

echo ""
echo "✅ Data file:"
ls -lh data/raw/

## 📦 Cell 2: Install Dependencies

In [None]:
%%bash
pip install -q optuna catboost lightgbm xgboost scikit-learn==1.5.2

# Verify
echo "✅ Package versions:"
python -c "import catboost; print(f'CatBoost: {catboost.__version__}')"
python -c "import optuna; print(f'Optuna: {optuna.__version__}')"
python -c "import sklearn; print(f'Scikit-learn: {sklearn.__version__}')"

## 🔧 Cell 3: Configure Environment

In [None]:
import os
import sys

# Set base path for Kaggle
os.environ['ML_BASE_PATH'] = '/kaggle/working'
sys.path.insert(0, '/kaggle/working')

# Verify imports
print("🔍 Verifying imports...")
from src.config import BASE_PATH
from src.environment import detect_environment
from src.data_handler import load_and_clean_data

print(f"✅ BASE_PATH: {BASE_PATH}")
print(f"✅ Environment: {detect_environment()}")
print("✅ All imports working!")

## 🚀 Cell 4: Train CatBoost Model

**This will take 1-2 hours**

Progress will be shown with Optuna progress bars

In [None]:
!python train_catboost_only.py

## 📊 Cell 5: Check Results

In [None]:
import joblib
import os

checkpoint_path = 'models/checkpoints/catboost_checkpoint.pkl'

if os.path.exists(checkpoint_path):
    checkpoint = joblib.load(checkpoint_path)
    
    print("="*80)
    print("✅ CATBOOST MODEL LOADED!")
    print("="*80)
    print(f"📊 R² Score:       {checkpoint['r2_score']:.4f}")
    print(f"📊 MAE:            ฿{checkpoint['mae']:.2f}")
    print(f"📊 RMSE:           ฿{checkpoint['rmse']:.2f}")
    print(f"⏱️  Training Time:  {checkpoint['training_time_hours']:.2f} hours")
    print(f"📅 Timestamp:      {checkpoint['timestamp']}")
    print("="*80)
    
    # Check file size
    file_size_mb = os.path.getsize(checkpoint_path) / (1024**2)
    print(f"💾 File size: {file_size_mb:.1f} MB")
    
    # Performance assessment
    r2 = checkpoint['r2_score']
    if r2 >= 0.89:
        print("\n🎉 EXCELLENT! R² > 0.89 - Better than expected!")
    elif r2 >= 0.85:
        print("\n✅ GOOD! R² in target range (0.85-0.89)")
    elif r2 >= 0.80:
        print("\n⚠️  OK! R² above 0.80 - Ensemble will boost this")
    else:
        print("\n❌ LOW! R² below 0.80 - Check logs for issues")
else:
    print(f"❌ Checkpoint not found: {checkpoint_path}")
    print("Training may still be running or failed. Check logs:")
    !ls -lh logs/

## 🧪 Cell 6: Test Prediction

In [None]:
import joblib
import pandas as pd
import numpy as np
from src.features import create_all_features

# Load model
checkpoint = joblib.load('models/checkpoints/catboost_checkpoint.pkl')
model = checkpoint['model']
preprocessor = checkpoint['preprocessor']

# Test numbers
test_numbers = [
    "0899999999",  # Very lucky (all 9s)
    "0812345678",  # Sequential
    "0888888888",  # All 8s (lucky)
    "0811112222",  # Repeating pairs
    "0804040404",  # Pattern
]

print("="*80)
print("🔮 PRICE PREDICTIONS")
print("="*80)

for phone_number in test_numbers:
    df_test = pd.DataFrame({'phone_number': [phone_number]})
    X_test, _, _ = create_all_features(df_test)
    X_test_processed = preprocessor.transform(X_test)
    
    # Predict
    price_log_pred = model.predict(X_test_processed)
    price_pred = np.expm1(price_log_pred[0])
    
    print(f"📞 {phone_number}")
    print(f"   💰 Predicted: ฿{price_pred:,.0f}")
    print()

## 📈 Cell 7: View Training Logs (Optional)

In [None]:
%%bash
# Show latest log file
echo "📋 Latest training logs:"
echo "="*80
tail -100 logs/catboost_*.log | head -100

## 💾 Cell 8: Download Model

Run this cell to get a download link for the trained model

In [None]:
from IPython.display import FileLink
import os

checkpoint_path = 'models/checkpoints/catboost_checkpoint.pkl'

if os.path.exists(checkpoint_path):
    file_size_mb = os.path.getsize(checkpoint_path) / (1024**2)
    print(f"💾 Model ready for download ({file_size_mb:.1f} MB)")
    print("Click the link below:")
    display(FileLink(checkpoint_path))
else:
    print(f"❌ File not found: {checkpoint_path}")

## 🔍 Cell 9: GPU Usage Check (Run during training)

In [None]:
!nvidia-smi

---

## ✅ Success Checklist

- [ ] GPU P100 enabled
- [ ] Dataset copied successfully
- [ ] Dependencies installed
- [ ] Environment configured
- [ ] Training completed (1-2 hours)
- [ ] R² score > 0.85
- [ ] Model checkpoint saved
- [ ] Predictions working
- [ ] Model downloaded

---

## 🎯 Next Steps

**Option 1**: Train more models (XGBoost, LightGBM, RandomForest)
**Option 2**: Create ensemble if you have multiple models
**Option 3**: Deploy CatBoost directly if R² is good enough

---

**Created**: 2025-10-08
**Model**: CatBoost
**Platform**: Kaggle GPU P100
**Ready to train!** 🚀