# XGBoost Model Creation & Export

## Goal
This notebook focuses on **training a production-ready XGBoost model** and **saving it**. 
In a real-world workflow, you separate *experimentation/analysis* (previous notebooks) from *model artifact creation* (this notebook).

Steps:
1.  Load the full dataset.
2.  Train a high-quality XGBoost model (using best known params).
3.  **Serialize (Save)** the model to a file.
4.  **Verify** the saved model works.

In [None]:
import pandas as pd
import xgboost as xgb
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import mean_squared_error, r2_score
import joblib  # For saving the model (pickle wrapper)
import os

## 1. Load Data

In [None]:
# Load data
df = pd.read_csv("../cement_qc_data.csv", skiprows=4)
df.columns = df.columns.str.strip()

# Features & Target
features = ['CaO', 'SiO2', 'Al2O3', 'Fe2O3', 'MgO', 'SO3', 'K2O', 'Na2O', 'P2O5', 'TiO2']
target = 'Res 45 um'

data = df[features + [target]].dropna()
X = data[features]
y = data[target]

# Split (We still split to evaluate performance before saving)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"Training samples: {len(X_train)}")
print(f"Testing samples: {len(X_test)}")

## 2. Train Optimized Model

We use `GridSearchCV` to ensure we are saving a *good* model, not just a default one.

In [None]:
# Parameter grid (simplified from experimentation)
param_grid = {
    'max_depth': [3, 5, 7],
    'learning_rate': [0.05, 0.1],
    'n_estimators': [100, 200]
}

xgb_reg = xgb.XGBRegressor(objective='reg:squarederror', seed=42)

grid_search = GridSearchCV(estimator=xgb_reg, param_grid=param_grid, cv=3, scoring='r2', verbose=1)
grid_search.fit(X_train, y_train)

best_model = grid_search.best_estimator_

print("Best Params:", grid_search.best_params_)
print("Best R2:", grid_search.best_score_)

## 3. Final Evaluation
Check performance on the held-out test set.

In [None]:
y_pred = best_model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"Test MSE: {mse:.4f}")
print(f"Test R2: {r2:.4f}")

## 4. Save the Model

We will save the model to a `.pkl` (Python Pickle) file using `joblib`. This file contains the trained mathematical structure of the model.

In [None]:
model_filename = 'xgboost_cement_model.pkl'

# Save
joblib.dump(best_model, model_filename)

print(f"Model saved to {os.path.abspath(model_filename)}")

## 5. Verification: Load and Predict
Let's pretend we are in a new application. We load the file and make a prediction.

In [None]:
# Load
loaded_model = joblib.load(model_filename)

# Create a sample input (using our test data for convenience)
sample_input = X_test.iloc[[0]]

# Predict
prediction = loaded_model.predict(sample_input)

print("Sample Input:")
print(sample_input)
print(f"\nPrediction: {prediction[0]:.4f}")
print(f"Actual: {y_test.iloc[0]:.4f}")