# **💎Diamond Price Prediction💎**

## Model Development
**Steps involved in Model Building**:
- Setting up features and target
- Build a pipeline of standard scalar and model for five different regressors.
- Fit all the models on training data
- Get mean of cross-validation on the training set for all the models for negative root mean square error
- Pick the model with the best cross-validation score
- Fit the best model on the training set and get

In [3]:
# Basic Libraries
import numpy as np
import pandas as pd

# Preprocessing
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

# Modelling
from sklearn.linear_model import LinearRegression, Lasso
from sklearn.ensemble import RandomForestRegressor
from sklearn.neighbors import KNeighborsRegressor
from catboost import CatBoostRegressor
from xgboost import XGBRegressor
from sklearn.svm import SVR

# Metric and Model Selection
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error

import warnings
warnings.filterwarnings('ignore')

### Train Test Split

In [36]:
# Make copy to avoid changing original data
data_label = pd.read_csv('./data/diamonds-clean.csv')

In [37]:
from sklearn.model_selection import train_test_split

X = data_label.drop(['price'], axis=1)
y = data_label['price']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [38]:
print(f'Total # of sample in whole dataset: {len(X)}')
print(f'Total # of sample in train dataset: {len(X_train)}')
print(f'Total # of sample in test dataset: {len(X_test)}')

Total # of sample in whole dataset: 51938
Total # of sample in train dataset: 41550
Total # of sample in test dataset: 10388


### Encoding & Standardization

In [39]:
# Create Column Transformer with 3 types of transformers
num_features = X.select_dtypes(exclude="category").columns
cat_features = X.select_dtypes(include="category").columns

from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer

numeric_transformer = StandardScaler()
oh_transformer = OneHotEncoder(drop='first')

preprocessor = ColumnTransformer(
    [
        ("OneHotEncoder", oh_transformer, cat_features),
        ("StandardScaler", numeric_transformer, num_features),
    ]
)

preprocessor

### **Find the Best Model**

**Create the Evaluate Function**

In [40]:
# Define the evaluation function
def evaluate_model(true, predicted):
    mae = mean_absolute_error(true, predicted)
    rmse = np.sqrt(mean_squared_error(true, predicted))
    r2 = r2_score(true, predicted)
    return mae, rmse, r2

**Crete the Helper Function for training and evaluating models**

In [41]:
# Helper function for training and evaluating models
def run_model(model_name, pipeline, param_grid):
    # Setup GridSearchCV
    grid_search = GridSearchCV(pipeline, param_grid, cv=5, scoring='r2', n_jobs=-1)

    # Fit model
    grid_search.fit(X_train, y_train)

    # Get the best model
    best_model = grid_search.best_estimator_

    # Evaluate on training set
    y_train_pred = best_model.predict(X_train)
    train_mae, train_rmse, train_r2 = evaluate_model(y_train, y_train_pred)

    # Evaluate on test set
    y_test_pred = best_model.predict(X_test)
    test_mae, test_rmse, test_r2 = evaluate_model(y_test, y_test_pred)

    # Print results
    print(f"{model_name} Best Hyperparameters: {grid_search.best_params_}")
    print(f"Training set performance:\n - MAE: {train_mae:.4f}\n - RMSE: {train_rmse:.4f}\n - R2: {train_r2:.4f}")
    print(f"Test set performance:\n - MAE: {test_mae:.4f}\n - RMSE: {test_rmse:.4f}\n - R2: {test_r2:.4f}")
    print('=' * 50)

**Define the Model and Hyperparameters**

In [44]:
# Define the models and hyperparameters for each model
models = {
    "Linear Regression": {
        "model": LinearRegression(),
        "params": {
            "model__fit_intercept": [True, False]
        }
    },
    "Lasso": {
        "model": Lasso(),
        "params": {
            "model__alpha": [0.1, 1.0, 10.0]
        }
    },
    "K-Neighbors Regressor": {
        "model": KNeighborsRegressor(),
        "params": {
            "model__n_neighbors": [3, 5, 7]
        }
    },
    "Random Forest Regressor": {
        "model": RandomForestRegressor(),
        "params": {
            "model__n_estimators": [50, 100],
            "model__max_depth": [None, 5, 10],
            "model__max_features": ["auto", 5, 7, 8],
        }
    },
    "XGBRegressor": {
        "model": XGBRegressor(),
        "params": {
            "model__n_estimators": [50, 100],
            "model__learning_rate": [0.01, 0.1, 0.3]
        }
    },
    "CatBoost": {
        "model": CatBoostRegressor(verbose=False),
        "params": {
            "model__depth": [6, 8],
            "model__learning_rate": [0.01, 0.1],
            "model__iterations": [100, 200]
        }
    }
}


In [45]:
# Run the models
for model_name, model_dict in models.items():
    model = model_dict['model']
    param_grid = model_dict['params']

    # Create pipeline
    pipeline = Pipeline(steps=[
        ('preprocessor', preprocessor),  # Add your ColumnTransformer here for preprocessing
        ('model', model)
    ])

    # Train and evaluate
    run_model(model_name, pipeline, param_grid)

Linear Regression Best Hyperparameters: {'model__fit_intercept': True}
Training set performance:
 - MAE: 694.2722
 - RMSE: 1041.0746
 - R2: 0.9294
Test set performance:
 - MAE: 699.6570
 - RMSE: 1050.4327
 - R2: 0.9278
Lasso Best Hyperparameters: {'model__alpha': 0.1}
Training set performance:
 - MAE: 693.3417
 - RMSE: 1041.2171
 - R2: 0.9294
Test set performance:
 - MAE: 698.7602
 - RMSE: 1050.5554
 - R2: 0.9278
K-Neighbors Regressor Best Hyperparameters: {'model__n_neighbors': 5}
Training set performance:
 - MAE: 324.3995
 - RMSE: 638.0807
 - R2: 0.9735
Test set performance:
 - MAE: 409.0384
 - RMSE: 784.2999
 - R2: 0.9597
Random Forest Regressor Best Hyperparameters: {'model__max_depth': None, 'model__max_features': 8, 'model__n_estimators': 100}
Training set performance:
 - MAE: 107.7902
 - RMSE: 227.1172
 - R2: 0.9966
Test set performance:
 - MAE: 296.1412
 - RMSE: 619.8238
 - R2: 0.9749
XGBRegressor Best Hyperparameters: {'model__learning_rate': 0.3, 'model__n_estimators': 100}
T

Note:

Based on the result, there are two models that i want to observe and consider between **Random Forest Regression** and **XGBoost Regression**

- **`RandomForestRegressor`** ➡ Random Forest *has excellent training performance, but the gap between training and test performance suggests some potential overfitting*. The *`R²` on the test set is slightly lower* compared to XGBoost, and *the `RMSE` is higher*.

- **`XGBoostRegressor`** ➡ XGBoost *has slightly worse performance on the training set but does better on the test set* compared to Random Forest. The *`RMSE` is lower on the test set*, and the *`R²` score is higher*, indicating **better generalization**.

Based on the observation, i will choose the **`XGBoost` as our final model.**

### **Save the Best Model**

In [46]:
import pickle

# Assuming `preprocessor` is the preprocessing pipeline and `best_model` is the trained XGBRegressor
final_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('model', XGBRegressor(learning_rate=0.3, n_estimators=100))  # Best hyperparameters
])

# Fit the pipeline with the full training data
final_pipeline.fit(X_train, y_train)

# Save the pipeline (model + preprocessor) to a .pkl file
with open('./data/final_model_pipeline.pkl', 'wb') as f:
    pickle.dump(final_pipeline, f)

print("Model and preprocessor saved as final_model_pipeline.pkl")

Model and preprocessor saved as final_model_pipeline.pkl


**Test to new data**

In [66]:
import pickle

# Load the saved pipeline
with open('final_model_pipeline.pkl', 'rb') as f:
    loaded_pipeline = pickle.load(f)

# Assuming `new_data` is the new input data (as a DataFrame)
new_data = pd.DataFrame({'carat': [0.23, 0.31, 0.75],
                         'cut': ['Ideal', 'Good', 'Ideal'],
                         'color': ['E', 'J', 'D'],
                         'clarity': ['SI2', 'SI2', 'SI2'],
                         'depth': [61.5, 63.3, 62.2],
                         'table': [55.0, 58.0, 55.0],
                         'x': [3.95, 4.34, 5.83],
                         'y': [3.98, 4.35, 5.87],
                         'z': [2.43, 2.75, 3.64],
                         })

# Predict using the loaded pipeline
predictions = loaded_pipeline.predict(new_data)

# Print the predictions
print("Predicted values:", predictions)

Predicted values: [ 314.3677   368.07376 2811.642  ]
