# **💎Diamond Carat Prediction💎**

## **Basic Knowledge**

#### **Context**
This classic dataset contains the prices and other attributes of almost 54,000 diamonds. There are 10 attributes included in the dataset including the target ie. price.


#### **Content**
**Features Description:**
- **`price`** ➡ in US dollars (\$326 -- $18,823)
- **`carat`** ➡ weight of the diamond (0.2 -- 5.01)
- **`cut`** ➡ quality of the diamond's cut (Fair, Good, Very Good, Premium, and Ideal).
- **`color`** ➡ the color of diamond, from J (words) to D (best).
- **`clarity`** ➡ a measurement of how clear the diamond is (I1 (worst), SI2, SI1, VS2, VS1, VVS2, VVS1, IF (best))
- **`x`** ➡ Length in mm (0 -- 10.74).
- **`y`** ➡ Width in mm (0 -- 58.9).
- **`z`** ➡ Depth in mm (0 -- 31.8).
- **`depth`** ➡ total depth percentage == `z/mean(x, y) = 2 * z/(x + y) (43-79)`.
- **`table`** ➡ width of top of diamond relative to widest point (43 -- 95).

#### **Resource**
- 🔗 [Kaggle - Diamond Price](https://www.kaggle.com/datasets/shivam2503/diamonds)
- 🔗 [tydiverse/ggplot2](https://raw.githubusercontent.com/tidyverse/ggplot2/main/data-raw/diamonds.csv)

## **Import Libraries**

In [1]:
!pip install catboost

Collecting catboost
  Downloading catboost-1.2.7-cp310-cp310-manylinux2014_x86_64.whl.metadata (1.2 kB)
Downloading catboost-1.2.7-cp310-cp310-manylinux2014_x86_64.whl (98.7 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m98.7/98.7 MB[0m [31m8.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: catboost
Successfully installed catboost-1.2.7


In [2]:
# Basic Libraries
import numpy as np
import pandas as pd

# Preprocessing
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

# Modelling
from sklearn.linear_model import LinearRegression, Lasso
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, AdaBoostRegressor
from sklearn.neighbors import KNeighborsRegressor
from catboost import CatBoostRegressor
from xgboost import XGBRegressor
from sklearn.svm import SVR

# Metric and Model Selection
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.model_selection import cross_val_score
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error

import warnings
warnings.filterwarnings('ignore')

## **Loading Data**

In [3]:
data = pd.read_csv('diamonds-clean.csv')
data.head()

Unnamed: 0.1,Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
0,0,0.23,Ideal,E,SI2,61.5,55.0,326,3.95,3.98,2.43
1,1,0.21,Premium,E,SI1,59.8,61.0,326,3.89,3.84,2.31
2,3,0.29,Premium,I,VS2,62.4,58.0,334,4.2,4.23,2.63
3,4,0.31,Good,J,SI2,63.3,58.0,335,4.34,4.35,2.75
4,5,0.24,Very Good,J,VVS2,62.8,57.0,336,3.94,3.96,2.48


In [4]:
data.drop(columns=['Unnamed: 0'], inplace=True)

In [5]:
data.head()

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
0,0.23,Ideal,E,SI2,61.5,55.0,326,3.95,3.98,2.43
1,0.21,Premium,E,SI1,59.8,61.0,326,3.89,3.84,2.31
2,0.29,Premium,I,VS2,62.4,58.0,334,4.2,4.23,2.63
3,0.31,Good,J,SI2,63.3,58.0,335,4.34,4.35,2.75
4,0.24,Very Good,J,VVS2,62.8,57.0,336,3.94,3.96,2.48


In [6]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 51938 entries, 0 to 51937
Data columns (total 10 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   carat    51938 non-null  float64
 1   cut      51938 non-null  object 
 2   color    51938 non-null  object 
 3   clarity  51938 non-null  object 
 4   depth    51938 non-null  float64
 5   table    51938 non-null  float64
 6   price    51938 non-null  int64  
 7   x        51938 non-null  float64
 8   y        51938 non-null  float64
 9   z        51938 non-null  float64
dtypes: float64(6), int64(1), object(3)
memory usage: 4.0+ MB


## Model Development
**Steps involved in Model Building**:
- Setting up features and target
- Build a pipeline of standard scalar and model for five different regressors.
- Fit all the models on training data
- Get mean of cross-validation on the training set for all the models for negative root mean square error
- Pick the model with the best cross-validation score
- Fit the best model on the training set and get



### Train Test Split

In [7]:
# Make copy to avoid changing original data
data_label = data.copy()

In [8]:
from sklearn.model_selection import train_test_split

X = data_label.drop(['carat'], axis=1)
y = data_label['carat']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [9]:
print(f'Total # of sample in whole dataset: {len(X)}')
print(f'Total # of sample in train dataset: {len(X_train)}')
print(f'Total # of sample in test dataset: {len(X_test)}')

Total # of sample in whole dataset: 51938
Total # of sample in train dataset: 41550
Total # of sample in test dataset: 10388


### Encoding & Standardization

In [10]:
# Create Column Transformer with 3 types of transformers
num_features = X.select_dtypes(exclude="object").columns
cat_features = X.select_dtypes(include="object").columns

from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer

numeric_transformer = StandardScaler()
oh_transformer = OneHotEncoder(drop='first')

preprocessor = ColumnTransformer(
    [
        ("OneHotEncoder", oh_transformer, cat_features),
        ("StandardScaler", numeric_transformer, num_features),
    ]
)

preprocessor

### **Find the Best Model**

**Create the Evaluate Function**

In [11]:
# Define the evaluation function
def evaluate_model(true, predicted):
    mae = mean_absolute_error(true, predicted)
    rmse = np.sqrt(mean_squared_error(true, predicted))
    r2 = r2_score(true, predicted)
    return mae, rmse, r2

**Crete the Helper Function for training and evaluating models**

In [12]:
# Helper function for training and evaluating models
def run_model(model_name, pipeline, param_grid):
    # Setup GridSearchCV
    grid_search = GridSearchCV(pipeline, param_grid, cv=5, scoring='r2', n_jobs=-1)

    # Fit model
    grid_search.fit(X_train, y_train)

    # Get the best model
    best_model = grid_search.best_estimator_

    # Evaluate on training set
    y_train_pred = best_model.predict(X_train)
    train_mae, train_rmse, train_r2 = evaluate_model(y_train, y_train_pred)

    # Evaluate on test set
    y_test_pred = best_model.predict(X_test)
    test_mae, test_rmse, test_r2 = evaluate_model(y_test, y_test_pred)

    # Print results
    print(f"{model_name} Best Hyperparameters: {grid_search.best_params_}")
    print(f"Training set performance:\n - MAE: {train_mae:.4f}\n - RMSE: {train_rmse:.4f}\n - R2: {train_r2:.4f}")
    print(f"Test set performance:\n - MAE: {test_mae:.4f}\n - RMSE: {test_rmse:.4f}\n - R2: {test_r2:.4f}")
    print('=' * 50)

**Define the Model and Hyperparameters**

In [13]:
# Define the models and hyperparameters for each model
models = {
    "Linear Regression": {
        "model": LinearRegression(),
        "params": {
            "model__fit_intercept": [True, False]
        }
    },
    "Lasso": {
        "model": Lasso(),
        "params": {
            "model__alpha": [0.1, 1.0, 10.0]
        }
    },
    "K-Neighbors Regressor": {
        "model": KNeighborsRegressor(),
        "params": {
            "model__n_neighbors": [3, 5, 7]
        }
    },
    "Random Forest Regressor": {
        "model": RandomForestRegressor(),
        "params": {
            "model__n_estimators": [50, 100],
            "model__max_depth": [None, 5, 10],
            "model__max_features": ["auto", 5, 7, 8],
        }
    },
    "XGBRegressor": {
        "model": XGBRegressor(),
        "params": {
            "model__n_estimators": [50, 100],
            "model__learning_rate": [0.01, 0.1, 0.3]
        }
    },
    "CatBoost": {
        "model": CatBoostRegressor(verbose=False),
        "params": {
            "model__depth": [6, 8],
            "model__learning_rate": [0.01, 0.1],
            "model__iterations": [100, 200]
        }
    }
}


In [14]:
# Run the models
for model_name, model_dict in models.items():
    model = model_dict['model']
    param_grid = model_dict['params']

    # Create pipeline
    pipeline = Pipeline(steps=[
        ('preprocessor', preprocessor),  # Add your ColumnTransformer here for preprocessing
        ('model', model)
    ])

    # Train and evaluate
    run_model(model_name, pipeline, param_grid)

Linear Regression Best Hyperparameters: {'model__fit_intercept': True}
Training set performance:
 - MAE: 0.0390
 - RMSE: 0.0569
 - R2: 0.9844
Test set performance:
 - MAE: 0.0387
 - RMSE: 0.0563
 - R2: 0.9846
Lasso Best Hyperparameters: {'model__alpha': 0.1}
Training set performance:
 - MAE: 0.0865
 - RMSE: 0.1232
 - R2: 0.9269
Test set performance:
 - MAE: 0.0857
 - RMSE: 0.1214
 - R2: 0.9284
K-Neighbors Regressor Best Hyperparameters: {'model__n_neighbors': 7}
Training set performance:
 - MAE: 0.0240
 - RMSE: 0.0372
 - R2: 0.9933
Test set performance:
 - MAE: 0.0279
 - RMSE: 0.0424
 - R2: 0.9913
Random Forest Regressor Best Hyperparameters: {'model__max_depth': None, 'model__max_features': 8, 'model__n_estimators': 100}
Training set performance:
 - MAE: 0.0025
 - RMSE: 0.0046
 - R2: 0.9999
Test set performance:
 - MAE: 0.0067
 - RMSE: 0.0115
 - R2: 0.9994
XGBRegressor Best Hyperparameters: {'model__learning_rate': 0.1, 'model__n_estimators': 100}
Training set performance:
 - MAE: 0.0

Note:

Based on the result, there are two models that i want to observe and consider between **Random Forest Regression** and **XGBoost Regression**

- **`RandomForestRegressor`** ➡ Random Forest *has excellent training performance, but the gap between training and test performance suggests some potential overfitting*. The *`R²` on the test set is slightly lower* compared to XGBoost, and *the `RMSE` is higher*.

- **`XGBoostRegressor`** ➡ XGBoost *has slightly worse performance on the training set but does better on the test set* compared to Random Forest. The *`RMSE` is lower on the test set*, and the *`R²` score is higher*, indicating **better generalization**.

Based on the observation, i will choose the **`XGBoost` as our final model.**

### **Save the Best Model**

In [15]:
import pickle

# Assuming `preprocessor` is the preprocessing pipeline and `best_model` is the trained XGBRegressor
final_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('model', XGBRegressor(learning_rate=0.1, n_estimators=100))  # Best hyperparameters
])

# Fit the pipeline with the full training data
final_pipeline.fit(X_train, y_train)

# Save the pipeline (model + preprocessor) to a .pkl file
with open('final_model_pipeline_carat_xgb.pkl', 'wb') as f:
    pickle.dump(final_pipeline, f)

print("Model and preprocessor saved as final_model_pipeline_carat_xgb.pkl")

Model and preprocessor saved as final_model_pipeline_carat_xgb.pkl


In [18]:
data_label.tail(10)

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
51928,0.71,Premium,E,SI1,60.5,55.0,2756,5.79,5.74,3.49
51929,0.71,Premium,F,SI1,59.8,62.0,2756,5.74,5.73,3.43
51930,0.7,Very Good,E,VS2,60.5,59.0,2757,5.71,5.76,3.47
51931,0.7,Very Good,E,VS2,61.2,59.0,2757,5.69,5.72,3.49
51932,0.72,Premium,D,SI1,62.7,59.0,2757,5.69,5.73,3.58
51933,0.72,Ideal,D,SI1,60.8,57.0,2757,5.75,5.76,3.5
51934,0.72,Good,D,SI1,63.1,55.0,2757,5.69,5.75,3.61
51935,0.7,Very Good,D,SI1,62.8,60.0,2757,5.66,5.68,3.56
51936,0.86,Premium,H,SI2,61.0,58.0,2757,6.15,6.12,3.74
51937,0.75,Ideal,D,SI2,62.2,55.0,2757,5.83,5.87,3.64


In [19]:
import pickle

# Load the saved pipeline
with open('final_model_pipeline_carat_xgb.pkl', 'rb') as f:
    loaded_pipeline = pickle.load(f)

# Assuming `new_data` is the new input data (as a DataFrame)
new_data = pd.DataFrame({'price': [2757, 2756, 335],
                         'cut': ['Ideal', 'Premium', 'Ideal'],
                         'color': ['E', 'D', 'D'],
                         'clarity': ['VS2', 'SI2', 'SI2'],
                         'depth': [59.8, 60.5, 62.2],
                         'table': [62.0, 59.0, 55.0],
                         'x': [5.74, 5.71, 5.83],
                         'y': [5.73, 5.76, 5.87],
                         'z': [3.43, 3.47, 3.64],
                         })

# Predict using the loaded pipeline
predictions = loaded_pipeline.predict(new_data)

# Print the predictions
print("Predicted values:", predictions)

Predicted values: [0.6962865 0.704815  0.7290879]


In [None]:
data_label.describe()

Unnamed: 0,carat,depth,table,price,x,y,z
count,51938.0,51938.0,51938.0,51938.0,51938.0,51938.0,51938.0
mean,0.783248,61.756207,57.343101,3865.76676,5.702206,5.70586,3.522127
std,0.455244,1.220353,2.050718,3915.740917,1.100802,1.093949,0.679106
min,0.2,58.0,51.6,326.0,3.73,3.68,1.07
25%,0.4,61.1,56.0,936.0,4.7,4.71,2.9
50%,0.7,61.8,57.0,2362.5,5.68,5.69,3.51
75%,1.04,62.5,59.0,5280.0,6.52,6.52,4.03
max,2.3,65.5,63.5,18823.0,8.71,8.68,5.34
