<a href="https://www.kaggle.com/code/eugniodias/diamonds-price-predictor?scriptVersionId=126924573" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Diamonds price predictor with RMSE = 517 and r2 score = 98%

The dataset is composed by:

**price**: price in US dollars (\$326--\$18,823)

**carat**: weight of the diamond (0.2--5.01)

**cut**: quality of the cut (Fair, Good, Very Good, Premium, Ideal)

**color**: diamond colour, from J (worst) to D (best)

**clarity**: a measurement of how clear the diamond is (I1 (worst), SI2, SI1, VS2, VS1, VVS2, VVS1, IF (best))

**x**: length in mm (0--10.74)

**y**: width in mm (0--58.9)

**z**: depth in mm (0--31.8)

**depth**: total depth percentage = z / mean(x, y) = 2 * z / (x + y) (43--79)

**table**: width of top of diamond relative to widest point (43--95)

## EDA

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.metrics import mean_squared_error, r2_score

import seaborn as sns
import xgboost as xgb

Loading the dataset

In [None]:
data = pd.read_csv('/kaggle/input/diamonds/diamonds.csv')
data.head()

In [None]:
data = data.drop(data.columns[0], axis=1)
data.head()

In [None]:
data.info()

We see that the dataset has 53940 and no missing values.

In [None]:
data.describe()

We have values of x, y and z equal to 0. We will remove them.

In [None]:
# Removing x, y, z columns with values equal to 0
data = data[(data[['x', 'y', 'z']] != 0).all(axis=1)]
data.describe()

We get the categorical columns.

In [None]:
# Getting the categorical columns
categorical_columns = [col for col in data.columns if data[col].dtype == 'object']
categorical_columns

In [None]:
# Getting the categories of each categorical column
for col in categorical_columns:
    print(col, data[col].unique())

We analyse the count of each categorical column.

In [None]:
plt.figure(figsize=(5, 5))
sns.countplot(x='cut', data=data)
plt.xlabel('Cut')
plt.ylabel('Count')

In [None]:
plt.figure(figsize=(5, 5))
sns.countplot(x='color', data=data)
plt.xlabel('Color')
plt.ylabel('Count')

In [None]:
plt.figure(figsize=(5, 5))
sns.countplot(x='clarity', data=data)
plt.xlabel('Clarity')
plt.ylabel('Count')

We do a one-hot encoding of the categorical columns.

In [None]:
# Doing one-hot encoding for the categorical columns
data = pd.get_dummies(data, columns=categorical_columns)
data.info()

We now extract the correlation between the price and the other columns.

In [None]:
# Extracting the correlation between price and the other columns
corr = data.corr()['price'].sort_values(ascending=False)
corr

We clearly see that carat, x, y, and z are highly correlated with the price. 

In [None]:
corr_feats = corr.index[1:5]
corr_feats

In [None]:
for feat in corr_feats:
    plt.figure(figsize=(5, 5))
    sns.scatterplot(x=feat, y='price', data=data)
    plt.xlabel(feat)
    plt.ylabel('Price')

We analyse those features to remove their outliers

In [None]:
for feat in corr_feats:
    plt.figure(figsize=(5, 5))
    sns.displot(x=feat, data=data)
    plt.xlabel(feat)

In [None]:
for feat in corr_feats:
    plt.figure(figsize=(5, 5))
    sns.boxplot(x=feat, data=data)
    plt.xlabel(feat)

We'll use the interquatile range to remove the outliers.

In [None]:
# We remove the outliers using the quantile method
for feat in corr_feats:
    q1 = data[feat].quantile(0.25)
    q3 = data[feat].quantile(0.75)
    iqr = q3 - q1
    data = data[(data[feat] >= q1 - 2 * iqr) & (data[feat] <= q3 + 2 * iqr)]

data.info()

In [None]:
data.shape

## Building model 

Doing an adversarial validation

In [None]:
# Using random forest to predict the is_train column
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score

In [None]:
def adv_validation(data):
    data_train, data_test = train_test_split(data, test_size=0.2)
    data_train['is_train'] = 1
    data_test['is_train'] = 0

    data_adv = pd.concat([data_train, data_test])
    data_adv.drop('price', axis=1, inplace=True)
    features = [col for col in data_adv.columns if col != 'is_train']

    model = RandomForestClassifier(n_estimators=100, max_depth=5)

    skf = StratifiedKFold(n_splits=5, shuffle=True)
    scores = []

    for fold, (train_index, test_index) in enumerate(skf.split(data_adv[features], data_adv['is_train'])):
        X = data_adv[features].iloc[train_index]
        y = data_adv['is_train'].iloc[train_index]
        
        X_train, X_test = data_adv[features].iloc[train_index], data_adv[features].iloc[test_index]
        y_train, y_test = data_adv['is_train'].iloc[train_index], data_adv['is_train'].iloc[test_index]

        model.fit(X_train, y_train)
        preds = model.predict_proba(X_test)[:, 1]

        score = roc_auc_score(y_test, preds)
        print(f"Fold {fold}: AUC score = {score}")
        scores.append(score)

    print(f"Mean AUC score = {np.mean(scores)}")


In [None]:
scores = adv_validation(data)

The AUC score lies around 0.5, this shows that the train and test set have similar distributions.

We now fit a xgboost model.

In [None]:
# Running a xgboost using RSME as objective function
xgb_model = xgb.XGBRegressor(objective='reg:squarederror')

In [None]:
# Defining stratified k-fold
skf = StratifiedKFold(n_splits=5, shuffle=True)

# Defining the features and the target
features = [col for col in data.columns if col != 'price']
target = 'price'

In [None]:
rmse_list = []
r2_list = []

for fold, (train_index, test_index) in enumerate(skf.split(data[features], data[target])):
    # Split the data
    x_train, x_test = data[features].iloc[train_index], data[features].iloc[test_index]
    y_train, y_test = data[target].iloc[train_index], data[target].iloc[test_index]

    # Train the model
    xgb_model.fit(x_train, y_train)

    # Predict the model
    preds = xgb_model.predict(x_test)
    
    # Evaluate the model
    rmse = np.sqrt(mean_squared_error(y_test, preds))
    r2 = r2_score(y_test, preds)
    print(f"Fold {fold}:")
    print(f"RMSE = {rmse}")
    print(f"R2 = {r2}")

    rmse_list.append(rmse)
    r2_list.append(r2)

print()
print(f"Mean RMSE = {np.mean(rmse_list)}")
print(f"Best RMSE = {np.min(rmse_list)}")
print()
print(f"Mean R2 = {np.mean(r2_list)}")
print(f"Best R2 = {np.max(r2_list)}")