# Kaggle Challenge S3 E8 - ChatGPT to Boost Performance

Challenge: https://www.kaggle.com/competitions/playground-series-s3e8/data  

Discussion from which I took the hint: https://www.kaggle.com/competitions/playground-series-s3e8/discussion/389472  

In this Kaggle challenge I tried to ChatGPT in order to create new features to improve the model's performance.

# Importing Libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

sns.set_style("darkgrid")
matplotlib.rcParams['font.size'] = 14
matplotlib.rcParams['figure.figsize'] = (15, 5)
matplotlib.rcParams['figure.facecolor'] = '#00000000'

import warnings
warnings.simplefilter(action='ignore')

import opendatasets as od

import os 

# Importing Data

In [2]:
url = './playground-series-s3e8/'

train = pd.read_csv(url+'train.csv').drop(columns='id')
test = pd.read_csv(url+'test.csv').drop(columns='id')
original = pd.read_csv('./gemstone-price-prediction/cubic_zirconia.csv').drop(columns='Unnamed: 0')

The original dataset has some missing values in the "depth" feature.  
By the way, I obtained the best score by leaving them as they are.

# EDA and Feature Engineering

Cut description:  
describe the cut quality of the cubic zirconia.   
Quality is increasing order Fair, Good, Very Good, Premium, Ideal.

In [3]:
sorted(train.cut.unique())

['Fair', 'Good', 'Ideal', 'Premium', 'Very Good']

In [4]:
sorted(test.cut.unique()) == sorted(train.cut.unique())

True

Clarity Description:  

cubic zirconia Clarity refers to the absence of the Inclusions and Blemishes.   
(In order from Best to Worst, FL = flawless, I3= level 3 inclusions) FL, IF, VVS1, VVS2, VS1, VS2, SI1, SI2, I1, I2, I3

In [5]:
sorted(train.clarity.unique())

['I1', 'IF', 'SI1', 'SI2', 'VS1', 'VS2', 'VVS1', 'VVS2']

In [6]:
sorted(train.clarity.unique()) == sorted(test.clarity.unique())

True

Color description:  
Colour of the cubic zirconia. With D being the best and J the worst.

In [7]:
sorted(train.color.unique())

['D', 'E', 'F', 'G', 'H', 'I', 'J']

In [8]:
sorted(train.color.unique()) == sorted(test.color.unique())

True

## Feature Engineering Ideas  

- Ordinal Encoding of categorical variables 
- Substituting x,y,z that are 0  

Features provided by ChatGPT:  

- Carat weight (in carats): The carat weight of a diamond is the most important factor affecting its price. The formula for carat weight is:  

Carat weight = (x * y * z) / 200  

- Volume (in cubic millimeters): The volume of a diamond can be calculated using its dimensions. The formula for volume is:  

Volume = x * y * z  

- Total depth percentage (in percentage): The total depth percentage is the depth of the diamond as a percentage of its overall height. The formula for total depth percentage is:  

Total depth percentage = (2 * depth) / (x + y)  

- Table size percentage (in percentage): The table size percentage is the size of the table facet as a percentage of the diamond's overall width. The formula for table size percentage is:  

Table size percentage = (table / ((x + y) / 2)) * 100  

- Crown height (in millimeters): The crown height is the height of the diamond's crown above the girdle. The formula for crown height is:  

Crown height = depth * (1 - (table / 100))  

- Pavilion depth (in millimeters): The pavilion depth is the depth of the diamond's pavilion below the girdle. The formula for pavilion depth is:  

Pavilion depth = depth * (1 - (total depth percentage / 100))  

- Crown angle (in degrees): The crown angle is the angle between the girdle plane and the plane of the table facet. The formula for crown angle is:  

Crown angle = arctan((crown height / 2) / (x + y) / 2) * 2  

- Pavilion angle (in degrees): The pavilion angle is the angle between the girdle plane and the pavilion main facet. The formula for pavilion angle is:  

Pavilion angle = arctan(pavilion depth / (x + y) / 2) * 2  

### x, y, z

Substituting values with x, y, z = 0.  

Assuming that a diamond has a round shape:

x = sqrt((Table/100)^2 / (1 + (tan(arccos(2*(Depth/100)-1)))^2))  

y = x / (Table/100) * sqrt(1 / (1 - (Depth/100) + (Depth/100) / (tan(arccos(2*(Depth/100)-1)))^2))  

z = Depth / sqrt(1 - (x^2 + y^2) / (Depth/100)^2)  

If you have x and y:  

z = ((x + y) / 2) * Depth / (x * y)

In [9]:
original[(original.x == 0) | (original.y == 0) | (original.z == 0)]

Unnamed: 0,carat,cut,color,clarity,depth,table,x,y,z,price
5821,0.71,Good,F,SI2,64.1,60.0,0.0,0.0,0.0,2130
6034,2.02,Premium,H,VS2,62.7,53.0,8.02,7.95,0.0,18207
6215,0.71,Good,F,SI2,64.1,60.0,0.0,0.0,0.0,2130
10827,2.2,Premium,H,SI1,61.2,59.0,8.42,8.37,0.0,17265
12498,2.18,Premium,H,SI2,59.4,61.0,8.49,8.45,0.0,12631
12689,1.1,Premium,G,SI2,63.0,59.0,6.5,6.47,0.0,3696
17506,1.14,Fair,G,VS1,57.5,67.0,0.0,0.0,0.0,6381
18194,1.01,Premium,H,I1,58.1,59.0,6.66,6.6,0.0,3167
23758,1.12,Premium,G,I1,60.4,59.0,6.71,6.67,0.0,2383


In [10]:
train[(train.x == 0) | (train.y == 0) | (train.z == 0)]

Unnamed: 0,carat,cut,color,clarity,depth,table,x,y,z,price
8750,1.02,Premium,H,SI2,59.4,61.0,6.57,6.53,0.0,4144
39413,2.18,Premium,H,SI2,59.4,60.0,8.46,8.41,0.0,15842
92703,0.71,Good,F,SI1,64.1,60.0,0.0,0.0,0.0,2130
98719,2.17,Premium,H,SI2,60.3,57.0,8.42,8.36,0.0,15923
99624,2.2,Premium,I,SI2,60.1,60.0,8.45,8.41,0.0,11221
117161,2.2,Premium,F,SI2,60.3,58.0,8.49,8.45,0.0,15188
151690,2.18,Premium,I,VS2,61.2,62.0,8.45,8.37,0.0,15701
159429,2.18,Premium,H,SI2,60.8,59.0,8.42,8.38,0.0,13938
170318,0.71,Good,D,VS2,64.1,60.0,0.0,0.0,0.0,910
178000,0.71,Very Good,F,SI2,62.0,60.0,0.0,6.71,0.0,2130


In [11]:
test[(test.x == 0) | (test.y == 0) | (test.z == 0)]

Unnamed: 0,carat,cut,color,clarity,depth,table,x,y,z
344,1.11,Premium,E,SI2,63.0,60.0,6.59,6.56,0.0
2460,1.1,Very Good,E,SI2,62.5,60.0,6.52,6.57,0.0
9036,2.18,Premium,G,SI1,62.4,59.0,8.25,8.22,0.0
10693,2.2,Premium,I,VS1,59.4,59.0,8.49,8.45,0.0
50969,2.2,Premium,I,SI1,60.8,58.0,8.42,8.46,0.0
66929,2.2,Premium,I,SI1,61.2,59.0,8.43,8.39,0.0
68406,0.71,Good,G,SI1,64.2,59.0,0.0,0.0,0.0
72930,2.18,Premium,H,SI1,59.1,61.0,8.47,8.41,0.0
73937,1.01,Premium,H,I1,61.1,59.0,6.42,6.46,0.0
74113,0.71,Very Good,F,SI2,64.1,56.0,0.0,0.0,0.0


Python formula:  

round(((train.loc[i, 'x'] + train.loc[i, 'y']) / 2) * train.loc[i, 'depth'] / (train.loc[i, 'x'] * train.loc[i, 'y']),2)

In [12]:
# Writing a loop for the original dataset

for i in original[(original.x == 0) | (original.y == 0) | (original.z == 0)].index: 
    if original.loc[i, 'x'] != 0 and original.loc[i, 'y'] != 0:
        original.loc[i, 'z'] = round(((original.loc[i, 'x'] + original.loc[i, 'y']) / 2) * original.loc[i, 'depth'] / (original.loc[i, 'x'] * original.loc[i, 'y']),2)

In [13]:
# Writing a loop for the training set

for i in train[(train.x == 0) | (train.y == 0) | (train.z == 0)].index: 
    if train.loc[i, 'x'] != 0 and train.loc[i, 'y'] != 0:
        train.loc[i, 'z'] = round(((train.loc[i, 'x'] + train.loc[i, 'y']) / 2) * train.loc[i, 'depth'] / (train.loc[i, 'x'] * train.loc[i, 'y']),2)

In [14]:
# Writing a loop for the test set

for i in test[(test.x == 0) | (test.y == 0) | (test.z == 0)].index: 
    if test.loc[i, 'x'] != 0 and test.loc[i, 'y'] != 0:
        test.loc[i, 'z'] = round(((test.loc[i, 'x'] + test.loc[i, 'y']) / 2) * test.loc[i, 'depth'] / (test.loc[i, 'x'] * test.loc[i, 'y']),2)

In [15]:
original[(original.x == 0) | (original.y == 0) | (original.z == 0)]

Unnamed: 0,carat,cut,color,clarity,depth,table,x,y,z,price
5821,0.71,Good,F,SI2,64.1,60.0,0.0,0.0,0.0,2130
6215,0.71,Good,F,SI2,64.1,60.0,0.0,0.0,0.0,2130
17506,1.14,Fair,G,VS1,57.5,67.0,0.0,0.0,0.0,6381


In [16]:
train[(train.x == 0) | (train.y == 0) | (train.z == 0)]

Unnamed: 0,carat,cut,color,clarity,depth,table,x,y,z,price
92703,0.71,Good,F,SI1,64.1,60.0,0.0,0.0,0.0,2130
170318,0.71,Good,D,VS2,64.1,60.0,0.0,0.0,0.0,910
178000,0.71,Very Good,F,SI2,62.0,60.0,0.0,6.71,0.0,2130


In [17]:
test[(test.x == 0) | (test.y == 0) | (test.z == 0)]

Unnamed: 0,carat,cut,color,clarity,depth,table,x,y,z
68406,0.71,Good,G,SI1,64.2,59.0,0.0,0.0,0.0
74113,0.71,Very Good,F,SI2,64.1,56.0,0.0,0.0,0.0
98515,0.71,Good,F,SI2,64.1,60.0,0.0,0.0,0.0


While in the training and in the original set the remaining values with 0 can be dropped, this is not possible for the test set.  

x = sqrt((Table/100)^2 / (1 + (tan(arccos(2*(Depth/100)-1)))^2))  
Python: test.loc[68406, 'x'] = round(np.sqrt(np.square((test.loc[68406, 'table']/100)) / (1 + np.square((np.tan(np.arccos(2*(test.loc[68406, 'depth']/100)-1)))))),2)  

y = x / (Table/100) * sqrt(1 / (1 - (Depth/100) + (Depth/100) / (tan(arccos(2*(Depth/100)-1)))^2))  
Python: round(test.loc[68406,'x'] / (test.loc[68406,'table']/100) * np.sqrt(1 / (1 - np.square((test.loc[68406,'depth']/100) + (test.loc[68406,'depth']/100) / (np.tan(np.arccos(2*(test.loc[68406,'depth']/100)-1)))))),2)   

Dropping 0s of the train set.

In [18]:
train.drop(index=train[(train.x == 0) | (train.y == 0) | (train.z == 0)].index, inplace=True)
original.drop(index=original[(original.x == 0) | (original.y == 0) | (original.z == 0)].index, inplace=True)

In [19]:
# Writing a loop for the test set

for i in  test[(test.x == 0) | (test.y == 0) | (test.z == 0)].index:
    test.loc[i, 'x'] = round(np.sqrt(np.square((test.loc[i, 'table']/100)) / (1 + np.square((np.tan(np.arccos(2*(test.loc[i, 'depth']/100)-1)))))),2)
    test.loc[i, 'y'] = round(test.loc[68406,'x'] / (test.loc[68406,'table']/100) * np.sqrt(1 / (1 - np.square((test.loc[68406,'depth']/100) + (test.loc[68406,'depth']/100) / (np.tan(np.arccos(2*(test.loc[68406,'depth']/100)-1)))))),2) 
    test.loc[i, 'z'] = round(((test.loc[i, 'x'] + test.loc[i, 'y']) / 2) * test.loc[i, 'depth'] / (test.loc[i, 'x'] * test.loc[i, 'y']),2)

In [20]:
original[(original.x == 0) | (original.y == 0) | (original.z == 0)]

Unnamed: 0,carat,cut,color,clarity,depth,table,x,y,z,price


In [21]:
train[(train.x == 0) | (train.y == 0) | (train.z == 0)]

Unnamed: 0,carat,cut,color,clarity,depth,table,x,y,z,price


In [22]:
test[(test.x == 0) | (test.y == 0) | (test.z == 0)]

Unnamed: 0,carat,cut,color,clarity,depth,table,x,y,z


No more 0s.

### Adding the Original Dataset

In [23]:
original.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 26964 entries, 0 to 26966
Data columns (total 10 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   carat    26964 non-null  float64
 1   cut      26964 non-null  object 
 2   color    26964 non-null  object 
 3   clarity  26964 non-null  object 
 4   depth    26267 non-null  float64
 5   table    26964 non-null  float64
 6   x        26964 non-null  float64
 7   y        26964 non-null  float64
 8   z        26964 non-null  float64
 9   price    26964 non-null  int64  
dtypes: float64(6), int64(1), object(3)
memory usage: 2.3+ MB


In [24]:
train = pd.concat([train, original])

### Carat Weight  

Carat weight = (x * y * z) / 200

In [25]:
train['carat_weight'] = (train.x * train.y * train.z) / 200  
test['carat_weight'] = (test.x * test.y * test.z) / 200

### Volume  

Volume = x * y * z

In [26]:
train['volume'] = train.x * train.y * train.y 
test['volume'] = test.x * test.y * test.y 

### Total Depth Percentage  

Total depth percentage = (2 * depth) / (x + y)

In [27]:
train['total_depth_percentage'] = (2 * train.depth) / (train.x + train.y)
test['total_depth_percentage'] = (2 * test.depth) / (test.x + test.y)

### Table Size Percentage  

Table size percentage = (table / ((x + y) / 2)) * 100

In [28]:
train['table_size_percentage'] = (train.table / ((train.x + train.y) / 2)) * 100
test['table_size_percentage'] = (test.table / ((test.x + test.y) / 2)) * 100

### Crown Height  

Crown height = depth * (1 - (table / 100))  

In [29]:
train['crown_height'] = train.depth * (1 - (train.table / 100))
test['crown_height'] = test.depth * (1 - (test.table / 100))

### Pavilion Depth  

Pavilion depth = depth * (1 - (total depth percentage / 100)) 

In [30]:
train['pavilion_depth'] = train.depth * (1 - (train.total_depth_percentage / 100))
test['pavilion_depth'] = test.depth * (1 - (test.total_depth_percentage / 100))

### Crown Angle  

Crown angle = arctan((crown height / 2) / (x + y) / 2) * 2

In [31]:
train['crown_angle'] = np.arctan((train.crown_height / 2) / (train.x + train.y) / 2) * 2
test['crown_angle'] = np.arctan((test.crown_height / 2) / (test.x + test.y) / 2) * 2

### Pavilion Angle  

Pavilion angle = arctan(pavilion depth / (x + y) / 2) * 2 

In [32]:
train['pavilion_angle'] = np.arctan(train.pavilion_depth / (train.x + train.y) / 2) * 2
test['pavilion_angle'] = np.arctan(test.pavilion_depth / (test.x + test.y) / 2) * 2

### Ordinal Encoding

#### Cut  

 Describe the cut quality of the cubic zirconia. Quality is increasing order Fair, Good, Very Good, Premium, Ideal.

In [33]:
cut = {'Fair':0, 'Good':1, 'Very Good':2, 'Premium':3, 'Ideal':4}

train['cut'] = train.cut.map(cut)
test['cut'] = test.cut.map(cut)

#### Color  

 Colour of the cubic zirconia.  
With D being the best and J the worst.

In [34]:
color = {'J':0, 'I':1, 'H':2, 'G':3, 'F':4, 'E':5, 'D':6}

train['color'] = train.color.map(color) 
test['color'] = test.color.map(color)

#### Clarity  

 cubic zirconia Clarity refers to the absence of the Inclusions and Blemishes.  
(In order from Best to Worst, FL = flawless, I3= level 3 inclusions) FL, IF, VVS1, VVS2, VS1, VS2, SI1, SI2, I1, I2, I3

In [35]:
clarity = {'I3':0, 'I2':1, 'I1':2, 'SI2':3, 'SI1':4, 'VS2':5, 'VS1':6, 'VVS2':7, 'VVS1':8, 'IF':9, 'FL':10}

train['clarity'] = train.clarity.map(clarity)
test['clarity'] = test.clarity.map(clarity)

# Dividing in X and y 

In [36]:
X = train.drop(columns='price')
y = train.price

In [37]:
X

Unnamed: 0,carat,cut,color,clarity,depth,table,x,y,z,carat_weight,volume,total_depth_percentage,table_size_percentage,crown_height,pavilion_depth,crown_angle,pavilion_angle
0,1.52,3,4,5,62.2,58.0,7.27,7.33,4.55,1.212327,390.609103,8.520548,794.520548,26.124,56.900219,0.841261,2.193322
1,2.03,2,0,3,62.0,58.0,8.06,8.12,5.05,1.652542,531.431264,7.663782,716.934487,26.040,57.248455,0.765059,2.112632
2,0.70,4,3,6,61.2,57.0,5.69,5.73,3.50,0.570565,186.819201,10.718039,998.248687,26.316,54.640560,1.045313,2.349732
3,0.32,4,3,6,61.6,56.0,4.38,4.41,2.71,0.261729,85.182678,14.015927,1274.175199,27.104,52.966189,1.313457,2.500655
4,1.70,3,3,5,62.6,59.0,7.65,7.61,4.77,1.388464,443.027565,8.204456,773.263434,25.666,57.464010,0.796069,2.165135
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
26962,1.11,3,3,4,62.3,58.0,6.61,6.52,4.09,0.881338,280.993744,9.489718,883.472963,26.166,56.387906,0.924429,2.269922
26963,0.33,4,2,9,61.9,55.0,4.44,4.42,2.74,0.268860,86.741616,13.972912,1241.534989,27.855,53.250767,1.332262,2.499114
26964,0.51,3,5,5,61.7,58.0,5.12,5.15,3.17,0.417933,135.795200,12.015579,1129.503408,25.914,54.286388,1.125544,2.418160
26965,0.27,2,4,7,61.8,56.0,4.19,4.20,2.60,0.228774,73.911600,14.731824,1334.922527,27.192,52.695733,1.361920,2.525033


# Model 

In [39]:
from xgboost import XGBRegressor
from sklearn.model_selection import RandomizedSearchCV

Model parameters were obtained from a previous run using RandomizedSearchCV.

In [40]:
model = XGBRegressor(n_jobs=-1, n_estimators=200, min_child_weight=4, max_depth=7, learning_rate=0.03)

model.fit(X, y)

XGBRegressor(base_score=0.5, booster='gbtree', callbacks=None,
             colsample_bylevel=1, colsample_bynode=1, colsample_bytree=1,
             early_stopping_rounds=None, enable_categorical=False,
             eval_metric=None, gamma=0, gpu_id=-1, grow_policy='depthwise',
             importance_type=None, interaction_constraints='',
             learning_rate=0.03, max_bin=256, max_cat_to_onehot=4,
             max_delta_step=0, max_depth=7, max_leaves=0, min_child_weight=4,
             missing=nan, monotone_constraints='()', n_estimators=200,
             n_jobs=-1, num_parallel_tree=1, predictor='auto', random_state=0,
             reg_alpha=0, reg_lambda=1, ...)

This is a first format for the hyperparameter tuning, but the computational time is very high.

In [None]:
regr = RandomizedSearchCV(XGBRegressor(n_jobs=-1,random_state=42),{
    'n_estimators':[200, 400, 600],
    'learning_rate':[.001, .01, .1, .2, .3],
    'max_depth': [1,2,3,4,5,6,7,8,9,10],
    'min_child_weight':[1,2,3,4,5,6,7,8,9,10]
}, cv=5, return_train_score=False, scoring='neg_root_mean_squared_error', n_iter=10) 

regr.fit(X,y)

pd.DataFrame(regr.cv_results_).sort_values('rank_test_score').head(3)

By calling "regr.best_params_" it is possible to obtain the best parameters.

# Submission

In [41]:
submission = pd.read_csv('./playground-series-s3e8/sample_submission.csv')

In [42]:
submission['price'] = model.predict(test)

In [43]:
submission.to_csv('n_attempt.csv', index=None)

My final score on the test set is 577.89290, a huge improvement from the 590.36328.  

In particular, this has allowed me to pass from the ~180 position to the current 54.  

ChatGPT can be used as a tool to partially fill the missing domain knowledge.