## Diamond Price Prediction

### Introduction About the Data :

**The dataset** The goal is to predict `price` of given diamond (Regression Analysis).

There are 10 independent variables (including `id`):

* `id` : unique identifier of each diamond
* `carat` : Carat (ct.) refers to the unique unit of weight measurement used exclusively to weigh gemstones and diamonds.
* `cut` : Quality of Diamond Cut
* `color` : Color of Diamond
* `clarity` : Diamond clarity is a measure of the purity and rarity of the stone, graded by the visibility of these characteristics under 10-power magnification.
* `depth` : The depth of diamond is its height (in millimeters) measured from the culet (bottom tip) to the table (flat, top surface)
* `table` : A diamond's table is the facet which can be seen when the stone is viewed face up.
* `x` : Diamond X dimension
* `y` : Diamond Y dimension
* `z` : Diamond Z dimension

Target variable:
* `price`: Price of the given Diamond.

Dataset Source Link :
[https://www.kaggle.com/competitions/playground-series-s3e8/data?select=train.csv](https://www.kaggle.com/competitions/playground-series-s3e8/data?select=train.csv)

In [None]:
import pandas as pd

In [None]:
## Data Ingestions step
df=pd.read_csv('data/gemstone.csv')
df.head()

In [None]:
df.isnull().sum()

In [None]:
### No missing values present in the data

In [None]:
df.info()

In [None]:
df.head()

In [None]:
## Lets drop the id column
df=df.drop(labels=['id'],axis=1)
df.head()

In [None]:
## check for duplicated records
df.duplicated().sum()

In [None]:
## segregate numerical and categorical columns

numerical_columns=df.columns[df.dtypes!='object']
categorical_columns=df.columns[df.dtypes=='object']
print("Numerical columns:",numerical_columns)
print('Categorical Columns:',categorical_columns)

In [None]:
df[categorical_columns].describe()

In [None]:
df['cut'].value_counts()

In [None]:
df['color'].value_counts()

In [None]:
df['clarity'].value_counts()

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
plt.figure(figsize=(8,6))
x=0
for i in numerical_columns:
    sns.histplot(data=df,x=i,kde=True)
    print('\n')
    plt.show()

In [None]:
## Assignment Do the same for categorical data

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
plt.figure(figsize=(8,6))

sns.catplot(data = df, x = 'cut', y = 'color', hue = 'clarity')
print('\n')
plt.show()

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
plt.figure(figsize=(8,6))

sns.catplot(data = df, x = 'cut', y = 'color', hue = 'clarity', kind = "box")
print('\n')
plt.show()

In [None]:
## correlation
sns.heatmap(df[numerical_columns].corr(),annot=True)

In [None]:
##Currently we will not execute this
## df.drop(labels=['x','y','z'],axis=1)

In [None]:
df.head()

In [None]:
## For Domain Purpose https://www.americangemsociety.org/ags-diamond-grading-system/
df['cut'].unique()

In [None]:
cut_map={"Fair":1,"Good":2,"Very Good":3,"Premium":4,"Ideal":5}

In [None]:
df['clarity'].unique()

In [None]:
clarity_map = {"I1":1,"SI2":2 ,"SI1":3 ,"VS2":4 , "VS1":5 , "VVS2":6 , "VVS1":7 ,"IF":8}

In [None]:
df['color'].unique()

In [None]:
color_map = {"D":1 ,"E":2 ,"F":3 , "G":4 ,"H":5 , "I":6, "J":7}

In [None]:
df['cut']=df['cut'].map(cut_map)
df['clarity'] = df['clarity'].map(clarity_map)
df['color'] = df['color'].map(color_map)

In [None]:
df.head()