# ***Cleaning the Dataset***

In this Jupyter we make some data cleaning and feature engineering to prepare the best set of variables.

In [1]:
import pandas as pd
import numpy as np

from src import cleaning_functions as cf

## Import and explore:

In [2]:
data = pd.read_csv("DATA/train.csv")
data.head()

Unnamed: 0,id,carat,cut,color,clarity,depth,table,x,y,z,price
0,0,1.14,Ideal,G,VVS2,61.0,56.0,6.74,6.76,4.12,9013
1,1,0.76,Ideal,H,VS2,62.7,57.0,5.86,5.82,3.66,2692
2,2,0.84,Ideal,G,VS1,61.4,56.0,6.04,6.15,3.74,4372
3,3,1.55,Ideal,H,VS1,62.0,57.0,7.37,7.43,4.59,13665
4,4,0.3,Ideal,G,SI2,61.9,57.0,4.28,4.31,2.66,422


In [None]:
data.shape

In [None]:
data.info()

In [None]:
data.dtypes

In [None]:
data.isnull().sum()

# Evaluating colinearity:

In [None]:
data.corr()

x, y and z are strongly correlated with carats (in fact, carat is a function of x,y,z)

In [3]:
data.drop(columns=["x","y","z"], inplace=True)

In [None]:
data.head()

# Handling categorical data:

"cut" is a scaled feature, so it makes sense to be converted to numerical:

the same with "clarity" (see https://www.info-diamond.com/polished/clarity.html for info) :

...and the same with color (https://www.info-diamond.com/polished/color.html) :

In [4]:
data = cf.categ(data)

In [5]:
data.head()

Unnamed: 0,id,carat,cut,color,clarity,depth,table,price
0,0,1.14,5,77,9,61.0,56.0,9013
1,1,0.76,5,66,7,62.7,57.0,2692
2,2,0.84,5,77,8,61.4,56.0,4372
3,3,1.55,5,66,8,62.0,57.0,13665
4,4,0.3,5,77,5,61.9,57.0,422


In [7]:
data.corr().price

id        -0.004065
carat      0.921128
cut       -0.052115
color     -0.173896
clarity   -0.142408
depth     -0.015052
table      0.127691
price      1.000000
Name: price, dtype: float64

In [None]:
import seaborn as sns
sns.heatmap(data.corr())

# Now our data is ready for ML analysis

let's import it:

In [11]:
data.to_csv(f"DATA/clean_data.csv", index=False)