**Assignment 1 – End-to-end Machine Learning project**

Data Science Lifecycle

# Step 1: Problem Formulation

The diamond industry is growing ~5% year on year, and the expertise required to define accurate prices has been limited to a few gemologists, making the process prone to discrepancies and inefficiencies. To maintain fair commerce and buyer trust, there needs to be a system that can accurately determine the price of a diamond based on various elements, including physical traits and quality parameters such as carat, cut, color, clarity, and dimensions (length, width, and depth).





**Objective:** To create a ML model that uses a diamond's carat, cut, color, clarity, and dimensions (length, width, and depth) to predict its price.

# Step 2: Get the Data

In [None]:
import pandas as pd
import sklearn

data = pd.read_csv('/content/diamonds.csv')

In [None]:
data.head()

Unnamed: 0.1,Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
0,1,0.23,Ideal,E,SI2,61.5,55.0,326,3.95,3.98,2.43
1,2,0.21,Premium,E,SI1,59.8,61.0,326,3.89,3.84,2.31
2,3,0.23,Good,E,VS1,56.9,65.0,327,4.05,4.07,2.31
3,4,0.29,Premium,I,VS2,62.4,58.0,334,4.2,4.23,2.63
4,5,0.31,Good,J,SI2,63.3,58.0,335,4.34,4.35,2.75


In [None]:
data.drop(columns=['Unnamed: 0'], inplace=True)

In [None]:
# Rename columns
data.rename(columns={'depth':'depth_perc'}, inplace=True)
data.rename(columns={'x': 'length', 'y': 'width', 'z': 'depth'}, inplace=True)

In [None]:
data.columns

Index(['carat', 'cut', 'color', 'clarity', 'depth_perc', 'table', 'price',
       'length', 'width', 'depth'],
      dtype='object')

## Take a Quick Look at the Data Structure

In [None]:
data.head()

Unnamed: 0,carat,cut,color,clarity,depth_perc,table,price,length,width,depth
0,0.23,Ideal,E,SI2,61.5,55.0,326,3.95,3.98,2.43
1,0.21,Premium,E,SI1,59.8,61.0,326,3.89,3.84,2.31
2,0.23,Good,E,VS1,56.9,65.0,327,4.05,4.07,2.31
3,0.29,Premium,I,VS2,62.4,58.0,334,4.2,4.23,2.63
4,0.31,Good,J,SI2,63.3,58.0,335,4.34,4.35,2.75


### Description of columns


*   price - price in US dollars
*   carat - weight of the diamond
*   cut - quality of the cut (Fair, Good, Very Good, Premium, Ideal)
*   color - diamond colour, from J (worst) to D (best)
*   clarity - a measurement of how clear the diamond is (I1 (worst), SI2, SI1, VS2, VS1, VVS2, VVS1, IF (best))
*   length - length (in mm)
*   width - width (in mm)
*   depth - depth (in mm)
*   depth_perc - total depth percentage {z / mean(x, y) = 2 * z / (x + y)}
*   table - width of top of diamond relative to widest point

# Step 3: Data Exploration

# Look at the data types and null values in data

In [None]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 53940 entries, 0 to 53939
Data columns (total 10 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   carat       53940 non-null  float64
 1   cut         53940 non-null  object 
 2   color       53940 non-null  object 
 3   clarity     53940 non-null  object 
 4   depth_perc  53940 non-null  float64
 5   table       53940 non-null  float64
 6   price       53940 non-null  int64  
 7   length      53940 non-null  float64
 8   width       53940 non-null  float64
 9   depth       53940 non-null  float64
dtypes: float64(6), int64(1), object(3)
memory usage: 4.1+ MB


#### ***Observation:*** No column has null value and we have three object columns.

# Look at the value counts for each 'object' column

In [None]:
data["cut"].value_counts()

Unnamed: 0_level_0,count
cut,Unnamed: 1_level_1
Ideal,21551
Premium,13791
Very Good,12082
Good,4906
Fair,1610


# Look at the value counts for each 'object' column

In [None]:
data["color"].value_counts()

Unnamed: 0_level_0,count
color,Unnamed: 1_level_1
G,11292
E,9797
F,9542
H,8304
D,6775
I,5422
J,2808


In [None]:
data["clarity"].value_counts()

Unnamed: 0_level_0,count
clarity,Unnamed: 1_level_1
SI1,13065
VS2,12258
SI2,9194
VS1,8171
VVS2,5066
VVS1,3655
IF,1790
I1,741


# Distribution of values in non-object columns

In [None]:
data.describe()

Unnamed: 0,carat,depth_perc,table,price,length,width,depth
count,53940.0,53940.0,53940.0,53940.0,53940.0,53940.0,53940.0
mean,0.79794,61.749405,57.457184,3932.799722,5.731157,5.734526,3.538734
std,0.474011,1.432621,2.234491,3989.439738,1.121761,1.142135,0.705699
min,0.2,43.0,43.0,326.0,0.0,0.0,0.0
25%,0.4,61.0,56.0,950.0,4.71,4.72,2.91
50%,0.7,61.8,57.0,2401.0,5.7,5.71,3.53
75%,1.04,62.5,59.0,5324.25,6.54,6.54,4.04
max,5.01,79.0,95.0,18823.0,10.74,58.9,31.8


As we can see from the dataset, there are cases where length, width or depth are zero, which are anomalies. Hence, dropping them from the dataset.  

In [None]:
## Removing rows with 0 length/width/depth

data = data.drop(data[data["length"]==0].index)
data = data.drop(data[data["width"]==0].index)
data = data.drop(data[data["depth"]==0].index)
data.shape

(53920, 10)

### Code to save the figures as high-res PNGs for the book

In [None]:
# extra code – code to save the figures as high-res PNGs for the book
from pathlib import Path
import pandas as pd
import tarfile
import urllib.request

IMAGES_PATH = Path() / "images" / "end_to_end_project"
IMAGES_PATH.mkdir(parents=True, exist_ok=True)

def save_fig(fig_id, tight_layout=True, fig_extension="png", resolution=300):
    path = IMAGES_PATH / f"{fig_id}.{fig_extension}"
    if tight_layout:
        plt.tight_layout()
    plt.savefig(path, format=fig_extension, dpi=resolution)