# Diamond Price Prediction
The aim of this analysis is to predict the price of diamonds based on their characteristics. The dataset used for this analysis is the Diamonds dataset from Kaggle. The dataset contains 53940 observations and 10 variables. The variables are as follows:


| Column Name | Description |
|:--------|:--------|
|  carat |  Weight of the diamond  |
|  cut   | Quality of the cut (Fair, Good, Very Good, Premium, Ideal)  |
|  color   |  	Diamond colour, from J (worst) to D (best)   |
|  clarity   |  	How clear the diamond is (I1 (worst), SI2, SI1, VS2, VS1, VVS2, VVS1, IF (best))   |
|  x   |  Length in mm   |
|  y   |  	Width in mm   |
|  z   |  Depth in mm   |
|  depth   |  Total depth percentage = z / mean(x, y) = 2 * z / (x + y) (43--79)   |
|  table   |  Width of top of diamond relative to widest point (43--95)   |
|  price   |  Price in US dollars (326--18,823)   |

In [34]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [35]:
df = pd.read_csv('diamonds.csv')
df.head()

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
0,0.23,Ideal,E,SI2,61.5,55.0,326,3.95,3.98,2.43
1,0.21,Premium,E,SI1,59.8,61.0,326,3.89,3.84,2.31
2,0.23,Good,E,VS1,56.9,65.0,327,4.05,4.07,2.31
3,0.29,Premium,I,VS2,62.4,58.0,334,4.2,4.23,2.63
4,0.31,Good,J,SI2,63.3,58.0,335,4.34,4.35,2.75


## Data Processing

In [36]:
df.shape

(50000, 10)

#check for null-values
df.info()

In [None]:
df.describe()

In [None]:
#values of each category
print(df.cut.value_counts(),'\n',df.color.value_counts(),'\n',df.clarity.value_counts())

In [None]:
df.head(10)

In [None]:
## Exploratory Data Analysis

In [None]:
sns.histplot(df['price'],bins = 20)

In [None]:
sns.histplot(df['carat'],bins=20)

In [None]:
plt.figure(figsize=(5,5))
plt.pie(df['cut'].value_counts(),labels=['Ideal','Premium','Very Good','Good','Fair'],autopct='%1.1f%%')
plt.title('Cut')
plt.show()

In [None]:
plt.figure(figsize=(5,5))
plt.bar(df['color'].value_counts().index,df['color'].value_counts())
plt.ylabel("Number of Diamonds")
plt.xlabel("Color")
plt.show()

In [None]:
plt.figure(figsize=(5,5))
plt.bar(df['clarity'].value_counts().index,df['clarity'].value_counts())
plt.title('Clarity')
plt.ylabel("Number of Diamonds")
plt.xlabel("Clarity")
plt.show()

In [None]:
sns.histplot(df['table'],bins=10)
plt.title('Table')
plt.show()

In [None]:
## Comparing Diamond's Features with Price

In [None]:
sns.barplot(x='cut',y='price',data=df)

In [None]:
sns.barplot(x='color',y='price',data=df)
plt.title('Price vs Color')
plt.show()

In [None]:
sns.barplot(x = 'clarity', y = 'price', data = df)

In [None]:
## Data Preprocessing 2

In [None]:
#change categorical variables to numerical variables
df['cut'] = df['cut'].map({'Ideal':5,'Premium':4,'Very Good':3,'Good':2,'Fair':1})
df['color'] = df['color'].map({'D':7,'E':6,'F':5,'G':4,'H':3,'I':2,'J':1})
df['clarity'] = df['clarity'].map({'IF':8,'VVS1':7,'VVS2':6,'VS1':5,'VS2':4,'SI1':3,'SI2':2,'I1':1})

In [None]:
## Coorelation

In [None]:
df.corr()

In [None]:
plt.figure(figsize=(10,10))
sns.heatmap(df.corr(),annot=True,cmap='coolwarm')
plt.title('Correlation Heatmap')
plt.show()

In [None]:
### Ploting the relationship between Price and Carat

In [None]:
sns.lineplot(x='carat',y='price',data=df)
plt.title('Carat vs Price')
plt.show()

The price of the diamond increases as the carat of the diamond increases. Diamonds with less carat also have a high price due to other factors that affect the price.

In [None]:
fig, ax = plt.subplots(2,3,figsize=(15,5))
sns.scatterplot(x='x',y='carat',data=df, ax=ax[0,0])
sns.scatterplot(x='y',y='carat',data=df, ax=ax[0,1])
sns.scatterplot(x='z',y='carat',data=df, ax=ax[0,2])
sns.scatterplot(x='x',y='price',data=df, ax=ax[1,0])
sns.scatterplot(x='y',y='price',data=df, ax=ax[1,1])
sns.scatterplot(x='z',y='price',data=df, ax=ax[1,2])
plt.show()

In [None]:
Most of the diamonds have x values between 4 and 8, y values between 4 and 10, and z values between 2 and 6. 

In [None]:
## Model Building

In [None]:
### Decision Tree Regressor

In [None]:
from sklearn.tree import DecisionTreeRegressor
dt = DecisionTreeRegressor()
dt

In [None]:
#training the model
dt.fit(x_train,y_train)
#train accuracy
dt.score(x_train,y_train)

In [None]:
#predicting the test set
dt_pred = dt.predict(x_test)

In [None]:
### Random Forest Regressor

In [None]:
from sklearn.ensemble import RandomForestRegressor
rf = RandomForestRegressor()
rf

In [None]:
#training the model
rf.fit(x_train,y_train)
#train accuracy
rf.score(x_train,y_train)

In [None]:
#predicting the test set
rf_pred = rf.predict(x_test)

In [None]:
## Model Evaluation

In [None]:
from sklearn.metrics import mean_squared_error,mean_absolute_error

In [None]:
### Decision Tree Regressor

In [None]:
#distribution plot for actual and predicted values
ax = sns.distplot(y_test,hist=False,color='r',label='Actual Value')
sns.distplot(dt_pred,hist=False,color='b',label='Fitted Values',ax=ax)
plt.title('Actual vs Fitted Values for Price')
plt.xlabel('Price')
plt.ylabel('Proportion of Diamonds')
plt.show()

In [None]:
print('Decision Tree Regressor RMSE:',np.sqrt(mean_squared_error(y_test,dt_pred)))
print('Decision Tree Regressor Accuracy:',dt.score(x_test,y_test))
print('Decision Tree Regressor MAE:',mean_absolute_error(y_test,dt_pred))

In [None]:
### Random Forest Regressor

In [None]:
#distribution plot for actual and predicted values
ax = sns.distplot(y_test,hist=False,color='r',label='Actual Value')
sns.distplot(rf_pred,hist=False,color='b',label='Fitted Values',ax=ax)
plt.title('Actual vs Fitted Values for Price')
plt.xlabel('Price')
plt.ylabel('Proportion of Diamonds')
plt.show()

In [None]:
print('Random Forest Regressor RMSE:',np.sqrt(mean_squared_error(y_test,rf_pred)))
print('Random Forest Regressor Accuracy:',rf.score(x_test,y_test))
print('Random Forest Regressor MAE:',mean_absolute_error(y_test,rf_pred))

In [None]:
## Conclusion

Both models are useful, but the Random Forest Regressor model is slightly more accurate, therefore would be a better model to use between the two. 

From the data shown above, as the carat of the diamond increases, the price of the diamond increases. It is also shown that J color and I1 clarity are the worst features for a diamond, but are higher in pricce than some of the other diamonds, which could be a result of other factors that affect the price of the diamond.