<p><img src="https://github.com/brahim0404/Diamond-Data/blob/main/Gif.GIF?raw=true"></p>




# **Diamond Sales Prediction**

In this project, We seek to predicate diamond sale prices based on specific diamond attributes with the goal of recognizing patterns and using that information to make targeted sales and marketing, which will generate more profit and sales.

Goals of the project are:

* Perform Data Preprocessing.
* Data visualization.
* Data Prediction.
* Use Regression Model

While reading the notebook, make sure to read the comments inside the code. It will illustrate our thinking processing through this project.
 

![purple-divider](https://raw.githubusercontent.com/brahim0404/Diamond-Data/main/pur-break.png)

## **Importing necessary packages**

In [22]:
# Importing packages.
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

%matplotlib inline

#ignoring the warnings while executing codes. 
import warnings
warnings.filterwarnings("ignore")

%matplotlib inline

![purple-divider](https://raw.githubusercontent.com/brahim0404/Diamond-Data/main/blue-break.png)
## **About the Dataset**

 Throughout this project, we have used Shivam Agrawal's Diamond Dataset on Kaggle. You can learn more about it <a href="https://www.kaggle.com/datasets/shivam2503/diamonds?select=diamonds.csv"> here</a>.

 The dataset contains the price, quality, and other attributes of almost 54,000 diamonds. The dataset presents 10 attributes in total, and they are as follow:

**Carat**: Weight of the diamond (0.2 - 5.01).

**Cut**: Quality of the cut (Fair, Good, Very Good, Premium, Ideal).

**Color**: Diamond color, from J (Worst) to D (Best).

**Clarity**: A measurement of how clear the diamond is (I1 (Worst), SI2, SI1, VS2, VS1, VVS2, VVS1, IF (Best))

**Depth**: Total depth percentage = z / mean(x, y) = 2 * z / (x + y). (43 - 79).

**Table**: The width of the diamond's table expressed as a percentage of its average diameter (43 - 95).

**Price**: The price of the diamond in US dollar ($326 - $18.8K)

**x**: Length in mm (0 - 10.74).

**y**: Width in mm (0 - 58.9).

**z**: Depth in mm (0 - 31.8).


![purple-divider](https://raw.githubusercontent.com/brahim0404/Diamond-Data/main/blue-break.png)
## **Importing Dataset**

In [23]:
# Importing and loading Dataset.
data = pd.read_csv("../input/diamonds/diamonds.csv")
data

Unnamed: 0.1,Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
0,1,0.23,Ideal,E,SI2,61.5,55.0,326,3.95,3.98,2.43
1,2,0.21,Premium,E,SI1,59.8,61.0,326,3.89,3.84,2.31
2,3,0.23,Good,E,VS1,56.9,65.0,327,4.05,4.07,2.31
3,4,0.29,Premium,I,VS2,62.4,58.0,334,4.20,4.23,2.63
4,5,0.31,Good,J,SI2,63.3,58.0,335,4.34,4.35,2.75
...,...,...,...,...,...,...,...,...,...,...,...
53935,53936,0.72,Ideal,D,SI1,60.8,57.0,2757,5.75,5.76,3.50
53936,53937,0.72,Good,D,SI1,63.1,55.0,2757,5.69,5.75,3.61
53937,53938,0.70,Very Good,D,SI1,62.8,60.0,2757,5.66,5.68,3.56
53938,53939,0.86,Premium,H,SI2,61.0,58.0,2757,6.15,6.12,3.74


![purple-divider](https://raw.githubusercontent.com/brahim0404/Diamond-Data/main/blue-break.png)
## **Fetching dimensions of the Dataset**

In [24]:
# The Dataset has 53920 rows and 10 columns.
data.shape

(53940, 11)

The Dataset has 53920 rows and 10 columns.


![purple-divider](https://raw.githubusercontent.com/brahim0404/Diamond-Data/main/pur-break.png)
# **DATA PREPROCESSING**

## **Data cleaning**

In [25]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 53940 entries, 0 to 53939
Data columns (total 11 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Unnamed: 0  53940 non-null  int64  
 1   carat       53940 non-null  float64
 2   cut         53940 non-null  object 
 3   color       53940 non-null  object 
 4   clarity     53940 non-null  object 
 5   depth       53940 non-null  float64
 6   table       53940 non-null  float64
 7   price       53940 non-null  int64  
 8   x           53940 non-null  float64
 9   y           53940 non-null  float64
 10  z           53940 non-null  float64
dtypes: float64(6), int64(2), object(3)
memory usage: 4.5+ MB


We can see the first column "Unnamed: 0" being an index. We must clean it.


In [26]:
#Column "Unnamed" is dropped. Now, the dataset looks cleaner.
data = data.drop(["Unnamed: 0"], axis=1)
data.describe()

Unnamed: 0,carat,depth,table,price,x,y,z
count,53940.0,53940.0,53940.0,53940.0,53940.0,53940.0,53940.0
mean,0.79794,61.749405,57.457184,3932.799722,5.731157,5.734526,3.538734
std,0.474011,1.432621,2.234491,3989.439738,1.121761,1.142135,0.705699
min,0.2,43.0,43.0,326.0,0.0,0.0,0.0
25%,0.4,61.0,56.0,950.0,4.71,4.72,2.91
50%,0.7,61.8,57.0,2401.0,5.7,5.71,3.53
75%,1.04,62.5,59.0,5324.25,6.54,6.54,4.04
max,5.01,79.0,95.0,18823.0,10.74,58.9,31.8


The minimum values for "x", "y", and "z" are 0, but this it is not possible. This shows that there are faulty values in the data, which represents dimensionless or 2-dimensional diamonds.

We must clean the data further.


In [27]:
#Dropping dimentionless diamonds
data = data.drop(data[data["x"]==0].index)
data = data.drop(data[data["y"]==0].index)
data = data.drop(data[data["z"]==0].index)
data.shape

(53920, 10)

The number of rows dropped from 53940 to 53920 after the cleaning. We lost 20 data points by deleting the dimensionless diamonds.

In [28]:
# Checking the types of the data-inputes to prevent errors in computing. 
data.dtypes

carat      float64
cut         object
color       object
clarity     object
depth      float64
table      float64
price        int64
x          float64
y          float64
z          float64
dtype: object

Data types are correct.

![purple-divider](https://raw.githubusercontent.com/brahim0404/Diamond-Data/main/blue-break.png)
## Pairplot Of Data



In [None]:
p=sns.pairplot(data)


![purple-divider](https://raw.githubusercontent.com/brahim0404/Diamond-Data/main/pur-break.png)
# **The End**

### **Project participants:**

* Brahim Boussada
* Abdelrahman Bakeer
* Emad Salloum
* Eyad Salloum
* Muhammad Ayaaz
* Aia Mohamed Abdelmagid Elkoumy
* Shiza Khan
* Nouhaila Mezyan