<a href="https://colab.research.google.com/github/cmendesfirmino/diamond_prices_study_project/blob/master/diamond_prices_study_project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Predicting Diamond Prices**

**Project Overview**

A jewelry company wants to put in a bid to purchase a large set of diamonds, but is unsure how much it should bid. In this project, you will use the results from a predictive model to make a recommendation on how much the jewelry company should bid for the diamonds.

**Project Details**

A diamond distributor has recently decided to exit the market and has put up a set of 3,000 diamonds up for auction. Seeing this as a great opportunity to expand its inventory, a jewelry company has shown interest in making a bid. To decide how much to bid, you will use a large database of diamond prices to build a model to predict the price of a diamond based on its attributes. Then you will use the results of that model to make a recommendation for how much the company should bid.

In [0]:
#import library pandas to handle with dataset
import pandas as pd


In [0]:
#read trainning dataset on github repo
train_data = pd.read_csv('https://raw.githubusercontent.com/cmendesfirmino/diamond_prices_study_project/master/diamonds.csv')


In [68]:
train_data.head()

Unnamed: 0.1,Unnamed: 0,carat,cut,cut_ord,color,clarity,clarity_ord,price
0,1,0.51,Premium,4,F,VS1,4,1749
1,2,2.25,Fair,1,G,I1,1,7069
2,3,0.7,Very Good,3,E,VS2,5,2757
3,4,0.47,Good,2,F,VS1,4,1243
4,5,0.3,Ideal,5,G,VVS1,7,789


Let's understanding above data. There are 8 fields:

**Unnamed**: 0  is the number row

**caract**: represents the weight of the diamond, and is a numerical variable.

**cut**: represents the quality of the cut of the diamond, and falls into 5 categories: fair, good, very good, ideal, and premium. 

**cut ord**: cut categories were representend by an ordinal variable, 1-5

**color**: represents the color of the diamond, and is rated D through J, with D being the most colorless (and valuable) and J being the most yellow.

**clarity**: represents the internal purity of the diamond, and falls into 8 categories: I1, SI2, SI1, VS2, VS1, VVS2, VVS1, and IF (in order from least to most pure). 

**clarity_ord**: clarity categories were represented by an ordinal variable, 1-8.

**price**: represents the price of diamond



In [69]:
train_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 8 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Unnamed: 0   50000 non-null  int64  
 1   carat        50000 non-null  float64
 2   cut          50000 non-null  object 
 3   cut_ord      50000 non-null  int64  
 4   color        50000 non-null  object 
 5   clarity      50000 non-null  object 
 6   clarity_ord  50000 non-null  int64  
 7   price        50000 non-null  int64  
dtypes: float64(1), int64(4), object(3)
memory usage: 3.1+ MB


In [70]:
train_data.describe()

Unnamed: 0.1,Unnamed: 0,carat,cut_ord,clarity_ord,price
count,50000.0,50000.0,50000.0,50000.0,50000.0
mean,25000.5,0.798597,3.90398,4.1267,3939.1035
std,14433.901067,0.474651,1.117043,1.665564,3995.879832
min,1.0,0.2,1.0,1.0,326.0
25%,12500.75,0.4,3.0,3.0,948.0
50%,25000.5,0.7,4.0,4.0,2402.5
75%,37500.25,1.04,5.0,5.0,5331.0
max,50000.0,5.01,5.0,8.0,18823.0


In [0]:
train_data = train_data[['carat','cut', 'color', 'clarity', 'price']]

Exclude name of rows, cut_ord and clarity_ord. The first is because useless featura. The second and third is due to the correlation problem as we have this info two times.

In [0]:
from sklearn.linear_model import LinearRegression

In [0]:
model = LinearRegression()

In [74]:
train_data = train_data.reindex(['price', 'carat', 'cut', 'color', 'clarity'], axis=1)
train_data.head(10)
train_data_dummy = pd.get_dummies(train_data, columns=['cut', 'color', 'clarity'], drop_first=True)



Unnamed: 0,price,carat,cut_Good,cut_Ideal,cut_Premium,cut_Very Good,color_E,color_F,color_G,color_H,color_I,color_J,clarity_IF,clarity_SI1,clarity_SI2,clarity_VS1,clarity_VS2,clarity_VVS1,clarity_VVS2
0,1749,0.51,0,0,1,0,0,1,0,0,0,0,0,0,0,1,0,0,0
1,7069,2.25,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0
2,2757,0.7,0,0,0,1,1,0,0,0,0,0,0,0,0,0,1,0,0
3,1243,0.47,1,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0
4,789,0.3,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0
5,728,0.33,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0
6,18398,2.01,0,0,0,1,0,0,1,0,0,0,0,1,0,0,0,0,0
7,2203,0.51,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1
8,15100,1.7,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0
9,1857,0.53,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0


In [0]:
y_train = train_data_dummy['price']
x_train = train_data_dummy.iloc[:,1:]


In [76]:
model.fit(x_train,y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

In [77]:
r_sq = model.score(x_train, y_train)
print('coefficient of determination:', r_sq)
print('intercept:', model.intercept_)
print('slope:', model.coef_)

coefficient of determination: 0.9162528260544507
intercept: -7382.290586052183
slope: [ 8887.41193964   682.16700303  1017.09019855   889.25666125
   867.07528846  -205.242761    -298.6711949   -498.56000112
  -966.19947487 -1441.42623378 -2321.3535325   5421.78978757
  3570.56218042  2616.87033749  4534.68549294  4217.13996616
  5057.78671531  4953.74109955]


In [81]:
import statsmodels.api as sm
est = sm.OLS(y_train, x_train)
est2 = est.fit()
#print(est2.summary())
p_values = est2.summary2().tables[1]['P>|t|']
print(p_values)

carat             0.000000e+00
cut_Good         1.676395e-207
cut_Ideal        1.135530e-128
cut_Premium      4.705136e-152
cut_Very Good    3.050032e-155
color_E          8.020613e-266
color_F           0.000000e+00
color_G           0.000000e+00
color_H           0.000000e+00
color_I           0.000000e+00
color_J           0.000000e+00
clarity_IF        4.078149e-34
clarity_SI1      1.638330e-249
clarity_SI2       0.000000e+00
clarity_VS1       9.598166e-12
clarity_VS2       6.473187e-58
clarity_VVS1      5.927913e-07
clarity_VVS2      4.834734e-03
Name: P>|t|, dtype: float64


In [0]:
test_data = pd.read_csv('https://raw.githubusercontent.com/cmendesfirmino/diamond_prices_study_project/master/new-diamonds.csv')

In [85]:
test_data.head()

Unnamed: 0.1,Unnamed: 0,carat,cut,cut_ord,color,clarity,clarity_ord
0,1,1.22,Premium,4,G,SI1,3
1,2,1.01,Good,2,G,VS2,5
2,3,0.71,Very Good,3,I,VS2,5
3,4,1.01,Ideal,5,D,SI2,2
4,5,0.27,Ideal,5,H,VVS2,6


In [0]:
test_data = test_data[['carat','cut', 'color', 'clarity']]


In [0]:
test_data_dummy = pd.get_dummies(test_data, columns=['cut', 'color', 'clarity'], drop_first=True)

In [0]:
test_data_dummy['predicted_price'] = model.predict(test_data_dummy)

In [99]:
test_data_dummy.head(2)

Unnamed: 0,carat,cut_Good,cut_Ideal,cut_Premium,cut_Very Good,color_E,color_F,color_G,color_H,color_I,color_J,clarity_IF,clarity_SI1,clarity_SI2,clarity_VS1,clarity_VS2,clarity_VVS1,clarity_VVS2,predicted_price
0,1.22,0,0,1,0,0,0,1,0,0,0,0,1,0,0,0,0,0,7421.610821
1,1.01,1,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,5994.742441


In [103]:
test_data_dummy['predicted_price'].sum()*0.7

8230695.689744963