## Project Description / Business Task

You are working for a leading jewelry retailer that works extensively with diamonds. Your boss is constantly striving to optimize pricing. Prices should maximize profitability while ensuring you remain competitive in the market. You were hired to model the key factors that influence diamond prices using statistical tools.

## Data Dictionary

● price: price in US dollars 

● carat:  weight of the diamond

● cut: quality of the cut (Fair, Good, Very Good, Premium, Ideal)

● color: diamond colour, from J (worst) to D (best)

● clarity: a measurement of how clear the diamond is (I1 (worst), SI2, SI1, VS2, VS1, VVS2, VVS1, IF (best))

● x: length in mm

● y: width in mm

● z: depth in mm

● depth: total depth percentage = z / mean(x, y) = 2 * z / (x + y) 

● table: width of top of diamond relative to widest point 

## Import Libraries

In [1]:
import numpy as np
#from numpy import count_nonzero, median, mean
import pandas as pd
from pandas.plotting import scatter_matrix
import matplotlib.pyplot as plt
import seaborn as sns
import random


import datetime
from datetime import datetime, timedelta, date


#import os
#import zipfile
import scipy
from scipy import stats
from scipy.stats.mstats import normaltest # D'Agostino K^2 Test
from scipy.stats import boxcox
from collections import Counter

%matplotlib inline
#sets the default autosave frequency in seconds
%autosave 60 
sns.set_style('dark')
sns.set(font_scale=1.2)
#sns.set(rc={'figure.figsize':(14,10)})

plt.rc('axes', titlesize=9)
plt.rc('axes', labelsize=14)
plt.rc('xtick', labelsize=12)
plt.rc('ytick', labelsize=12)

import warnings
warnings.filterwarnings('ignore')

pd.set_option('display.max_columns',None)
#pd.set_option('display.max_rows',None)
pd.set_option('display.width', 1000)
pd.set_option('display.float_format','{:.2f}'.format)

random.seed(0)
np.random.seed(0)
np.set_printoptions(suppress=True)

Autosaving every 60 seconds


## Import Data

In [2]:
df = pd.read_csv("diamonds.csv")

## Data Quick Glance

In [None]:
df.head()

In [None]:
df.info()

In [None]:
df.dtypes.value_counts()

In [None]:
# Descriptive Statistical Analysis
df.describe(include="all")

In [None]:
# Descriptive Statistical Analysis
df.describe(include=["int", "float"])

In [None]:
# Descriptive Statistical Analysis
df.describe(include="object")

In [None]:
df.columns

In [None]:
df.shape

In [None]:
df.isnull().sum()

# Statistics

In [3]:
df["price"].mean()

3862.4168

In [5]:
from scipy import stats

# Null hypothesis: The average price of all diamonds is 3750 USD
# Alternative hypothesis: The average price of all diamonds is not 3750 USD

# Perform a one-sample t-test
t_stat, p_value = stats.ttest_1samp(df['price'], 3750)

print('t-statistic:', t_stat)
print('p-value:', p_value)

t-statistic: 1.998478427057817
p-value: 0.04571888078754048


In [7]:
print(p_value/2)

0.02285944039377024


In [8]:
df2 = pd.read_csv("diamondpremium.csv")

In [9]:
df2

Unnamed: 0,carat,cut,price,price_per_carat
0,1.06,Premium,4689.00,
1,0.80,Premium,3285.00,
2,0.31,Premium,558.00,
3,2.01,Premium,12369.00,
4,1.52,Premium,8897.00,
...,...,...,...,...
12900,,,,
12901,,,,
12902,,,,
12903,,,,


In [10]:
df3 = pd.read_csv("diamondfair.csv")
df3

Unnamed: 0,carat,cut,price,price_per_carat
0,1.03,Fair,4328.00,
1,0.92,Fair,3924.00,
2,0.96,Fair,2539.00,
3,1.43,Fair,6727.00,
4,1.73,Fair,6007.00,
...,...,...,...,...
11742,,,,
11743,,,,
11744,,,,
11745,,,,


#### Python code done by Dennis Lam