The goal of this analysis will be to predict diamond prices in usd using various regression techniques.
I will use the cliche Kaggle diamond dataset: https://www.kaggle.com/datasets/shivam2503/diamonds.
However, this will be a fresh analysis and not a copy of any tutorial or other published notebook.


I started by getting the set into a DataFrame, dropping the first column (id) and getting a sense of structure.

In [50]:
import pandas as pd

path_to_csv = 'C:\\users\\ashby\\dsprojects\\applied-modeling-techniques\\' + \
              'multiple-linear-regression\\diamonds.csv'
df = pd.read_csv(path_to_csv)

df = df.drop(df.columns[[0]], axis=1)
df.head()

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
0,0.23,Ideal,E,SI2,61.5,55.0,326,3.95,3.98,2.43
1,0.21,Premium,E,SI1,59.8,61.0,326,3.89,3.84,2.31
2,0.23,Good,E,VS1,56.9,65.0,327,4.05,4.07,2.31
3,0.29,Premium,I,VS2,62.4,58.0,334,4.2,4.23,2.63
4,0.31,Good,J,SI2,63.3,58.0,335,4.34,4.35,2.75


I then checked each column for nulls.  Given that regression techniques usually require numerical data, the 3 categorical columns also needed special handling.  I started this by eyeballing the distribution of values in those columns and found nothing that would raise any flags here.

In [49]:
print(df.isna().sum())
categorical_columns = ['cut','color','clarity']
for column in categorical_columns:
    print(df[column].value_counts())

carat      0
cut        0
color      0
clarity    0
depth      0
table      0
price      0
x          0
y          0
z          0
dtype: int64
Ideal        21551
Premium      13791
Very Good    12082
Good          4906
Fair          1610
Name: cut, dtype: int64
G    11292
E     9797
F     9542
H     8304
D     6775
I     5422
J     2808
Name: color, dtype: int64
SI1     13065
VS2     12258
SI2      9194
VS1      8171
VVS2     5066
VVS1     3655
IF       1790
I1        741
Name: clarity, dtype: int64



When you get back, it looks like there could be 146 duplicates in the dataset and you need to remove them?










I will need to check for all 4 of the linear regression assumptions.  

https://sphweb.bumc.bu.edu/otlt/MPH-Modules/BS/R/R5_Correlation-Regression/R5_Correlation-Regression4.html#:~:text=There%20are%20four%20assumptions%20associated,are%20independent%20of%20each%20other.

- Linearity: The relationship between X and the mean of Y is linear.
- Homoscedasticity: The variance of residual is the same for any value of X.
- Independence: Observations are independent of each other.
- Normality: For any fixed value of X, Y is normally distributed.

Independence is an easy one to check off, each observation here is a discrete diamond.  Just to be safe though, duplicates were dropped.  


In [61]:
df.duplicated().value_counts()

False    53794
True       146
dtype: int64

In [41]:
# price will be the dependent variable and needs to be in the top left for the 
# multi-collinearity fix in the next cell
df = df[['price', 'cut', 'carat', 'color', 'clarity', 'depth', 'table', 'x', 'y', 'z']]
# dummies!! drop first to maintain the independence assumption
df = pd.get_dummies(df, drop_first=True)
# abs() to filter on the absolute correlation later
corr_matrix = df.corr().abs()

Below cell purposefully simple to explain the fix found in the link.  
Project Pro assumes a level of familiary with numpy some might not have.

In [45]:
# https://www.projectpro.io/recipes/drop-out-highly-correlated-features-in-python
x = corr_matrix.shape
# 24x24
x = np.ones(corr_matrix.shape)
# ndarray of 24x24 1s
x = np.triu(x, 1)
# ndarray of 24x24 the longest diagonal (0) and everything below it is zeroed
x = x.astype(bool)
# ndarray of 24x24 the longest diagonal (0) and everything below it is false
# everything above it is true
x = corr_matrix.where(x)

x = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), 1).astype(bool))
# the correlation matrix again, but NaN for the values below the 0 diag
[column for column in x.columns if any(x[column] > 0.8)]


['carat', 'x', 'y', 'z']

In [47]:
import statsmodels