<a href="https://colab.research.google.com/github/annatsai0803/Price_Predictor_Models/blob/main/Diamond_Price_Predictor_Models.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Sara Gets a Diamond Case

# 1. Getting Ready - Import pacakages

Import packages we will use:

1. Numpy. This is a de-facto standard library for linear algebra in Python. Info: https://numpy.org/doc/
2. Pandas. It is most commonly used library for data engineering. Info: https://pandas.pydata.org
3. Statsmodels. Commonly used for basic statistical analysis. Info: https://www.statsmodels.org/stable/index.html.
4. Plotly express. Info: https://plotly.com/python/plotly-express/


In [None]:
import pandas as pd
import numpy as np
import statsmodels.api as sm
import plotly.express as px

# 2. Data Engineering

### Step 2.1: Load and explore the data

Upload the data file "UV6248-XLS-ENG.xls" and read the provided datafile into a dataframe  using **pd.read_excel()** function.

Rename the Carat Weight variable to get rid of the space in the name.

Remove records with null values.

In [None]:
df = pd.read_excel("UV6248-XLS-ENG.xls", sheet_name = "Raw Data", skiprows=2)
df = df.rename(columns={'Carat Weight': 'CaratWeight'})
df = df.dropna()

Display the first 7 rows of data using the **head(7)** command.

In [None]:
df.tail(7)
#You can change how many rows of data you want to see by adding a parameter to the head(rows) function: e.g

Unnamed: 0,ID,CaratWeight,Cut,Color,Clarity,Polish,Symmetry,Report,Price
5993,5994,2.03,Very Good,H,VS2,VG,G,GIA,18866.0
5994,5995,0.81,Very Good,D,VVS2,VG,VG,GIA,5423.0
5995,5996,1.03,Ideal,D,SI1,EX,EX,GIA,6250.0
5996,5997,1.0,Very Good,D,SI1,VG,VG,GIA,5328.0
5997,5998,1.02,Ideal,D,SI1,EX,EX,GIA,6157.0
5998,5999,1.27,Signature-Ideal,G,VS1,EX,EX,GIA,11206.0
5999,6000,2.19,Ideal,E,VS1,EX,EX,GIA,30507.0


Print out the count, mean, and standard deviation for all numerical variables formatted to two decimal points using **.describe()** command.

In [None]:
df.describe().loc[['count', 'mean', 'std']].round(2)
#Note that you only get summary stats for the numerical fields, not categories

Unnamed: 0,ID,CaratWeight,Price
count,6000.0,6000.0,6000.0
mean,3000.5,1.33,11791.58
std,1732.2,0.48,10184.35


Check which variables are categorical by listing the data type for all variables using **.dtypes**:

In [None]:
df.dtypes

Unnamed: 0,0
ID,int64
CaratWeight,float64
Cut,object
Color,object
Clarity,object
Polish,object
Symmetry,object
Report,object
Price,float64


If you want to look up the values for each categorical variable, you can use **.unique()**:

In [None]:
print(df.Cut.unique())
print(df.Color.unique())
print(df.Clarity.unique())
print(df.Polish.unique())
print(df.Symmetry.unique())
print(df.Report.unique())

['Ideal' 'Very Good' 'Fair' 'Good' 'Signature-Ideal']
['H' 'E' 'G' 'D' 'F' 'I']
['SI1' 'VS1' 'VS2' 'VVS2' 'VVS1' 'IF' 'FL']
['VG' 'ID' 'EX' 'G']
['EX' 'ID' 'VG' 'G']
['GIA' 'AGSL']


There, are, of course, many other ways to get the list of unique categorical variables. Here is another option:

In [None]:
#Print a list of categorical variables
categorical_variables = df.select_dtypes(include=['object']).columns.tolist()
print("Categorical Variables:", categorical_variables)

Categorical Variables: ['Cut', 'Color', 'Clarity', 'Polish', 'Symmetry', 'Report']


This is bad news :(

  You do not want to include a variable for explanation where one category has 4 records in it, and another - 1000.
  
  # *Exercise*
  
* Check the number of observations across categories in all variables.
* For problematic categories, combine the ''rare categories" together.

  Try this on your own with your teammates. If you do not how, ask ChatGPT.


In [None]:
#INSERT YOUR CODE HERE:

print(f'Value counts in Clarity are {df.Clarity.value_counts()}')
print(f'Value counts in Color are {df.Color.value_counts()}')
print(f'Value counts in Cut are {df.Cut.value_counts()}')
print(f'Value counts in Polish are {df.Polish.value_counts()}')
print(f'Value counts in Symmetry are {df.Symmetry.value_counts()}')
print(f'Value counts in Report are {df.Report.value_counts()}')

Value counts in Clarity are Clarity
SI1     2059
VS2     1575
VS1     1192
VVS2     666
VVS1     285
IF       219
FL         4
Name: count, dtype: int64
Value counts in Color are Color
G    1501
H    1079
F    1013
I     968
E     778
D     661
Name: count, dtype: int64
Value counts in Cut are Cut
Ideal              2482
Very Good          2428
Good                708
Signature-Ideal     253
Fair                129
Name: count, dtype: int64
Value counts in Polish are Polish
EX    2425
VG    2409
ID     595
G      571
Name: count, dtype: int64
Value counts in Symmetry are Symmetry
VG    2417
EX    2059
G      916
ID     608
Name: count, dtype: int64
Value counts in Report are Report
GIA     5266
AGSL     734
Name: count, dtype: int64


In [None]:
df['Clarity'] = df['Clarity'].replace({'IF':'IF_FL','FL':'IF_FL'})

print(f'Value counts in Clarity are {df.Clarity.value_counts()}')

Value counts in Clarity are Clarity
SI1      2059
VS2      1575
VS1      1192
VVS2      666
VVS1      285
IF_FL     223
Name: count, dtype: int64


## Step 2.2 Visual Exploration of the Data


Explore impact of different categorical variables on our data.

In [None]:
fig = px.scatter(df,
                 x="CaratWeight",
                 y="Price",
                 color = 'Color',
                 height = 350
                )
fig.show()


Let's explore whether taking a log of X or Y helps the relationship.

In [None]:
from math import log
df_log = df.copy()
df_log['Price']=df_log['Price'].transform(log)
df_log['CaratWeight']=df_log['CaratWeight'].transform(log)
print(df_log['Price'].head(10))

fig = px.scatter(df_log,
                 x="CaratWeight",
                 y="Price",
                 color = 'Color',
                 height = 350,
                 labels={'Price':'Log of Price', 'CaratWeight': 'Log of CaratWeight'
                         }
                )
fig.show()

0    8.550435
1    8.151910
2    8.065579
3    8.382518
4    8.061802
5    9.456497
6    8.656433
7    9.254357
8    9.831401
9    8.944550
Name: Price, dtype: float64


## Step 2.3 Create dummy variables for categorical variables using **pd.get_dummies()**:

In [None]:
df = pd.get_dummies(data=df, columns = ['Color', 'Report'], drop_first=True, dtype=int)
df.head() # Check how the data looks now
# ['Color','Clarity','Cut','Polish','Symmetry','Report']

Unnamed: 0,ID,CaratWeight,Cut,Clarity,Polish,Symmetry,Price,Color_E,Color_F,Color_G,Color_H,Color_I,Report_GIA
0,1,1.1,Ideal,SI1,VG,EX,5169.0,0,0,0,1,0,1
1,2,0.83,Ideal,VS1,ID,ID,3470.0,0,0,0,1,0,0
2,3,0.85,Ideal,SI1,EX,EX,3183.0,0,0,0,1,0,1
3,4,0.91,Ideal,SI1,VG,VG,4370.0,1,0,0,0,0,1
4,5,0.83,Ideal,SI1,EX,EX,3171.0,0,0,1,0,0,1


## Step 2.4 Check for multicollinearity

Look at the correlation matrix to see if we have potential multicollinearity

Encode ordinal categorical variables using an increasing scale

In [None]:
scale_map_cut = {'Fair':1, 'Good':2, 'Very Good':3, 'Ideal':4,  'Signature-Ideal':5}
scale_map_clarity = {'SI1':1, 'VVS2':2, 'VVS1':3, 'VS2':4, 'VS1':5, 'IF_FL':6}
scale_map_polish = {'G':1, 'VG':2, 'EX':3, 'ID':4}

df["Cut"] = df["Cut"].replace(scale_map_cut).astype(int)
df["Clarity"] = df["Clarity"].replace(scale_map_clarity).astype(int)
df["Polish"] = df["Polish"].replace(scale_map_polish).astype(int)
df["Symmetry"] = df["Symmetry"].replace(scale_map_polish).astype(int)
df.head()


Downcasting behavior in `replace` is deprecated and will be removed in a future version. To retain the old behavior, explicitly call `result.infer_objects(copy=False)`. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`


Downcasting behavior in `replace` is deprecated and will be removed in a future version. To retain the old behavior, explicitly call `result.infer_objects(copy=False)`. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`


Downcasting behavior in `replace` is deprecated and will be removed in a future version. To retain the old behavior, explicitly call `result.infer_objects(copy=False)`. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`


Downcasting behavior in `replace` is deprecated and will be removed in a future version. To retain the old behavior, explicitly call `result.infer_objects(copy=False)`. To opt-in to the future behavior, set `pd

Unnamed: 0,ID,CaratWeight,Cut,Clarity,Polish,Symmetry,Price,Color_E,Color_F,Color_G,Color_H,Color_I,Report_GIA
0,1,1.1,4,1,2,3,5169.0,0,0,0,1,0,1
1,2,0.83,4,5,4,4,3470.0,0,0,0,1,0,0
2,3,0.85,4,1,3,3,3183.0,0,0,0,1,0,1
3,4,0.91,4,1,2,2,4370.0,1,0,0,0,0,1
4,5,0.83,4,1,3,3,3171.0,0,0,1,0,0,1


In [None]:
print(df.dtypes)
df.corr().style.background_gradient(cmap='RdBu_r', axis=None)
df.drop(['Price','ID'], axis=1).corr().style.background_gradient(cmap='coolwarm', axis=None)
#"axis=None" option above indicates that the colors are assigned based on the values in the whole matrix
# Other good color maps: 'RdBu_r' & 'PuOr_r' & 'coolwarm'

ID               int64
CaratWeight    float64
Cut              int64
Clarity          int64
Polish           int64
Symmetry         int64
Price          float64
Color_E          int64
Color_F          int64
Color_G          int64
Color_H          int64
Color_I          int64
Report_GIA       int64
dtype: object


Unnamed: 0,CaratWeight,Cut,Clarity,Polish,Symmetry,Color_E,Color_F,Color_G,Color_H,Color_I,Report_GIA
CaratWeight,1.0,0.072943,0.14717,0.051494,0.040413,-0.089155,-0.024898,0.037891,0.045767,0.058235,0.011461
Cut,0.072943,1.0,0.106612,0.46231,0.554999,-0.053341,-0.045228,0.057314,0.004432,0.012034,-0.276354
Clarity,0.14717,0.106612,1.0,0.078497,0.063179,-0.074843,-0.009883,0.1119,-0.021487,0.003519,-0.044425
Polish,0.051494,0.46231,0.078497,1.0,0.720307,-0.046379,-0.036705,0.062356,0.014438,0.01582,-0.567507
Symmetry,0.040413,0.554999,0.063179,0.720307,1.0,-0.067113,-0.028463,0.054716,0.021985,0.024339,-0.566545
Color_E,-0.089155,-0.053341,-0.074843,-0.046379,-0.067113,1.0,-0.173963,-0.222948,-0.18074,-0.169293,0.059315
Color_F,-0.024898,-0.045228,-0.009883,-0.036705,-0.028463,-0.173963,1.0,-0.260326,-0.211042,-0.197675,0.02841
Color_G,0.037891,0.057314,0.1119,0.062356,0.054716,-0.222948,-0.260326,1.0,-0.270468,-0.253338,-0.021583
Color_H,0.045767,0.004432,-0.021487,0.014438,0.021985,-0.18074,-0.211042,-0.270468,1.0,-0.205377,-0.02914
Color_I,0.058235,0.012034,0.003519,0.01582,0.024339,-0.169293,-0.197675,-0.253338,-0.205377,1.0,-0.072709


Compute Variance Inflation Factors (VIF)

In [None]:
from statsmodels.stats.outliers_influence import variance_inflation_factor
# VIF dataframe

vif_data = pd.DataFrame()
vif_data["feature"] = df.drop(['Price','ID'], axis=1).columns

# calculating VIF for each feature
vif_data["VIF"] = [variance_inflation_factor(df.drop(['Price','ID'], axis=1).values, i)
                          for i in range(len(df.drop(['Price','ID'], axis=1).columns))]

print(vif_data)

        feature        VIF
0   CaratWeight   8.736756
1           Cut  24.102396
2       Clarity   4.207357
3        Polish  22.072559
4      Symmetry  22.064853
5       Color_E   1.980216
6       Color_F   2.296860
7       Color_G   3.046606
8       Color_H   2.413377
9       Color_I   2.256681
10   Report_GIA   6.932628


# *Exercise*

Try different approaches to multicollinearity remediation and check how do VIF numbers change. Later, you will have to choose what to use for your further analysis.

* Try mean-centering variables
* Try creating combinations of variables
* Try dropping variables

In [None]:
# Mean-centering
df_c = df.copy()
df_c['CaratWeight'] = df_c['CaratWeight'] - df_c['CaratWeight'].mean()
df_c['Price'] = df_c['Price'] - df_c['Price'].mean()

X_centered = df_c.drop(columns=['Price', 'ID'])

# Check VIF after mean-centering
vif_data = pd.DataFrame()
vif_data["Feature"] = X_centered.columns
vif_data["VIF"] = [variance_inflation_factor(X_centered.values, i) for i in range(X_centered.shape[1])]

print(vif_data)

        Feature        VIF
0   CaratWeight   1.027783
1           Cut  23.782066
2       Clarity   4.146063
3        Polish  21.718894
4      Symmetry  22.043226
5       Color_E   1.979058
6       Color_F   2.271893
7       Color_G   3.000654
8       Color_H   2.362880
9       Color_I   2.202865
10   Report_GIA   6.256686
      CaratWeight  Cut  Clarity  Polish  Symmetry  Color_E  Color_F  Color_G  \
0        -0.23452    4        1       2         3        0        0        0   
1        -0.50452    4        5       4         4        0        0        0   
2        -0.48452    4        1       3         3        0        0        0   
3        -0.42452    4        1       2         2        1        0        0   
4        -0.50452    4        1       3         3        0        0        1   
...           ...  ...      ...     ...       ...      ...      ...      ...   
5995     -0.30452    4        1       3         3        0        0        0   
5996     -0.33452    3        1     

In [None]:
# Combination
df_com = df.copy()
df_com["Combined_Feature"] = df_com[['Cut', 'Polish', 'Symmetry']].mean(axis=1)
df_com = df_com.drop(columns=['Cut', 'Polish', 'Symmetry'])

X_com = df_com.drop(columns=['Price', 'ID'])

# Check VIF after mean-centering
vif_data = pd.DataFrame()
vif_data["Feature"] = X_com.columns
vif_data["VIF"] = [variance_inflation_factor(X_com.values, i) for i in range(X_com.shape[1])]

print(vif_data)

            Feature       VIF
0       CaratWeight  8.716214
1           Clarity  4.192126
2           Color_E  1.972230
3           Color_F  2.290628
4           Color_G  3.038508
5           Color_H  2.408485
6           Color_I  2.253953
7        Report_GIA  5.722819
8  Combined_Feature  9.525065


In [None]:
# Drop Variables
df_drop = df.copy()
df_drop = df_drop.drop(columns=['Polish', 'Symmetry'])

X_drop = df_drop.drop(columns=['Price', 'ID'])

# Check VIF after mean-centering
vif_data = pd.DataFrame()
vif_data["Feature"] = X_drop.columns
vif_data["VIF"] = [variance_inflation_factor(X_drop.values, i) for i in range(X_drop.shape[1])]

print(vif_data)

       Feature        VIF
0  CaratWeight   8.426367
1          Cut  10.244471
2      Clarity   4.174843
3      Color_E   1.907157
4      Color_F   2.208260
5      Color_G   2.917201
6      Color_H   2.321840
7      Color_I   2.185718
8   Report_GIA   5.974793


## Step 2.5: Split the data into X and Y.
The vector of Y ("dependent") variable should contain the Price.
The matrix of X ("independent") variables should contain everything we will use to predict Y


In [None]:
Y = df[(['Price'])]
Y.head() # it's always a good idea to peak at your output

Unnamed: 0,Price
0,5169.0
1,3470.0
2,3183.0
3,4370.0
4,3171.0


In [None]:
#X = df.drop(['Price','ID', 'Report_GIA'], axis=1)
#X = df[['CaratWeight']]
X_com.dtypes

Unnamed: 0,0
CaratWeight,float64
Clarity,int64
Color_E,int64
Color_F,int64
Color_G,int64
Color_H,int64
Color_I,int64
Report_GIA,int64
Combined_Feature,float64


# 3. Build regression models

## 3.1 Try a linear model

In [None]:
# In this package, by default, the regression will have no intercept, hence we need to manually add it to the X matrix, and call the result X_const
X_const = sm.add_constant(X_com)

# Fit a linear regression model with vector Y as dependent and matrix X_sm as independent
lm = sm.OLS(Y, X_const).fit()

# Display the summary of model results
print(lm.summary())

                            OLS Regression Results                            
Dep. Variable:                  Price   R-squared:                       0.822
Model:                            OLS   Adj. R-squared:                  0.822
Method:                 Least Squares   F-statistic:                     3082.
Date:                Thu, 30 Jan 2025   Prob (F-statistic):               0.00
Time:                        06:43:00   Log-Likelihood:                -58700.
No. Observations:                6000   AIC:                         1.174e+05
Df Residuals:                    5990   BIC:                         1.175e+05
Df Model:                           9                                         
Covariance Type:            nonrobust                                         
                       coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------
const            -1.368e+04    447.919  

### How to interpet the output of lm.summary()?

- *Dep. Variable*: the dependent variable of the regression model;
- *R-squared*: the coefficient of determination, which measures the proportion of the variation in the dependent variable that is explained by the independent variables. A value close to 1 indicates a good fit, while a value close to 0 indicates a poor fit;
- *Adj. R-squared*: the adjusted coefficient of determination, which takes into account the number of predictors in the model. A higher value of the adjusted R-squared indicates a better fit compared to models with fewer predictors;
- *Coefficients*: the estimated coefficients of the regression model, including the constant term. The coefficient values represent the change in the dependent variable for a one-unit change in the predictor variable, holding all other predictors constant:
    - The *std err* column shows the standard error of the coefficient estimate, which measures the precision of the estimate.
    - The *t* column shows the t-statistic for each coefficient, which measures the significance of the coefficient.
    - The *P>|t|* column shows the p-value for each coefficient, a p-value less than 0.05 indicates that the coefficient is significantly different from 0.
- *Intercept*: the constant term of the regression model, which represents the estimated value of the dependent variable when all predictor variables are 0.
- *[0.025 0.975]*: the 95% confidence interval for each coefficient, which represents the range of values that is likely to contain the true value of the coefficient with a probability of 95%. If the confidence interval does not contain 0, it indicates that the coefficient is significantly different from 0.
- *Omnibus*: the Omnibus test of normality, which tests the assumption that the residuals are normally distributed.
- *Prob(Omnibus)*: the p-value of the Omnibus test, a p-value less than 0.05 indicates that the residuals are **not** normally distributed.
- *Skew*: the skewness of the residuals, which measures the degree of asymmetry of the distribution. A value close to 0 indicates that the residuals are symmetrically distributed.
- *Kurtosis*: the kurtosis of the residuals, which measures the peakedness of the distribution. A value close to 3 indicates that the residuals are normally distributed.
- *Durbin-Watson*: the Durbin-Watson statistic, which tests for autocorrelation of the residuals. A value close to 2 indicates that the residuals are not autocorrelated.
- *Jarque-Bera*: the Jarque-Bera test of normality, which tests the assumption that the residuals are normally distributed.
- *Prob(JB)*: the p-value of the Jarque-Bera test, a p-value less than 0.05 indicates that the residuals are *not* normally distributed.
- *Cond. No*: the condition number measures the sensitivity of the regression results to small changes in the input data. A value greater than 20 indicates that the regression results may be highly sensitive to the changes in input.

### Plot the residuals

In [None]:
# Compute the residuals
results = pd.DataFrame()
results['Price'] = df['Price']
results['prediction_lm'] = lm.fittedvalues
results['residual_lm'] = lm.resid

fig = px.scatter(
    results, x='prediction_lm', y='residual_lm', height = 350,
    labels={'prediction_lm':'Predicted values using the Linear Model after Combination', 'residual_lm':'Residuals'}
)
fig.show()

## 3.2 Try a log-linear model to the case data

In [None]:
from math import exp, log
Y_log = Y['Price'].transform(log)
log_linear_model = sm.OLS(Y_log, X_const).fit()
print(log_linear_model.summary())

                            OLS Regression Results                            
Dep. Variable:                  Price   R-squared:                       0.927
Model:                            OLS   Adj. R-squared:                  0.927
Method:                 Least Squares   F-statistic:                     8409.
Date:                Thu, 30 Jan 2025   Prob (F-statistic):               0.00
Time:                        06:44:15   Log-Likelihood:                 1376.2
No. Observations:                6000   AIC:                            -2732.
Df Residuals:                    5990   BIC:                            -2665.
Df Model:                           9                                         
Covariance Type:            nonrobust                                         
                       coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------
const                6.9949      0.020  

In [None]:
# Compute residuals
results['prediction_llm'] = log_linear_model.fittedvalues
results['residual_llm'] = log_linear_model.resid

fig = px.scatter(
    results, x='prediction_llm', y='residual_llm', height = 350,
    labels={'prediction_llm':'Predicted values using the Log-Linear Model after Combination', 'residual_llm':'Residuals'}
)
fig.show()


## 3.3 Fit a log-log model to the case data

In [None]:
from math import exp, log

Y_log = Y['Price'].transform(log)
X_log = X_com.copy()
X_log['CaratWeight'] = X_log['CaratWeight'].transform(log)
X_log_const = sm.add_constant(X_log)
log_log_model = sm.OLS(Y_log, X_log_const).fit()

print(log_log_model.summary())

                            OLS Regression Results                            
Dep. Variable:                  Price   R-squared:                       0.950
Model:                            OLS   Adj. R-squared:                  0.949
Method:                 Least Squares   F-statistic:                 1.253e+04
Date:                Thu, 30 Jan 2025   Prob (F-statistic):               0.00
Time:                        06:52:46   Log-Likelihood:                 2500.0
No. Observations:                6000   AIC:                            -4980.
Df Residuals:                    5990   BIC:                            -4913.
Df Model:                           9                                         
Covariance Type:            nonrobust                                         
                       coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------
const                8.3945      0.016  

# *Exercise*

Try fitting a log-log model on your own.
Then, create a residuals plot

In [None]:
#INSERT CODE HERE:
results['prediction_llm'] = log_log_model.fittedvalues
results['residual_llm'] = log_log_model.resid

fig = px.scatter(
    results, x='prediction_llm', y='residual_llm', height = 350,
    labels={'prediction_llm':'Predicted values using the Log-Log Model', 'residual_llm':'Residuals'}
)
fig.show()

# 4. Cross-validation

The main machine learning principle that allows to answer this question -- cross-validation: splitting the data into training (80%) and testing (20%) subsets, training on the former and testing on the latter.

In [None]:
# Redefine X and Y for the training and testing data
from sklearn.model_selection import train_test_split

X_train, X_test, Y_train, Y_test = train_test_split(X_com, Y, test_size=0.2, random_state=42)

#Add a constant to the X's:
X_train_const = sm.add_constant(X_train)
X_test_const = sm.add_constant(X_test)

X_train_const.head()

Unnamed: 0,const,CaratWeight,Clarity,Color_E,Color_F,Color_G,Color_H,Color_I,Report_GIA,Combined_Feature
3897,1.0,1.34,4,0,0,0,1,0,1,2.0
5628,1.0,2.57,4,0,0,0,1,0,1,3.0
1756,1.0,1.01,6,0,0,0,0,0,1,3.333333
2346,1.0,2.09,5,0,1,0,0,0,1,2.666667
2996,1.0,0.9,1,0,0,0,0,1,1,2.333333


## 4.1 Check the Linear Model

In [None]:
# Fit a linear regression model to the training data
lm = sm.OLS(Y_train, X_train_const).fit()

# Use the trained model to predict the prices for the testing data. Call the vector of predicted prices Y_pred
Y_pred = lm.predict(X_test_const)
percent_errors = np.abs((Y_test['Price'] - Y_pred) / Y_test['Price']) *100
print("Linear Model MAPE = ", np.mean(percent_errors), "%")

Linear Model MAPE =  29.630641316127026 %


## 4.2 Check the Log-Linear Model


In [None]:
# Fit a log-linear regression model to the training data
Y_train_log = Y_train['Price'].transform(log)
llm = sm.OLS(Y_train_log, X_train_const).fit()

Y_pred_llm = np.exp(llm.predict(X_test_const))
percent_errors = np.abs((Y_test['Price'] - Y_pred_llm) / Y_test['Price']) *100
print("Log-Linear Model MAPE = ", np.mean(percent_errors), "%")

Log-Linear Model MAPE =  14.357497084769017 %


## 4.3 Check the Log-Log Model

In [None]:
# Fit a log-log regression model to the training data
X_train_log = X_train.copy()
X_train_log['CaratWeight'] = X_train_log['CaratWeight'].transform(log)
X_train_log_const = sm.add_constant(X_train_log)

X_test_log = X_test.copy()
X_test_log['CaratWeight'] = X_test_log['CaratWeight'].transform(log)
X_test_log_const = sm.add_constant(X_test_log)


loglog = sm.OLS(Y_train_log, X_train_log_const).fit()

Y_pred_loglog = np.exp(loglog.predict(X_test_log_const))
percent_errors = np.abs((Y_test['Price'] - Y_pred_loglog) / Y_test['Price']) *100
print("Log-Log Model MAPE = ", np.mean(percent_errors), "%")

Log-Log Model MAPE =  12.020266258270366 %


# 5. How to improve the model further? Push the model further with your team

Recall our visualizations, we observed multiple effects:

1.   Price increases with Carat Weight -- our model already "learned" that -- Carat Weight is one of the variables, and its coefficient is positive
2.   Price increases exponentially with Carat Weight -- our model already "learned" that with log-log transformation.
3.   Diamonds with "better" colors are more expensive -- our model already "somewhat learned" that too: the best ("D") color diamond is in the intercept, and the other Color coefficients are negative. However, our model does not let the slope of the line change based on the color. What can you do? Perhaps, try some interactions.
4.   There seems to be a disconitnuity around 2 carats, maybe even one more around 1 carat. Try splitting the dataset into multiple models, and fit a different model in each interval.

