# Predicting the price of diamonds
## Phase 2: Statistical Modelling

#### Group Name:

#### Name(s) & ID(s) of Group Members:

#### 

## Table of Contents
- [Introduction](#intro)
- [Statistical Modelling](#sm)
- [Critique & Limitations](#cl)
- [Summary & Conclusions](#sc)

## Introduction <a id='intro'></a>

### Phase 1 Summary

A brief yet complete and accurate summary of the work conducted for your Phase 1 report and how they relate to your Phase 2 report. 

**Important Phase 2 Note:** Please do NOT include your Phase 1 report or its contents with your Phase 2 submissions. You can, however, make some changes with your Phase 1 tasks if you need to, and then ONLY include these changes with your Phase 2 report with some explanation for these changes.

### Report Overview

A complete and accurate overview of the contents of your Phase 2 report. Clarification: A Table of Contents is not a report overview.

### Overview of Methodology

A detailed, complete, and accurate overview of your statistical modelling methodology (which is multiple linear regression).

## Statistical Modelling <a id='sm'></a>

### Model Overview

Overview of full model, including the variables and terms you are using in your regression model.

#### Module Imports

In [15]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
import statsmodels.formula.api as smf
import patsy

import warnings
warnings.filterwarnings('ignore')

pd.set_option('display.max_columns', None) 

%matplotlib inline 
%config InlineBackend.figure_format = 'retina'
plt.style.use("ggplot")

df = pd.read_csv('cleaned_diamonds.csv')
df.head()

Unnamed: 0,carat,cut,color,clarity,depth,table,price
0,0.31,Ideal,E,SI1,62.4,57.0,544
1,0.71,Ideal,D,SI2,60.4,56.0,2516
2,0.9,Good,G,VVS2,62.6,58.0,4668
3,0.72,Ideal,D,VS2,60.8,56.0,3179
4,1.42,Ideal,J,VS2,61.7,56.0,7942


### Model Fitting

Details of assumptions check, model selection, plots of residuals, and technical analysis of regression results.

**NOTE:** The second half of [this](https://github.com/vaksakalli/stats_tutorials/blob/master/Regression_Case_Study1_web.ipynb) regression case study ("Statistical Modeling and Performance Evaluation" Section) will be **very helpful** for this Model Fitting section.

Much of the data in our model is still categorical. We must first one-hot encode this data in order to fit it to a model effectively.

In [16]:
df_enc = pd.get_dummies(df)
df_enc.sample(10, random_state=405)

Unnamed: 0,carat,depth,table,price,cut_Fair,cut_Good,cut_Ideal,cut_Premium,cut_Very Good,color_D,color_E,color_F,color_G,color_H,color_I,color_J,clarity_I1,clarity_IF,clarity_SI1,clarity_SI2,clarity_VS1,clarity_VS2,clarity_VVS1,clarity_VVS2
21045,0.76,63.4,58.0,3517,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0
29418,0.39,61.1,60.0,1232,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1
3112,0.33,61.3,56.0,854,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0
20503,0.41,62.3,58.0,876,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0
5274,1.55,62.0,60.0,15966,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0
53,0.4,61.9,56.0,1570,0,0,1,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0
63,1.01,63.8,59.0,4899,0,1,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0
20355,0.3,62.7,53.0,590,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1
17264,0.7,64.3,57.0,2142,0,1,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0
466,0.55,60.5,59.0,1348,0,0,0,1,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0


Some of our column names also contain spaces. The statsmodels module does not behave well with columns with spaces, so we will remove them:

In [17]:
for col in df_enc.columns:
    col_cont = df_enc[col]
    df_enc = df_enc.drop(col, 1)
    col = col.replace(' ', '_')
    df_enc[col] = col_cont
    
df_enc.head()

Unnamed: 0,carat,depth,table,price,cut_Fair,cut_Good,cut_Ideal,cut_Premium,cut_Very_Good,color_D,color_E,color_F,color_G,color_H,color_I,color_J,clarity_I1,clarity_IF,clarity_SI1,clarity_SI2,clarity_VS1,clarity_VS2,clarity_VVS1,clarity_VVS2
0,0.31,62.4,57.0,544,0,0,1,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0
1,0.71,60.4,56.0,2516,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0
2,0.9,62.6,58.0,4668,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1
3,0.72,60.8,56.0,3179,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0
4,1.42,61.7,56.0,7942,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0


Then, we create a formula string to create our initial full model:

In [18]:
formula = 'price ~ ' + ' + '.join(df_enc.drop(columns='price').columns)
print(f'Formula string: {formula}')

Formula string: price ~ carat + depth + table + cut_Fair + cut_Good + cut_Ideal + cut_Premium + cut_Very_Good + color_D + color_E + color_F + color_G + color_H + color_I + color_J + clarity_I1 + clarity_IF + clarity_SI1 + clarity_SI2 + clarity_VS1 + clarity_VS2 + clarity_VVS1 + clarity_VVS2


Now, we construct the initial full model:

In [19]:
model_full = sm.formula.ols(formula=formula, data=df_enc)
model_full_fit = model_full.fit()
print(model_full_fit.summary())

                            OLS Regression Results                            
Dep. Variable:                  price   R-squared:                       0.917
Model:                            OLS   Adj. R-squared:                  0.917
Method:                 Least Squares   F-statistic:                 1.656e+04
Date:                Wed, 20 Oct 2021   Prob (F-statistic):               0.00
Time:                        15:46:38   Log-Likelihood:            -2.5400e+05
No. Observations:               30000   AIC:                         5.080e+05
Df Residuals:                   29979   BIC:                         5.082e+05
Df Model:                          20                                         
Covariance Type:            nonrobust                                         
                    coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------
Intercept      -518.0115    327.342     -1.582

As we can see by the results above, something is either very right or very wrong. **HELP PLS?**

#### Feature Selection

You can use the code below to to perform backward feature selection using p-values ([credit](https://github.com/vaksakalli/stats_tutorials/blob/master/Regression_Case_Study1_web.ipynb)).

In [20]:
## create the patsy model description from formula
patsy_description = patsy.ModelDesc.from_formula(formula)

# initialize feature-selected fit to full model
linreg_fit = model_full_fit

# do backwards elimination using p-values
p_val_cutoff = 0.05

## WARNING 1: The code below assumes that the Intercept term is present in the model.
## WARNING 2: It will work only with main effects and two-way interactions, if any.

print('\nPerforming backwards feature selection using p-values:')

while True:

    # uncomment the line below if you would like to see the regression summary
    # in each step:
    ### print(linreg_fit.summary())

    pval_series = linreg_fit.pvalues.drop(labels='Intercept')
    pval_series = pval_series.sort_values(ascending=False)
    term = pval_series.index[0]
    pval = pval_series[0]
    if (pval < p_val_cutoff):
        break
    term_components = term.split(':')
    print(f'\nRemoving term "{term}" with p-value {pval:.4}')
    if (len(term_components) == 1): ## this is a main effect term
        patsy_description.rhs_termlist.remove(patsy.Term([patsy.EvalFactor(term_components[0])]))    
    else: ## this is an interaction term
        patsy_description.rhs_termlist.remove(patsy.Term([patsy.EvalFactor(term_components[0]), 
                                                        patsy.EvalFactor(term_components[1])]))    
        
    linreg_fit = smf.ols(formula=patsy_description, data=df_enc).fit()
    
###
## this is the clean fit after backwards elimination
model_reduced_fitted = smf.ols(formula = patsy_description, data = df_enc).fit()
###
    
#########
print("\n***")
print(model_reduced_fitted.summary())
print("***")
print(f"Regression number of terms: {len(model_reduced_fitted.model.exog_names)}")
print(f"Regression F-distribution p-value: {model_reduced_fitted.f_pvalue:.4f}")
print(f"Regression R-squared: {model_reduced_fitted.rsquared:.4f}")
print(f"Regression Adjusted R-squared: {model_reduced_fitted.rsquared_adj:.4f}")


Performing backwards feature selection using p-values:

Removing term "cut_Very_Good" with p-value 0.4031

Removing term "cut_Premium" with p-value 0.05718

***
                            OLS Regression Results                            
Dep. Variable:                  price   R-squared:                       0.917
Model:                            OLS   Adj. R-squared:                  0.917
Method:                 Least Squares   F-statistic:                 1.743e+04
Date:                Wed, 20 Oct 2021   Prob (F-statistic):               0.00
Time:                        15:46:38   Log-Likelihood:            -2.5400e+05
No. Observations:               30000   AIC:                         5.080e+05
Df Residuals:                   29980   BIC:                         5.082e+05
Df Model:                          19                                         
Covariance Type:            nonrobust                                         
                   coef    std err          t   

## Critique & Limitations <a id='cl'></a>

Critique & Limitations of your approach: strengths and weaknesses in detail.

## Summary & Conclusions <a id='sc'></a>

### Project Summary

A comprehensive summary of your entire project (both Phase 1 and Phase 2). That is, what exactly did you do in your project? (Example: I first cleaned the data in such and such ways. And then I applied multiple linear regression techniques in such and such ways. etc).

### Summary of Findings

A comprehensive summary of your findings. That is, what exactly did you find about your particular problem?

### Conclusions

Your detailed conclusions as they relate to your goals and objectives.