# Data Science  - Unit 1 Sprint 3 Module 4

## Module Project: Metrics, Bias and Variance

### Learning Objectives

* Interpret your model results using OLS and Sklearn metrics
* Define and analyze bias in your model
* Define and analyze variance in your model

## Analyzing results from diamonds

Use the seaborn dataset `diamonds` to run a linear regression model and produce the common metrics you would use to evaluate your model's accuracy. 

**Task 1** - Load the data
Load the `diamonds` dataset from the `seaborn` package. 

- Assign the value to an object called `dia`
- Make sure to import the packages you expect to use for an `ols` linear regression model. 

In [2]:
#Task 1

#imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import metrics
from statsmodels.formula.api import ols


### BEGIN SOLUTION
dia = sns.load_dataset('diamonds')
### END SOLUTION

In [3]:
# Task 1 - Tests

assert isinstance(dia, pd.DataFrame)

### BEGIN HIDDEN TESTS
assert dia.columns[0] == 'carat', "Did you load the correct dataframe?"
### END HIDDEN TESTS

**Task 2** - Conduct EDA on your dataset
- Check for null values. Assign the total number of null values in your dataset to `num_null`

In [4]:
#Task 2
###BEGIN SOLUTION
num_null = dia.isnull().sum().sum()
###END SOLUTION


In [5]:
# Task 2 - Tests

### BEGIN HIDDEN TESTS
assert num_null == 0
### END HIDDEN TESTS

**Task 3** - Visualize your feature distributions

- Use seaborn's `pairplot`to visualize the distributions for all your dataset's features.
- You can access the documentation [here](https://seaborn.pydata.org/generated/seaborn.pairplot.html)
- This next task will not be autograded. 

**Task 4**

How would you describe the distribution for the `price` feature?

A: Uniform

B: Right skewed

C: Left skewed

D: Normally distributed

Specify your answer in the next code block using Answer =. For example, if the correct answer is choice B, you'll type Answer = 'B'.

In [6]:
#Task 4

### BEGIN SOLUTION
Answer =  "B"
### END SOLUTION

In [7]:
#Task 4 - Test

### BEGIN HIDDEN TESTS
assert Answer == "B", "Check your histogram."
### END HIDDEN TESTS

**Task 5** Check for multicollinearity

- Determine the `pearson` correlations for the `x`, `y`, and `z` columns to `carat`. 
- Assign the value of the correlations to `x_corr`, `y_corr` and `z_corr` respectively. 

In [8]:
#Task 5

### BEGIN SOLUTION
x_corr, y_corr, z_corr = dia.corr().loc['carat', ['x', 'y', 'z']]
### END SOLUTION


print(x_corr, y_corr, z_corr)

0.9750942267264254 0.9517221990129883 0.9533873805614275


In [9]:
#Task 5 - Test


### BEGIN HIDDEN TESTS
assert round(x_corr, 1) == 1
assert round(y_corr, 3) == 0.952
assert round(z_corr, 3) == 0.953
### END HIDDEN TESTS

**Task 6** 


Because these three columns share a great deal of correlation with the `carat` feature, it does not make sense to use them as part of our model. Drop the three columns and reassign to the `dia` dataframe. 

In [10]:
#Task 6

### BEGIN SOLUTION
dia = dia.drop(columns= ['x', 'y','z'])
### END SOLUTION

In [11]:
#Task 6 - Test


### BEGIN HIDDEN TESTS
assert dia.shape[1] == 7
### END HIDDEN TESTS

**Task 7** - OLS Modeling

- Use `carat` as your independent feature. 
- Use the `price` values as your dependent features. 
- Build an OLS model and review the summary report. Make sure to assign a variable called `model`

In [12]:
#Task 7

### BEGIN SOLUTION
model = ols('price ~ carat', data=dia).fit()
### END SOLUTION

print(model.summary())

                            OLS Regression Results                            
Dep. Variable:                  price   R-squared:                       0.849
Model:                            OLS   Adj. R-squared:                  0.849
Method:                 Least Squares   F-statistic:                 3.041e+05
Date:                Fri, 28 Jan 2022   Prob (F-statistic):               0.00
Time:                        13:02:06   Log-Likelihood:            -4.7273e+05
No. Observations:               53940   AIC:                         9.455e+05
Df Residuals:                   53938   BIC:                         9.455e+05
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept  -2256.3606     13.055   -172.830      0.0

In [13]:
#Task 7 - Test


### BEGIN HIDDEN TESTS
assert 'Intercept', 'carat' in model.params.index 
### END HIDDEN TESTS


**Task 8** - Predictions and Residuals 

- Create a new column that includes your model predictions for your features. Name the column `y_pred`
- Calculate the prediction residuals. Assign the values to a column named `residuals`.

In [14]:
#Task 8

### BEGIN SOLUTION
dia['y_pred'] = model.predict()
dia['residuals'] = dia['price'] - dia['y_pred']
### END SOLUTION

In [15]:
#Task 8 - Test

assert dia.shape == (53940, 9), "Have you created the two columns?"

### BEGIN HIDDEN TESTS
assert round(dia['residuals'][0], 2) == 798.38, "Ensure you've properly calculated your residual values."
### END HIDDEN TESTS

**Task 9** - Metrics

- Determine the values for the **mean absolute error, the mean squared error** and the **root mean squared error** for your previous model. 
- Assign the values as `mae`, `mse`, and `rmse` respectively. 
- *Hint*: We discussed a few methods for this in class. You can refer to this [documentation](https://scikit-learn.org/stable/modules/model_evaluation.html) for other metric values.

In [16]:
#Task 9

### BEGIN SOLUTION
mae = metrics.mean_absolute_error(dia['price'], dia['y_pred'])
mse = metrics.mean_squared_error(dia['price'], dia['y_pred'])
rmse = np.sqrt(metrics.mean_squared_error(dia['price'], dia['y_pred']))
### END SOLUTION

print(mae, mse, rmse)

1007.4632473569903 2397955.05001268 1548.5331930613177


In [16]:
#Task 9 - Test

###BEGIN HIDDEN TESTS
assert round(rmse, 2) == 1548.53
###END HIDDEN TESTS 

**Task 10** - OLS Modeling, Addtional Features

- Use the `depth`, `table`, and `carat` as your independent features. 
- Use the `price` values as your dependent features. 
- Build an OLS model and review the summary report. Make sure to assign a variable called `model`.  

In [17]:
#Task 10

### BEGIN SOLUTION
model = ols('price ~ carat + table + depth', data=dia).fit()
### END SOLUTION

print(model.summary())

                            OLS Regression Results                            
Dep. Variable:                  price   R-squared:                       0.854
Model:                            OLS   Adj. R-squared:                  0.854
Method:                 Least Squares   F-statistic:                 1.049e+05
Date:                Thu, 27 Jan 2022   Prob (F-statistic):               0.00
Time:                        15:51:51   Log-Likelihood:            -4.7194e+05
No. Observations:               53940   AIC:                         9.439e+05
Df Residuals:                   53936   BIC:                         9.439e+05
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     1.3e+04    390.918     33.264      0.0

In [18]:
#Task 10 - Test


assert len(model.params.index) == 4, "Make sure you've assigned both values."
### BEGIN HIDDEN TESTS
assert 'Intercept', 'carat' in model.params.index 
### END HIDDEN TESTS


**Task 11** - Predictions and Residuals 

- Create a new column that includes your model predictions for your features. Name the column `y_pred`
- Calculate the prediction residuals. Assign the values to a column named `residuals`.


In [19]:
#Task 11

### BEGIN SOLUTION
dia['y_pred'] = model.predict()
dia['residuals'] = dia['price'] - dia['y_pred']
### END SOLUTION

In [20]:
#Task 11- Test

assert dia.shape == (53940, 9), "Have you created the two columns?"

### BEGIN HIDDEN TESTS
assert round(dia['residuals'][0], 2) == 562.08, "Ensure you've properly calculated your residual values."
### END HIDDEN TESTS

**Task 12** - Predictions and Residuals 

- Create a new column that includes your model predictions for your features. Name the column `y_pred`
- Calculate the prediction residuals. Assign the values to a column named `residuals`.


In [21]:
#Task 12

### BEGIN SOLUTION
dia['y_pred'] = model.predict()
dia['residuals'] = dia['price'] - dia['y_pred']
### END SOLUTION

In [22]:
#Task 12 - Test

assert dia.shape == (53940, 9), "Have you created the two columns?"

### BEGIN HIDDEN TESTS
assert round(dia['residuals'][0], 2) == 562.08, "Ensure you've properly calculated your residual values."
### END HDIDDEN TESTS

**Task 13** - Metrics

- Determine the values for the **mean absolute error, the mean squared error** and the **root mean squared error** for your previous model. 
- Assign the values as `mae`, `mse`, and `rmse` respectively. 
- *Hint*: We discussed a few methods for this in class. You can refer to this [documentation](https://scikit-learn.org/stable/modules/model_evaluation.html) for other metric values.

In [23]:
#Task 13

### BEGIN SOLUTION
mae = metrics.mean_absolute_error(dia['price'], dia['y_pred'])
mse = metrics.mean_squared_error(dia['price'], dia['y_pred'])
rmse = np.sqrt(metrics.mean_squared_error(dia['price'], dia['y_pred']))
### END SOLUTION


In [24]:
#Task 13 - Test

### BEGIN HIDDEN TESTS
assert round(rmse, 2) == 1526.04
### END HIDDEN TESTS 