### Codio Activity 8.5: Comparing Complexity and Variance

**Expected Time: 60 Minutes**

**Total Points: 35**

In this activity, you will explore the effect of model complexity on the variance in predictions.  Continuing with the automotive data, you will build models on a subset of 10 vehicles.  You will compare the model error when used on the entire dataset, and investigate how variance changes with model complexity.

#### Index:

- [Problem 1](#Problem-1)
- [Problem 2](#Problem-2)
- [Problem 3](#Problem-3)
- [Problem 4](#Problem-4)


In [5]:
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_squared_error
import plotly.express as px

In [6]:
auto = pd.read_csv('data/auto.csv')

In [7]:
auto.head()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,year,origin,name
0,18.0,8,307.0,130.0,3504,12.0,70,1,chevrolet chevelle malibu
1,15.0,8,350.0,165.0,3693,11.5,70,1,buick skylark 320
2,18.0,8,318.0,150.0,3436,11.0,70,1,plymouth satellite
3,16.0,8,304.0,150.0,3433,12.0,70,1,amc rebel sst
4,17.0,8,302.0,140.0,3449,10.5,70,1,ford torino


### The Sample

Below, a sample of ten vehicles from the data is extracted.  These data are what will form our **training** data.  The data is subsequently split into `X_train` and `y_train`.  You are to use this smaller dataset to build your models on, and explore their performance using the entire dataset.

In [8]:
X = auto.loc[:,['horsepower']]
y = auto['mpg']
sample = auto.sample(10, random_state = 22)
X_train = sample.loc[:, ['horsepower']]
y_train = sample['mpg']

In [9]:
X_train

Unnamed: 0,horsepower
280,88.0
57,80.0
46,100.0
223,110.0
303,90.0
73,140.0
98,100.0
250,105.0
254,100.0
337,110.0


In [10]:
y_train

280    22.3
57     25.0
46     19.0
223    17.5
303    28.4
73     13.0
98     18.0
250    19.2
254    20.5
337    23.5
Name: mpg, dtype: float64

In [11]:
X.shape

(392, 1)

[Back to top](#Index:) 

### Problem 1

#### Iterate on Models

**20 Points**

Complete the code below according to the instructions below:

- Assign the values in the `horsepower` column of `auto` to the variable `X` below.
- Assign the values in the `mpg` column of `auto` to the variable `y` below.

Use a `for` loop to loop over the values from one to ten. For each iteration `i`:

- Use `Pipeline` to create a pipeline object. Inside the pipeline object define a a tuple where the first element is a string identifier `quad_features'` and the second element is an instance of `PolynomialFeatures` of degree `i` with `include_bias = False`. Inside the pipeline define another tuple where the first element is a string identifier `quad_model`, and the second element is an instance of `LinearRegression`. Assign the pipeline object to the variable `pipe`.
- Use the `fit` function on `pipe` to train your model on `X_train` and `y_train`. Assign the result to `preds`.
- Use the `predict` function to predict the value of `X_train`. Assign the result to `preds`.
- Assign the each `model_predictions` of degree `i` the corresponding `preds` value.

In [16]:
### GRADED

### YOUR SOLUTION HERE
model_predictions = {f'degree_{i}': None for i in range(1, 11)}

def predictions_for_range_of_degrees(X_train, y_train, X_pred, range_start, range_stop):
    predictions = []
    #for 1, 2, 3, ..., 10
    for i in range(range_start, range_stop):
        #create pipeline
        pipe = Pipeline([
            ('quad_features', PolynomialFeatures(degree=i, include_bias=False)),
            ('quad_model', LinearRegression())
        ])
        #fit pipeline on training data
        pipe.fit(X_train, y_train)
        #make predictions on all data
        preds = pipe.predict(X_pred)
        #assign to model_predictions
        predictions.append(preds)
        
    return predictions

predictions = predictions_for_range_of_degrees(X_train, y_train, X_train, 1, 11)

for key, value in zip(model_predictions.keys(), predictions):
    model_predictions[key] = value
    
# Answer check
model_predictions['degree_1'][:10]

array([23.60120856, 25.25782873, 21.1162783 , 19.04550308, 23.18705352,
       12.83317743, 21.1162783 , 20.08089069, 21.1162783 , 19.04550308])

[Back to top](#Index:) 

### Problem 2

#### DataFrame of Predictions

**5 Points**

Use the `model_predictions` dictionary to create a DataFrame of the 10 models predictions.  Assign your solution to `pred_df` below as a DataFrame. 

In [13]:
### GRADED

### YOUR SOLUTION HERE
pred_df = pd.DataFrame(model_predictions)

# Answer check
print(type(pred_df))
print(pred_df.head())

<class 'pandas.core.frame.DataFrame'>
    degree_1   degree_2   degree_3   degree_4   degree_5   degree_6  \
0  23.601209  23.730040  23.517217  25.640822  24.918036  25.051988   
1  25.257829  25.669836  26.057265  24.755267  24.864116  24.841059   
2  21.116278  20.981922  20.820752  19.496913  19.845537  19.808741   
3  19.045503  18.839933  19.152249  20.457650  20.746899  20.716968   
4  23.187054  23.258556  22.988407  24.670613  25.141244  25.029322   

    degree_7   degree_8   degree_9  degree_10  
0  25.171962  25.269558  25.350124  25.415930  
1  24.826735  24.807184  24.787025  24.766287  
2  19.770362  19.728813  19.686741  19.645249  
3  20.686838  20.660395  20.636748  20.615915  
4  24.952802  24.899243  24.867973  24.855330  


[Back to top](#Index:) 

### Problem 3

#### DataFrame of Errors

**5 Points**

Now, determine the error for each model and create a DataFrame of these errors.  One way to do this is to use your prediction DataFrame's `.subtract` method to subtract `y` from each feature.  Assign the DataFrame of errors as `error_df` below.  

In [14]:
### GRADED

### YOUR SOLUTION HERE
error_df = pred_df.subtract(y, axis=0)

# Answer check
print(type(error_df))
print(error_df.head())

<class 'pandas.core.frame.DataFrame'>
    degree_1   degree_2   degree_3  degree_4  degree_5  degree_6  degree_7  \
0   5.601209   5.730040   5.517217  7.640822  6.918036  7.051988  7.171962   
1  10.257829  10.669836  11.057265  9.755267  9.864116  9.841059  9.826735   
2   3.116278   2.981922   2.820752  1.496913  1.845537  1.808741  1.770362   
3   3.045503   2.839933   3.152249  4.457650  4.746899  4.716968  4.686838   
4   6.187054   6.258556   5.988407  7.670613  8.141244  8.029322  7.952802   

   degree_8  degree_9  degree_10  
0  7.269558  7.350124   7.415930  
1  9.807184  9.787025   9.766287  
2  1.728813  1.686741   1.645249  
3  4.660395  4.636748   4.615915  
4  7.899243  7.867973   7.855330  


[Back to top](#Index:) 

### Problem 4

#### Mean and Variance of Model Errors

**5 Points**


Using the DataFrame of errors, examine the mean and variance of each model's error.  What degree model has the highest variance?  Assign your response as an integer to `highest_var_degree` below.

In [15]:
### GRADED

## Exploration of the data statistics
error_stats = error_df.describe()
# Calculate variance and create a DataFrame
var_df = pd.DataFrame(error_df.var()).T
var_df.index = ['var']

# Concatenate the original DataFrame with the variance DataFrame
error_stats = pd.concat([error_stats, var_df])

print(error_stats)

### YOUR SOLUTION HERE
highest_degree_name = error_df.var().idxmax()
print(highest_degree_name)
highest_var_degree = 3

# Answer check
print(type(highest_var_degree))
print(highest_var_degree)

        degree_1   degree_2   degree_3   degree_4   degree_5   degree_6  \
count  10.000000  10.000000  10.000000  10.000000  10.000000  10.000000   
mean    5.040000   5.040000   5.040000   5.040000   5.040000   5.040000   
std     3.319434   3.315471   3.360741   3.315497   3.343652   3.327971   
min    -2.166823  -1.778159  -1.878879  -1.996799  -2.001236  -1.988709   
25%     3.348584   3.196424   3.402249   4.574252   3.772297   3.883876   
50%     5.841050   5.812068   5.733226   5.477282   5.796218   5.762855   
75%     6.883972   6.801080   6.612666   7.104845   6.649912   6.741177   
max    10.257829  10.669836  11.057265   9.755267   9.864116   9.841059   
var    11.018642  10.992348  11.294580  10.992521  11.180011  11.075388   

        degree_7   degree_8   degree_9  degree_10  
count  10.000000  10.000000  10.000000  10.000000  
mean    5.040000   5.040000   5.040000   5.040000  
std     3.324002   3.319023   3.316491   3.315892  
min    -2.000648  -2.000498  -2.000356  -