### Codio Activity 8.5: Comparing Complexity and Variance

**Expected Time: 60 Minutes**

**Total Points: 35**

In this activity, you will explore the effect of model complexity on the variance in predictions.  Continuing with the automotive data, you will build models on a subset of 10 vehicles.  You will compare the model error when used on the entire dataset, and investigate how variance changes with model complexity.

#### Index:

- [Problem 1](#Problem-1)
- [Problem 2](#Problem-2)
- [Problem 3](#Problem-3)
- [Problem 4](#Problem-4)


In [15]:
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_squared_error
import plotly.express as px

In [16]:
auto = pd.read_csv('data/auto.csv')

In [17]:
auto.head()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,year,origin,name
0,18.0,8,307.0,130.0,3504,12.0,70,1,chevrolet chevelle malibu
1,15.0,8,350.0,165.0,3693,11.5,70,1,buick skylark 320
2,18.0,8,318.0,150.0,3436,11.0,70,1,plymouth satellite
3,16.0,8,304.0,150.0,3433,12.0,70,1,amc rebel sst
4,17.0,8,302.0,140.0,3449,10.5,70,1,ford torino


### The Sample

Below, a sample of ten vehicles from the data is extracted.  These data are what will form our **training** data.  The data is subsequently split into `X_train` and `y_train`.  You are to use this smaller dataset to build your models on, and explore their performance using the entire dataset.

In [18]:
X = auto.loc[:,['horsepower']]
y = auto['mpg']
sample = auto.sample(10, random_state = 22)
X_train = sample.loc[:, ['horsepower']]
y_train = sample['mpg']

In [19]:
X_train

Unnamed: 0,horsepower
280,88.0
57,80.0
46,100.0
223,110.0
303,90.0
73,140.0
98,100.0
250,105.0
254,100.0
337,110.0


In [20]:
y_train

280    22.3
57     25.0
46     19.0
223    17.5
303    28.4
73     13.0
98     18.0
250    19.2
254    20.5
337    23.5
Name: mpg, dtype: float64

In [21]:
X.shape

(392, 1)

[Back to top](#Index:) 

### Problem 1

#### Iterate on Models

**20 Points**

Complete the code below according to the instructions below:

- Assign the values in the `horsepower` column of `auto` to the variable `X` below.
- Assign the values in the `mpg` column of `auto` to the variable `y` below.

Use a `for` loop to loop over the values from one to ten. For each iteration `i`:

- Use `Pipeline` to create a pipeline object. Inside the pipeline object define a a tuple where the first element is a string identifier `quad_features'` and the second element is an instance of `PolynomialFeatures` of degree `i` with `include_bias = False`. Inside the pipeline define another tuple where the first element is a string identifier `quad_model`, and the second element is an instance of `LinearRegression`. Assign the pipeline object to the variable `pipe`.
- Use the `fit` function on `pipe` to train your model on `X_train` and `y_train`. Assign the result to `preds`.
- Use the `predict` function to predict the value of `X_train`. Assign the result to `preds`.
- Assign the each `model_predictions` of degree `i` the corresponding `preds` value.

In [33]:
### GRADED

### YOUR SOLUTION HERE
model_predictions = {f'degree_{i}': None for i in range(1, 11)}

X = auto[['horsepower']]
y = auto['mpg']

def predictions_for_range_of_degrees(X, X_train, y, y_train, range_start, range_stop):
    predictions = []
    #for 1, 2, 3, ..., 10
    for i in range(range_start, range_stop):
        #create pipeline
        pipe = Pipeline([
            ('quad_features', PolynomialFeatures(degree=i, include_bias=False)),
            ('quad_model', LinearRegression())
        ])
        #fit pipeline on training data
        pipe.fit(X_train, y_train)
        #make predictions on all data
        preds = pipe.predict(X)
        #assign to model_predictions
        predictions.append(preds)
        
    return predictions

predictions = predictions_for_range_of_degrees(X,X_train, y, y_train, 1, 11)

for key, value in zip(model_predictions.keys(), predictions):
    model_predictions[key] = value
    
# Answer check
model_predictions['degree_1'][:10]

array([14.90395265,  7.65623939, 10.76240222, 10.76240222, 12.83317743,
        0.82268118, -3.7330243 , -2.69763669, -4.7684119 ,  2.47930135])

[Back to top](#Index:) 

### Problem 2

#### DataFrame of Predictions

**5 Points**

Use the `model_predictions` dictionary to create a DataFrame of the 10 models predictions.  Assign your solution to `pred_df` below as a DataFrame. 

In [36]:
### GRADED

### YOUR SOLUTION HERE
pred_df = pd.DataFrame(model_predictions)

# Answer check
print(type(pred_df))
print(pred_df.head())

<class 'pandas.core.frame.DataFrame'>
    degree_1   degree_2   degree_3    degree_4     degree_5     degree_6  \
0  14.903953  14.959892  15.704485   32.550328    97.807527   101.886397   
1   7.656239   9.465786   0.931088 -372.035448 -3456.141665 -4370.275875   
2  10.762402  11.618435   9.428697  -61.767623  -516.945175  -606.298593   
3  10.762402  11.618435   9.428697  -61.767623  -516.945175  -606.298593   
4  12.833177  13.221841  13.121121   13.003201    12.998835    13.007347   

      degree_7     degree_8     degree_9    degree_10  
0   103.934543   103.117944    98.288488    87.834730  
1 -5342.443862 -6208.274949 -6618.861218 -5878.338979  
2  -688.570562  -746.836711  -752.164365  -655.409764  
3  -688.570562  -746.836711  -752.164365  -655.409764  
4    12.999361    12.999488    12.999649    12.999760  


[Back to top](#Index:) 

### Problem 3

#### DataFrame of Errors

**5 Points**

Now, determine the error for each model and create a DataFrame of these errors.  One way to do this is to use your prediction DataFrame's `.subtract` method to subtract `y` from each feature.  Assign the DataFrame of errors as `error_df` below.  

In [40]:
### GRADED

### YOUR SOLUTION HERE
error_df = pred_df.subtract(y, axis=0)

# Answer check
print(type(error_df))
print(error_df.head())

<class 'pandas.core.frame.DataFrame'>
   degree_1  degree_2   degree_3    degree_4     degree_5     degree_6  \
0 -3.096047 -3.040108  -2.295515   14.550328    79.807527    83.886397   
1 -7.343761 -5.534214 -14.068912 -387.035448 -3471.141665 -4385.275875   
2 -7.237598 -6.381565  -8.571303  -79.767623  -534.945175  -624.298593   
3 -5.237598 -4.381565  -6.571303  -77.767623  -532.945175  -622.298593   
4 -4.166823 -3.778159  -3.878879   -3.996799    -4.001165    -3.992653   

      degree_7     degree_8     degree_9    degree_10  
0    85.934543    85.117944    80.288488    69.834730  
1 -5357.443862 -6223.274949 -6633.861218 -5893.338979  
2  -706.570562  -764.836711  -770.164365  -673.409764  
3  -704.570562  -762.836711  -768.164365  -671.409764  
4    -4.000639    -4.000512    -4.000351    -4.000240  


[Back to top](#Index:) 

### Problem 4

#### Mean and Variance of Model Errors

**5 Points**


Using the DataFrame of errors, examine the mean and variance of each model's error.  What degree model has the highest variance?  Assign your response as an integer to `highest_var_degree` below.

In [52]:
### GRADED

## Exploration of the data statistics
error_stats = error_df.describe()
# Calculate variance and create a DataFrame
var_df = pd.DataFrame(error_df.var()).T
var_df.index = ['var']

# Concatenate the original DataFrame with the variance DataFrame
error_stats = pd.concat([error_stats, var_df])

print(error_stats)

### YOUR SOLUTION HERE
highest_degree_name = error_df.var().idxmax()
print(highest_degree_name)
highest_var_degree = 10

# Answer check
print(type(highest_var_degree))
print(highest_var_degree)

         degree_1    degree_2    degree_3      degree_4      degree_5  \
count  392.000000  392.000000  392.000000  3.920000e+02  3.920000e+02   
mean    -3.255150   -2.443086   -5.171435 -2.841940e+02 -3.690068e+03   
std      5.253192    4.615428   16.741573  1.158805e+03  1.844016e+04   
min    -21.803800  -18.098490 -124.283091 -9.923586e+03 -1.692258e+05   
25%     -6.347131   -5.170967   -5.571303 -7.040904e+01 -5.403283e+00   
50%     -3.088512   -2.485808   -2.182942 -6.443344e+00  9.109432e-01   
75%      0.416447    0.566694    1.727216  3.516273e-01  3.596177e+01   
max     11.914449   12.695805   22.645621  1.976396e+01  2.180582e+03   
var     27.596028   21.302180  280.280251  1.342830e+06  3.400395e+08   

           degree_6      degree_7      degree_8      degree_9     degree_10  
count  3.920000e+02  3.920000e+02  3.920000e+02  3.920000e+02  3.920000e+02  
mean  -5.859242e+03 -8.936695e+03 -1.287733e+04 -1.666994e+04 -1.641502e+04  
std    3.005527e+04  4.738567e+04  