# Bias-Variance Tradeoff
As complexity of a model increases, the bias term descreases and the variance term increases. In order to not have a biased approach, we need to perform <b>cross-validation</b>. Otherwise, if we fit a model, it already knows what kind of noise is present in the data. <br>
The model is <b>overfit</b> if CV error is bigger than the training set error. In such case:
- decrease model complexity (decrease max_depth, increase min samples per leaf)
- gather more data

The model is <b>underfit</b> if CV error ~ training set error, and both are bigger than the desired error. In such case:
- increase model complexity (increase max_depth, descrease min samples per leaf)
- gather more relevant features

In [5]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.linear_model import LinearRegression, Lasso, Ridge, LogisticRegression
from sklearn.model_selection import KFold, StratifiedKFold, train_test_split, cross_val_score
from sklearn.tree import DecisionTreeRegressor, DecisionTreeClassifier
from sklearn.metrics import mean_squared_error as MSE

sns.set()

In [2]:
auto_mpg = pd.read_csv('datasets/auto_mpg.csv')
print(auto_mpg.info())
auto_mpg.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 392 entries, 0 to 391
Data columns (total 7 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   mpg     392 non-null    float64
 1   displ   392 non-null    float64
 2   hp      392 non-null    int64  
 3   weight  392 non-null    int64  
 4   accel   392 non-null    float64
 5   origin  392 non-null    object 
 6   size    392 non-null    float64
dtypes: float64(4), int64(2), object(1)
memory usage: 21.6+ KB
None


Unnamed: 0,mpg,displ,hp,weight,accel,origin,size
0,18.0,250.0,88,3139,14.5,US,15.0
1,9.0,304.0,193,4732,18.5,US,20.0
2,36.1,91.0,60,1800,16.4,Asia,10.0
3,18.5,250.0,98,3525,19.0,US,15.0
4,34.3,97.0,78,2188,15.8,Europe,10.0


In [3]:
auto_mpg.origin = auto_mpg.origin.astype('category')
auto_mpg = pd.get_dummies(auto_mpg, drop_first=True)
auto_mpg.head(3)

Unnamed: 0,mpg,displ,hp,weight,accel,size,origin_Europe,origin_US
0,18.0,250.0,88,3139,14.5,15.0,0,1
1,9.0,304.0,193,4732,18.5,20.0,0,1
2,36.1,91.0,60,1800,16.4,10.0,0,0


In [7]:
X, y = auto_mpg.drop('size', axis=1), auto_mpg['size']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.25)

dt = DecisionTreeRegressor(max_depth=4, min_samples_leaf=.15)
MSE_cv = - cross_val_score(dt, X_train, y_train, cv=10, scoring='neg_mean_squared_error', n_jobs=-1)
MSE_cv

array([0.37542183, 0.21961097, 0.76042649, 0.75860843, 0.03275246,
       0.78937535, 0.03044264, 0.022449  , 0.23604757, 0.22881466])

In [8]:
dt.fit(X_train, y_train)
y_pred_train = dt.predict(X_train)
y_pred_test = dt.predict(X_test)

# CV MSE
print(f'CV MSE: {MSE_cv.mean()}')

# Training set MSE
print(f'Training set MSE: {MSE(y_train, y_pred_train)}')

# Test set MSE
print(f'Test set MSE: {MSE(y_test, y_pred_test)}')

CV MSE: 0.3453949388758532
Training set MSE: 0.3367442799400928
Test set MSE: 1.0440452779234408
