# Chapter 2

Model complexity tends to change two measures of accuract, bias and variance.

A very simple model is likely is likely to show high bias, and low variance. The predicted values do not change a lot e.g. for a linear regression, the predictions are a straight line where each point changes very little compared to its adjacent points. However this model is likey to be biased - it's predictions will tend to be a long way from most specific values.

We say this model underfits the training data.

A complex model will follow the training data very closely, but each prediction can be quite different from adjacent predictions. However, each prediction will be close to the training data. So the bias will be small, but the variance high. 

We say this model overfits the training data.

Neither model are likely to give good predictions on unseen data.

Fpr regression, you can compare the mean square errors of training versus test predictions. If the training MSE is much smaller than the test MSE, the model overfits the training data. You should reduce the complexity and train again.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from io import StringIO

In [9]:
from sklearn.tree import DecisionTreeRegressor

# Pandas is clever enough to read url's via csv
Xy = pd.read_csv('https://assets.datacamp.com/production/repositories/1796/datasets/3781d588cf7b04b1e376c7e9dda489b3e6c7465b/auto.csv')

In [3]:
Xy.head()

Unnamed: 0,mpg,displ,hp,weight,accel,origin,size
0,18.0,250.0,88,3139,14.5,US,15.0
1,9.0,304.0,193,4732,18.5,US,20.0
2,36.1,91.0,60,1800,16.4,Asia,10.0
3,18.5,250.0,98,3525,19.0,US,15.0
4,34.3,97.0,78,2188,15.8,Europe,10.0


In [62]:
Xy = Xy.drop(['origin'], axis=1)

In [63]:
# Make MPG the target feature
X = Xy.iloc[:, 1:]
y = Xy.iloc[:,0]

In [64]:
X.head(), y.head()

(   displ   hp  weight  accel  size
 0  250.0   88    3139   14.5  15.0
 1  304.0  193    4732   18.5  20.0
 2   91.0   60    1800   16.4  10.0
 3  250.0   98    3525   19.0  15.0
 4   97.0   78    2188   15.8  10.0,
 0    18.0
 1     9.0
 2    36.1
 3    18.5
 4    34.3
 Name: mpg, dtype: float64)

In [70]:
# Import train_test_split from sklearn.model_selection
from sklearn.model_selection import train_test_split

# Set SEED for reproducibility
SEED = 1

# Split the data into 70% train and 30% test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.3, random_state=SEED)

# Instantiate a DecisionTreeRegressor dt
dt = DecisionTreeRegressor(max_depth=4, min_samples_leaf=0.26, random_state=SEED)

In [71]:
from sklearn.model_selection import cross_val_score

In [81]:
# Compute the array containing the 10-folds CV MSE
MSE_CV_scores = - cross_val_score(dt, X_train, y_train.tolist(), cv=10,
                        scoring='neg_mean_squared_error',
                        n_jobs=-1)

# Compute the 10-folds CV RMSE
RMSE_CV = (MSE_CV_scores.mean())**(0.5)

# Print RMSE_CV
print('CV RMSE: {:.2f}'.format(RMSE_CV))

CV RMSE: 5.14


In [84]:
# Import mean_squared_error from sklearn.metrics as MSE
from sklearn.metrics import mean_squared_error as MSE

# Fit dt to the training set
dt.fit(X_train, y_train)

# Predict the labels of the training set
y_pred_train = dt.predict(X_train)

# Evaluate the training set RMSE of dt
RMSE_train = (MSE(y_train, y_pred_train))**(0.5)

# Print RMSE_train
print('Train RMSE: {:.2f}'.format(RMSE_train))

Train RMSE: 5.15


In [88]:
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier as KNN
from sklearn.tree import DecisionTreeClassifier

In [89]:
# Set seed for reproducibility
SEED=1

# Instantiate lr
lr = LogisticRegression(random_state=SEED)

# Instantiate knn
knn = KNN(n_neighbors=27)

# Instantiate dt
dt = DecisionTreeClassifier(min_samples_leaf=.13, random_state=SEED)

# Define the list classifiers
classifiers = [('Logistic Regression', lr), ('K Nearest Neighbours', knn), ('Classification Tree', dt)]

In [91]:
# This was downloaded as zip from datacamp
# Use the data that has already been through standard transform (minus mean, divide by standard deviation)
Xy = pd.read_csv('./indian_liver_patient_preprocessed.csv', index_col=0)
Xy.head()

Unnamed: 0,Age_std,Total_Bilirubin_std,Direct_Bilirubin_std,Alkaline_Phosphotase_std,Alamine_Aminotransferase_std,Aspartate_Aminotransferase_std,Total_Protiens_std,Albumin_std,Albumin_and_Globulin_Ratio_std,Is_male_std,Liver_disease
0,1.247403,-0.42032,-0.495414,-0.42887,-0.355832,-0.319111,0.293722,0.203446,-0.14739,0,1
1,1.062306,1.218936,1.423518,1.675083,-0.093573,-0.035962,0.939655,0.077462,-0.648461,1,1
2,1.062306,0.640375,0.926017,0.816243,-0.115428,-0.146459,0.478274,0.203446,-0.178707,1,1
3,0.815511,-0.372106,-0.388807,-0.449416,-0.36676,-0.312205,0.293722,0.329431,0.16578,1,1
4,1.679294,0.093956,0.179766,-0.395996,-0.295731,-0.177537,0.755102,-0.930414,-1.713237,1,1


In [101]:
from sklearn.model_selection import train_test_split
target_col = 'Liver_disease'
X = Xy.drop([target_col], axis=1)
y = Xy[target_col]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.3, random_state=SEED)

In [102]:
from sklearn.metrics import accuracy_score

In [103]:
# Iterate over the pre-defined list of classifiers
for clf_name, clf in classifiers:    
 
    # Fit clf to the training set
    clf.fit(X_train, y_train)    
   
    # Predict y_pred
    y_pred = clf.predict(X_test)
    
    # Calculate accuracy
    accuracy = accuracy_score(y_test, y_pred) 
   
    # Evaluate clf's accuracy on the test set
    print('{:s} : {:.3f}'.format(clf_name, accuracy))

Logistic Regression : 0.759
K Nearest Neighbours : 0.701
Classification Tree : 0.730


In [104]:
# Import VotingClassifier from sklearn.ensemble
from sklearn.ensemble import VotingClassifier

# Instantiate a VotingClassifier vc
vc = VotingClassifier(estimators=classifiers)     

# Fit vc to the training set
vc.fit(X_train, y_train)   

# Evaluate the test set predictions
y_pred = vc.predict(X_test)

# Calculate accuracy score
accuracy = accuracy_score(y_test, y_pred)
print('Voting Classifier: {:.3f}'.format(accuracy))

Voting Classifier: 0.770
