In [1]:
#Complexity, bias and variance

#In the video, you saw how the complexity of a model labeled f^ influences the bias and variance terms of its
#generalization error.

#Which of the following correctly describes the relationship between f^'s complexity and f^'s bias and variance terms?

#Possible Answers

#As the complexity of f^ decreases, the bias term decreases while the variance term increases.

#As the complexity of f^ decreases, both the bias and the variance terms increase.

#As the complexity of f^ increases, the bias term increases while the variance term decreases.

#As the complexity of f^ increases, the bias term decreases while the variance term increases.*

In [2]:
#NOTE: You're now able to relate model complexity to bias and variance!

In [3]:
#Overfitting and underfitting

#In this exercise, you'll visually diagnose whether a model is overfitting or underfitting the training set.

#For this purpose, we have trained two different models A and B on the auto dataset to predict the mpg consumption of a car
#using only the car's displacement (displ) as a feature.

#The following figure shows you scatterplots of mpg versus displ along with lines corresponding to the training set
#predictions of models A and B in red.

<img src='diagnose-problems.jpg'>

In [4]:
#Which of the following statements is true?

#Possible Answers

#A suffers from high bias and overfits the training set.

#A suffers from high variance and underfits the training set.

#B suffers from high bias and underfits the training set.*

#B suffers from high variance and underfits the training set.

In [5]:
#NOTE: Model B is not able to capture the nonlinear dependence of mpg on displ.

In [6]:
#Instantiate the model

import pandas as pd
from sklearn.tree import DecisionTreeRegressor
df_mpg = pd.read_csv('datasets/auto-mpg.csv')
df_mpg = pd.get_dummies(df_mpg)
X = df_mpg.drop('mpg', axis=1)
y = df_mpg['mpg']

#In the following set of exercises, you'll diagnose the bias and variance problems of a regression tree. The regression
#tree you'll define in this exercise will be used to predict the mpg consumption of cars from the auto dataset using all
#available features.

#We have already processed the data and loaded the features matrix X and the array y in your workspace. In addition, the
#DecisionTreeRegressor class was imported from sklearn.tree.

# Import train_test_split from sklearn.model_selection
from sklearn.model_selection import train_test_split

# Set SEED for reproducibility
SEED = 1

# Split the data into 70% train and 30% test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=SEED)

# Instantiate a DecisionTreeRegressor dt
dt = DecisionTreeRegressor(max_depth=4, min_samples_leaf=0.26, random_state=SEED)

In [7]:
#NOTE: In the next exercise, you'll evaluate dt's CV error.

In [8]:
#Evaluate the 10-fold CV error

from sklearn.model_selection import cross_val_score

#In this exercise, you'll evaluate the 10-fold CV Root Mean Squared Error (RMSE) achieved by the regression tree dt that
#you instantiated in the previous exercise.

#In addition to dt, the training data including X_train and y_train are available in your workspace. We also imported
#cross_val_score from sklearn.model_selection.

#Note that since cross_val_score has only the option of evaluating the negative MSEs, its output should be multiplied by
#negative one to obtain the MSEs. The CV RMSE can then be obtained by computing the square root of the average MSE.

# Compute the array containing the 10-folds CV MSEs
MSE_CV_scores = - cross_val_score(dt, X_train, y_train, cv=10,
                                  scoring='neg_mean_squared_error',
                                  n_jobs=-1)

# Compute the 10-folds CV RMSE
RMSE_CV = (MSE_CV_scores.mean()) ** (1/2)

# Print RMSE_CV
print('CV RMSE: {:.2f}'.format(RMSE_CV))

CV RMSE: 5.14


In [9]:
#NOTE: A very good practice is to keep the test set untouched until you are confident about your model's performance. CV is
#a great technique to get an estimate of a model's performance without affecting the test set.

In [10]:
#Evaluate the training error

#You'll now evaluate the training set RMSE achieved by the regression tree dt that you instantiated in a previous exercise.

#In addition to dt, X_train and y_train are available in your workspace.

#Note that in scikit-learn, the MSE of a model can be computed as follows:

#MSE_model = mean_squared_error(y_true, y_predicted)

#where we use the function mean_squared_error from the metrics module and pass it the true labels y_true as a first
#argument, and the predicted labels from the model y_predicted as a second argument.

# Import mean_squared_error from sklearn.metrics as MSE
from sklearn.metrics import mean_squared_error as MSE

# Fit dt to the training set
dt.fit(X_train, y_train)

# Predict the labels of the training set
y_pred_train = dt.predict(X_train)

# Evaluate the training set RMSE of dt
RMSE_train = (MSE(y_train, y_pred_train)) ** (1/2)

# Print RMSE_train
print('Train RMSE: {:.2f}'.format(RMSE_train))

Train RMSE: 5.15


In [11]:
#NOTE: Notice how the training error is roughly equal to the 10-folds CV error you obtained in the previous exercise.

In [12]:
#High bias or high variance?

baseline_RMSE = 5.1

#In this exercise you'll diagnose whether the regression tree dt you trained in the previous exercise suffers from a bias
#or a variance problem.

#The training set RMSE (RMSE_train) and the CV RMSE (RMSE_CV) achieved by dt are available in your workspace. In addition,
#we have also loaded a variable called baseline_RMSE which corresponds to the root mean-squared error achieved by the
#regression-tree trained with the disp feature only (it is the RMSE achieved by the regression tree trained in chapter 1,
#lesson 3). Here baseline_RMSE serves as the baseline RMSE above which a model is considered to be underfitting and below
#which the model is considered 'good enough'.

#Does dt suffer from a high bias or a high variance problem?

#Possible Answers

#dt suffers from high variance because RMSE_CV is far less than RMSE_train.

#dt suffers from high bias because RMSE_CV ≈ RMSE_train and both scores are greater than baseline_RMSE.*

#dt is a good fit because RMSE_CV ≈ RMSE_train and both scores are smaller than baseline_RMSE.

In [13]:
#NOTE: dt is indeed underfitting the training set as the model is too constrained to capture the nonlinear dependencies
#between features and labels.

In [14]:
#Define the ensemble

from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier as KNN
from sklearn.tree import DecisionTreeClassifier

#In the following set of exercises, you'll work with the Indian Liver Patient Dataset from the UCI Machine learning
#repository.

#In this exercise, you'll instantiate three classifiers to predict whether a patient suffers from a liver disease using all
#the features present in the dataset.

#The classes LogisticRegression, DecisionTreeClassifier, and KNeighborsClassifier under the alias KNN are available in your
#workspace.

# Set seed for reproducibility
SEED=1

# Instantiate lr
lr = LogisticRegression(random_state=SEED)

# Instantiate knn
knn = KNN(n_neighbors=27)

# Instantiate dt
dt = DecisionTreeClassifier(min_samples_leaf=0.13, random_state=SEED)

# Define the list classifiers
classifiers = [('Logistic Regression', lr), ('K Nearest Neighbours', knn), ('Classification Tree', dt)]

In [15]:
#NOTE: In the next exercise, you will train these classifiers and evaluate their test set accuracy.

In [16]:
#Evaluate individual classifiers

import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
df_liver = pd.read_csv('datasets/indian_liver_patient/indian_liver_patient.csv')
df_liver.dropna(inplace=True)
df_liver = pd.get_dummies(df_liver, drop_first=True)
df_liver['Dataset'] = df_liver['Dataset'].where(df_liver['Dataset'] != 2, 0)
scaler = StandardScaler()
df_liver_preprocessed = df_liver.copy()
df_liver_preprocessed[df_liver.drop(['Dataset', 'Gender_Male'], axis=1).columns] = \
    scaler.fit_transform(df_liver.drop(['Dataset', 'Gender_Male'], axis=1))
df_liver_preprocessed = df_liver_preprocessed[['Age','Total_Bilirubin','Direct_Bilirubin','Alkaline_Phosphotase',
                                               'Alamine_Aminotransferase','Aspartate_Aminotransferase','Total_Protiens',
                                               'Albumin','Albumin_and_Globulin_Ratio','Gender_Male','Dataset']]
col_names = ['Age_std','Total_Bilirubin_std','Direct_Bilirubin_std','Alkaline_Phosphotase_std',
             'Alamine_Aminotransferase_std','Aspartate_Aminotransferase_std','Total_Protiens_std','Albumin_std',
             'Albumin_and_Globulin_Ratio_std','Is_male_std','Liver_disease']
df_liver_preprocessed.set_axis(col_names, axis='columns', inplace=True)
X = df_liver_preprocessed.drop('Liver_disease', axis=1)
y = df_liver_preprocessed['Liver_disease']
SEED = 1
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=SEED)

#In this exercise you'll evaluate the performance of the models in the list classifiers that we defined in the previous
#exercise. You'll do so by fitting each classifier on the training set and evaluating its test set accuracy.

#The dataset is already loaded and preprocessed for you (numerical features are standardized) and it is split into 70%
#train and 30% test. The features matrices X_train and X_test, as well as the arrays of labels y_train and y_test are
#available in your workspace. In addition, we have loaded the list classifiers from the previous exercise, as well as the
#function accuracy_score() from sklearn.metrics.

# Iterate over the pre-defined list of classifiers
for clf_name, clf in classifiers:    
 
    # Fit clf to the training set
    clf.fit(X_train, y_train)
   
    # Predict y_pred
    y_pred =  clf.predict(X_test)
    
    # Calculate accuracy
    accuracy = accuracy_score(y_test, y_pred)
   
    # Evaluate clf's accuracy on the test set
    print('{:s} : {:.3f}'.format(clf_name, accuracy))

Logistic Regression : 0.759
K Nearest Neighbours : 0.701
Classification Tree : 0.730


In [17]:
#NOTE: Notice how Logistic Regression achieved the highest accuracy of 74.7%.

In [18]:
#Better performance with a Voting Classifier

#Finally, you'll evaluate the performance of a voting classifier that takes the outputs of the models defined in the list
#classifiers and assigns labels by majority voting.

#X_train, X_test,y_train, y_test, the list classifiers defined in a previous exercise, as well as the function
#accuracy_score from sklearn.metrics are available in your workspace.

# Import VotingClassifier from sklearn.ensemble
from sklearn.ensemble import VotingClassifier

# Instantiate a VotingClassifier vc
vc = VotingClassifier(estimators=classifiers)

# Fit vc to the training set
vc.fit(X_train, y_train)

# Evaluate the test set predictions
y_pred = vc.predict(X_test)

# Calculate accuracy score
accuracy = accuracy_score(y_test, y_pred)
print('Voting Classifier: {:.3f}'.format(accuracy))

Voting Classifier: 0.770


In [19]:
#NOTE: Notice how the voting classifier achieves a test set accuracy of 75.3%. This value is greater than that achieved by
#LogisticRegression.