# The Bias-Variance Tradeoff

## Generalization Error

The goal of supervised learning is to find a f(head) that best approximates f.(with logistic regression, decision tree, neural network...) When training fhat the noise should be discarded as much as possible. At the end fhat should achieve a low predictive error on unseen datasets. 

There are 2 difficulties while approximating f.
Overfitting: when fhat fits the noise in the trainin set.
Underfitting: When fhat is not flexible enough to approximate f.

With overfitting, model's predictive power on unseen datasets is pretty low. (low training set error - high test set error.)

With underfitting the training set error is roughly equal to the test set error and both relatively high.

The generalization error of a model tells you how much it generalizes on unseen data. -> Bias, variance, irreducible error

irreducible error: error contribution of noise

bias: tells how much fhat and f are different. High bias models lead to underfitting.

variance: tells you how much fhat is inconsistent over different training sets.
fhat of a high variance model follows training data points too close and it leads to overfitting.

The complexity of a model sets its flexibility to approximate the true function of f. Increased tree depth -> increased complexity of a disicion tree.

Increased model complexity the bias decreases and variance increases. Finding the model complexity that achieves the lowest generalization error. So find a balance between bias and variance. -> bias-variance trade-off.

## Diagnose bias and variance problems

The model fhat can then be fit to the training set and its error can be evaluated on the test set. The generalization error of fhat is roughly approximated by fhat's error on the test set.To obtain a reliable estimate of fhat's performance, you should use a technique called cross-validation or CV. 
Once fhat's cross-validation-error is computed, it can be checked if it is greater than fhat's training set error. If it is greater, fhat is said to suffer from high variance. In such case, fhat has overfit the training set. To remedy this try decreasing fhat's complexity.

On the other hand, fhat is said to suffer from high bias if its cross-validation-error is roughly equal to the training error but much greater than the desired error. In such case fhat underfits the training set. To remedy this try increasing the model's complexity or gather more relevant features for the problem.



### Instantiate the model

In [8]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
cars = pd.read_csv("mpg.csv")
df_origin = pd.get_dummies(cars)
X = df_origin.drop("mpg", axis=1)
y = df_origin["mpg"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)

dt = DecisionTreeRegressor(min_samples_leaf = 0.26, max_depth=4, random_state=1)

### Evaluate the 10-fold CV error

Cross_val_score has only the option of evaluating the negative MSEs, its output should be multiplied by negative one to obtain the MSEs.

In [9]:
from sklearn.model_selection import cross_val_score
MSE_CV_scores = - cross_val_score(dt, X_train, y_train, cv=10, scoring="neg_mean_squared_error",
                                 n_jobs=-1)
RMSE_CV = (MSE_CV_scores.mean()) ** 0.5
print('CV RMSE: {:.2f}'.format(RMSE_CV))

CV RMSE: 5.14


 CV is a great technique to get an estimate of a model's performance without affecting the test set.

### Evaluate the training error

In [11]:
from sklearn.metrics import mean_squared_error
dt.fit(X_train, y_train)
y_pred_train = dt.predict(X_train)
RMSE_train = (mean_squared_error(y_train, y_pred_train))**0.5
print('Train RMSE: {:.2f}'.format(RMSE_train))

Train RMSE: 5.15


Training error is roughly equal to the 10-folds CV error you obtained in the previous exercise. dt suffers from high bias.

## Ensemble Learning

CARTs are sensitive to small variations. The also suffer from high variance and may overfit it they trained without constrains. A solution that takes advantage of the flexibility of CARTs while reducing their tendency to memorize noise is ensemble learning.
It is training different models on the same dataset, let each model make its predictions, aggregate predictions of individual models, and make final prediction that is more robust and less prone to errors than each individual in the model. Best result is recieved when models are skillful in different ways.

In [37]:
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier as KNN
from sklearn.ensemble import VotingClassifier
from sklearn.preprocessing import LabelEncoder

cancer = pd.read_csv("cancer.csv")
X = cancer.drop(["id", "Unnamed: 32", "diagnosis"], axis=1)
y = cancer["diagnosis"]
le = LabelEncoder()
y = le.fit_transform(y)
y = pd.Series(y)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,
                                                   random_state=1)
lr = LogisticRegression(random_state=1)
knn = KNN()
dt = DecisionTreeClassifier(random_state=1)

classifiers = [("Logistic Regression", lr),
               ("K Nearest Neighbor", knn),
               ("Classification Tree", dt)]

for clf_name, clf in classifiers:
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    print("{:s}: {:.3f}".format(clf_name, accuracy_score(y_test, y_pred)))
    
vc = VotingClassifier(estimators=classifiers)
vc.fit(X_train, y_train)
y_pred = vc.predict(X_test)
print("Voting Classifier: {:.3f}".format(accuracy_score(y_test, y_pred)))

Logistic Regression: 0.936
K Nearest Neighbor: 0.930
Classification Tree: 0.930
Voting Classifier: 0.959


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Accuracy is higher than that achieved by any of the individual models in the ensemble.



### Define the ensemble

In [47]:
lr = LogisticRegression(random_state=1)
knn = KNN(n_neighbors=27)
dt = DecisionTreeClassifier(min_samples_leaf=0.13, random_state=1)
classifiers = [('Logistic Regression', lr),
               ('K Nearest Neighbours', knn),
               ('Classification Tree', dt)]

### Evaluate individual classifiers

In [49]:
liver = pd.read_csv("indian_liver_patient_preprocessed.csv", index_col=0)
X = liver.drop("Liver_disease", axis=1)
y = liver["Liver_disease"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1) 

for clf_name, clf in classifiers:
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    print('{:s} : {:.3f}'.format(clf_name, accuracy))

Logistic Regression : 0.759
K Nearest Neighbours : 0.701
Classification Tree : 0.730


### Better performance with a Voting Classifier

In [52]:
vc = VotingClassifier(estimators=classifiers)
vc.fit(X_train, y_train)
y_pred = vc.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print('Voting Classifier: {:.3f}'.format(accuracy))

Voting Classifier: 0.770
