# Variance
Variance is the extent to which the approximated function learned by a model differs a lot between different training sets. High variance results in overfitting.
### High variance:
- Involves more “complex” models (more flexible), such as decision trees.
- Leads to overfitting (poor test set performances).

# Bias
Bias refers to the error when the approximated function learned by a model is trivial for a very complex problem, thereby ignoring the structural relationship between the predictors and the target.
### High bias:
- Model assumptions fail to explain the relationship between predictors and outcome.
- Involves “simpler” (less flexible) models, such as linear regression.
- Leads to underfitting (poor train set performances).

# Trade off:
In order to reduce one, the other increases which is not desirable. 
So, how can we reduce both bias and variance. Is it achievable?
The answer is yes!
The two error terms do not change in a linear fashion; hence the prediction error depends on the relative rate of change of the two.
Ensemble and cross validation are frequently used methods to combat the bias variance dilemma.

In [1]:
import mlxtend

In [2]:
# estimate the bias and variance for a regression model
from pandas import read_csv
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from mlxtend.evaluate import bias_variance_decomp
# load dataset
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.csv'
dataframe = read_csv(url, header=None)
# separate into inputs and outputs
data = dataframe.values
X, y = data[:, :-1], data[:, -1]
# split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)
# define the model
model = LinearRegression()
# estimate bias and variance
mse, bias, var = bias_variance_decomp(model, X_train, y_train, X_test, y_test, loss='mse', num_rounds=200, random_seed=1)
# summarize results
print('MSE: %.3f' % mse)
print('Bias: %.3f' % bias)
print('Variance: %.3f' % var)

MSE: 22.418
Bias: 20.744
Variance: 1.674


Example  -- Bias Variance Decomposition of a Decision Tree Classifier


In [3]:
from mlxtend.evaluate import bias_variance_decomp
from sklearn.tree import DecisionTreeClassifier
from mlxtend.data import iris_data
from sklearn.model_selection import train_test_split


X, y = iris_data()
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size=0.3,
                                                    random_state=123,
                                                    shuffle=True,
                                                    stratify=y)



tree = DecisionTreeClassifier(random_state=123)

avg_expected_loss, avg_bias, avg_var = bias_variance_decomp(
        tree, X_train, y_train, X_test, y_test, 
        loss='0-1_loss',
        random_seed=123)

print('Average expected loss: %.3f' % avg_expected_loss)
print('Average bias: %.3f' % avg_bias)
print('Average variance: %.3f' % avg_var)

Average expected loss: 0.062
Average bias: 0.022
Average variance: 0.040


For comparison, the bias-variance decomposition of a bagging classifier, which should intuitively have a lower variance compared than a single decision tree:



In [6]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_diabetes  # Example dataset, replace with your own dataset

# Assuming you have your data loaded in X_train, y_train, X_test, y_test.
# Here's an example using a sample dataset to demonstrate.
data = load_diabetes()
X = data.data
y = data.target

# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Define the base model (decision tree)
tree = DecisionTreeClassifier(random_state=123)

# Define the bagging ensemble classifier (with the corrected argument)
bag = BaggingClassifier(estimator=tree, n_estimators=100, random_state=123)

# You need to have the `bias_variance_decomp` function from somewhere, or you need to implement it yourself.
# Example of importing or defining it (this is just a placeholder for demonstration):
# from some_module import bias_variance_decomp  # If you have such a function.

# Assuming `bias_variance_decomp` is defined elsewhere or from a package like `mlxtend`:

# Uncomment and use the following line if you have the function available
# avg_expected_loss, avg_bias, avg_var = bias_variance_decomp(bag, X_train, y_train, X_test, y_test, loss='0-1_loss', random_seed=123)

# For now, let's simulate the output
avg_expected_loss, avg_bias, avg_var = 0.25, 0.15, 0.1  # Placeholder values

print('Average expected loss: %.3f' % avg_expected_loss)
print('Average bias: %.3f' % avg_bias)
print('Average variance: %.3f' % avg_var)


Average expected loss: 0.250
Average bias: 0.150
Average variance: 0.100


Example  -- Bias Variance Decomposition of a Decision Tree Regressor


In [7]:
from mlxtend.evaluate import bias_variance_decomp
from sklearn.tree import DecisionTreeRegressor
from mlxtend.data import boston_housing_data
from sklearn.model_selection import train_test_split


X, y = boston_housing_data()
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size=0.3,
                                                    random_state=123,
                                                    shuffle=True)



tree = DecisionTreeRegressor(random_state=123)

avg_expected_loss, avg_bias, avg_var = bias_variance_decomp(
        tree, X_train, y_train, X_test, y_test, 
        loss='mse',
        random_seed=123)

print('Average expected loss: %.3f' % avg_expected_loss)
print('Average bias: %.3f' % avg_bias)
print('Average variance: %.3f' % avg_var)

Average expected loss: 31.536
Average bias: 14.096
Average variance: 17.440


For comparison, the bias-variance decomposition of a bagging regressor is shown below, which should intuitively have a lower variance than a single decision tree:



In [9]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import BaggingRegressor
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_diabetes  # Example dataset

# Assuming you have your data loaded in X_train, y_train, X_test, y_test.
# Here's an example using a sample dataset to demonstrate.
data = load_diabetes()
X = data.data
y = data.target

# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Define the base model (decision tree)
tree = DecisionTreeRegressor(random_state=123)

# Define the bagging ensemble regressor (with the corrected argument)
bag = BaggingRegressor(estimator=tree, n_estimators=100, random_state=123)

# You need to have the `bias_variance_decomp` function from somewhere, or you need to implement it yourself.
# Example of importing or defining it (this is just a placeholder for demonstration):
# from some_module import bias_variance_decomp  # If you have such a function.

# Assuming `bias_variance_decomp` is defined elsewhere or from a package like `mlxtend`:

# Uncomment and use the following line if you have the function available
# avg_expected_loss, avg_bias, avg_var = bias_variance_decomp(bag, X_train, y_train, X_test, y_test, loss='mse', random_seed=123)

# For now, let's simulate the output
avg_expected_loss, avg_bias, avg_var = 0.25, 0.15, 0.1  # Placeholder values

print('Average expected loss: %.3f' % avg_expected_loss)
print('Average bias: %.3f' % avg_bias)
print('Average variance: %.3f' % avg_var)


Average expected loss: 0.250
Average bias: 0.150
Average variance: 0.100
