# Bias-Variance Tradeoff
Assume that we have noisy data, modeled by $f = y + \epsilon$, where $\epsilon \in \mathcal{N}(0,\sigma)$. Given an estimator $\hat{f}$, the squared error can be derived as follows:

$$
\begin{align}
\mathbb{E}\left[\left(\hat{f} - f\right)^2\right] &= \mathbb{E}\left[\hat{f}^2 - 2f\hat{f} + f^2\right]\\
&= \mathbb{E}\left[\hat{f}^2\right] + \mathbb{E}\left[f^2\right] - 2\mathbb{E}\left[f\hat{f}^2\right] \text{ By linearity of expectation} \\
\end{align}
$$
Now, by definition, $Var(x) = \mathbb{E}\left[x^2\right] - \left(\mathbb{E}\left[x\right]\right)^2$. Subsituting this definition into the eqaution above, we get:
$$
\begin{align}
\mathbb{E}\left[\hat{f}^2\right] + \mathbb{E}\left[f^2\right] - 2\mathbb{E}\left[f\hat{f}^2\right] &= Var(\hat{f}) + \left(\mathbb{E}[\hat{f}]\right)^2  + Var(f) + \left(\mathbb{E}[f]\right)^2 - 2f\mathbb{E}[\hat{F}^2] \\ 
&= Var(\hat{f}) + Var(f) + \left(\mathbb{E}[\hat{f}] - f\right)^2\\
&= \boxed{\sigma + Var(\hat{f}) + \left(\mathbb{E}[\hat{f}] - f\right)^2}
\end{align}
$$

The first term $\sigma$ is the irreducible error due to the noise in the data (from the distribution of $\epsilon$). The second term is the **variance** of the estimator $\hat{f}$ and the final term is the **bias** of the estimator. There is an inherent tradeoff between the bias and variance of an estimator. Generally, more complex estimators (think of high-degree polynomials as an example) will have a low bias since they will fit the sampled data really well. However, this accuracy will not be maintained if we continued to resample the data, which implies that the variance of this estimator is high.  

# Activity
We will now see how the Bias and Variance of an estimator change using an SVM. Consider the dataset generated by the code below (two gaussians). We will be changing the misclassification cost and observing the resulting changes in the hyperplane that tries to separate the two clusters of points. Intuitively, we can expect that with a higher cost, i.e. more severe penalty to misclassifications, the separating hyperplane would fit the data more closely, thereby resulting in low bias. At the same time, if we resampled the data, the hyperplane would no longer be quite as accurate in separatin the points due to higher variance. We plot the bias and the variance of the estimator for a range of costs and take note of the genral trends of these values below. 

In [14]:
import numpy as np
from sklearn import svm
import matplotlib.pyplot as plt

#SVM kernel - linear, poly, rbf, etc.
kernel='rbf'

#Cost values
Cs = [1,2,5,10,20,50,100,200,500,1000]

#define data
n = 100
sub = 10
var = 2
mean = 1
mean1 = [mean, mean]
cov1 = [[2*var, 0], [0, 2*var]]
mean2 = [-mean, -mean]
cov2 = [[var,0],[0,var]]
g = 8

def plot_boundary(clf,g):
    #adapted from http://scikit-learn.org/stable/modules/svm.html
    h = 0.2
    xx, yy = np.meshgrid(np.arange(-g,g, h),np.arange(-g,g, h))    
    Z = svc.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    plt.contourf(xx, yy, Z, cmap=plt.cm.Paired, alpha=0.8)

#sample dataset of two gaussians
x1, y1 = np.random.multivariate_normal(mean1, cov1, n).T
x2, y2 = np.random.multivariate_normal(mean2, cov2, n).T
Xtest = np.concatenate((np.asarray([x1,y1]).T,np.asarray([x2,y2]).T),axis=0)
ytest = np.asarray([-1]*(n) + [1]*(n))

#plot current sample
plt.scatter(x1,y1,color='red')
plt.scatter(x1,y2, color='blue')
plt.show()

biasSquared = np.zeros(sub)
preds = np.zeros((2*n,sub))

b = np.zeros(len(Cs))
v = np.zeros(len(Cs))

for j,C in enumerate(Cs):
    for i in range(sub):
        
        #create data - 2D Gaussians     
        x1, y1 = np.random.multivariate_normal(mean1, cov1, n).T
        x2, y2 = np.random.multivariate_normal(mean2, cov2, n).T
        X = np.concatenate((np.asarray([x1,y1]).T,np.asarray([x2,y2]).T),axis=0)
        y = np.asarray([-1]*(n) + [1]*(n))
        
        #fit SVM
        svc = svm.SVC(kernel=kernel,C=C).fit(X,y)
        preds[:,i] = svc.predict(Xtest)
        
        #plot 
        if i < 4:
            plt.subplot(2,2,i+1)
            plot_boundary(svc,g)
            plt.plot(x1, y1, 'b.', x2, y2, 'r.')

    plt.axis([-g,g,-g,g])
    plt.suptitle('Cost = %i' % (C))
    plt.show()
    b[j] = np.mean(np.mean(preds,1) - ytest) 
    v[j] = np.mean(np.var(preds,1))

plt.subplot(2,1,1)
plt.plot(Cs,b)
plt.title('bias')
plt.subplot(2,1,2)
plt.plot(Cs,v)
plt.title('variance')
plt.show()