## Linear Regression

For many real-world applications, given a set of data $X$ which we consider inputs or measurements observed, and $y$ which is considered an *continuous* output, or useful metric to inter, is there a set of latent coefficients/weights $w$ which when scaled to $X$ can infer $y$? This is the fundamental principle of **linear regression**. In this notation, capital letters $X$ describe *matrices*, and lower-case letters *w,y* describe vectors, with greek letters describing scalar coefficients. Linear models are henceforth:

$$
\bf y=w^Tx+\eta+\epsilon
$$

where $w$ is our slope/gradient, $x$ is the input, $\eta$ is the intercept and $\epsilon$ is the error.

We will however work in matrix notation, and to somplify the math we merge the intercept $\eta$ into the weights $w$, and add a bias column to $X$:

$$
\mathbf{X}=\left[\begin{matrix}
   x_{11} & x_{12} & \dots & x_{1m} & 1\\ 
            x_{21} & x_{22} & \dots & x_{2m} & 1\\ 
            \vdots & \vdots & \ddots & \vdots & \vdots \\ 
            x_{n1} & x_{n2} & \dots & x_{nm} & 1\\ 
  \end{matrix} \right], \qquad
  \mathbf{w}=\left[\begin{matrix}
    w_1 \\ w_2 \\ \vdots \\ w_n \\ \eta
  \end{matrix}\right]
$$

where $\bf x_1, \dots, x_n$ is a vector, $w_1, \dots, w_n$ are scalars. Therefore our new model is $\mathbf{y=Xw}$, with $p+1$ unknowns. In order to find the best $\bf w$, we minimize the difference between the values generated from $\bf Xw$ and $\bf y$, as:

$$
\mathbf{e} = \min \ \lvert \lvert \mathbf{Xw-y} \rvert \rvert^2
$$

to minimize this, we calculate the gradient with respect to each of the weights $\bf w$ (including intercept):

$$
\nabla_w \mathbf{e} = 2\mathbf{X^T}(\mathbf{Xw-y})
$$

Equating this to 0, and after some equation manipulation we get:

$$
\mathbf{X^TXw}=\mathbf{X^Ty} \\
\mathbf{w}=(\mathbf{X^TX})^{-1}\mathbf{X^Ty}
$$

In [1]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

### Task 1.

Write a function `least_squares()`, that given an input matrix $\mathbf{X}_{N,P}$ and output vector $\mathbf{y}_N$, *directly* calculates and returns the best weight coefficients $\mathbf{w}_P$. You can use `np.linalg.inv()` to calculate the matrix inverse. Remember to include the bias column to the $X$ matrix. To generate $X$ and $y$, use the function `make_regression()` provided, you may choose to change some of the optional parameters.

In [1]:
def make_regression(n_samples = 100, n_features = 4, n_optimal = 2, bias = 0.0, noise = 0.5, mean_slope = 1.0):
    """
    This function generates an X and y for regression tasks.
    
    Parameters
    -------
    n_samples : int
        the number of samples to generate
    n_features : int
        the number of columns/features
    n_optimal : int
        the number of useful columns/features, must be <= n_features
    bias : double
        the bias/intercept
    noise : double
        variance
    mean_slope : double
        the approximate slope of optimal values
    
    Returns
    -------
    X : matrix (n_samples, n_features)
        input matrix
    y : vector (n_samples)
        output vector
    """
    # create random X matrix
    X = np.random.rand(n_samples,n_features)
    # good features
    w_opt = mean_slope + np.random.rand(n_optimal)*0.1
    # append bad features
    w = np.hstack((w_opt, np.random.rand(n_features - n_optimal)*0.1))
    # add some noise from normal distribution
    error = np.random.normal(bias, noise, n_samples)
    # apply y = Xw+b+e
    y = np.dot(X,w) + bias + error
    return X,y

In [None]:
# your codes here
def least_squares(X, y):
	n, p = X.shape
	# add bias
	nX = np.column_stack((np.ones(n), X))
	# solve
	return np.dot(np.linalg.inv(np.dot(nX.T,nX)),np.dot(nX.T,y))


### Task 2.

Plot the predicted values $\hat y$ against the actual values $y$ generated as a scatterplot. These can be estimated using $\mathbf{\hat y} \simeq \mathbf{Xw}$.

In [None]:
# your codes here
def ls_predict(X, w, bias_included=False):
    if bias_included:
        return np.dot(X,w)
    else:
        return np.dot(np.column_stack((np.ones(len(X)), X)), w)

np.random.seed(5458392)
X, y = make_regression()

# call method
w = least_squares(X, y)
# predict yp
yp = ls_predict(X, w)

plt.scatter(yp,y)
plt.plot([-.5, 3.], [-.5, 3.], 'k--')
plt.xlabel("predicted values")
plt.ylabel("actual values")
plt.title("Actual against predicted values")

print(w)

### Task 3.

Ordinary Least squares is known to suffer from strongly skewed *outliers*. To mitigate this we can apply a regularizing term to the objective minimization function in the form of the $\ell_2$-norm: $\lvert \lvert \mathbf{w} \rvert \vert_2 \ $:

$$
\mathbf{e} = \min \ \lvert \lvert \mathbf{Xw-y} \rvert \rvert^2 + \lambda \lvert \lvert \mathbf{w} \rvert \vert_2
$$

where $\lambda$ is a hyperparameter to tune the amount of regularization. When derived the optimal minimization of $\bf w$ is: 

$$
\mathbf{w}=(\mathbf{X^TX}+\lambda I)^{-1}\mathbf{X^Ty}
$$

where $I$ refers to the identity matrix.

Write a function `ridge()`, with solves the equation *directly* and which has the same parameters as `least_squares()` with an additional parameter $\lambda=1$ default. Plot $\hat y$ against $y$.

In [None]:
# your codes here
def ridge(X, y, lamda):
	n, p = X.shape
	# bias
	nX = np.column_stack((np.ones(n), X))
	# solve
	return np.dot(np.linalg.inv(np.dot(nX.T,nX) + lamda*np.eye(p+1)),np.dot(nX.T,y))

w = ridge(X, y, .1)
yp = ls_predict(X, w)
import matplotlib.pyplot as plt

print(w)

plt.scatter(yp, y)
plt.plot([-.5, 3.], [-.5, 3.], 'k--')
plt.xlabel("predicted values")
plt.ylabel("actual values")
plt.show()

### Task 4.

Calculate the **Pearson correlation** between $\hat y$ and $y$. This is calculated as:

$$
P(x,y)=\frac{\sum_{i=1}^{n} (x_i-\bar x)(y_i-\bar y)}{\sqrt{\sum_{i=1}^n (x_i-\bar x)^2} \sqrt{\sum_{i=1}^n(y_i-\bar y)^2}}
$$

where $\bar x$ and $\bar y$ refer to the mean of each respective vector.

In [None]:
# your codes here
def pearson(x, y):
    xm = x.mean()
    ym = y.mean()
    return np.sum((x - xm)*(y - ym)) / (np.sqrt(np.sum((x - xm)**2)) * np.sqrt(np.sum((y - ym)**2)))

p = pearson(yp, y)

plt.scatter(yp,y,label="r={:0.3f}".format(p))
plt.legend()
plt.show()

### Task 5.

There are cases of where computing the matrix-inverse is intractable, meaning we cannot solve the matrix *directly*. In this case, we can instead take steps in the direction of the *global minimum* through **gradient descent**. The algorithm works as follows:
1. Initialise $\bf w$ at uniform random, $i = 0$
1. While i < maximum iterations:
    1. Calculate $\Delta_w \mathbf{e}$
    2. Update $w^{(k+1)}=w^{(k)} - \gamma \Delta_w \mathbf{e}$
1. Until convergence

where $\gamma$ is the learning rate.

Write a function `gradient_descent()` using the derivative from least-squares, given $X$, $y$, $\gamma=10^{-3}$ and a number of iterations $K_{max}=10^3$. Save each step and plot $k$ against each weight $w$ (or the mean) to see the minimization in weights.

In [None]:
# your codes here
def gradient_descent(X, y, gamma = .001, n_iter=500):
	n, P = X.shape
	nX = np.column_stack(((np.ones(n,)), X))
	saved_w = np.empty((P+1, n_iter))
	w = np.random.rand(P+1)
	saved_w[:,0] = w
	for i in range(1,n_iter):
		dE = np.dot((2*nX.T),(np.dot(nX,w) - y))
		w -= gamma*dE
		saved_w[:,i] = w
	return saved_w, w

N_iter = 500

sw, w = gradient_descent(X, y, .001, N_iter)

print(X.shape, y.shape, sw.shape, w.shape)

t = np.arange(N_iter)
for i in range(len(w)):
	plt.plot(t,sw[i,:],'--',label=i)
plt.legend()
plt.show()