# Simple Linear Regression

* When there i a single input variable the method is referered to as simple linear regression.
$$ y = b_0 + b_1 x $$

We can estimate the coefficients as follows:
$$ 
b_1 = \frac{\sum_{i=1}^{n} (x_i - \mu_x )(y_i - \mu_y)}{\sum^{n}_{i=1} (x_i - \mu_x )^2} 
= \frac{covariance(x, y)}{variance(x)}
$$
$$
b_0 = \mu_y - b_1 \mu_x 
= mean(y) - b_1 * mean(x)
$$


* Mean       : $$ \mu_x = \frac{1}{n} \sum^{n}_{i=1} x_i $$
* Variance   : $$ \sum_{i=1}^{n} (x_i - \mu_x)^2 $$
* Covariance : Generalization of correlation to describe the relation between two or more values.
$$ \sum^{n}_{i=1} (x_i - \mu_x)(y_i - \mu_y) $$

## Quantities

In [6]:
def mean(values):
    return sum(values) / float(len(values))

def variance(values):
    mean_val = mean(values)
    variance = [ (val - mean_val)**2.0 for val in values ]
    return sum(variance)

def covariance(x, y):
    mean_x = mean(x)
    mean_y = mean(y)
    covar = 0.0
    for i in range(len(x)):
        covar += (x[i]-mean_x) * (y[i]-mean_y)
    return covar

In [8]:
dataset = [[1, 1], [2, 3], [4, 3], [3, 2], [5, 5]]                              
x = [row[0] for row in dataset]                                                 
y = [row[1] for row in dataset]                                                 
mean_x, mean_y = mean(x), mean(y)                                               
var_x, var_y = variance(x), variance(y)                         
print('x stats: mean=%.3f variance=%.3f' % (mean_x, var_x))                     
print('y stats: mean=%.3f variance=%.3f' % (mean_y, var_y))

covar = covariance(x, y)
print( 'Covariance: %.3f ' % (covar))

x stats: mean=3.000 variance=10.000
y stats: mean=2.800 variance=8.800
Covariance: 8.000 


## Coefficients

In [10]:
def coefficients(dataset):
    x = [row[0] for row in dataset]
    y = [row[1] for row in dataset]
    x_mean, y_mean = mean(x), mean(y)
    b1 = covariance(x, y) / variance(x)
    b0 = y_mean - b1*x_mean
    return b0, b1

In [14]:
b0, b1 = coefficients(dataset)
print( ' Coefficients: B0={:.3f}, B1={:.3f}'.format(b0, b1))

 Coefficients: B0=0.400, B1=0.800


## Putting it all together

In [17]:
def simple_linear_regression(train, test):
    predictions = []
    b0, b1 = coefficients(train)
    for row in test:
        yhat = b0 + b1*row[0]
        predictions.append( yhat )
    return predictions

In [18]:
from Getting_Started import rmse_metric

def evaluate_algorithm(dataset, algorithm):
    test_set = []
    for row in dataset:
        row_copy = list(row)
        row_copy[-1] = None
        test_set.append(row_copy)
    predicted = algorithm(dataset, test_set)
    print(predicted)
    actual = [row[-1] for row in dataset]
    rmse = rmse_metric(actual, predicted)
    return rmse

In [19]:
rmse = evaluate_algorithm(dataset, simple_linear_regression)
print( ' RMSE: %.3f ' % (rmse))

[1.1999999999999995, 1.9999999999999996, 3.5999999999999996, 2.8, 4.3999999999999995]
 RMSE: 0.693 


# Multivariate Linear Regression

Given a vector of inputs $X^T = (X_1, ..., X_p)$, $X$ being a column vector, one can predict the output of $Y$ by using the following model:
$$
\hat{Y} = \hat{\beta_0} + \sum^{p}_{j=1} X_j \hat{\beta_j}
$$
$p$ is the number of features.
$\hat{\beta_0}$ is the intercept or **bias** term.

We can include the constant variable 1 in $X$ in order to include $\hat{\beta_0}$ with the vector of coefficients $\hat{\beta}$ so that our linear model can be writen as an inner product,
$$
\hat{Y} = X^T \hat{\beta}
$$

We can interpret our model as a function over a $p$-dimensional input space, $f(X) = X^T \hat{\beta}$ with its gradient $f\prime (X) = \hat{\beta} $.
$\beta$ (we are dropping the hat) is thus a vector in input space that points in the steepest uphill direction.

To fit the linear model to a dataset we will use the method of least squares.
We pick the coefficients $\beta$ that **minimize the residual sum of squares**
$$
RSS(\beta ) =\sum^{N}_{i=1} (y_i - x_{i}^{T}\beta )^2
$$

In matrix notation,
$$
RSS(\beta ) = (\mathbf{y - X}\beta)^T (\mathbf{y - X}\beta),
$$
$\mathbf{X}$ is an $N \times p$ matrix where each row corresponds to an input vector, and $\mathbf{y}$ is a $N$-vector of outputs for the training set.
Because the RSS is a quadratic function its minimum always exists, although it may not be unique.

In order to continue, let's differentiate $RSS(\beta )$ with respect to $\beta$. This will give us the set of equations,
$$
\mathbf{X}^T (\mathbf{y-X}\beta) = 0
$$

In the case that $\mathbf{X}^T \mathbf{X}$ is nonsingular, we then obtain 
$$
\hat{\beta} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}
$$