# REGRESSION FROM SCRATCH

## APPLY WITH CONTRIVED DATASET

In [None]:
# Calculate the mean value of a list of numbers
def mean(values):
    return sum(values) / float(len(values))

In [None]:
# The variance is the sum of squared difference for each value from the mean value.
# Calculate the variance of a list of numbers
def variance(values, mean):
    return sum([(x-mean)**2 for x in values])

In [None]:
# calculate mean and variance
dataset = [[1, 1], [2, 3], [4, 3], [3, 2], [5, 5]]

x = [row[0] for row in dataset]
y = [row[1] for row in dataset]

mean_x, mean_y = mean(x), mean(y)
var_x, var_y = variance(x, mean_x), variance(y, mean_y)

print('x stats: mean=%.3f variance=%.3f' % (mean_x, var_x))
print('y stats: mean=%.3f variance=%.3f' % (mean_y, var_y))

## PART 1: COVARIANCE

The covariance of two groups of numbers describes how those numbers change together. Co-variance is a generalization of correlation. Correlation describes the relationship between two groups of numbers, whereas covariance can describe the relationship between two or more groups of numbers. Additionally, covariance can be normalized to produce a correlation value.

In [None]:
# Calculate covariance between x and y
def covariance(x, mean_x, y, mean_y):
    covar = 0.0
    for i in range(len(x)):
        covar += (x[i] - mean_x) * (y[i] - mean_y)
    return covar

In [None]:
mean_x, mean_y = mean(x), mean(y)
covar = covariance(x, mean_x, y, mean_y)
print('Covariance: %.3f' % (covar))

## PART 2: ESTIMATE COEFFICIENTS

~~~
B1 = covariance(x, y) / variance(x)
~~~

We already have functions to calculate covariance() and variance(). Next, we need to estimate a value for B0, also called the intercept as it controls the starting point of the line where it intersects the y-axis.
    
~~~
B0 = mean(y) - ( B1 * mean(x) )
~~~
    
Again, we know how to estimate B1 and we have a function to estimate mean(). We can put all of this together into a function named coefficients() that takes the dataset as an argument and returns the coefficients.

In [None]:
# Calculate coefficients
def coefficients(dataset):
    x = [row[0] for row in dataset]
    y = [row[1] for row in dataset]
    x_mean, y_mean = mean(x), mean(y)
    b1 = covariance(x, x_mean, y, y_mean) / variance(x, x_mean)
    b0 = y_mean - b1 * x_mean
    return [b0, b1]

In [None]:
b0, b1 = coefficients(dataset)
(b0, b1)

## PART 3: MAKE PREDICTIONS
The simple linear regression model is a line defined by coeffcients estimated from training data. Once the coeffcients are estimated, we can use them to make predictions. The equation to make predictions with a simple linear regression model is as follows:

~~~
y = b0 + b1 * x 
~~~

Below is a function named simple linear regression() that implements the prediction equation to make predictions on a test dataset. It also ties together the estimation of the coeffcients on training data from the steps above. The coeffcients prepared from the training data are used to make predictions on the test data, which are then returned.

In [None]:
def simple_linear_regression(train, test):
    predictions = list()
    b0, b1 = coefficients(train)
    for row in test:
        yhat = b0 + b1 * row[0]
        predictions.append(yhat)
    return predictions

Let's pull together everything we have learned and make predictions for our simple contrived dataset. As part of this example, we will also add in a function to manage the evaluation of the predictions called evaluate algorithm() and another function to estimate the Root Mean Squared Error of the predictions called rmse metric(). 

In [None]:
# Example of Standalone Simple Linear Regression
from math import sqrt

# Calculate root mean squared error
def rmse_metric(actual, predicted):
    sum_error = 0.0
    for i in range(len(actual)):
        prediction_error = predicted[i] - actual[i]
        sum_error += (prediction_error ** 2)
    mean_error = sum_error / float(len(actual))
    return sqrt(mean_error)

# Evaluate regression algorithm on training dataset
def evaluate_algorithm(dataset, algorithm):
    test_set = list()
    for row in dataset:
        row_copy = list(row)
        row_copy[-1] = None
        test_set.append(row_copy)
        predicted = algorithm(dataset, test_set)
    print(predicted)
    actual = [row[-1] for row in dataset]
    rmse = rmse_metric(actual, predicted)
    return rmse

In [None]:
# Test simple linear regression
dataset = [[1, 1], [2, 3], [4, 3], [3, 2], [5, 5]]
rmse = evaluate_algorithm(dataset, simple_linear_regression)
print('RMSE: %.3f' % (rmse))

## PART 4: COVARIANCE vs CORRELATION

Covariance is a measure of how changes in one variable are associated with changes in a second variable. Specifically, covariance measures the degree to which two variables are linearly associated. 

Covariance doesn't really tell you about the strength of the relationship between the two variables, while correlation does. 

### A comparison of correlation and covariance

Although both the correlation coefficient and the covariance are measures of linear association, they differ in the following ways:
- Correlations coefficients are standardized. Thus, a perfect linear relationship results in a coefficient of 1.
- Covariance values are not standardized. Thus, the value for a perfect linear relationship depends on the data.

The correlation coefficient is a function of the covariance. The correlation coefficient is equal to the covariance divided by the product of the standard deviations of the variables. Therefore, a positive covariance always results in a positive correlation and a negative covariance always results in a negative correlation.

For example:

In [None]:
x = [1, 2, 3]
y = [4, 6, 10]

mean_x, mean_y = mean(x), mean(y)
covar = covariance(x, mean_x, y, mean_y)
print('Covariance: %.3f' % (covar))

In [None]:
#Now let's change the scale, and multiply both x and y by 10

x = [10, 20, 30]
y = [40, 60, 100]

mean_x, mean_y = mean(x), mean(y)
covar = covariance(x, mean_x, y, mean_y)
print('Covariance: %.3f' % (covar))

Standard deviation is calculated as the mean difference from the mean value. We square the difference between the mean and a given value, calculate the average squared difference from the mean, then take the square root to return the units
back to their original value. Below is a small function named standard deviation() that calculates the standard deviation of a list of numbers. You will notice that it calculates the mean. It might be more effcient to calculate the mean of a list of numbers once and pass it to the standard deviation() function as a parameter. 

~~~
cor(X,Y) = cov(X,Y) / sd(X) * sd(Y)
~~~ 

In [None]:
from math import sqrt

# Calculate the standard deviation of a list of numbers
def stdev(numbers):
    avg = mean(numbers)
    variance = sum([(x-avg)**2 for x in numbers]) / float(len(numbers)-1)
    return sqrt(variance)

In [None]:
x = [1, 2, 3]
y = [4, 6, 10]

mean_x, mean_y = mean(x), mean(y)
covar = covariance(x, mean_x, y, mean_y)
#correlation = covar / (stdev(x) * stdev(y))
correlation = covar / (sqrt(variance(x, mean(x))) * sqrt(variance(y, mean(y)) ))
print('Covariance: %.3f' % (covar))
print('Correlation: %.3f' % (correlation))

In [None]:
x = [10, 20, 30]
y = [40, 60, 100]

mean_x, mean_y = mean(x), mean(y)
covar = covariance(x, mean_x, y, mean_y)
print('Covariance: %.3f' % (covar))
correlation = covar / (sqrt(variance(x, mean(x))) * sqrt(variance(y, mean(y)) ))
print('Correlation: %.3f' % (correlation))

Changing the scale should not increase the strength of the relationship, so we can adjust by dividing the covariances by standard deviations of x and y, which is exactly the definition of correlation coefficient.

In both above cases correlation coefficient between x and y is 0.982.

In [None]:
%matplotlib inline

import pandas as pd
import matplotlib.pyplot as plt 

# read data into a DataFrame
ads = pd.read_csv('http://www-bcf.usc.edu/~gareth/ISL/Advertising.csv', index_col=0)
ads.head()

# read data into a DataFrame
ads = pd.read_csv('http://www-bcf.usc.edu/~gareth/ISL/Advertising.csv', index_col=0)
ads.head()

In [None]:
ads.corr()

In [None]:
def compute_relationship(x, y):
    mean_x, mean_y = mean(x), mean(y)
    covar = covariance(x, mean_x, y, mean_y)
    correlation = covar / (sqrt(variance(x, mean(x))) * sqrt(variance(y, mean(y)) ))
    return covar, correlation

In [None]:
_, correlation = compute_relationship(ads.TV.values, ads.Radio.values)
print (round(correlation, 6))

_, correlation = compute_relationship(ads.TV.values, ads.Newspaper.values)
print (round(correlation, 6))

_, correlation = compute_relationship(ads.TV.values, ads.Sales.values)
print (round(correlation, 6))


# Calculate coefficients
def coefficients2(x, y):
    x_mean, y_mean = mean(x), mean(y)
    b1 = covariance(x, x_mean, y, y_mean) / variance(x, x_mean)
    b0 = y_mean - b1 * x_mean
    return [b0, b1]

b0, b1 = coefficients2(ads.TV.values, ads.Sales.values)
round(b0, 11), round(b1, 8)

## PART 5: APPLY SCIKIT-LEARN

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.cross_validation import train_test_split
from sklearn import metrics

# create X and y
feature_cols = ['TV']
X = ads[feature_cols]
y = ads.Sales

# instantiate and fit
linreg = LinearRegression()
linreg.fit(X, y)

# print the coefficients
print (linreg.intercept_)
print (linreg.coef_)