# Linear Regression  
Purpose is to show an example of a linear regression machine learning implementation in Python.
Based off of Stanford University's Machine Learning by Andrew NG.

### Hypothesis Function
The Hypothesis Function formulates the relationship between the dependent and independent variables. Hypothesis functions have the general form:
$$h_{\theta}(x)  =  \theta_{0} + \theta_{1}x + ... + \theta_{n}x$$  

You can think of the hypothesis function as a functional mapping of an input X to an output y. Consider the following dataset. Let's say we're interested in predicting housing prices. We have a table of flattened data on houses (each row is a home) with the size of each house in square feet, the number of bedrooms, and the house price. We can use linear regression to _weigh_ and _isolate_ the effect each independent variable has on the dependent variable (house price). In this case, our hypothesis function would look like:
$$h_{\theta}(x)  =  \theta_{0} + \theta_{1}x_{1} + \theta_{2}x_{2}$$  
$$h_{\theta}(x)  =  \theta_{0} + \theta_{1}sqfeet_{1} + \theta_{2}rooms_{2}$$  

Here's what our data actually looks like:

In [182]:
import pandas as pd
import numpy as np

ex1data2 = 'ex1/ex1data2.txt'
cols = ['sqfeet', 'rooms', 'price']
y_col = ['price']

df = pd.read_csv(
    filepath_or_buffer=ex1data2,
    delimiter=',',
    names=cols
)
df['intercept'] = 1
cols.append('intercept')

print df.shape
print df.dtypes
print df.describe()
print df.head(5)


(47, 4)
sqfeet       int64
rooms        int64
price        int64
intercept    int64
dtype: object
            sqfeet      rooms          price  intercept
count    47.000000  47.000000      47.000000         47
mean   2000.680851   3.170213  340412.659574          1
std     794.702354   0.760982  125039.899586          0
min     852.000000   1.000000  169900.000000          1
25%    1432.000000   3.000000  249900.000000          1
50%    1888.000000   3.000000  299900.000000          1
75%    2269.000000   4.000000  384450.000000          1
max    4478.000000   5.000000  699900.000000          1
   sqfeet  rooms   price  intercept
0    2104      3  399900          1
1    1600      3  329900          1
2    2400      3  369000          1
3    1416      2  232000          1
4    3000      4  539900          1


Next lets create the matrices needed for linear algebra

In [183]:
y = df.as_matrix(columns=y_col)
print 'y shape: {}'.format(y.shape)
print 'y preview: \n {}'.format(y[:5])
X = df.as_matrix(columns=[x for x in cols if x not in y_col])
print 'X shape: {}'.format(X.shape)
print 'X preview: \n {}'.format(X[:5])
theta = np.zeros((X.shape[1], 1))
print 'theta shape: {}'.format(theta.shape)
print 'theta preview: \n {}'.format(theta[:5])

y shape: (47, 1)
y preview: 
 [[399900]
 [329900]
 [369000]
 [232000]
 [539900]]
X shape: (47, 3)
X preview: 
 [[2104    3    1]
 [1600    3    1]
 [2400    3    1]
 [1416    2    1]
 [3000    4    1]]
theta shape: (3, 1)
theta preview: 
 [[ 0.]
 [ 0.]
 [ 0.]]


In [184]:
def feature_normalize(X):
    Z = np.zeros(X.shape)
    for i in range(X.shape[1]):
        # bias term handling
        if np.array_equal(X[:,i], np.ones(X.shape[0])):
            Z[:,i] = np.ones(X.shape[0])
        else:
            # xi = (xi - xibar) / std(xi)
            Z[:,i] = (X[:,i] - np.mean(X[:,i]))/np.std(X[:,i])
    return Z
Z = feature_normalize(X)
print Z

[[  1.31415422e-01  -2.26093368e-01   1.00000000e+00]
 [ -5.09640698e-01  -2.26093368e-01   1.00000000e+00]
 [  5.07908699e-01  -2.26093368e-01   1.00000000e+00]
 [ -7.43677059e-01  -1.55439190e+00   1.00000000e+00]
 [  1.27107075e+00   1.10220517e+00   1.00000000e+00]
 [ -1.99450507e-02   1.10220517e+00   1.00000000e+00]
 [ -5.93588523e-01  -2.26093368e-01   1.00000000e+00]
 [ -7.29685755e-01  -2.26093368e-01   1.00000000e+00]
 [ -7.89466782e-01  -2.26093368e-01   1.00000000e+00]
 [ -6.44465993e-01  -2.26093368e-01   1.00000000e+00]
 [ -7.71822042e-02   1.10220517e+00   1.00000000e+00]
 [ -8.65999486e-04  -2.26093368e-01   1.00000000e+00]
 [ -1.40779041e-01  -2.26093368e-01   1.00000000e+00]
 [  3.15099326e+00   2.43050370e+00   1.00000000e+00]
 [ -9.31923697e-01  -2.26093368e-01   1.00000000e+00]
 [  3.80715024e-01   1.10220517e+00   1.00000000e+00]
 [ -8.65782986e-01  -1.55439190e+00   1.00000000e+00]
 [ -9.72625673e-01  -2.26093368e-01   1.00000000e+00]
 [  7.73743478e-01   1.10220

From here, we can solve directly for theta with normal equations

In [197]:
# normal equation
def normal_equation_linear_regression(y, X):
    return np.dot(np.dot(np.linalg.inv(np.dot(np.transpose(X), X)), X.T), y)
print normal_equation_linear_regression(y, Z)
print normal_equation_linear_regression(y, X)

[[ 109447.79646964]
 [  -6578.35485416]
 [ 340412.65957447]]
[[   139.21067402]
 [ -8738.01911233]
 [ 89597.9095428 ]]


In [186]:
# gradient descent
alpha = 1.0
def gradient_descent_linear_regression(y, X, alpha, theta):
    # theta = theta - (alpha / m) * X' * (X * theta - y)
    m = X.shape[0]
    return theta - (alpha / m) * np.dot(np.transpose(X), (np.dot(X, theta) - y))

def cost_linear_regression(y, X, theta):
    # cost = (1 / (2 * m)) * (X * theta - y)' * (X * theta - y)
    m = X.shape[0]
    return (1.0/(2.0*m)) * np.dot(np.transpose((np.dot(X, theta) - y)), (np.dot(X, theta) - y))


In [193]:
y = df.as_matrix(columns=y_col)
print 'y shape: {}'.format(y.shape)
print 'y preview: \n {}'.format(y[:5])
X = df.as_matrix(columns=[x for x in cols if x not in y_col])
print 'X shape: {}'.format(X.shape)
print 'X preview: \n {}'.format(X[:5])
theta = np.zeros((X.shape[1], 1))
print 'theta shape: {}'.format(theta.shape)
print 'theta preview: \n {}'.format(theta[:5])

thetas = []
costs = []
alpha = .079371
cost = 0
for x in range(50):
    thetas.append(theta)
    costs.append(cost)
    theta = gradient_descent_linear_regression(y, Z, alpha, theta)
    cost = cost_linear_regression(y, Z, theta)
    print cost


y shape: (47, 1)
y preview: 
 [[399900]
 [329900]
 [369000]
 [232000]
 [539900]]
X shape: (47, 3)
X preview: 
 [[2104    3    1]
 [1600    3    1]
 [2400    3    1]
 [1416    2    1]
 [3000    4    1]]
theta shape: (3, 1)
theta preview: 
 [[ 0.]
 [ 0.]
 [ 0.]]
[[  5.56986495e+10]]
[[  4.73818892e+10]]
[[  4.03837965e+10]]
[[  3.44902416e+10]]
[[  2.95228714e+10]]
[[  2.53329216e+10]]
[[  2.17961428e+10]]
[[  1.88086313e+10]]
[[  1.62833963e+10]]
[[  1.41475276e+10]]
[[  1.23398547e+10]]
[[  1.08090075e+10]]
[[  9.51180881e+09]]
[[  8.41193716e+09]]
[[  7.47881535e+09]]
[[  6.68668347e+09]]
[[  6.01382550e+09]]
[[  5.44192307e+09]]
[[  4.95551483e+09]]
[[  4.54154367e+09]]
[[  4.18897711e+09]]
[[  3.88848881e+09]]
[[  3.63219093e+09]]
[[  3.41340931e+09]]
[[  3.22649422e+09]]
[[  3.06666112e+09]]
[[  2.92985645e+09]]
[[  2.81264456e+09]]
[[  2.71211234e+09]]
[[  2.62578869e+09]]
[[  2.55157665e+09]]
[[  2.48769592e+09]]
[[  2.43263445e+09]]
[[  2.38510742e+09]]
[[  2.34402259e+09]]
[[  