# Playing with Linear Regression

Using the same example from Pierre, same data, but we can try to do this using basic machine learning.
https://github.com/axiomiety/crashburn/blob/master/jupyter/simple_linear_regression.ipynb

### Import the same datasets

In [8]:
import requests, pandas, io

url='http://www.stat.ufl.edu/~winner/data/brainhead.dat'
data=requests.get(url)
col_names=('gender', 'age_range', 'head_size', 'brain_weight')
col_widths=[(8,8),(16,16),(21-24),(29-32)]
df=pandas.read_fwf(io.StringIO(data.text), names=col_names, colspec=col_widths)
df.head()


Unnamed: 0,gender,age_range,head_size,brain_weight
0,1,1,4512,1530
1,1,1,3738,1297
2,1,1,4261,1335
3,1,1,3777,1282
4,1,1,4177,1590


First before I start I want to do some sizing tests..  like a few magnitudes more data,
so we make an array of dataFrames dfs[]

In [4]:
import itertools

def churn(d, n):
    for _ in itertools.repeat(None, n):
        d = d.append(d)
    return d

dfs = [df, 
       churn(df,4),
       churn(df,8),
       churn(df,12),
       churn(df,16)]
       
for d in dfs:
    print (d.shape)


(237, 4)
(3792, 4)
(60672, 4)
(970752, 4)
(15532032, 4)


Ok I think these 5 samples is enough.

## Analytical Method 


In [62]:
from scipy.stats import linregress
import timeit

for d in dfs:
    %%timeit linregress(d.head_size,d.brain_weight)

for d in dfs:
    print(linregress(d.head_size,d.brain_weight))


433 µs ± 61.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
511 µs ± 72.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
1.41 ms ± 57.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
17.8 ms ± 981 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
446 ms ± 16.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
LinregressResult(slope=0.26342933948939939, intercept=325.57342104944235, rvalue=0.79956970925429616, pvalue=5.9576308394065412e-54, stderr=0.012907433440886988)
LinregressResult(slope=0.26342933948939917, intercept=325.57342104944314, rvalue=0.79956970925429593, pvalue=0.0, stderr=0.0032140617796317652)
LinregressResult(slope=0.26342933948939917, intercept=325.57342104944314, rvalue=0.79956970925429582, pvalue=0.0, stderr=0.00080331675985773965)
LinregressResult(slope=0.26342933948939984, intercept=325.57342104944075, rvalue=0.79956970925430071, pvalue=0.0, stderr=0.00020082608673381252)
LinregressResult(slope=0.26342933948942276, inte

Shows growth eyeballing it ...  data size {10k, 100k, 1m} -> {6ms, 60ms, 1100ms} seems O(n^2) at least, i can't actually calculate any higher datasets w/ this tutorial - run out of memory!  Lol

Somewhat proves what everyone says, analytic solver good for < 10k or so size samples (and we are using only a simple linear model).


### Performance Sidebar
For fun, wanted to compare performance w/ Pierre's solution (vs the scipy library)

In [None]:
import timeit
def pierre_solver (df):
    brain_weight_sample_mean = df['brain_weight'].mean()
    head_size_sample_mean = df['head_size'].mean()

    b_1 = sum(df.apply(lambda r: (r['head_size']-head_size_sample_mean)*(r['brain_weight']-brain_weight_sample_mean), axis=1))
    b_1 /= sum(df.apply(lambda r: (r['head_size']-head_size_sample_mean)**2, axis=1))

    b_0 = brain_weight_sample_mean-b_1*head_size_sample_mean
    return [b_0, b_1]

#    df['error'] = df.apply(lambda r: r['brain_weight'] - b_0 - b_1*r['head_size'], axis=1)
#    rss = sum(df['error']**2)
#    print(rss)

# prove the solutions are about same:
print (pierre_solver(dfs[0]))
print (pierre_solver(dfs[1]))

for d in dfs:
    %%timeit r = pierre_solver(d)


[325.57342104944223, 0.26342933948939945]
[325.57342104944132, 0.26342933948939967]
52.7 ms ± 4.27 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
763 ms ± 44.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
12.3 s ± 449 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
3min 24s ± 7.24 s per loop (mean ± std. dev. of 7 runs, 1 loop each)


Alot slower for obvious reasons (demo code, unoptimized etc).  It also shows as n grows > 1000's its infeasible to solve analytically... this crashes doing the 100k+ iteration (or never finishes).

# Training Method 

Things needed for linear a regression line, of form f(x) = Ax + B (activation function)  
- Find optimal A, B values & cost function J(A,B)  
- Need partial derivatives for J(A,B) - dA and dB   
- Iterate A = A + u * dA && B = B + u * dB  
- Pretty much can start at random pt or 0,0 and do Gradient Descent  
- Step until close to 0 - define step size (u) and when to stop  

First a basic gradient descent example for a simple function f (x^2 - 2x + 1)
The derivative of this is 2x - 2, start at x=0, and solve where y=0

### Gradient Descent Dummy Sample (1)


In [1]:
from scipy.misc import derivative

def f(x):
    return x**2 - 2*x + 1

def p(f, x):
    return derivative(f,x)

def grad_descent(f):
    x = 0
    step = 0.1
    error = 0.5

    # loop while error > err_lmt
    while (error > 0.01):
        x = x - step*(p(f,x)) 
        print ('x=',x, 'y=',f(x))
        error = abs(f(x)-0)
    print('error: ',error)
    print('x: ',x)

grad_descent(f)


x= 0.2 y= 0.64
x= 0.36 y= 0.4096
x= 0.488 y= 0.262144
x= 0.5904 y= 0.16777216
x= 0.67232 y= 0.1073741824
x= 0.737856 y= 0.068719476736
x= 0.7902848 y= 0.043980465111
x= 0.83222784 y= 0.0281474976711
x= 0.865782272 y= 0.0180143985095
x= 0.8926258176 y= 0.0115292150461
x= 0.91410065408 y= 0.00737869762948
error:  0.00737869762948
x:  0.91410065408


### Putting it together

Training set: x[n] ~ df['head_size'] 
Solution set: y[n] ~ df['brain_weight']
Hypothesis: h(x):  Ax + B  
Cost P(h[x]) = P(A,B) = 1/n * sum( h(x[n]) - y[n] )^2  

Consider P(A,B) = contour/3d of all costs of every combination A,B (0x+0, 1x+1, 0x+1, etc...)  
Optimal is where P(A,B) = minimum  

Start with guess Optimal P = A=1, B=1, h(x) = 1x+1  

Optimal P(A,B) =  
 * A = A - step * dP/dA  <- initial guess +/- partial derivative of A (ie, movement of A parameter)      
 * B = B - step * dP/dA  <- initial guess +/- partial derivative of B (ie, movement of B parameter)  
Repeat until close to step_limit  


In [48]:
def costP(df,A,B):
    n = 0
    for _,x in df.iterrows():  # global test data
        n += (x[2]*A + B - x[3])**2
    return 1/len(df) * n 
    
def partialDeriv(p, v):  # partial deriv of function p w/ respect to v    
    return 0.088 # tbd

def grad_descent2():
    guessA = guessB = 1   # h(x) = Ax+B = x+1
    testData = df
    step - 0.05
    step_limit = 0.01 # when to stop
    changeA = changeB = 1
    
    while (changeA < step_limit) and (changeB < step_limit):
        changeA = step * partialDeriv(costP,guessA)
        changeB = step * partialDeriv(costP,guessB)
        guessA = guessA - changeA
        guessB = guessB - changeB
        print (guessA, guessB)
    return guessA,guessB

costP(df, 1,1)


5609738.7088607587

In [44]:
row5 = df.head(5)
len(row5)


5