In [1]:
from linear_algebra import dot, Vector

def predict(x: Vector, theta: Vector) -> float:
    """assumes that first element of x is 1"""
    return dot(x, theta)


Further assumptions of the Least Squares Model:

- the first is that features of vector X are linearly independent;meaning there is no way to write any one as a weighted sum of some of the others. It this assumtion fails it's is impossible to correctly estimate theta

- the second assumption is that the features of X are all uncorrelated with the errors E. If this fals to be the case, our estimate theta will systematiclly be incorrect

pag 191 to 193 have more details on this. Also more detail in this [article](https://statisticsbyjim.com/regression/ols-linear-regression-assumptions/)

#### Fitting the model

In [2]:
from typing import List
from linear_algebra import Vector

def error(x: Vector, y: float, theta: Vector) -> float:
    return predict(x, theta) - y

def squared_error(x: Vector, y: float, theta: Vector) -> float:
    return error(x, y, theta) ** 2

x = [1, 2, 3]
y = 30
theta = [4, 4, 4] # so prediction = 4 + 8 + 12 = 24

assert error(x, y, theta) == -6
assert squared_error(x, y, theta) == 36

In [3]:
def sqerror_gradient(x: Vector, y: float, theta: Vector) -> Vector:
    err = error(x, y, theta)
    return [2 * err * x_i for x_i in x]

assert sqerror_gradient(x, y, theta) == [-12, -24, -36]

Using gradient descent we can know compute the optimal theta. first lets write a least_squares_fit function that can work with any dataset:


In [9]:
import random
from linear_algebra import vector_mean
from gradient_descent import gradient_step

In [5]:
def least_squares_fit(xs: List[Vector], ys: List[float], learning_rate: float = 0.001, num_steps: int = 1000, batch_size: int = 1) -> Vector:
    """
    Find the theta that minimizes the sum of squared errors assuming the model y = dot(x, theta)
    """
    # start with a random guess
    guess = [random.random() for _ in xs[0]]
    for epoch in range(num_steps):
        for start in range(0, len(xs), batch_size):
            batch_xs = xs[start:start+batch_size]
            batch_ys = ys[start:start+batch_size]

            gradient = vector_mean([sqerror_gradient(x, y, guess)
                                   for x, y in zip(batch_xs, batch_ys)])
            guess = gradient_step(guess, gradient, - learning_rate)
        print(f'epoch is {epoch}; current guess is {guess}')
    return guess  

In [6]:
from statistics import daily_minutes_good
from inputs import inputs

In [None]:
random.seed(0)

learning_rate = 0.001

theta = least_squares_fit(inputs, daily_minutes_good, learning_rate, 5000, 25)

In [10]:
# minutes= 30.58 + 0.972 friends -1.87 work hours + 0.923 phd
assert 30.50 < theta[0] < 30.70
assert 0.96 < theta[1] < 1.00 
assert -1.89 < theta[2] < -1.85
assert 0.91 < theta[3] < 0.94

You should think of the coefficients of the model as representing all-else-being-equalestimates  of  the  impacts  of  each  factor.  All  else  being  equal,  each  additional  friendcorresponds to an extra minute spent on the site each day. All else being equal, eachadditional hour in a user’s workday corresponds to about two fewer minutes spent onthe  site  each  day.  All  else  being  equal,  having  a  PhD  is  associated  with  spending  an extra minute on the site each day.

What this doesnt capture is interactions between features. It's possible works hours effect is sifferent with people with many friends. One way to handle this is to introduce a new variable with the product of friends and work hours. 

Or  it’s  possible  that  the  more  friends  you  have,  the  more  time  you  spend  on  the  siteup  to  a  point,  after  which  further  friends  cause  you  to  spend  less  time  on  the  site.(Perhaps  with  too  many  friends  the  experience  is  just  too  overwhelming?)  We  couldtry  to  capture  this  in  our  model  by  adding  another  variable  that’s  the  square  of  thenumber of friends.

Once we start adding varaibles we need to worry about weather their coefficients matter. There are no limits to the numbers of products, logs, squares and high powers that can be added. 

#### Goodness of fit 

In [13]:
from simple_linear_regression import total_sum_squares

In [14]:
def multiple_r_squared(xs: List[Vector], ys:Vector, theta: Vector) -> float:
    sum_of_squared_errors = sum(error(x, y, theta) ** 2
                                for x, y in zip(xs, ys))
    return 1.0 - sum_of_squared_errors / total_sum_squares(ys)

In [16]:
assert 0.67 < multiple_r_squared(inputs, daily_minutes_good, theta) < 0.68

R squared tends to increase the more varables are added to the model. Because of this, in a multiple regression, we also need to look at the standard errors ofthe  coefficients,  which  measure  how  certain  we  are  about  our  estimates  of  each  theta_i.
The regression as a whole may fit our data very well, but if some of the independentvariables are correlated (or irrelevant), their coefficients might not mean much.The typical approach to measuring these errors starts with another assumption—that the errors **εi** are independent normal random variables with mean 0 and some shared(unknown) standard deviation σ. In that case, we (or, more likely, our statistical soft‐ware) can use some linear algebra to find the standard error of each coefficient. Thelarger it is, the less sure our model is about that coefficient. Unfortunately, we’re notset up to do that kind of linear algebra from scratch.