# Simple and Multiple Linear Regression

## Today we will be trying to demonstrate a few concepts

- Linear regression is subject to the same kind of randomness t-tests are subject to. We will show why you need to be careful when interpreting linear regression coefficients
- We will work on building and evaluating simple regression models
- Building regressions with multiple variables
- How to check to make sure assumptions are satisfied

In [None]:
import numpy as np
import statsmodels.formula.api as smf
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline


### Problem 1: You are given weights for 8 people. You will simulate data around how much money they have in their wallet that will have no relationship to their weight. You'll then ask linear regression to tell you if there is a relationship (knowing full well there isn't!)

In [None]:
weight = [160, 185, 190, 200, 205, 220, 235, 280] #this will be your X

###### To prove that you need to be careful interpreting regression coefficiecnts, generate 7 N(60,10) random variables with numpy, and call that variable money

In [10]:
def make_money():
    gen_range = xrange(1, 8)
    money = []
    for i in gen_range:
        money.append(np.random.normal(60, 10, 7))
        
    return money

###### Combine them into a dataframe

In [11]:
df = pd.DataFrame(money)

###### Fit the linear model

In [13]:
import statsmodels as sm

###### If you had to write the equation by hand what would it be? Calculate how much money the model expects you to have if you weigh 100 pounds. Interesting how the weight coefficient isn't 0 even though we know it should have no impact on money! Super important concept, if you don't understand this, ask a TA!

###### What is the p-value for the weight variable? Should you use it? If not, what is the new estimate for how much money someone who is 100 lbs has in their wallet?

### Problem 2: Edwin hubble once measured the distance of nebulae outside of the Milkway Way, and was surprised that he found a relationship between a nebula's distance from the earth and the velocity with which it was moving away. This was the data the initially supported the idea of the big bang. Here you'll analyze the data and see if you come up with teh same conclusoin. 

###### Read the data from "../../data/nebula.csv" and plot a scatter plot (plot.scatter(x,y)) - with velocity on the X axis, and distance on the Y

###### Fit the regression. According to the Big Bang theory, the distance should just be Time * Velocity (with no intercept). Does that relationship hold here?

###### Let's see if we can manually calculate the R^2 value. The R2 is supposed to explain how well our linear regression fits the data. There are two main factors that go into an R2

- Total Sum of Squares (TSS): This is a measure of how much the data varies overall. If this number is big, it means the data is all over the place, if it is small it means the data is similar. It is calculated by taking the sum of squares for the difference between the y-values, and the overall mean of the all the ys
  - First calculate the mean of the ys
  - Subtract it from each of the actual ys
  - Square that subtraction
  - Sum up the results
- Residual Sum of Squares (RSS): This is effectively a measure for how much variation there is after you created your model. If you created a great model, the variation should be small. If the model didn't work, the varation will be big
  - Predict what the values of y would be for the Xs that you have
  - Subract the predicted (or fitted) ys from the actual ys
  - Suare that difference
  - Sum up the results
- Now we can calculate the R^2 Value:
  - (TSS-RSS)/TSS

### Total Sum of Squares

###### Calculate the mean of the true Ys (Distance)

###### Square the difference between the real Ys and the mean, and then sum it up

#### Residual Sum of Squares

###### Get the values for what Y would be according to the model (Hint: you can can use lm.predict())

###### Square the difference between the actual Ys and the fitted ys and sum it up

###### Now calculate the R^2. It should be very close to the one outputted by statsmodels

## Multiple Regression Analysis

###### Here you will try to understand the relationship between marketing spend and purchases made of your product

###### Read in the spend data from "../../data/spend.csv"

###### Plot the data

###### You notice that the model begins to taper out above. How can we better fit it than just spend vs purchases?

###### compare this with a model that doesnt use the Squared term. Which one should we use?