## Evaluating simple linear regressions on lemonade data with other features:

1. Create a dataframe from the csv at https://gist.githubusercontent.com/ryanorsinger/c303a90050d3192773288f7eea97b708/raw/536533b90bb2bf41cea27a2c96a63347cde082a6/lemonade.csv

In [1]:
from pydataset import data
from sklearn.metrics import mean_squared_error
from math import sqrt
from statsmodels.formula.api import ols

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

In [2]:
df = pd.read_csv("https://gist.githubusercontent.com/ryanorsinger/9867c96ddb56626e9aac94d8e92dabdf/raw/45f9a36a8871ac0e24317704ed0072c9dded1327/lemonade_regression.csv")
df.head()

Unnamed: 0,temperature,rainfall,flyers,sales
0,27.0,2.0,15,10
1,28.9,1.33,15,13
2,34.5,1.33,27,15
3,44.1,1.05,28,17
4,42.4,1.0,33,18


Make a baseline for predicting sales. (The mean is a good baseline)

In [3]:
baseline = df.sales.mean()

baseline

25.323287671232876

Create a new dataframe to hold residuals.

In [4]:
residuals = pd.DataFrame()

Calculate the baseline residuals.

In [5]:
residuals['x'] = df.flyers
residuals["y"] = df.sales

residuals["baseline"] = baseline

residuals["baseline_residual"] = residuals.baseline - residuals.y
residuals.head()

Unnamed: 0,x,y,baseline,baseline_residual
0,15,10,25.323288,15.323288
1,15,13,25.323288,12.323288
2,27,15,25.323288,10.323288
3,28,17,25.323288,8.323288
4,33,18,25.323288,7.323288


Use ols from statsmodels to create a simple linear regression (1 independent variable, 1 dependent variable) to predict sales using flyers.

In [6]:
model = ols('sales ~ flyers', data=df).fit()

Use the .predict method from ols to produce all of our predictions. Add these predictions to the data

In [7]:
residuals["yhat"] = model.predict()
residuals.head()

Unnamed: 0,x,y,baseline,baseline_residual,yhat
0,15,10,25.323288,15.323288,14.673754
1,15,13,25.323288,12.323288,14.673754
2,27,15,25.323288,10.323288,19.727926
3,28,17,25.323288,8.323288,20.149107
4,33,18,25.323288,7.323288,22.255013


Calculate that model's residuals.

In [8]:
residuals["yhat_residuals"] = residuals.yhat - residuals.y
residuals.head()

Unnamed: 0,x,y,baseline,baseline_residual,yhat,yhat_residuals
0,15,10,25.323288,15.323288,14.673754,4.673754
1,15,13,25.323288,12.323288,14.673754,1.673754
2,27,15,25.323288,10.323288,19.727926,4.727926
3,28,17,25.323288,8.323288,20.149107,3.149107
4,33,18,25.323288,7.323288,22.255013,4.255013


Evaluate that model's performance and answer if the model is significant.

In [9]:
baseline_sse = (residuals.baseline_residual**2).sum()
flyer_model_sse = (residuals.yhat_residuals**2).sum()

In [10]:
if flyer_model_sse < baseline_sse:
    print("Our model beats the baseline")
else:
    print("Our baseline is better than the model.")

print("\nBaseline SSE", baseline_sse)
print("\nModel SSE", flyer_model_sse)

Our model beats the baseline

Baseline SSE 17297.85205479452

Model SSE 6083.326244705024


In [11]:
r2 = model.rsquared
print('R-squared = ', round(r2,3))

R-squared =  0.648


In [12]:
f_pval = model.f_pvalue
print("p-value for model significance = ", f_pval)

p-value for model significance =  2.193718738113383e-84


__Since p value is less than alpha (.05), we reject the null hypothesis. Our model is significant.__

Evaluate that model's performance and answer if the feature is significant.

__The feature is significant since it is the only feature used__

## Repetition Improves Performance!

In the next section of your notebook, perform the steps above with the rainfall column as the model's feature. Does this model beat the baseline? Would you prefer the rainfall model over the flyers model?

In [13]:
df = pd.read_csv("https://gist.githubusercontent.com/ryanorsinger/9867c96ddb56626e9aac94d8e92dabdf/raw/45f9a36a8871ac0e24317704ed0072c9dded1327/lemonade_regression.csv")

baseline = df.sales.mean()

residuals = pd.DataFrame()

residuals['x'] = df.rainfall
residuals["y"] = df.sales

residuals["baseline"] = baseline

residuals["baseline_residual"] = residuals.baseline - residuals.y

model = ols('sales ~ rainfall', data=df).fit()

residuals["yhat"] = model.predict()

residuals["yhat_residuals"] = residuals.yhat - residuals.y

residuals.head()

Unnamed: 0,x,y,baseline,baseline_residual,yhat,yhat_residuals
0,2.0,10,25.323288,15.323288,-1.599602,-11.599602
1,1.33,13,25.323288,12.323288,13.773142,0.773142
2,1.33,15,25.323288,10.323288,13.773142,-1.226858
3,1.05,17,25.323288,8.323288,20.197573,3.197573
4,1.0,18,25.323288,7.323288,21.344793,3.344793


In [14]:
baseline_sse = (residuals.baseline_residual**2).sum()
rainfall_model_sse = (residuals.yhat_residuals**2).sum()

In [15]:
if rainfall_model_sse < baseline_sse:
    print("Our model beats the baseline")
else:
    print("Our baseline is better than the model.")

print("\nBaseline SSE", baseline_sse)
print("\nModel SSE", rainfall_model_sse)

Our model beats the baseline

Baseline SSE 17297.85205479452

Model SSE 2998.2371310300655


In [16]:
r2 = model.rsquared

f_pval = model.f_pvalue

print('R-squared = ', round(r2,3))
print("p-value for model significance = ", f_pval)

R-squared =  0.827
p-value for model significance =  3.2988846597381e-140


__Since p value is less than alpha (.05), we reject the null hypothesis. Our model is significant.__
__This is a better model than the flyers model__

In the next section of your notebook, perform the steps above with the log_rainfall column as the model's feature. Does this model beat the baseline? Would you prefer the log_rainfall model over the flyers model? Would you prefer the model built with log_rainfall over the rainfall model from before?

In the next section of your notebook, perform the steps above with the temperature column as the model's only feature. Does this model beat the baseline? Would you prefer the rainfall, log_rainfall, or the flyers model?

Which of these 4 single regression models would you want to move forward with?

## Tips dataset

Load the tips dataset from pydataset or seaborn

Define your baseline for "tip". Our goal will be to see if we can make a model that is better than baseline for predicting tips on total_bill.

Fit a linear regression model (ordinary least squares) and compute yhat, predictions of tip using total_bill. Here is some sample code to get you started:
from statsmodels.formula.api import ols
from pydataset import data

df = data("tips")

model = ols('tip ~ total_bill', data=df).fit()
predictions = model.predict(df.x)

Calculate the sum of squared errors, explained sum of squares, total sum of squares, mean squared error, and root mean squared error for your model.

Calculate the sum of squared errors, mean squared error, and root mean squared error for the baseline model (i.e. a model that always predicts the average tip amount).

Write python code that compares the sum of squared errors for your model against the sum of squared errors for the baseline model and outputs whether or not your model performs better than the baseline model.

What is the amount of variance explained in your model?

Is your model significantly better than the baseline model?

Plot the residuals for the linear regression model that you made.