# Linear Regression Practice

In this lab we are going to use a 1994 dataset that has detailed prices on items sold at over 400 Burguer King, Wendy's, KFC and Roy Roger restaurants in New Jersey and Pennsivania.

Roy Roger https://en.wikipedia.org/wiki/Roy_Rogers_Restaurants


The data set has zip-code level data on various items prices, characterisitics of the zip code population.

The idea with this exercise is to see wheter fast-food restaurants charge higher prices in areas with a larger concentration of African Americans. 

### Data Set Characteristics:  
K. Graddy (1997), "Do Fast-Food Chains Price Discriminate on the Race and Income Characteristics of an Area?" Journal of Business and Economic Statistics 15, 391-401.
http://people.brandeis.edu/~kgraddy/published%20papers/GraddyK_jbes1997.pdf

    :Number of Instances: 410
    
    :Attribute Information
    
    psoda         price of medium soda, 1st wave
    pfries        price of small fries, 1st wave
    pentree       price entree (burger or chicken), 1st wave
    wagest        starting wage, 1st wave
    nmgrs         number of managers, 1st wave
    nregs         number of registers, 1st wave
    hrsopen       hours open, 1st wave
    emp           number of employees, 1st wave
    psoda2        price of medium soday, 2nd wave
    pfries2       price of small fries, 2nd wave
    pentree2      price entree, 2nd wave
    wagest2       starting wage, 2nd wave
    nmgrs2        number of managers, 2nd wave
    nregs2        number of registers, 2nd wave
    hrsopen2      hours open, 2nd wave
    emp2          number of employees, 2nd wave
    compown       =1 if company owned
    chain         BK = 1, KFC = 2, Roy Rogers = 3, Wendy's = 4
    density       population density, town
    crmrte        crime rate, town
    state         NJ = 1, PA = 2
    prpblck       proportion black, zipcode
    prppov        proportion in poverty, zipcode
    prpncar       proportion no car, zipcode
    hseval        median housing value, zipcode
    nstores       number of stores, zipcode
    income        median family income, zipcode
    county        county label
    lpsoda        log(psoda)
    lpfries       log(pfries)
    lhseval       log(hseval)
    lincome       log(income)
    ldensity      log(density)
    NJ            =1 for New Jersey
    BK            =1 if Burger King
    KFC           =1 if Kentucky Fried Chicken
    RR            =1 if Roy Rogers


In [1]:
# data modules
import numpy as np
import scipy.stats as stats
import pandas as pd

# plotting modules
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('whitegrid')

# Stats/Regresions Packages
from sklearn import linear_model
from sklearn.metrics import mean_squared_error

# make sure charts appear in the notebook:
%matplotlib inline
%config InlineBackend.figure_format ='retina'

#### 1. Read the data, print the last 5 rows, perform EDA (missing values, data types, describe)

In [3]:
path_to_file = '~/datasets/fast_food_chains/discrim.csv' ## Change it to your path
fast_food = pd.read_csv(path_to_file)


IOError: File /datasets/fast_food_chains/discrim.csv does not exist

#### 2. The variable/feature psoda has eight missing observations. Replace those missing observations with the average price of soda per chain ( you will have to find which chain is missing psoda values)

#### 3. In one graph, plot the distribution of 'psoda' for each chain (4 histograms in one graph) add different colors per histogram, label each histogram, locate the legend on the 'upper left' side of your chart. Are there any similarities or differences in their distributions?



#### 4. The variable income has one missing value. First, identify to what state "NJ=1 or PA=2" this missing value belongs to. Then drop the "row " that corresponds to this missing value fast_food.drop("row_number", axis=0)

#### 5. Graph the distribution of psoda for the entire dataset and include a line for the average in your graph.

#### 6. Find the average values of _prpblck_ and _income_ in the sample, along with their standard deviations. Can you infer the units of measurement of these two variables (Get used to do this, these are your baseline values)

### Linear Regressions

#### 7.  Consider a model to explain the price of soda _psoda_, in terms of the proportion of the African American population and the median income
    
    psoda = β0 + β1prpblck + β2income + e


In [None]:
# Define your linear model 


# Define the target variable, called it y


# Define your predictors, called them X


# Print the shapes of your y and X



In [None]:
# Fit your model


# Predict your y, call them predictions, print the shape of predictions
# Print the shape of predictions


In [None]:
## Is there a difference between the means for actual values (psoda) and your predictions


In [None]:
## Construct a scatter plot of your model. Use your predicted values as your y axis and your y values as your x axis
## Print the Mean Square error see/read this link 
## http://mste.illinois.edu/patel/amar430/meansquare.html





In [None]:
## What is the coefficient of determination (R^2) of the prediction. In other words,what is the accuracy of your model?


In [None]:
## Print the estimated coefficients of your model.


In [None]:
## What is the intercept of your model.


In [None]:
## Write your results in equation form, include the sample size and R^2
## Interpret the coefficient on prpblck

### 8.  Compare the estimate from question 7 with a simple regression estimate from _psoda_ on _prpblck_. Is the discrimination effect larger or smaller when you control/include income on your predictors?
    
    psoda = β0 + β1prpblck + e

In [None]:
# Define your predictors and called them X


# Print the shapes of your y and X


In [None]:
# Fit your model


In [None]:
# find the predictors and called them predictions
# print the shape of predictions


In [None]:
## Is there a difference between the means for actual values (psoda) and your predictions


In [None]:
## Construct a scatter plot of your model. Use your predicted values as your y axis and your y values as your x axis
## Print the Mean Square error 



In [None]:
## What is the accuracy of your model?


In [None]:
## Print the estimated coefficients for the linear regression problem


In [None]:
## What is the intercept of your model.


In [None]:
## Write your results in equation form, include the sample size and R^2

#### 9.  Now Use StatsModel and repeat questions 7 and 8
http://statsmodels.sourceforge.net/devel/example_formulas.html

##### Helpful Notes to Keep in mind:
1. P value for a coefficient says nothing about the size of the effect that variable is having on your dependent variable - it is possible to have a highly significant result (very small P-value) for a miniscule effect.
2. With a P value of 5% (or .05) there is only a 5% chance that results you are seeing would have come up in a 
random distribution, so you can say with a 95% probability of being correct that the variable is having some effect, assuming your model is specified correctly.
3. In simple or multiple linear regression, the size of the coefficient for each independent variable gives you 
the size of the effect that variable is having on your dependent/target (y) variable, and the sign on the coefficient 
(positive or negative) gives you the direction of the effect. In regression with a single independent/predictor variable, the coefficient tells you how much the dependent variable is expected to increase (if the coefficient is positive) or decrease (if the coefficient is negative) when that independent variable increases by one unit. 
4. In regression with multiple independent/predictors variables, the coefficient tells you how much the dependent variable is expected to increase when that independent variable increases by one unit, "_holding all the other independent variables constant_". Remember to keep in mind the units which your variables are measured in.

In [None]:
import statsmodels.formula.api as smf


#### 10. There is one fundamental step that I have "purposely" left out? Can you guess what this step is, and implement it. Use Matplotlib and Seaborn (regplot) to graph hem.

### Before you move to the next question, read this article "Why I'm not a fan of R-Squared"
http://www.johnmyleswhite.com/notebook/2016/07/23/why-im-not-a-fan-of-r-squared/

### Bonus
#### 11. Report the estimates of the follwing model (use sklearn):
    log(psoda) = β0 + β1prpblck + β2log(income) + e


In [None]:
## If there are missing observations on any of the log variables, would you dropped them
## from the dataset, if so how does this impact your calculations?
## Fill the missing values of lpsoda with 0


In [None]:
## Plot a histogram that will have 2 histograms showing the distribution of psoda and lpsoda
## Are they the same?


In [None]:
## Plot a histogram of the variable log of income (lincome)


In [None]:
## Plot a histogram of the variable income 


In [None]:
## Can you explain what log is doing to the variables (income, and price of soda (psoda))

In [None]:
## In one graph use seaborn regplot to graph
## graph psoda, vs lincome
## graph lpsoda vs income
## graph lpsoda vs lincome


In [None]:
## graph lpsoda vs income


In [None]:
# graph lpsoda vs lincome


In [None]:
# Define your linear model 


# Define the target variable, called it y



# Define your predictors, called them X


# Print the shapes of your y and X




In [None]:
# Fit your model


# Predict your y, call them predictions, print the shape of predictions
# Print the shape of predictions


In [None]:
## Graph your predicted and y values


In [None]:
## Print the coefficients, and R2, interpret your result 
## See/read this link for interpretation of results/coeficients: 
## http://www.ats.ucla.edu/stat/mult_pkg/faq/general/log_transformed_regression.htm



## Do you think that is model is more appropriate? 

In [None]:
## Now add the variable prppov to the regression in question 11, What happened with prpblck?

# Define your linear model 


# Define the target variable, called it y



# Define your predictors, called them X


# Print the shapes of your y and X


In [None]:
# Fit your model


# Predict your y, call them predictions, print the shape of predictions
# Print the shape of predictions

## Print the coefficients, and R2, interpret your result


In [None]:
## Find the correlatoion between log(income) and prpov. Is it close to what you expected?


In [None]:
## What are your thoughts on this statement:
## "Because log(income) and prppov are so highky correlated, they have no business in being 
## included in the same regression"