<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Fast Food Chains and Price Discrimination

_Authors: Kiefer Katovich (SF), Mario Carrillo (SF)_

---

This group lab uses a 1994 dataset that has detailed prices on items sold at over 400 Burger King, Wendy's, KFC and [Roy Roger](https://en.wikipedia.org/wiki/Roy_Rogers_Restaurants) restaurants in New Jersey and Pennsivania.

It is a restricted version of the dataset used in this publication:

> [K. Graddy (1997), "Do Fast-Food Chains Price Discriminate on the Race and Income Characteristics of an Area?" Journal of Business and Economic Statistics 15, 391-401.](http://people.brandeis.edu/~kgraddy/published%20papers/GraddyK_jbes1997.pdf)

**The goal of this this exercise is to evaluate whether fast-food restaurants are using discriminatory pricing.** This is a fairly open ended prompt and it up to you how you want to quantify pricing discrimination. There are two immediate groups of variables that can help you investigate this:
- You have the prices of items, which can be a metric of discriminatory practices (charging higher prices).
- You have proportion of African Americans, proportion of low income, and proportion of residents without a car.

---

### In groups, you should:
1. **Load the data and examine it.**
2. **Perform any required data cleaning.**
3. **Conduct EDA relevant to the goals of the project.** What variables are you interested in for your target(s) and predictors? What kinds of relationships to you see in the data that will inform your analysis?
4. **Formulate your hypotheses and formally define them.** Come up with a plan, based on the prompt and your EDA, for testing your hypotheses.
5. **Construct regression models to test your hypotheses.** What do you find? Do you accept or reject your hypothesis? What are the limitations and assumptions of your approach? 
6. **[Bonus] Cross-validate the results of your regression.** If you achieved positive results, do these hold up during cross-validation or train/test split?
7. **Prepare a brief (10 minute) presentation on the findings.** Your group will present on the questions you asked, models you built, and findings.
    - Be concise! Only present information that is relevant.
    - Visuals are nice but don't overdo it.
    - Don't just talk about the significance or metrics in your model. Interpret the coefficients. What are the implications?
    - What future directions would you go in from here?


> *Note*: If you are interested, the full dataset is also in the datasets folder under the name `discrim_full.csv`.

### Dataset characteristics

    :Number of Instances: 410
    
    :Attribute Information
    
    psoda         price of medium soda
    pfries        price of small fries
    pentree       price entree (burger or chicken)
    wagest        starting wage
    nmgrs         number of managers
    nregs         number of registers
    hrsopen       hours open
    emp           number of employees
    compown       =1 if company owned
    chain         BK = 1, KFC = 2, Roy Rogers = 3, Wendy's = 4
    density       population density, town
    crmrte        crime rate, town
    state         NJ = 1, PA = 2
    prpblck       proportion black, zipcode
    prppov        proportion in poverty, zipcode
    prpncar       proportion no car, zipcode
    hseval        median housing value, zipcode
    nstores       number of stores, zipcode
    income        median family income, zipcode
    county        county label
    NJ            =1 for New Jersey
    BK            =1 if Burger King
    KFC           =1 if Kentucky Fried Chicken
    RR            =1 if Roy Rogers

In [9]:
# data modules
import numpy as np
import scipy.stats as stats
import pandas as pd
import patsy
import statsmodels.api as sm
# plotting modules
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('whitegrid')
# Stats/Regresions Packages
from sklearn import linear_model
from sklearn.metrics import r2_score
from sklearn import datasets, linear_model
from sklearn.model_selection import train_test_split
# make sure charts appear in the notebook:
%matplotlib inline
%config InlineBackend.figure_format ='retina'
food = pd.read_csv('./datasets/discrim.csv')
food.dropna(subset=["psoda","pfries","pentree"],how="any",inplace=True)
food["combined_price"]=food["psoda"]+food["pfries"]+food["pentree"]
food.reset_index(inplace=True,drop=True)
food=food[['hrsopen', 'emp', 'density', 'prpblck', 'prppov', 'nstores', 'income', 'NJ', 'BK', 'KFC', 'RR',"combined_price"]]
food.dropna(inplace=True)

food=food[food['KFC']]
# A: 7030
formula = 'combined_price ~ hrsopen+emp+density+prpblck+prppov+nstores+income+NJ+BK+KFC+RR'
y, X = patsy.dmatrices(formula, data=food, return_type='dataframe')
y = y.values.ravel()

x1,x2,y1,y2=train_test_split(X, y ,random_state = 142, test_size = 0.3)
# test=x1.append(y1)
# training=x2.append(y2)

model = sm.OLS(y1, x1).fit()
model.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.739
Model:,OLS,Adj. R-squared:,0.727
Method:,Least Squares,F-statistic:,64.99
Date:,"Wed, 13 Dec 2017",Prob (F-statistic):,3.0399999999999996e-67
Time:,16:04:28,Log-Likelihood:,-69.239
No. Observations:,265,AIC:,162.5
Df Residuals:,253,BIC:,205.4
Df Model:,11,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,3.0648,0.217,14.148,0.000,2.638,3.491
hrsopen,-0.0161,0.013,-1.205,0.229,-0.042,0.010
emp,-0.0002,0.003,-0.086,0.931,-0.005,0.005
density,1.601e-05,5.22e-06,3.065,0.002,5.72e-06,2.63e-05
prpblck,0.4783,0.176,2.719,0.007,0.132,0.825
prppov,-1.5585,0.553,-2.819,0.005,-2.647,-0.470
nstores,-0.0144,0.011,-1.264,0.207,-0.037,0.008
income,1.505e-06,2.18e-06,0.692,0.490,-2.78e-06,5.79e-06
NJ,0.1548,0.053,2.940,0.004,0.051,0.259

0,1,2,3
Omnibus:,101.24,Durbin-Watson:,1.874
Prob(Omnibus):,0.0,Jarque-Bera (JB):,429.917
Skew:,1.541,Prob(JB):,4.4099999999999997e-94
Kurtosis:,8.426,Cond. No.,1420000.0


In [15]:
food = food[food['KFC']==1]
# A: 7030
formula = 'combined_price ~ hrsopen+emp+density+prpblck+prppov+nstores+income+NJ+BK+KFC+RR'
y, X = patsy.dmatrices(formula, data=food, return_type='dataframe')
y = y.values.ravel()

x1,x2,y1,y2=train_test_split(X, y ,random_state = 142, test_size = 0.3)
# test=x1.append(y1)
# training=x2.append(y2)

model = sm.OLS(y1, x1).fit()
print(model.summary())

                            OLS Regression Results                            
Dep. Variable:                      y   R-squared:                       0.109
Model:                            OLS   Adj. R-squared:                 -0.049
Method:                 Least Squares   F-statistic:                    0.6883
Date:                Wed, 13 Dec 2017   Prob (F-statistic):              0.700
Time:                        16:26:39   Log-Likelihood:                -18.700
No. Observations:                  54   AIC:                             55.40
Df Residuals:                      45   BIC:                             73.30
Df Model:                           8                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      1.5562      0.385      4.038      0.0