# Which Demographic Variables help in predicting supermarket revenue? Evidence from Chicago


Research Question: What demographic variables are important for predicting the revenue of supermarkets?

## Data


We work with a dataset of 45 demographic variables related to 77 supermarkets located around the Chicago area from the year of 1996. We define demographic data as data that reflects a profile of the customers; examples of such data included in our data include age, sex, income level, race, employment, homeownership, and level of education.





<img data-src="CorrelationTable.PNG"/>

## Method 


In short, our method is using an elastic net, which deals with overfitting by using a weighted average of the penalty terms applied in the LASSO and Ridge regression.  



Consider the general form of a multivariate regression


\begin{align*}
    \hat{\boldsymbol{y}} = \mathbf{X}\boldsymbol{\hat{\beta}}.
\end{align*}

This has the following loss function

\begin{align*}
    L(\boldsymbol{\beta}) = (\boldsymbol{y} - \mathbf{X}\boldsymbol{\beta})^T(\boldsymbol{y} - \mathbf{X}\boldsymbol{\beta}).
\end{align*}


But this method is prone to overfitting...


The idea behind regularized regression methods is to apply penalty terms to multivariate regression to prevent overfitting. Elastic net is a regularized regression method that uses the following two penalty terms

\begin{align*}
    L_1(\boldsymbol{\beta}) = \lambda \sum_{i=1}^{p} \vert \beta_i \vert, \hspace{35pt} L_2(\boldsymbol{\beta}) = \lambda \boldsymbol{\beta}^T\boldsymbol{\beta}.
\end{align*}



Zhou and Hastie (2005) show that this method works better than using the LASSO method (which just uses $L_1$) when the independent variables are highly correlated - which makes it a good method for our use-case.  

In the elastic net, we minimize the following majorization function. 


\begin{align*}
    L(\boldsymbol{\beta}) = (\boldsymbol{y} - \mathbf{X}\boldsymbol{\beta})^T(\boldsymbol{y} - \mathbf{X}\boldsymbol{\beta}) + \lambda \alpha \sum_{i=1}^{p} \vert \beta_i \vert + \lambda (1-\alpha) \boldsymbol{\beta}^T\boldsymbol{\beta}.
\end{align*}
In order to find the $\boldsymbol{\beta}$ at which $L(\boldsymbol{\beta})$ is minimized, we use the majorize-minimization algorithm (MM). 


We use the following majorization function: 
\begin{align*}
    L(\boldsymbol{\beta})=\frac{1}{2}\boldsymbol{\beta}^T(\mathbf{A})\boldsymbol{\beta} - n^{-1}\boldsymbol{\beta}^T\mathbf{X}^T\boldsymbol{y} + c ,\\
\end{align*}


\begin{align*}
    \mathbf{A} = n^{-1}\mathbf{X}^T\mathbf{X} + \lambda(1+\alpha)I + \lambda \alpha \mathbf{D}, \hspace{35pt}
    \mathbf{D} = \begin{bmatrix}
    \frac{1}{max(\beta_1, \epsilon)} &0 &0 \\
    0& \ddots & 0\\
    0& 0& \frac{1}{max(\beta_p, \epsilon)} \\
  \end{bmatrix}, \hspace{35pt}
   c = \frac{1}{2n}\boldsymbol{y}^T\boldsymbol{y} + \frac{1}{2}\lambda\alpha  \sum_{i=1}^{p} \vert \beta_i \vert.
\end{align*}


We evaluate which $\lambda$, $\alpha$ to pick by using k-fold cross validation


\begin{align*}
    RMSE = \sqrt{ \frac{1}{n} \sum_{i=1}^{n} (\hat{\boldsymbol{y}} - \boldsymbol{y})^2 }, \hspace{35pt} \bar{RMSE} = \frac{1}{K}\sum_{i=1}^{K} RMSE_i.
\end{align*}





<img data-src="kfold.PNG" />


## Results

We find that an $\alpha$ of 0.2 and $\lambda$ of 0.1 produces the best fitted model. This indicates that in our problem more emphasis should be put on the L2 penalty. This result is consistent with the findings in (Marquardt and Snee 1975), which found that in problems with highly correlated explanatory variables ridge regression performs best.

<img data-src="CoefficientTable.PNG" />


<img data-src="GLM.ComparePlot.PNG" />