# Multiple Regression Analysis with Qualitative Information: Binary (or Dummy) Variables
## Describing Qualitative Information

Like male or female, a ***binary variable***, also known as ***zero-one variable***, is most commonly known as ***dummy variable***.

$Remark\DeclareMathOperator*{\argmin}{argmin}
\DeclareMathOperator*{\argmax}{argmax}
\DeclareMathOperator*{\plim}{plim}
\newcommand{\using}[1]{\stackrel{\mathrm{#1}}{=}}
\newcommand{\ffrac}{\displaystyle \frac}
\newcommand{\asim}{\overset{\text{a}}{\sim}}
\newcommand{\space}{\text{ }}
\newcommand{\bspace}{\;\;\;\;}
\newcommand{\QQQ}{\boxed{?\:}}
\newcommand{\void}{\left.\right.}
\newcommand{\Tran}[1]{{#1}^{\mathrm{T}}}
\newcommand{\d}[1]{\displaystyle{#1}}
\newcommand{\CB}[1]{\left\{ #1 \right\}}
\newcommand{\SB}[1]{\left[ #1 \right]}
\newcommand{\P}[1]{\left( #1 \right)}
\newcommand{\abs}[1]{\left| #1 \right|}
\newcommand{\norm}[1]{\left\| #1 \right\|}
\newcommand{\dd}{\mathrm{d}}
\newcommand{\Exp}{\mathrm{E}}
\newcommand{\RR}{\mathbb{R}}
\newcommand{\EE}{\mathbb{E}}
\newcommand{\NN}{\mathbb{N}}
\newcommand{\ZZ}{\mathbb{Z}}
\newcommand{\QQ}{\mathbb{Q}}
\newcommand{\AcA}{\mathscr{A}}
\newcommand{\FcF}{\mathscr{F}}
\newcommand{\Var}[2][\,\!]{\mathrm{Var}_{#1}\left[#2\right]}
\newcommand{\Avar}[2][\,\!]{\mathrm{Avar}_{#1}\left[#2\right]}
\newcommand{\Cov}[2][\,\!]{\mathrm{Cov}_{#1}\left(#2\right)}
\newcommand{\Corr}[2][\,\!]{\mathrm{Corr}_{#1}\left(#2\right)}
\newcommand{\I}[1]{\mathrm{I}\left( #1 \right)}
\newcommand{\N}[1]{\mathcal{N} \left( #1 \right)}$

>What we call our dummy variable can be clearer like we prefer $\text{female}$ as the column title to $\text{gender}$

## A Single Dummy Independent Variable

Consider a simple model: $\text{wage} = \beta_0 + \delta_0 \cdot\text{female} + \beta_1\cdot\text{educ} + u$. It's obvious that since it's a dummy variable so that

$$\delta_0 = \Exp\SB{\text{wage} \mid \text{female} = 1,\text{educ}} - \Exp\SB{\text{wage} \mid \text{female} = 0,\text{educ}}$$

this situation can be depicted graphically as an ***intercept shift*** between males and females.

And also since *gender* can only be one of $\text{male}$ or $\text{female}$, they have perfect collinearity. We can't put both of them into one model or we've fallen into the ***dummy variable trap***.

Also here we put $\text{female}$ in the model so that it's got a name: ***benchmark group***, or ***base group***.

$Remark$

>If the **base group** is $\text{female}$ and we conduct the test: $H_0: \delta_0 = 0$, if it was against, can we say there's discrimination for women? *NOT necessarily!*! Since there might be other productivity characteritics that have not been controlled for which also have certain effects on their wage.

And an important application is to use dummy variable to do ***policy analysis***. One special case is ***program evaluation***, where we would like to know the effect of economic or social programs on individuals, firms, neighborhoods, cities, and so on.

The simplest case is when there're two groups of objects. Those who don't participate is the ***control group***, otherwise the ***treatment group***, or ***experimental group***.

$Remark$

>This analysis is not to tell whether there's a causal effect or not.

### Interpreting Coefficients on Dummy Explanatory Variables When the Dependent Variable Is $\log\P{y}$

Generally, if $\hat\beta_1$ is the coefficient on a dummy variable, say $x_1$, when $\log\P{y}$ is the dependent variable, the *exact **percentage** difference* in the predicted $y$ when $x_1 = 1$ versus $x_1 = 0$ is

$$100\cdot\SB{\exp\P{\hat\beta_1} -1}$$

## Using Dummy Variables for Multiple Categories

The general principle for including dummy variables to indicate different groups: if the regression model is to have different intercepts for, say, $g$ groups, we need to include $g-1$ dummy variables in the model along with 
an intercept.

The intercept for the base group is the overall intercept in the model, and the dummy variable coefficient for a particular group represents the estimated difference in intercepts between that group and the base group. One more included dummy variable will result in the dummy variable trap. So there are somebody who put $g$ dummy variables without an overall intercept. Two drawbacks:

1. More cumbersome to test for differences relative to a base group
2. the way to compute $R$-squared is changed

$$R^2 = 1-\ffrac{\text{SSR}} {\text{SST}}\stackrel{\text{to}}{\longrightarrow} 1-\ffrac{\text{SSR}} {\text{SST}_0} \equiv R_0^2$$

where $\text{SST}_0 = \sum y_i^2$, the total sum of squares that does not centered to the mean, and the resulting $R$-squared is called the ***uncentered $R$-squared***. This one is rarely a good measure of goodness-of-fit. $\text{SST}_0$ will be much larger than $\text{SST}$ unless $\hat y = 0$.

### Incorporating Ordinal Information by Using Dummy Variables

See the following example about the city credit ratings and municipal bond interest rates. The model is set to be

$$\text{MBR} = \beta_0 + \beta_1 \cdot \text{CR} + \text{other factors}$$

While $\text{CR}$ can only be assigned from $0$ to $4$, called the ***Ordinal Variable***, we rewrite this model as

$$\text{MBR} = \beta_0 + \delta_1 \cdot \text{CR}_1 + \delta_2 \cdot \text{CR}_2 + \delta_3 \cdot \text{CR}_3 + \delta_4 \cdot \text{CR}_4 + \text{other factors}$$

Here we transfer the **Ordinal Variable** (Credit Ranking) to four **Dummy Variables**, and all effects are measured in comparison to the worst rating (base category with $\text{CR} = 0$).

In some cases, the ordinal variable takes on too many values so that a dummy variable cannot be included for each value. Instead, they each represent an interval for the rank of ordinal variable. Like from $1$ to $10$, then from $11$ to $20$, things like that.

## Interactions Involving Dummy Variables
### Interactions among Dummy Variables

**Interaction Term** can also be put in the model with Dummy Variables!

### Allowing for Different Slopes

$$\log\P{\text{wage}} = \beta_0 + \delta_0 \cdot\text{female} + \beta_1 \cdot\text{educ} + \delta_1 \text{female}\cdot\text{educ} + u$$

Then $\beta_0$ is the intercept for men and the corresponding slope is $\beta_1$; while for women, the intercept is $\beta_0+\delta_0$ and the slope is $\beta_1 + \delta_1$.

### Testing for Differences in Regression Functions across Groups

Sometimes, we wish to test the null hypothesis that two groups follow the same regression function, against the alternative that one or more of the slopes differ across the groups. In the preceding example, $H_0: \delta_1 = 0$ only says that the *return to* $\text{educ}$ is same for men and women however $H_0:\delta_1 = \delta_0 = 0$ is to say that the *whole model*, is same for that two groups.

Here's a new example. The overall model is

$$\text{GPA} = \beta_0 + \beta_1\cdot \text{SAT} + \beta_2\cdot\text{other} + u$$

Will it be same for men and women? We recast is to

$$\text{GPA} = \beta_0 + \delta_0\cdot\text{female} + \beta_1\cdot \text{SAT} + \delta_1\cdot \text{female} \cdot \text{SAT} + \beta_2\cdot\text{other} + \delta_2\cdot \text{female} \cdot \text{other} + u$$

Here the $\delta$s are the differences of the intercept and slopes between men and women on the corresponding variables. Thus the null hypothesis is $H_0: \delta_0 = \delta_1 = \delta_2 = 0$. If rejected, meaning that one of the $\delta_j$ is different from $0$, then the model is different for men and women.

***
Further, if we assume two groups share the same variance, we'll come to a new way to calculate the $F$ statistics, called the ***Chow statistic***.

For model $\text{GPA} = \beta_{g,0} + \beta_{g,1}x_1 + \cdots+\beta_{g,k}x_k + u$ where $g=1,2$, representing two groups. We would like to test whether the intercept and all slopes are the same across the two groups.

1. Regression the **unrestricted model** separately on two groups, result in $\text{SSR}_1$ and $\text{SSR}_2$.
2. $\text{SSR}_{ur} = \text{SSR}_1 + \text{SSR}_2$
3. Run regression for the restricted model and obtain $\text{SSR}_p$, here the restricted model is way easier. It's obtained by pooling the groups and estimating a single equation: $\text{GPA} = \beta_{g,0} + \beta_{g}\cdot x_{g} + u$

$$F = \ffrac{\ffrac{\text{SSR}_p-\P{\text SSR_{ur}}}{k+1}}{\ffrac{\text{SSR}_{ur}}{n-2\P{k+1}}}$$

Where $n=n_1+n_2$ the *total number* of obeservations. It's only valid under homoskedasticity. However,in many cases, it is more interesting to allow for an intercept difference between the groups and then to test for slope differences. 

The following is interest but not included in the slides, skipped, sad though.

## A Binary Dependent Variable: The Linear Probability Model

Now the dependent variable is binary so that $\beta_j$s can't be interpreted as before, however it's still useful. We call this model the ***linear probability model (LPM)***. If we assume that the zero conditional mean assumption $\text{MLR}.4$, $\Exp\SB{u\mid x_1,\dots,x_k} = 0$. Then as always,

$$\Exp\SB{y\mid \mathbf x} = \beta_0 + \beta_1 x_1 + \cdots + \beta_k x_k$$

where $\mathbf x$ is shorthand for all of the explanatory variables. The key is that $P\CB{y=1\mid \mathbf x} = \Exp\SB{y\mid \mathbf x}$, then the probability of success:

$$p\P{\mathbf x} = P\CB{y=1\mid \mathbf x} = \Exp\SB{y\mid \mathbf x} = \beta_0 + \beta_1 x_1 + \cdots + \beta_k x_k$$

$P\CB{y=1\mid \mathbf x}$ is called the ***response probability***. Besides, $P\CB{y=0\mid \mathbf x} = 1 - P\CB{y=1\mid \mathbf x}$. Thus these two are all linear function of $x_j$. And most importantly, we have

$$\beta_j = \ffrac{\partial P\CB{y=1\mid \mathbf x}} {\partial x_j}$$

Then by OLS we have the estimated equation: $\hat y = \hat \beta_0 + \hat \beta_1 x_1 + \cdots + \hat \beta_k x_k$. And now to interpret the parameters, we can say: *another unit of this variable will result in a change of the probability that a success happens by its parameter*.

And obviously the relaiton can't be always linear but this model is still very useful. It usually works well for values of the independent variables that are near the averages in the sample. Now we discuss its **disadvantages**

$\P{1}$ Predicted Value Exceeds the Boundary

When the predicted value $\hat y_i$ exceeds the boundary $0$ or $1$, we define the actual predict value $\tilde y_i$ as

$$\tilde y_i = \begin{cases}
1, &\text{if }\hat y_i \geq 0.5\\
0, &\text{if }\hat y_i < 0.5
\end{cases}$$

Then we can define a goodness-of-fit measure ***percent correctly predicted*** as the percentage number of overall correct predictions.

$\P{2}$ Counterintuitive Marginal Probability Effects

It never shrinks, and that's most unintelligible part.

$\P{3}$ HeterosKedasticity

**LPM** can't follow the Gauss-Markov assumptions. When $y$ is a binary variable, its variance, conditional on $\mathbf x$, is

$$\Var{y \mid \mathbf x} = p\P{\mathbf x} \SB{1-p\P{\mathbf x}}$$

Since $p\P{\mathbf x} = \beta_0 + \beta_1 x_1 + \cdots + \beta_k x_k$, unless the probability does not depend on any of the independent variables, there *must* be heteroskedasticity in it. The OLS estimators would not be biased, but the $t$ and $F$ statistics are altered (because of the altered standard error), even in large sample. In practise, they are acceptable, though.

## More on Policy Analysis and Program Evaluation

For a model with one dummy variable, we have treatment group and control group where the value for the variable are $1$ and $0$ respectively. The problems could be that the members in each group are not randomly assigned; that other factors are systematically related to the binary independent variable of interest; that there're ***self-selection*** problems, meaning that the individuals subjectively participant in certain group, instead of being randomly arranged by the researchers.

One possible problem can be that $\Exp\SB{u\mid \text{dummy} = 1}\neq\Exp\SB{u\mid \text{dummy} = 0}$ which results in the biased estimators. Here, the explanatory variable is knowned as **endogenous**.

## Interpreting Regression Results with Discrete Dependent Variables

skipped.
***