# Regression with a Binary Dependent Variable

Probit and logit regression are nonlinear regression models designed for binary dependent variables. 


## The Linear Probability Model

The Linear Probability Model (LPM) is simply the application of ordinary least squares (OLS) to binary outcomes instead of continuous outcomes. 

\begin{equation}
\begin{split}
Y_i &= \beta_0 + \beta_1 X_{1i} + \beta_2 X_{2i} + ... + \beta_k X_{ki} + u_i
\end{split}
\end{equation} 

Because $Y$ is binary, $E(Y|X_1, X_2, ..., X_k)= Pr(Y = 1 | X_1, X_2, ..., X_k)$ so for the linear probability model,

\begin{equation}
\begin{split}
Pr(Y = 1| X_1, X_2, ..., X_k) &= \beta_0 + \beta_1 X_{1} + \beta_2 X_{2} + ... + \beta_k X_{k} 
\end{split}
\end{equation} 

The regression coefficient $\beta_1$ is the change in the probability that $Y=1$ associated with a unit change in $X_1$, holding constant the other regressors, and so forth for $\beta_2, ..., \beta_k$.

### Disadvantage

The main disadvantage of the LPM is the linearity form of the model. Because probabilities must lie between zero and one, the effect on the probability that $Y=1$ of a given change in $X$ must be nonlinear.

The LPM assumes a constant marginal effect of $X$ for all values of $X$, but the marginal effect of $X$ almost always varies with respect to $X$. In extreme cases this misspecification of the functional form of the LPM can even lead to predicted probabilities that
are less than 0 or greater than 1.


## Probit Model

The population probit model with multiple regressors is:

\begin{equation}
\begin{split}
Pr(Y = 1| X_1, X_2, ..., X_k) &= \Phi(\beta_0 + \beta_1 X_{1} + \beta_2 X_{2} + ... \beta_k X_{k})
\end{split}
\end{equation} 

where the dependent variable $Y$ is binary, $\Phi$ is the cumulative standard normal distribution function, and $X_1,X_2$ and so on are regressors. 

The coefficient $\beta_1$ is the change in the $z-$value arising from a unit change in $X_1$, holding constant $X_2,...,X_k$.

## Logit Regression
The logit model is very similar to the probit model. The only difference is that the function $F(x)$ is now the logistic function. The population logit model of the binary dependent variable $Y$ with multiple regressors is:

\begin{equation}\label{eq:1}
\begin{split}
Pr(Y = 1| X_1, X_2, ..., X_k) &= F(\beta_0 + \beta_1 X_{1} + \beta_2 X_{2} + ... \beta_k X_{k}) \\
 &= \frac{1}{1+ e^{-(\beta_0 + \beta_1 X_{1} + \beta_2 X_{2} + ... \beta_k X_{k})}}
\end{split}
\end{equation}



## Empirical Exercise in Stata

Note: This material is based on the [Companion Website for Stock and Watson's Introduction to Econometrics](https://wps.pearsoned.com/aw_stock_ie_3/178/45691/11696959.cw/index.html).

### Set Stata Magic in Python

In [1]:
%%capture
import os
os.chdir('/Program Files/Stata17/utilities')
from pystata import config
config.init('se')

### Import Data from Web

In [2]:
%%stata
* Open Stata file from web page book
use https://wps.pearsoned.com/wps/media/objects/11422/11696965/data3eu/employment_08_09.dta, clear


. * Open Stata file from web page book
. use https://wps.pearsoned.com/wps/media/objects/11422/11696965/data3eu/employ
> ment_08_09.dta, clear

. 


In [3]:
%%stata
des


Contains data from https://wps.pearsoned.com/wps/media/objects/11422/11696965/d
> ata3eu/employment_08_09.dta
 Observations:         5,412                  
    Variables:            21                  15 Jun 2014 18:57
-------------------------------------------------------------------------------
Variable      Storage   Display    Value
    name         type    format    label      Variable label
-------------------------------------------------------------------------------
age             byte    %8.0g                 Age
race            byte    %8.0g                 Race
earnwke         double  %10.0g                (e&dp) Earnings per week
employed        float   %9.0g                 
unemployed      float   %9.0g                 
married         float   %9.0g                 
union           float   %9.0g                 
ne_states       float   %9.0g                 
so_states       float   %9.0g                 
ce_states       float   %9.0g                 
we_states      

a. What fraction of workers in the sample were employed in April 2009? Compute a 95% confidence interval for the probability that a worker was employed in April 2009, conditional on being employed in April 2008.

In [4]:
%%stata
tab employed
sum employed
dis r(mean)*100 "% workers were employed in Apr-2009"
local se = r(sd) / (r(N))^0.5 // standard error
dis "standard error " `se'
local ci_l = r(mean) -`se'*1.96 
local ci_u = r(mean) +`se'*1.96 
dis "CI Lower Bound " `ci_l' 
dis "CI Upper Bound " `ci_u'

** We can use the stata $ci$ command
ci means employed


. tab employed

   employed |      Freq.     Percent        Cum.
------------+-----------------------------------
          0 |        674       12.45       12.45
          1 |      4,738       87.55      100.00
------------+-----------------------------------
      Total |      5,412      100.00

. sum employed

    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
    employed |      5,412    .8754619    .3302249          0          1

. dis r(mean)*100 "% workers were employed in Apr-2009"
87.546194% workers were employed in Apr-2009

. local se = r(sd) / (r(N))^0.5 // standard error

. dis "standard error " `se'
standard error .00448881

. local ci_l = r(mean) -`se'*1.96 

. local ci_u = r(mean) +`se'*1.96 

. dis "CI Lower Bound " `ci_l' 
CI Lower Bound .86666387

. dis "CI Upper Bound " `ci_u'
CI Upper Bound .88426

. 
. ** We can use the stata $ci$ command
. ci means employed

    Variable

b. Regress employed on $age$ and $age^2$, using a linear probability model.

In [5]:
%%stata
* gen age squared
gen agesqr = age*age

eststo ols0: reg employed age agesqr, r


. * gen age squared
. gen agesqr = age*age

. 
. eststo ols0: reg employed age agesqr, r

Linear regression                               Number of obs     =      5,412
                                                F(2, 5409)        =      37.50
                                                Prob > F          =     0.0000
                                                R-squared         =     0.0197
                                                Root MSE          =     .32702

------------------------------------------------------------------------------
             |               Robust
    employed | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
         age |   .0282722   .0032812     8.62   0.000     .0218398    .0347047
      agesqr |  -.0003266   .0000388    -8.42   0.000    -.0004026   -.0002506
       _cons |   .3074929   .0668583     4.60   0.000     .1764237    .4385621
---

b.i Based on this regression, was age a statistically significant determinant of employment in April 2009? 
b.ii. Is there evidence of a nonlinear effect of age on the probability of being employed?

In [6]:
%%stata
** Test joint significance of age and age^2
test age agesqr


. ** Test joint significance of age and age^2
. test age agesqr

 ( 1)  age = 0
 ( 2)  agesqr = 0

       F(  2,  5409) =   37.50
            Prob > F =    0.0000

. 


b.iii. Compute the predicted probability of employment for a 20-year-old worker, a 40-year-old worker, and a 60-year-old worker.

In [7]:
%%stata
forvalues i = 20(20)60 {
	scalar age`i' = _b[_cons] + _b[age]*`i' + _b[agesqr]*(`i'^2)
	dis "Predicted Prob. of Employment for a " `i' " year old worker  " age`i'     
}


. forvalues i = 20(20)60 {
  2.         scalar age`i' = _b[_cons] + _b[age]*`i' + _b[agesqr]*(`i'^2)
  3.         dis "Predicted Prob. of Employment for a " `i' " year old worker  
> " age`i'     
  4. }
Predicted Prob. of Employment for a 20 year old worker  .74228415
Predicted Prob. of Employment for a 40 year old worker  .91576846
Predicted Prob. of Employment for a 60 year old worker  .82794584

. 


c. Regress Employed on $Age$ and $Age^2$, using a probit model. Compute the predicted probability of employment for a 20-year-old worker, a 40-year-old worker, and a 60-year-old worker.

In [8]:
%%stata
eststo prob0: probit employed age agesqr, r

forvalues i = 20(20)60 {
	scalar age`i'p = normprob(_b[_cons] + _b[age]*`i' + _b[agesqr]*(`i'^2))
	dis "Predicted Prob. of Employment for a " `i' " year old worker  " age`i'p     
}


. eststo prob0: probit employed age agesqr, r

Iteration 0:   log pseudolikelihood = -2034.2101  
Iteration 1:   log pseudolikelihood = -1987.1391  
Iteration 2:   log pseudolikelihood = -1986.9186  
Iteration 3:   log pseudolikelihood = -1986.9186  

Probit regression                                       Number of obs =  5,412
                                                        Wald chi2(2)  =  92.89
                                                        Prob > chi2   = 0.0000
Log pseudolikelihood = -1986.9186                       Pseudo R2     = 0.0232

------------------------------------------------------------------------------
             |               Robust
    employed | Coefficient  std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
         age |   .1217233   .0128959     9.44   0.000     .0964477    .1469988
      agesqr |  -.0014125   .0001555    -9.09   0.000    -.0017172   -.0011078

d. Regress Employed on $Age$ and $Age^2$, using a logit model. Compute the predicted probability of employment for a 20-year-old worker, a 40-year-old worker, and a 60-year-old worker.

In [9]:
%%stata
eststo log0: logit employed age agesqr, r

forvalues i = 20(20)60 {
	scalar xb`i' = _b[_cons] + _b[age]*`i' + _b[agesqr]*(`i'^2)
	scalar age`i'l =  1/(1+exp(-xb`i'))
	dis "Predicted Prob. of Employment for a " `i' " year old worker  " age`i'l 
}


. eststo log0: logit employed age agesqr, r

Iteration 0:   log pseudolikelihood = -2034.2101  
Iteration 1:   log pseudolikelihood = -1989.3895  
Iteration 2:   log pseudolikelihood = -1986.4342  
Iteration 3:   log pseudolikelihood = -1986.4295  
Iteration 4:   log pseudolikelihood = -1986.4295  

Logistic regression                                     Number of obs =  5,412
                                                        Wald chi2(2)  =  97.85
                                                        Prob > chi2   = 0.0000
Log pseudolikelihood = -1986.4295                       Pseudo R2     = 0.0235

------------------------------------------------------------------------------
             |               Robust
    employed | Coefficient  std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
         age |   .2254662   .0234547     9.61   0.000     .1794959    .2714366
      agesqr |  -.0026237   .

f. The data set includes variables measuring the workers’ educational attainment, sex, race, marital status, and weekly earnings in April 2008. Run a regression including these variables and investigate whether the conclusion on the effect of age on employment from (b)-(d) are affected by omitted variable bias.

Generate variables 

In [10]:
%%stata
* Generate dummy variable for race
gen white =race==1

* Generate dummy variables for education
g educ_1 =(educ_lths==1 | educ_hs==1)
g educ_2 =(educ_somecol==1|educ_aa==1)
g educ_3 =educ_bac==1
g educ_4 =educ_adv==1


. * Generate dummy variable for race
. gen white =race==1

. 
. * Generate dummy variables for education
. g educ_1 =(educ_lths==1 | educ_hs==1)

. g educ_2 =(educ_somecol==1|educ_aa==1)

. g educ_3 =educ_bac==1

. g educ_4 =educ_adv==1

. 


Results from (b)-(f) can be summarized in the following table:

In [19]:
%%stata
gl controls age agesqr white female married earnwke educ_2 educ_3 educ_4
qui: eststo ols:reg employed $controls , r
qui: eststo probit: probit employed $controls , r
qui: eststo logit: logit employed $controls , r
    
esttab ols0 prob0 log0, se star(* 0.10 ** 0.05 *** 0.01) b(%5.3f) ///
nonumbers mtitles("OLS" "Probit" "Logit") ///
title(Regressions with a Binary Dependent Variables)

esttab ols probit logit, se star(* 0.10 ** 0.05 *** 0.01) b(%5.3f) ///
nonumbers mtitles("OLS" "Probit" "Logit") ///
title(Regressions with a Binary Dependent Variables: Additional Controls)


. gl controls age agesqr white female married earnwke educ_2 educ_3 educ_4

. qui: eststo ols:reg employed $controls , r

. qui: eststo probit: probit employed $controls , r

. qui: eststo logit: logit employed $controls , r

.     
. esttab ols0 prob0 log0, se star(* 0.10 ** 0.05 *** 0.01) b(%5.3f) ///
> nonumbers mtitles("OLS" "Probit" "Logit" "OLS" "Probit" "Logit") ///
> title(Regressions with a Binary Dependent Variables: No controls)

Regressions with a Binary Dependent Variables: No controls
------------------------------------------------------------
                      OLS          Probit           Logit   
------------------------------------------------------------
main                                                        
age                 0.028***        0.122***        0.225***
                  (0.003)         (0.013)         (0.023)   

agesqr             -0.000***       -0.001***       -0.003***
                  (0.000)         (0.000)         (0.000)   

_cons

The results in (a)–(f) were based on the probability of employment. Workers who are not employed can either be (i) unemployed or (ii) out the labor force. Do the conclusions in (a)–(f) also hold for workers who became unemployed? 

In [22]:
%%stata
qui: eststo uols0: reg unemployed age agesqr, r
qui: eststo uprobit0: probit unemployed age agesqr, r
qui: eststo ulogit0: logit unemployed age agesqr, r
qui: eststo uols: reg unemployed $controls , r
qui: eststo uprobit: probit unemployed $controls , r
qui: eststo ulogit: logit unemployed $controls , r

esttab uols0 uprobit0 ulogit0, se star(* 0.10 ** 0.05 *** 0.01) b(%5.3f) ///
nonumbers mtitles("OLS" "Probit" "Logit") ///
title(Regressions with a Binary Dependent Variables)

esttab uols uprobit ulogit, se star(* 0.10 ** 0.05 *** 0.01) b(%5.3f) ///
nonumbers mtitles("OLS" "Probit" "Logit") ///
title(Regressions with a Binary Dependent Variables: Additional Controls)


. qui: eststo uols0: reg unemployed age agesqr, r

. qui: eststo uprobit0: probit unemployed age agesqr, r

. qui: eststo ulogit0: logit unemployed age agesqr, r

. qui: eststo uols: reg unemployed $controls , r

. qui: eststo uprobit: probit unemployed $controls , r

. qui: eststo ulogit: logit unemployed $controls , r

. 
. esttab uols0 uprobit0 ulogit0, se star(* 0.10 ** 0.05 *** 0.01) b(%5.3f) ///
> nonumbers mtitles("OLS" "Probit" "Logit") ///
> title(Regressions with a Binary Dependent Variables)

Regressions with a Binary Dependent Variables
------------------------------------------------------------
                      OLS          Probit           Logit   
------------------------------------------------------------
main                                                        
age                -0.007***       -0.065***       -0.142***
                  (0.002)         (0.017)         (0.037)   

agesqr              0.000***        0.001***        0.002***
                