In [4]:
import pandas as pd
import statsmodels.api as sm
import numpy as np

How to read probit regressions:
1. The probit model gives us the z-score (or index) that we convert to a probability using the normal CDF.
	A.  For Model (2), the equation is: z = -5.185 + (-0.235 × Female) + (0.509 × ln(income)) + (-0.019 × Age) + (0.028 × Age²/100)
2. Calculate the z-score for females
	A. For females, we set Female = 1, and use the mean values for the other variables:
	B. z_female = -5.185 + (-0.235 × 1) + (0.509 × 10.40) + (-0.019 × 38.64) + (0.028 × 15.76) = -5.185 - 0.235 + 5.294 - 0.734 + 0.441= -0.419
3. Step 3: Calculate the z-score for males
	A. Set Female = 0
	B. z_male = -5.185 + (-0.235 × 0) + (0.509 × 10.40) + (-0.019 × 38.64) + (0.028 × 15.76) = -5.185 + 0 + 5.294 - 0.734 + 0.441 = -0.184
4. Step 4: Convert z-scores to probabilities using the normal CDF
	A. Fraction of females purchasing = Φ(-0.419) ≈ 0.338 or 33.8%
	B. Fraction of males purchasing = Φ(-0.184) ≈ 0.427 or 42.7%
5. Step 5: Calculate how much more likely males are to purchase than females
	A. We can express this as a ratio: Male-to-female ratio = 0.427/0.338 ≈ 1.26
	B. This means males are about 1.26 times more likely to purchase a computer than females.
	C. Percentage difference = (0.427 - 0.338)/0.338 × 100% ≈ 26.3% 

A simpler way would be to run a linear regression (ordinary least squares or OLS) with the binary purchase outcome (1 if purchased, 0 if not) as the dependent variable and the female indicator as the only independent variable. This is a LPM.

What is the marginal effect of a proportional change (meaning a 1% increase) in income on the purchase probability in the regression in col. (2), evaluated at mean income? 
- The marginal effect evaluated at the mean income is 0:509*0.01*φ(X'β)
- We first multiply by 0.01 because it's a linear-log regression. We can derive this starting from the approximation based off the chain rule
- φ(X'β) denotes the linear predictor (X'β) at the mean values wrapped inside the standard normal distribution to ensure it's between 0 and 1
- X'β = -5.185 (constant) + (-0.235 × 0.361) + (0.509 × 10.40) + (-0.019 × 38.64) + (0.028 × 15.76) = -5.185 - 0.085 + 5.294 - 0.734 + 0.441 = -0.269
- φ(-0.269) = (1/√(2π)) × e^(-(-0.269)²/2) = φ(-0.269) ≈ 0.386
- Therefore, for the average income, a 1% increase in income leads to a 0.509 × 0.01 × 0.386 x 100 (for percentage terms)= 0.196474% increase in the probability of purchasing a computer.

What is the effect evaluated at an income of 50,000?
- Can we work backwards to work out what value of ln(income) is equal to when income = 50,000. This is ln(income) = 10.40. 
- Therefore, we redo the linear predictor to get X'β = -5.185 (constant) + (-0.235 × 0.361) + (0.509 × 10.8224) + (-0.019 × 38.64) + (0.028 × 15.76) = -0.0541134
- Φ(-0.0541134) ≈ 0.4784
- Therefore, for the average person with an income of $50,000, a 1% increase in income leads to a 0.509 × 0.01 × 0.4784 x 100 (for percentage terms)= 0.2435056% increase in the probability of purchasing a computer. The comparison group is another individual with an income of $50,000 but who does not receive a 1% increase in income level.
- Note we don't need to adjust the 0.509. The coefficient 0.509 represents the change in the probability of purchase for a one-unit change in log(income), so it stays constant across different income levels.


In [None]:
# Load the Stata .dta file
file_path = '/Users/danielseymour/Downloads/insurance.dta'
df = pd.read_stata(file_path)

# Display the first few rows to verify it loaded correctly
df.head()

Unnamed: 0,healthy,age,anylim,male,insured,deg_nd,deg_ged,deg_hs,deg_ba,deg_ma,...,married,selfemp,familysz,reg_ne,reg_mw,reg_so,reg_we,race_bl,race_ot,race_wht
0,1.0,31.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,...,1.0,1.0,4.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0
1,1.0,31.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,...,1.0,0.0,4.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0
2,1.0,54.0,0.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,...,1.0,0.0,5.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0
3,1.0,27.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,5.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0
4,1.0,39.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,...,1.0,0.0,5.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0


In [2]:
df.columns

Index(['healthy', 'age', 'anylim', 'male', 'insured', 'deg_nd', 'deg_ged',
       'deg_hs', 'deg_ba', 'deg_ma', 'deg_phd', 'deg_oth', 'married',
       'selfemp', 'familysz', 'reg_ne', 'reg_mw', 'reg_so', 'reg_we',
       'race_bl', 'race_ot', 'race_wht'],
      dtype='object')

In [5]:
# Run a LPM
X = sm.add_constant(df['selfemp'])
y = df['insured']

linear_model = sm.OLS(y, X).fit()
print(linear_model.summary())

                            OLS Regression Results                            
Dep. Variable:                insured   R-squared:                       0.011
Model:                            OLS   Adj. R-squared:                  0.011
Method:                 Least Squares   F-statistic:                     97.25
Date:                Mon, 10 Mar 2025   Prob (F-statistic):           8.03e-23
Time:                        20:34:34   Log-Likelihood:                -4356.4
No. Observations:                8802   AIC:                             8717.
Df Residuals:                    8800   BIC:                             8731.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          0.8167      0.005    180.896      0.0

## Interpretation

On average, being a self-employed worker is associated with a 12.76 percentage point decrease in the probability of having insurance coverage, compared to those who are not self-employed.

Insurance rate for not self-employed: 81.67%
Insurance rate for self-employed: 81.67% - 12.76% = 68.91%
Raw difference: 12.76 percentage points

This is one of the advantages of the LPM - coefficients can be directly interpreted as percentage point differences.

In [6]:
# Run a LPM with demographic controls
# Include age, gender, marital status, education, race, and region variables
X = sm.add_constant(df[['selfemp', 
                       'age', 'male', 'married', 'familysz',
                       'deg_ged', 'deg_hs', 'deg_ba', 'deg_ma', 'deg_phd', 'deg_oth',
                       'reg_ne', 'reg_mw', 'reg_so', 'reg_we',
                       'race_bl', 'race_ot', 'race_wht']])
y = df['insured']

linear_model_with_controls = sm.OLS(y, X).fit()
print(linear_model_with_controls.summary())

                            OLS Regression Results                            
Dep. Variable:                insured   R-squared:                       0.145
Model:                            OLS   Adj. R-squared:                  0.144
Method:                 Least Squares   F-statistic:                     93.26
Date:                Mon, 10 Mar 2025   Prob (F-statistic):          4.53e-284
Time:                        21:11:29   Log-Likelihood:                -3714.3
No. Observations:                8802   AIC:                             7463.
Df Residuals:                    8785   BIC:                             7583.
Df Model:                          16                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          0.2568      0.015     17.648      0.0

## Interpretation

On average, holding fixed demographics, being a self-employed worker is associated with a 17.29 percentage point decrease in the probability of having insurance coverage, compared to those who are not self-employed.

In [8]:
# Run the probit regression (assuming insured is binary 0/1)
probit_model = sm.Probit(y, X).fit()

# Display the summary of the regression
print(probit_model.summary())

# For easier interpretation, calculate odds ratios
odds_ratios = np.exp(logit_model.params)
conf = logit_model.conf_int()
conf['Odds Ratio'] = odds_ratios
conf.columns = ['2.5%', '97.5%', 'Odds Ratio']

print("\nOdds Ratios:")
print(conf)

Optimization terminated successfully.
         Current function value: 0.427126
         Iterations 6
                          Probit Regression Results                           
Dep. Variable:                insured   No. Observations:                 8802
Model:                         Probit   Df Residuals:                     8785
Method:                           MLE   Df Model:                           16
Date:                Mon, 10 Mar 2025   Pseudo R-squ.:                  0.1436
Time:                        21:13:42   Log-Likelihood:                -3759.6
converged:                       True   LL-Null:                       -4390.1
Covariance Type:            nonrobust   LLR p-value:                1.166e-258
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const         -0.2945        nan        nan        nan         nan         nan
selfemp       -0.6765      0.

In [None]:
# Calculate marginal effects
# The default is to calculate the marginal effect at the mean of the independent variables (i.e., the average person, APE hat)
# The APE is typically preferred because it represents an actual person in the sample, not a hypothectical person
# The APE is the average of the marginal effects for each observation in the sample
margeff = probit_model.get_margeff()
print(margeff.summary())

       Probit Marginal Effects       
Dep. Variable:                insured
Method:                          dydx
At:                           overall
                dy/dx    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
selfemp       -0.1608      0.011    -14.515      0.000      -0.183      -0.139
age            0.0036      0.000      9.425      0.000       0.003       0.004
male          -0.0348      0.008     -4.360      0.000      -0.050      -0.019
married        0.1272      0.009     13.766      0.000       0.109       0.145
familysz      -0.0128      0.003     -4.992      0.000      -0.018      -0.008
deg_ged        0.1008      0.019      5.174      0.000       0.063       0.139
deg_hs         0.1883      0.013     14.749      0.000       0.163       0.213
deg_ba         0.2613      0.016     16.317      0.000       0.230       0.293
deg_ma         0.3120      0.026     12.082      0.000    

  return cov_me, np.sqrt(np.diag(cov_me))


The marginal effects in your probit model show how a one-unit change in each independent variable affects the probability of being insured, holding all other variables constant.

The Average Partial Effect (APE) of self-employment shows that being self-employed is associated with a 16.08 percentage point decrease in the probability of having insurance, on average, after controlling for other factors in the model.

If we include the "healthy" variable on the RHS, it's a bad control for a number of reasons. These all give different reasons why it's bad to compare self-employed and insured people with the same health. 

1. Reverse Causality. Health status might be affected by insurance status. People with insurance tend to access more preventive care and may be diagnosed and treated earlier, potentially improving their health outcomes. There's probably not moral hazard with the insurance so that insured people are more unhealthy, but it's a fun possibility! 

2. Selection Effect
The selection effect refers to how individuals might "select into" or choose self-employment based on their health status. It's s
For example:

- People with chronic health conditions or disabilities might choose self-employment because it offers more flexibility in work hours and environment. 
- They can work from home, take breaks when needed, and avoid commutes.
- Conversely, very healthy individuals might feel more comfortable taking the financial risks associated with self-employment because they don't anticipate high medical expenses.

3. Post-treatment variable: If self-employment affects health (through stress, working conditions, etc.), then controlling for health would remove part of the effect you're trying to measure.

4. Collider bias: If both self-employment and insurance independently affect health status, controlling for health could create a spurious correlation between self-employment and insurance. Self-employment → Health ← Insurance status. When you control for a collider, you can create a spurious (false) association between the original variables. This happens because you're essentially examining the relationship between self-employment and insurance status only within strata of similar health status.

Reverse causality is about Y causing X when your model assumes X causes Y
Selection effect is about Z causing both X and Y, creating a spurious correlation

Reverse Causality:

- Focuses on the direction of causation
- Occurs when your outcome variable (insurance status) affects your control variable (health status)
- Problem: The arrow of causation runs in the reverse direction of what your model assumes
- Example: Insurance → Health, rather than Health → Insurance

Selection Effect:

- Focuses on how people sort themselves into different groups
- Occurs when a third variable (health) influences both your explanatory variable (self-employment) and outcome (insurance)
- Problem: Non-random sorting into treatment groups based on characteristics that also affect outcomes
- Example: Health → Self-employment AND Health → Insurance