# Modeling Absenteeism

## Simple Binary choice model: Probit or Logit

### Model specification:
We can specify our model as a binary outcome where 

$$
y_i = \begin{cases}
   1 &\text{if } \text{student } i \text{ is absent} \\
   0 &\text{if } \text{if student } i \text{ is present}
\end{cases}
$$

Our basic model could be:

$y_i = X_iB + u_i$

where $X_i$ is the vector of independent variables (e.g. student age, grade, school characteristics) and $B$ are the coefficients we want to estimate.


### Estimation:
We will use maximum likelihood estimation to find the value for $B$ that maximize the probability of observing the $y_i$ given our independent variables.
The Probit model uses a cumulative normal distribution function $F(x)$ to model the probability that $y_i = 1$, i.e.

$$
P(y_i = 1 | X_i, B) = F(X_iB)
$$

where $F$ is the standard normal cdf.

The Logit model uses a logistic function instead of a normal cdf, i.e.

$$
P(y_i = 1 | X_i, B) = exp(X_iB) / (1 + exp(X_iB))
$$

### Simulate data

In [9]:
import numpy as np
import pandas as pd
from scipy import optimize

# Set random seed for reproducibility
np.random.seed(42)

# Set the number of observations and the number of independent variables
n_observations = 1000
n_independent_variables = 3

# Generate random data for the independent variables
X = np.random.randn(n_observations, n_independent_variables)

# Generate true coefficients (including intercept)
true_B = np.array([0.5, 0.3, -0.2, 0.1])  # Intercept and three coefficients

# Add intercept term
X_with_intercept = np.column_stack((np.ones(n_observations), X))

# Compute linear predictor
z = np.dot(X_with_intercept, true_B)

# Compute probabilities
p = 1 / (1 + np.exp(-z))

# Generate binary outcomes with correct probability
y = (np.random.rand(n_observations) < p).astype(int)

def log_likelihood(B, X, y):
    # Compute probabilities
    z = np.dot(X, B)
    p = 1 / (1 + np.exp(-z))
    
    # Avoid log(0) with small epsilon
    eps = 1e-15
    p = np.clip(p, eps, 1 - eps)
    
    # Compute log-likelihood
    ll = np.sum(y * np.log(p) + (1 - y) * np.log(1 - p))
    
    return -ll

# Initial guess for coefficients
B_init = np.zeros(n_independent_variables + 1)

# Optimize log-likelihood
result = optimize.minimize(log_likelihood, B_init, args=(X_with_intercept, y), method='BFGS')

# Save results
data = pd.DataFrame(np.column_stack((y, X)), columns=["y", "x1", "x2", "x3"])
data.to_csv("logistic_regression_data.csv", index=False)

print("True Coefficients:", true_B)
print("Estimated Coefficients:", result.x)
print("Number of samples:", len(y))
print("Number of 0s:", np.sum(y == 0))
print("Number of 1s:", np.sum(y == 1))

True Coefficients: [ 0.5  0.3 -0.2  0.1]
Estimated Coefficients: [ 0.61650116  0.28666403 -0.25712836  0.05877295]
Number of samples: 1000
Number of 0s: 351
Number of 1s: 649


### Estimate

In [12]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.discrete.discrete_model import Logit, Probit

# Load data
data = pd.read_csv("data/absence_data.csv")

# Prepare predictors and target
X = data[["age", "grade", "school"]]
X = sm.add_constant(X)  # Add constant term for intercept
y = data["y"]

# Logit Model
logit_model = Logit(y, X).fit()
print(logit_model.summary())

# Probit Model
probit_model = Probit(y, X).fit()
print(probit_model.summary())

Optimization terminated successfully.
         Current function value: 0.270569
         Iterations 8
                           Logit Regression Results                           
Dep. Variable:                      y   No. Observations:                 1000
Model:                          Logit   Df Residuals:                      996
Method:                           MLE   Df Model:                            3
Date:                Thu, 28 Nov 2024   Pseudo R-squ.:                  0.5910
Time:                        18:06:54   Log-Likelihood:                -270.57
converged:                       True   LL-Null:                       -661.56
Covariance Type:            nonrobust   LLR p-value:                3.488e-169
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const         -1.2154      0.127     -9.556      0.000      -1.465      -0.966
age           -2.0019      0.