# Difference-in-Differences Analysis: Phase 1 CEU Impact on Student Abandonment
**Author:** [elia.matsumoto@fgv.br] (2026)

## Overview
This notebook performs a Difference-in-Differences (DiD) analysis to evaluate the impact of Phase 1 CEUs (opened in 2003-2004) on student abandonment rates in SÃ£o Paulo municipality (2000-2007).

## Research Question
Do Phase 1 CEUs (Centros Educacionais Unificados) reduce student abandonment rates compared to areas without CEUs?

## Input Files
1. `Data_2a_CEU_Phase1.xlsx`: Phase 1 CEU treatment assignment by subprefecture
2. `Data_2b_EDU_preproc.xlsx`: Preprocessed educational metrics (ABAND, APROV, REPROV)

## Output Files
1. `Fig_2_EDU_DiD_Phase1.jpeg`: Visualization of treatment/control trends
2. `Results_2_EDU_DiD_Phase1.xlsx`: DiD dataset with treatment indicators
3. Console output: Two OLS regression summaries

## Methodology
- **Design**: Difference-in-Differences (DiD)
- **Treatment**: Subprefectures with Phase 1 CEUs (2003-2004 openings)
- **Control**: Subprefectures without Phase 1 CEUs
- **Time Window**: 2000-2007 (4 years pre/post treatment)
- **Treatment Year**: 2003

## Key Variables
- **TREAT**: Binary (1=Phase 1 CEU, 0=No Phase 1 CEU)
- **AFTER**: Binary (1=Post-2003, 0=Pre-2003)
- **ABAND**: Student abandonment rates (outcome)
- **APROV/REPROV**: Covariates (approval/repetition rates)

## Regression Specifications
1. **Basic DiD**: ABAND ~ AFTER + TREAT + AFTER*TREAT
2. **With Controls**: ABAND ~ AFTER + TREAT + AFTER*TREAT + APROV + REPROV

## Visualization
- Annual means for treatment/control groups (2000-2007)
- Linear trends for pre-treatment (2000-2003, red) and post-treatment (2003-2007, blue)
- Vertical line at treatment year (2003)
- Excludes years 2014 and 2020 from trend calculations

## Interpretation
- **AFTER*TREAT coefficient**: Treatment effect of Phase 1 CEUs on abandonment
- **Parallel trends**: Visual check of pre-2003 trends for DiD validity
- **Two models**: Robustness check with/without covariates

In [None]:
# Importing libraries
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import statsmodels.formula.api as smf

In [None]:
# Trend line
def Trend_Line(X,Y,YEARS_OUT):
    # Filtering YEARS_OUT
    xf  = X;
    yf  = Y; 
    ind = np.where(np.isin(X,YEARS_OUT))[0];
    if len(ind) > 0:
        xf  = np.delete(xf,ind);
        yf  = np.delete(yf,ind);
    coeff      = np.polyfit(xf,yf, 1);
    trend_line = np.poly1d(coeff);
    n       = len(X);
    X_trend = np.linspace(X.min(), X.max(),n);
    Y_trend = trend_line(X_trend);
    df_trend = pd.DataFrame({
        'X' : X_trend.flatten(),
        'Y' : Y_trend.flatten()
    })
    return df_trend

In [None]:
# Files info
Data_CEU_File  = 'Data_2a_CEU_Phase1.xlsx';
Data_CEU_Sheet = 'Sheet1';
Data_EDU_File  = 'Data_2b_EDU_preproc.xlsx';
Data_EDU_Sheet = 'Data_preproc';
Data_DID_File  = 'Data_2_EDU_DiD_Phase1.xlsx';
Data_DID_Sheet = 'Sheet1';
Fig_File       = 'Fig_2_EDU_DiD_Phase1.jpeg';
Result_File    = 'Results_2_EDU_DiD_Phase1.xlsx';
Result_Sheet1  = "Without_covariates";
Result_Sheet2  = "With_covariates";

In [None]:
# Initialization
PHASE_YEAR = 2003;
YEARS     = np.array(range(2000, 2008));
nyears    = len(YEARS);
IY        = np.where(YEARS == PHASE_YEAR)[0][0];
YEARS_OUT = [2014, 2020];

In [None]:
# Reading Sao Paulo municipality EDU info
rawdata = pd.ExcelFile(Data_CEU_File);
df_CEU  = rawdata.parse(Data_CEU_Sheet);
rawdata = pd.ExcelFile(Data_EDU_File);
df_EDU  = rawdata.parse(Data_EDU_Sheet);

In [None]:
# Select Years
df_filtered = df_EDU[df_EDU['YEAR'].isin(YEARS)]

In [None]:
# DiD variable
TREAT = np.tile(df_CEU.PHASE, (nyears,1)).reshape(-1,1);
AFTER = np.array(df_filtered.YEAR > PHASE_YEAR).reshape(-1,1);

In [None]:
# Output Data File
df_did = pd.DataFrame({
    'SUBPREFECTURE' : df_filtered['SUBPREFECTURE'],
    'TREAT' : TREAT.flatten(),
    'AFTER' : AFTER.flatten(),
    'ABAND' : df_filtered['ABAND'],
    'APROV' : df_filtered['APROV'],
    'REPROV': df_filtered['REPROV'],    
    'YEAR'  : df_filtered['YEAR']
                     })
df_did.to_excel(Data_DID_File, sheet_name=Data_DID_Sheet, index=False);

In [None]:
# ABAND regression without the covariates (APROV, REPROV)
formula = 'ABAND ~ AFTER + TREAT + AFTER * TREAT'
model   = smf.ols(formula=formula, data=df_did)
result   = model.fit()
print(result.summary())
coef_df = pd.DataFrame({
    'Coefficient': result.params,
    'Std. Error': result.bse,
    't-value': result.tvalues,
    'P-value': result.pvalues,
    '95% CI Lower': result.conf_int()[0],
    '95% CI Upper': result.conf_int()[1],
    'R-squared': result.rsquared
                    })
coef_df.to_excel(Result_File, sheet_name=Result_Sheet1, index=True)

In [None]:
# ABAND regression with the covariates (APROV, REPROV)
formula  = 'ABAND ~ AFTER + TREAT + AFTER * TREAT + APROV + REPROV'
model    = smf.ols(formula=formula, data=df_did)
result   = model.fit()
print(result.summary())
coef_df = pd.DataFrame({
    'Coefficient': result.params,
    'Std. Error': result.bse,
    't-value': result.tvalues,
    'P-value': result.pvalues,
    '95% CI Lower': result.conf_int()[0],
    '95% CI Upper': result.conf_int()[1],
    'R-squared': result.rsquared
                    })
with pd.ExcelWriter(
    Result_File,
    engine="openpyxl",
    mode="a",
    if_sheet_exists="replace"
) as writer:
    coef_df.to_excel(writer, sheet_name=Result_Sheet2, index=True)

In [None]:
# Means by year: TREAT (True) , TREAT (false)
treat_means   = df_did[df_did['TREAT'] == True].groupby('YEAR')['ABAND'].mean()
control_means = df_did[df_did['TREAT'] == False].groupby('YEAR')['ABAND'].mean()

In [None]:
# Trend lines
treat_before   = Trend_Line(treat_means[0:IY+1].index,treat_means[0:IY+1].values,YEARS_OUT);
treat_after    = Trend_Line(treat_means[IY:].index,treat_means[IY:].values,YEARS_OUT);
control_before = Trend_Line(control_means[0:IY+1].index,control_means[0:IY+1].values,YEARS_OUT);
control_after  = Trend_Line(control_means[IY:].index,control_means[IY:].values,YEARS_OUT);

In [None]:
# Plotting mean curves
plt.figure(figsize=(10, 6))
# Plot each curve
plt.plot(treat_means.index, treat_means.values, label='Treatment values', color='black', marker='o', linestyle=':', linewidth=1)
plt.plot(control_means.index, control_means.values, label='Control values', color='black', marker='+', linestyle=':', linewidth=1)
# Plot trend lines
plt.plot(treat_before.X, treat_before.Y, label='Treatment trend before', color='red', marker='o', linestyle='-',linewidth=1.5)
plt.plot(control_before.X, control_before.Y, label='Control trend before', color='red', marker='+', linestyle='-',linewidth=1.5)
plt.plot(treat_after.X, treat_after.Y, label='Treatment trend after', color='blue', marker='o', linestyle='-',linewidth=1.5)
plt.plot(control_after.X, control_after.Y, label='Control trend after', color='blue', marker='+', linestyle='-',linewidth=1.5)
# Get current y-limits or set custom onesaot exists
x = PHASE_YEAR;
ax = plt.gca()  # Get current axis
y_min, y_max = ax.get_ylim()
plt.plot([x, x], [y_min, y_max], color='black', linestyle='--', linewidth=2)
# Add labels and title
plt.xlabel('YEAR', fontsize=12)
plt.xticks(YEARS, YEARS, rotation=45)
# Add labels and title
plt.xlabel('YEAR', fontsize=12)
plt.ylabel('Mean Value', fontsize=12)
plt.title('Mean Values of ABAND by Year (Phase 1: 2003 & 2004)', fontsize=14, fontweight='bold')
# Add legend
plt.legend(fontsize=11)
# Add grid for better readability
plt.grid(True, alpha=0.3)
# Adjust layout and display
plt.tight_layout()
# Save plot
plt.savefig(Fig_File, format='jpeg', dpi=300)
plt.show()