# CEU Data Preprocessing Script
**Author:** [elia.matsumoto@fgv.br] (2026)

## Overview
This notebook processes CEU (Centros Educacionais Unificados) data for São Paulo municipality to create phase indicators for subsequent analysis.

## Input File
- `Data_2a_CEU.xlsx` with two sheets:
  - `Subprefecture`: List of São Paulo subprefectures
  - `CEU`: CEU facility data including opening dates

## Output Files
1. `Data_2a_CEU_Phase1.xlsx`: Subprefectures with CEUs opened **before 2005**
2. `Data_2a_CEU_Phase2.xlsx`: Subprefectures with CEUs opened **between 2007 and 2009** (excluding Phase 1)

## Phase Definitions
- **Phase 1**: CEU opened **before 2005**
- **Phase 2**: CEU opened **between 2007 and 2009**

## Processing Steps
1. Load subprefecture and CEU data
2. For each subprefecture, identify all associated CEUs
3. Assign Phase 1 flag if any CEU opened before 2005
4. Assign Phase 2 flag if any CEU opened between 2007 and 2009 (and not Phase 1)
5. Export results to separate Excel files

In [None]:
# Importing libraries
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import statsmodels.formula.api as smf

In [None]:
# Files info
Info_SP_File   = 'Data_2a_CEU.xlsx';
Info_SP_Sheet  = 'Subprefecture';
Info_CEU_Sheet = 'CEU';
Data_CEU_Phase1 = 'Data_2a_CEU_Phase1.xlsx';
Data_CEU_Phase2 = 'Data_2a_CEU_Phase2.xlsx';
Data_CEU_Sheet = 'Sheet1';

In [None]:
# Reading Sao Paulo municipality info: Name, Subprefecture, Opening date
rawdata = pd.ExcelFile(Info_SP_File);
df_SP   = rawdata.parse(Info_SP_Sheet);
df_CEU  = rawdata.parse(Info_CEU_Sheet);

In [None]:
# Initialization
PHASE1_Year    = 2005;
PHASE2_Year    = 2010;
nSP           = len(df_SP);
CEU_Flag      = np.zeros((nSP,1)); 
PHASE1        = np.zeros((nSP,1));   
PHASE2        = np.zeros((nSP,1)); 

In [None]:
# Setting PHASE1 and PHASE2 flags
for i in range (nSP):
    indexes  = df_CEU[df_CEU['SUBPREFECTURE'].str.lower() == df_SP['SUBPREFECTURE'][i].lower()].index.tolist();
    nindexes = len(indexes);
    for j in range (nindexes):
        if  df_CEU.DATE[indexes[j]].year < PHASE1_Year:
            PHASE1[i] = 1;
        elif df_CEU.DATE[indexes[j]].year <PHASE2_Year:
            PHASE2[i] = 1;         

In [None]:
# PHASE1 Output Data File
df_out = pd.DataFrame({
    'SUBPREFECTURE': df_SP['SUBPREFECTURE'],
    'PHASE': PHASE1.flatten(),
})
df_out.to_excel(Data_CEU_Phase1, sheet_name=Data_CEU_Sheet, index=False)

In [None]:
# PHASE2 Output Data File
df_out = pd.DataFrame({
    'SUBPREFECTURE': df_SP['SUBPREFECTURE'],
    'PHASE': PHASE2.flatten(),
})
df_out = df_out[PHASE1 == 0].copy()
df_out.to_excel(Data_CEU_Phase2, sheet_name=Data_CEU_Sheet, index=False)