### Prerequisites
Considering a Python virtual environment is already created and activated, the following packages are required to run the code:

In [1]:
!pip install ace_tools



In [2]:
!pip install pandas



In [3]:
import pandas as pd
import zipfile
import io
import os

# Turmas

This dataset has been providing, since 2013, semiannual (June and December) records divided by school and Regional Education Board of enrolled students, classes, operating shifts, students with special needs, among other information. In addition to regular schooling classes, the dataset also contains details of all complementary activities offered in the  Sao Paulo Municipal Education Network. We are using the December 2023 dataset.

## Data Processing

This cell reads the raw `turmas_122023.csv`, removes all columns except the ones we need, and writes out a new file `turmas_processed.csv` containing only:

- **DRE**: Diretoria Regional de Educação (Regional Education Board)  
- **CODINEP**: INEP (Anísio Teixeira National Institute for Educational Studies and Research) code  
- **TIPOESC**: School type (e.g., state, municipal, etc.)  
- **NOMESC**: School name  
- **SUBPREF**: The subsection of the city hall that the school belongs to  
- **DISTRITO**: District of the school  
- **CODAMB**: Ambiental code (used for environmental classification of the school)  
- **DESCAMB**: Ambiental description (used for environmental classification of the school)  
- **CAPREAL**: Real capacity of the room  
- **METRAGEM**: Measurement of the room (m²)  
- **MODAL**: Description of the modality (e.g., complementary, pre-school, primary, secondary, etc.)  
- **REDE**: Network (e.g., direct public administration - DIR, partnership with civil organizations - COM)  
- **CODSERIE**: Grade code  
- **DESCSERIE**: Grade description (e.g., 1st grade, 2nd grade, etc.)  
- **TURNO**: Time of the day (e.g., morning, afternoon, evening)  
- **TURMA**: Name of the class (e.g., 1A, 2B, etc.)  
- **VAGOFER**: Number of vacancies offered in the class  
- **MATRIC**: Number of students enrolled in the class  


In [7]:
df = pd.read_csv('datasets/raw_data/turmas_122023.csv', sep=';')

cols_to_keep = [
    'DRE','CODINEP','TIPOESC','NOMESC','SUBPREF',
    'DISTRITO', 'CODAMB','DESCAMB','CAPREAL',
    'METRAGEM', 'MODAL','REDE','CODSERIE','DESCSERIE',
    'TURNO', 'TURMA','VAGOFER','MATRIC'
]

mapping = {1: 'Manhã', 2: 'Intermediário', 3: 'Tarde', 4: 'Vespertino', 5: 'Noite', 6: 'Integral'}
df['TURNO'] = df['TURNO'].replace(mapping)

df_trimmed = df[cols_to_keep]
df_trimmed.to_csv(
    'datasets/turmas_processed.csv',
     sep=';',
     index=False)
df_trimmed.head()


  df = pd.read_csv('datasets/raw_data/turmas_122023.csv', sep=';')


Unnamed: 0,DRE,CODINEP,TIPOESC,NOMESC,SUBPREF,DISTRITO,CODAMB,DESCAMB,CAPREAL,METRAGEM,MODAL,REDE,CODSERIE,DESCSERIE,TURNO,TURMA,VAGOFER,MATRIC
0,MP,35098760.0,EMEF,"JOSE ERMIRIO DE MORAIS, SEN.",SAO MIGUEL,JARDIM HELENA,421182,LABORATORIO DE INFORMATICA,40.0,48.0,ATCOMP,DIR,20249,ATCOMP,Tarde,XA,15,15
1,MP,35098760.0,EMEF,"JOSE ERMIRIO DE MORAIS, SEN.",SAO MIGUEL,JARDIM HELENA,421182,LABORATORIO DE INFORMATICA,40.0,48.0,ATCOMP,DIR,20249,ATCOMP,Tarde,XB,15,15
2,MP,35098760.0,EMEF,"JOSE ERMIRIO DE MORAIS, SEN.",SAO MIGUEL,JARDIM HELENA,421182,LABORATORIO DE INFORMATICA,40.0,48.0,ATCOMP,DIR,20249,ATCOMP,Tarde,XC,15,16
3,MP,35098760.0,EMEF,"JOSE ERMIRIO DE MORAIS, SEN.",SAO MIGUEL,JARDIM HELENA,421182,LABORATORIO DE INFORMATICA,40.0,48.0,ATCOMP,DIR,20249,ATCOMP,Tarde,XD,15,16
4,MP,35098760.0,EMEF,"JOSE ERMIRIO DE MORAIS, SEN.",SAO MIGUEL,JARDIM HELENA,421182,LABORATORIO DE INFORMATICA,40.0,48.0,ATCOMP,DIR,20249,ATCOMP,Vespertino,XE,15,15


# Parcerias

The present dataset aims to ensure transparency of information about Early Childhood Education Units maintained by partner institutions. It is possible to observe the Educational Units and their respective maintaining Organizations, the amounts planned for monthly transfers, rent, and property tax (IPTU). It is also possible to obtain data on the contracted vacancies for nurseries and other classes. We are using the December 2023 dataset.

## Data Processing

This cell reads the raw `parcerias_dezembro_2023.csv`, removes all columns except the ones we need, maps full DRE names to their two-letter codes, and writes out a new file `parcerias_processed.csv`, that:

1. **Select only the required columns**  
  Keeps:  

- **Nº Protocolo**: Protocol number  
- **DRE**: Regional Education Board (mapped to two-letter codes)  
- **OSC Parceira**: Partner Civil Society Organization  
- **CNPJ**: National Registry of Legal Entities (tax ID for organizations)  
- **Código Escola**: School code  
- **Unidade Educacional**: Educational unit  
- **Valor Mensal**: Monthly value  granted to the partner organization
- **Verba Locação**: Rent allowance 
- **Valor Mensal IPTU**: Monthly property tax value  
- **Data Início**: Start date of the partnership
- **Data Término**: End date  of the partnership

2. **Map DRE names to two-letter codes**  
  The full DRE names are mapped to their respective two-letter codes for easier reference. The mapping is as follows:

| DRE Full Name                                                    | Code |
|------------------------------------------------------------------|:----:|
| DIRETORIA REGIONAL DE EDUCACAO BUTANTA                           |  BT  |
| DIRETORIA REGIONAL DE EDUCACAO CAMPO LIMPO                       |  CL  |
| DIRETORIA REGIONAL DE EDUCACAO CAPELA DO SOCORRO                 |  CS  |
| DIRETORIA REGIONAL DE EDUCACAO FREGUESIA/BRASILANDIA             |  FB  |
| DIRETORIA REGIONAL DE EDUCACAO GUAIANASES                        |  G   |
| DIRETORIA REGIONAL DE EDUCACAO IPIRANGA                          |  IP  |
| DIRETORIA REGIONAL DE EDUCACAO ITAQUERA                          |  IQ  |
| DIRETORIA REGIONAL DE EDUCACAO JACANA/TREMEMBE                   |  JT  |
| DIRETORIA REGIONAL DE EDUCACAO SAO MIGUEL                        |  MP  |
| DIRETORIA REGIONAL DE EDUCACAO PENHA                             |  PE  |
| DIRETORIA REGIONAL DE EDUCACAO PIRITUBA                          |  PJ  |
| DIRETORIA REGIONAL DE EDUCACAO SANTO AMARO                       |  SA  |
| DIRETORIA REGIONAL DE EDUCACAO SAO MATEUS                        |  SM  |


In [5]:
df = pd.read_csv('datasets/raw_data/parcerias_dezembro_2023.csv', sep=';')

cols_to_keep = [
    'Nº Protocolo', 'DRE', 'OSC Parceira',
    'CNPJ', 'Código Escola', 'Unidade Educacional',
    'Valor Mensal', 'Verba Locação', 'Valor Mensal IPTU',
    'Data Início', 'Data Término'
]
df_trimmed = df[cols_to_keep].copy()

dre_map = {
    'DIRETORIA REGIONAL DE EDUCACAO BUTANTA':                              'BT',
    'DIRETORIA REGIONAL DE EDUCACAO CAMPO LIMPO':                          'CL',
    'DIRETORIA REGIONAL DE EDUCACAO CAPELA DO SOCORRO':                    'CS',
    'DIRETORIA REGIONAL DE EDUCACAO FREGUESIA/BRASILANDIA':                'FB',
    'DIRETORIA REGIONAL DE EDUCACAO GUAIANASES':                           'G',
    'DIRETORIA REGIONAL DE EDUCACAO IPIRANGA':                             'IP',
    'DIRETORIA REGIONAL DE EDUCACAO ITAQUERA':                             'IQ',
    'DIRETORIA REGIONAL DE EDUCACAO JACANA/TREMEMBE':                      'JT',
    'DIRETORIA REGIONAL DE EDUCACAO SAO MIGUEL':                           'MP',
    'DIRETORIA REGIONAL DE EDUCACAO PENHA':                                'PE',
    'DIRETORIA REGIONAL DE EDUCACAO PIRITUBA':                             'PJ',
    'DIRETORIA REGIONAL DE EDUCACAO SANTO AMARO':                          'SA',
    'DIRETORIA REGIONAL DE EDUCACAO SAO MATEUS':                           'SM'
}
df_trimmed['DRE'] = df_trimmed['DRE'].map(dre_map).fillna(df_trimmed['DRE'])

df_trimmed.to_csv(
    'datasets/parcerias_processed.csv',
    sep=';',
    index=False
)

df_trimmed.head()


Unnamed: 0,Nº Protocolo,DRE,OSC Parceira,CNPJ,Código Escola,Unidade Educacional,Valor Mensal,Verba Locação,Valor Mensal IPTU,Data Início,Data Término
0,2442017,BT,ASSOCIACAO EDUCACIONAL UIRAPURU,13.932.073/0001-63,400831.0,CEI INDIR - ACUCENA JORGE LIAN,22874555,0,0,28/10/2022,27/10/2027
1,7092017,BT,CENTRO EDUCACIONAL DE DESENVOLV JUVENIL E INFA...,13.222.705/0001-03,400806.0,CEI INDIR - CHACARA DO JOCKEY,2350928,0,0,01/01/2018,31/12/2022
2,1142019,BT,CENTRO DE DESENVOLVIMENTO E AMPARO A PESSOA,13.222.705/0001-03,400870.0,CEI INDIR - CHACARA DO JOCKEY,31658302,0,0,02/08/2019,01/08/2024
3,2510,BT,CENTRO EDUCACIONAL DE DESENVOLV JUVENIL E INFA...,01.309.378/0001-34,400806.0,CEI INDIR - CHACARA DO JOCKEY,16676862,0,0,22/12/2015,21/06/2018
4,5912017,BT,OBRA ASSISTENCIAL JESUS MENINO,01.309.378/0001-34,400414.0,CEI INDIR - COHAB EDUCANDARIO,16278845,0,0,01/01/2023,31/12/2027


# Matriculas

This dataset contains information on enrollments made in the Sao Paulo Municipal Education Network for Early Childhood Education (nurseries and preschools), Elementary Education (1st to 9th grade), High School, Youth and Adult Education (EJA), Projovem, Technical Education, and Special Education.  To reduce data interpretation errors and facilitate the handling of smaller files, the database is divided into tre files: two containing only schooling classes, separating early childhood education and other modalities, and another containing the remaining classes, i.e., complementary activities.

## Pre-processing

This code will automate the cleaning and re-formatting of each CSV inside `microdados-de-matricula-2023.zip`. It performs the following steps:

1. **Unzip and load**  
   - Extracts every `.csv` file from the ZIP archive  
   - Reads each file into a pandas `DataFrame`  

2. **Select only the required columns**  
   Keeps: 
  - **AN_LETIVO**: School year  
  - **CD_UNIDADE_EDUCACAO**: Education Unit code  
  - **NOME_DISTRITO**: District name  
  - **CD_SETOR**: Sector code  
  - **TIPO_ESCOLA**: School type  
  - **NOME_ESCOLA**: School name  
  - **DRE**: Regional Education Board  
  - **CD_INEP_ESCOLA**: INEP school code  
  - **CD_TURNO**: Shift code  
  - **DESC_TURNO**: Shift description  
  - **CD_SERIE**: Grade code  
  - **DESC_SERIE**: Grade description  
  - **MODALIDADE**: Modality  
  - **NOME_TURMA**: Class name  
  - **DESC_ETAPA_ENSINO**: Education stage description  
  - **DESC_CICLO_ENSINO**: Education cycle description  
  - **DESC_TIPO_TURMA**: Class type description  
  - **CD_ALUNO_SME**: Student code by SME  (anonymity protocol)
  - **ANO_NASC_ALUNO**: Student's birth year  
  - **MES_NASC_ALUNO**: Student's birth month  
  - **CD_SEXO**: Student's gender assigned at birth
  - **DESC_RACA_COR**: Race/color
  - **DESC_PAIS_NASC**: Country of birth
  - **NEE_ALT_HAB**, **DEF__AUTISMO**, ..., **DEF__N_POSSUI**: Flags for special educational needs  
  - **CD_MAT**: Subject code  
  - **DT_IN_MAT**: Subject start date  
  - **DT_FIM_MAT**: Subject end date  
  - **SITUACAO_MAT**: Subject status  
  - **MES_SIT_MAT**: Month of status' data collection
  - **ANO_SIT_MAT**: Year of status' data collection

3. **Build the `NASC_ALUNO` column**: To maintain anonymity, the exact birth date of the student is not provided in the dataset. Instead, we have two columns: `ANO_NASC_ALUNO` and `MES_NASC_ALUNO`. We will combine these two columns into a single column called `NASC_ALUNO` that contains the date of birth in the format `01/<MM>/<YYYY>`.  
- Combines `ANO_NASC_ALUNO` + `MES_NASC_ALUNO` into a date string of the form  
  ```
  01/<MM_NASC>/<YYYY_NASC>
  ```

4. **Build the `DATA_SIT` column**: Similar to the birth date, the dataset only provides the year and month of the student's enrollment status. We will combine these two columns into a single column called `DATA_SIT` that contains the date of the student's enrollment status in the format `01/<MM>/<YYYY>`.  
- Combines `ANO_SIT_MAT` + `MES_SIT_MAT` into  
  ```
  01/<MM_SIT>/<YYYY_SIT>
  ```

5. **Generate the `NEE` column**  
- Scans the series of indicator columns (`NEE_ALT_HAB`, `DEF__AUTISMO`, … , `DEF__N_POSSUI`)  
- Maps the first “1” found to its code string (e.g. `NEE_ALT_HAB → "ALT_HAB"`, `DEF__AUTISMO → "AUTISMO"`, etc.)  
- If `DEF__N_POSSUI` is 1 or no other flag is set, yields `null`

6. **Map the `DRE` column**
- Maps the full DRE names to their two-letter codes (e.g., `BUTANTA` → `BT`)

7. **Drop raw flags and intermediate date parts**  
- Removes all original `NEE_*` columns and the separate month/year columns  

8. **Reorder into the final schema**  
AN_LETIVO; CD_UNIDADE_EDUCACAO; NOME_DISTRITO; CD_SETOR; TIPO_ESCOLA; NOME_ESCOLA; DRE; CD_INEP_ESCOLA; CD_TURNO; DESC_TURNO; CD_SERIE; DESC_SERIE; MODALIDADE; NOME_TURMA; DESC_ETAPA_ENSINO; DESC_CICLO_ENSINO; DESC_TIPO_TURMA; CD_ALUNO_SME; NASC_ALUNO; CD_SEXO; DESC_RACA_COR; DESC_PAIS_NASC; NEE; CD_MAT; DT_IN_MAT; DT_FIM_MAT; SITUACAO_MAT; DATA_SIT


9. **Write out each cleaned file**  
- Saves to `datasets/<original_filename>_processed.csv`

In [None]:
zip_path = 'datasets/raw_data/microdados-de-matricula-2023.zip'
output_dir = 'datasets'
os.makedirs(output_dir, exist_ok=True)

raw_cols = [
    'AN_LETIVO','CD_UNIDADE_EDUCACAO','NOME_DISTRITO','CD_SETOR','TIPO_ESCOLA','NOME_ESCOLA',
    'DRE','CD_INEP_ESCOLA','CD_TURNO','DESC_TURNO','CD_SERIE','DESC_SERIE','MODALIDADE',
    'NOME_TURMA','DESC_ETAPA_ENSINO','DESC_CICLO_ENSINO','DESC_TIPO_TURMA','CD_ALUNO_SME',
    'ANO_NASC_ALUNO','MES_NASC_ALUNO','CD_SEXO','DESC_RACA_COR','DESC_PAIS_NASC',
    'NEE_ALT_HAB','DEF__AUTISMO','DEF__SURDEZ_LEVE','DEF__SURDEZ_SEV','DEF__INTELECT',
    'DEF__MULTIPLA','DEF__CEGUEIRA','DEF__BAIXA_VISAO','DEF__SURDO_CEG','DEF__TRANST_DES_INF',
    'DEF__SINDR_ASPER','DEF__SINDR_RETT','DEF__FIS_N_CADEIR','DEF__FIS_CADEIR','DEF__N_POSSUI',
    'CD_MAT','DT_IN_MAT','DT_FIM_MAT','SITUACAO_MAT','MES_SIT_MAT','ANO_SIT_MAT'
]

final_cols = [
    'AN_LETIVO','CD_UNIDADE_EDUCACAO','NOME_DISTRITO','CD_SETOR','TIPO_ESCOLA','NOME_ESCOLA','DRE',
    'CD_INEP_ESCOLA','CD_TURNO','DESC_TURNO','CD_SERIE','DESC_SERIE','MODALIDADE','NOME_TURMA',
    'DESC_ETAPA_ENSINO','DESC_CICLO_ENSINO','DESC_TIPO_TURMA','CD_ALUNO_SME','NASC_ALUNO','CD_SEXO',
    'DESC_RACA_COR','DESC_PAIS_NASC','NEE','CD_MAT','DT_IN_MAT','DT_FIM_MAT','SITUACAO_MAT','DATA_SIT'
]

nee_map = {
    'NEE_ALT_HAB':       'ALT_HAB',
    'DEF__AUTISMO':      'AUTISMO',
    'DEF__SURDEZ_LEVE':  'SURDEZ_LEVE',
    'DEF__SURDEZ_SEV':   'SURDEZ_SEVERA',
    'DEF__INTELECT':     'INTELECTUAL',
    'DEF__MULTIPLA':     'MULTIPLA',
    'DEF__CEGUEIRA':     'CEGUEIRA',
    'DEF__BAIXA_VISAO':  'BAIXA_VISAO',
    'DEF__SURDO_CEG':    'SURDO_CEGO',
    'DEF__TRANST_DES_INF': 'TRANST_DES_INF',
    'DEF__SINDR_ASPER':  'SINDROME_ASPERGER',
    'DEF__SINDR_RETT':   'SINDROME_RETT',
    'DEF__FIS_N_CADEIR': 'FIS_N_CADEIR',
    'DEF__FIS_CADEIR':   'FIS_CADEIR',
    'DEF__N_POSSUI':     None
}

def compute_nee(row):
    """
    Compute the NEE value based on the row data.
    If multiple NEE columns are set to 1, return the first one found in the order of nee_map.
    """
    for col, val in nee_map.items():
        if row.get(col) == 1:
            return val
    return None

dre_map = {
    'BUTANTA':                              'BT',
    'CAMPO LIMPO':                          'CL',
    'CAPELA DO SOCORRO':                    'CS',
    'FREGUESIA/BRASILANDIA':                'FB',
    'GUAIANASES':                           'G',
    'IPIRANGA':                             'IP',
    'ITAQUERA':                             'IQ',
    'JACANA/TREMEMBE':                      'JT',
    'SAO MIGUEL':                           'MP',
    'PENHA':                                'PE',
    'PIRITUBA':                             'PJ',
    'SANTO AMARO':                          'SA',
    'SAO MATEUS':                           'SM'
}

with zipfile.ZipFile(zip_path, 'r') as z:
    for name in z.namelist():
        if not name.lower().endswith('.csv'):
            continue
        with z.open(name) as f:
            df = pd.read_csv(io.TextIOWrapper(f, encoding='utf-8', errors='replace'), sep=';')
        
        df = df[raw_cols].copy()

        # Fix nome_escola by removing leading/trailing spaces and replacing multiple spaces with a single space
        df['NOME_ESCOLA'] = df['NOME_ESCOLA'].str.strip().str.replace(r'\s+', ' ', regex=True)
        
        try:
            df['NASC_ALUNO'] = '01/' + df['MES_NASC_ALUNO'].astype(int).astype(str).str.zfill(2) + '/' + df['ANO_NASC_ALUNO'].astype(int).astype(str)
        except:
            df['NASC_ALUNO'] = None
        finally:

            try:
                df['DATA_SIT'] = '01/' + df['MES_SIT_MAT'].astype(int).astype(str).str.zfill(2) + '/' + df['ANO_SIT_MAT'].astype(int).astype(str)
            except:
                df['DATA_SIT'] = None
            finally:
            
                df['NEE'] = df.apply(compute_nee, axis=1)
                df['DRE'] = df['DRE'].map(dre_map).fillna(df['DRE'])
                
                drop_cols = list(nee_map.keys()) + ['ANO_NASC_ALUNO','MES_NASC_ALUNO','MES_SIT_MAT','ANO_SIT_MAT']
                df = df.drop(columns=drop_cols)
                
                df = df[final_cols]
                
                out_path = os.path.join(output_dir, os.path.basename(name).replace('.csv', '_processed.csv'))
                df.to_csv(out_path, sep=';', index=False)

preview = pd.read_csv(os.path.join(output_dir, os.listdir(output_dir)[0]), sep=';')
print(preview)


        AN_LETIVO  CD_UNIDADE_EDUCACAO      NOME_DISTRITO       CD_SETOR  \
0            2023      724690000000000        ARTUR ALVIM  7393130000000   
1            2023      724690000000000        ARTUR ALVIM  7393130000000   
2            2023      724690000000000        ARTUR ALVIM  7393130000000   
3            2023      724690000000000        ARTUR ALVIM  7393130000000   
4            2023      724690000000000        ARTUR ALVIM  7393130000000   
...           ...                  ...                ...            ...   
466038       2023      139510000000000  CIDADE TIRADENTES   204550000000   
466039       2023      139510000000000  CIDADE TIRADENTES   204550000000   
466040       2023      139510000000000  CIDADE TIRADENTES   204550000000   
466041       2023      139510000000000  CIDADE TIRADENTES   204550000000   
466042       2023      139510000000000  CIDADE TIRADENTES   204550000000   

       TIPO_ESCOLA                             NOME_ESCOLA DRE  \
0             MOVA  I