# Creación del dataset para las matrículas de los estudiantes

## Propósito

Este script construye una tabla de **matrículas de estudiantes** (`enrollments`) combinando:

- Datos de usuarios desde Moodle (archivos Parquet) 2024 y 2025
- Información de los estudiantes desde archivos CSV 2024 y 2025
- Identificadores de grados desde `Grados.csv`

Se utiliza como fuente principal para identificar estudiantes en Moodle y Edukrea.

## Dataset resultante: `enrollments`

Contiene la siguiente estructura:

| Campo                    | Descripción                                      |
|--------------------------|--------------------------------------------------|
| `documento_identificación` | Hash del documento del estudiante                |
| `moodle_user_id`          | ID del usuario en Moodle                         |
| `year`                    | Año de la matrícula                              |
| `edukrea_user_id`         | ID del usuario en Edukrea (solo para 2025)       |
| `id_grado`                | Identificador numérico del grado (`Grados.csv`)  |

## Archivos requeridos

- `mdlvf_user.parquet` y `mdlvf_user_info_data.parquet` (Moodle)
- `Estudiantes_*.csv` con:
  - `documento_identificación` (hash)
  - `Grado`
- `Grados.csv` con columnas `grado`, `ID`
- `mdl_user.parquet` Edukrea Moodle


## Tecnologías

- **DuckDB** para consultas SQL sobre Parquet
- **HashUtility** para anonimización
- **pandas** para transformación de datos


In [1]:
import sys
import os
import pandas as pd

current_dir = os.path.dirname(os.path.abspath('__file__'))
project_root = os.path.abspath(os.path.join(current_dir, '..', '..'))
if project_root not in sys.path:
    sys.path.append(project_root)

from scripts.transform.data_cleaning_students import StudentsDataCleaner
from scripts.eda_analyzer import EDAAnalyzer
from scripts.utils.hash_utility import HashUtility

from scripts.transform.enrollment_processor import EnrollmentProcessor

## Grados

In [2]:
grados_df = pd.read_csv('../../data/processed/Grados.csv')
grados_df

Unnamed: 0,ID,grado,nivel,orden
0,1,Prejardín,Preescolar,-3
1,2,Jardín,Preescolar,-2
2,3,Transición,Preescolar,-1
3,4,Primero,Primaria,1
4,5,Segundo,Primaria,2
5,6,Tercero,Primaria,3
6,7,Cuarto,Primaria,4
7,8,Quinto,Primaria,5
8,9,Sexto,Secundaria,6
9,10,Séptimo,Secundaria,7


In [3]:
processor = EnrollmentProcessor(grados_df)

## Procesamiento año 2024

In [4]:
students_2024 = pd.read_csv("../../data/processed/Estudiantes_2024_hashed.csv")
enrollments_2024 = processor.create_enrollments_df(
    parquet_users_path="../../data/processed/parquets/Users/mdlvf_user.parquet",
    parquet_user_info_path="../../data/processed/parquets/Users/mdlvf_user_info_data.parquet",
    students_df=students_2024,
    year=2024,
    edukrea_users_path=None
)
enrollments_2024

Unnamed: 0,documento_identificación,moodle_user_id,year,edukrea_user_id,id_grado
0,8072c79e76f52029c5e9ded53bab22b143909e9315b3fe...,1539,2024,,1
1,37df570f402caf8bf385fc510a7358cc314620c80b8b86...,1540,2024,,1
2,ca3d5bbb8954ed37d08415de7f4cea23e81f839dccb692...,1541,2024,,1
3,08575068f55a622a1105cdafc671b1e21b8117d4bfa4ab...,1542,2024,,2
4,e506c90f7b8e35aaca6ce3fbe8ed3691aadb2b7ec7d0ab...,1543,2024,,2
...,...,...,...,...,...
338,5b223d33ffdac263b065e4a5bd11e7085fa35ccdcbc600...,1904,2024,,11
339,fce0603507f05ffcddd0d9e25ad2d45a3586c2ab495d1f...,1907,2024,,3
340,687eee9dd64b3e6e5df5d3db609f173741e057dd888b66...,1908,2024,,6
341,7eb8260753c7fd510cefffa9d153a4222161d3c049168d...,1909,2024,,3


## Procesamiento 2025

In [5]:
students_2025 = pd.read_csv("../../data/processed/Estudiantes_imputed_encoded.csv")
enrollments_2025 = processor.create_enrollments_df(
    parquet_users_path="../../data/processed/parquets_2025/Users/mdlvf_user.parquet",
    parquet_user_info_path="../../data/processed/parquets_2025/Users/mdlvf_user_info_data.parquet",
    students_df=students_2025,
    year=2025,
    edukrea_users_path="../../data/processed/Edukrea/Users/mdl_user.parquet",
)
enrollments_2025

Unnamed: 0,documento_identificación,moodle_user_id,year,edukrea_user_id,id_grado
0,945c83d3cff6cc83aaa392a985a62723db2acb618b7d40...,1554,2025,62,4
1,dbda73a31c0eb86e5dff1b940753448583d57bd55c7cdb...,1555,2025,64,4
2,89cbb87fca776cdb9573834f4d02644b904ad3b9330f2a...,1556,2025,65,4
3,d6434f3b731d975ae06275bb9b050864abab62ab4d6dc6...,1557,2025,68,4
4,6dc285f0f81324412fbf72f6ae7325d0e5c39c28f5eee6...,1558,2025,69,4
...,...,...,...,...,...
289,5acf93947f69980541f2ba2990642bf327736590847762...,1997,2025,339,5
290,a5607b9d2bdca49568a8b4c44925d3372790c03df27136...,2000,2025,340,8
291,69a7a496c2e57c7b117869bcc305daf0eed608d6762ec0...,2001,2025,341,11
292,c770616626bb56906040d885ccedae62ff67781227580f...,2003,2025,342,4


In [None]:
# Guardamos los datos de enrollments 2024 y 2025 en un solo archivo df
enrollments = pd.concat([enrollments_2024, enrollments_2025], ignore_index=True)


## Limpieza

Eliminamos los registros para los grados de Prejardín, Jardín, Transición porque no son relevantes para el análisis

In [10]:
# Hacemos merge enrollments con grados_df por id_grado en enrollments y ID in grados_df
enrollments = enrollments.merge(grados_df, left_on='id_grado', right_on='ID', how='left')
enrollments = enrollments.drop(columns=['ID'])
enrollments 

Unnamed: 0,documento_identificación,moodle_user_id,year,edukrea_user_id,id_grado,grado_x,nivel_x,orden_x,grado_y,nivel_y,orden_y,grado,nivel,orden
0,8072c79e76f52029c5e9ded53bab22b143909e9315b3fe...,1539,2024,,1,Prejardín,Preescolar,-3,Prejardín,Preescolar,-3,Prejardín,Preescolar,-3
1,37df570f402caf8bf385fc510a7358cc314620c80b8b86...,1540,2024,,1,Prejardín,Preescolar,-3,Prejardín,Preescolar,-3,Prejardín,Preescolar,-3
2,ca3d5bbb8954ed37d08415de7f4cea23e81f839dccb692...,1541,2024,,1,Prejardín,Preescolar,-3,Prejardín,Preescolar,-3,Prejardín,Preescolar,-3
3,08575068f55a622a1105cdafc671b1e21b8117d4bfa4ab...,1542,2024,,2,Jardín,Preescolar,-2,Jardín,Preescolar,-2,Jardín,Preescolar,-2
4,e506c90f7b8e35aaca6ce3fbe8ed3691aadb2b7ec7d0ab...,1543,2024,,2,Jardín,Preescolar,-2,Jardín,Preescolar,-2,Jardín,Preescolar,-2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
632,5acf93947f69980541f2ba2990642bf327736590847762...,1997,2025,339,5,Segundo,Primaria,2,Segundo,Primaria,2,Segundo,Primaria,2
633,a5607b9d2bdca49568a8b4c44925d3372790c03df27136...,2000,2025,340,8,Quinto,Primaria,5,Quinto,Primaria,5,Quinto,Primaria,5
634,69a7a496c2e57c7b117869bcc305daf0eed608d6762ec0...,2001,2025,341,11,Octavo,Secundaria,8,Octavo,Secundaria,8,Octavo,Secundaria,8
635,c770616626bb56906040d885ccedae62ff67781227580f...,2003,2025,342,4,Primero,Primaria,1,Primero,Primaria,1,Primero,Primaria,1


In [None]:
# Eliminamos los grados: Prejardín, Jardín y Transición
grados_to_remove = ['Prejardín', 'Jardín', 'Transición']
enrollments = enrollments[~enrollments['grado'].isin(grados_to_remove)]
enrollments = enrollments.reset_index(drop=True)
enrollments

Unnamed: 0,documento_identificación,moodle_user_id,year,edukrea_user_id,id_grado,grado_x,nivel_x,orden_x,grado_y,nivel_y,orden_y,grado,nivel,orden
0,b8bd5170750ee52b2456363db9b3987fe2af7b89842474...,1561,2024,,4,Primero,Primaria,1,Primero,Primaria,1,Primero,Primaria,1
1,a197ee802a1b5e306881a2ea9af366266ac2d1f4643272...,1562,2024,,4,Primero,Primaria,1,Primero,Primaria,1,Primero,Primaria,1
2,13d065def180ab40a3e0acc899c5524b4a437af268c0ec...,1563,2024,,4,Primero,Primaria,1,Primero,Primaria,1,Primero,Primaria,1
3,65778a71ac202c4eada1eb1cd76b30206b0546c7a34b25...,1564,2024,,4,Primero,Primaria,1,Primero,Primaria,1,Primero,Primaria,1
4,b6005ef089582419d97090147411273c47991bd937856d...,1565,2024,,4,Primero,Primaria,1,Primero,Primaria,1,Primero,Primaria,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
589,5acf93947f69980541f2ba2990642bf327736590847762...,1997,2025,339,5,Segundo,Primaria,2,Segundo,Primaria,2,Segundo,Primaria,2
590,a5607b9d2bdca49568a8b4c44925d3372790c03df27136...,2000,2025,340,8,Quinto,Primaria,5,Quinto,Primaria,5,Quinto,Primaria,5
591,69a7a496c2e57c7b117869bcc305daf0eed608d6762ec0...,2001,2025,341,11,Octavo,Secundaria,8,Octavo,Secundaria,8,Octavo,Secundaria,8
592,c770616626bb56906040d885ccedae62ff67781227580f...,2003,2025,342,4,Primero,Primaria,1,Primero,Primaria,1,Primero,Primaria,1


In [14]:
# Seleccionar la columnas necesarias: documento_identificación, moodle_user_id, year, edukrea_user_id, grado
enrollments = enrollments[['documento_identificación', 'moodle_user_id', 'year', 'edukrea_user_id', 'grado']]
enrollments

Unnamed: 0,documento_identificación,moodle_user_id,year,edukrea_user_id,grado
0,b8bd5170750ee52b2456363db9b3987fe2af7b89842474...,1561,2024,,Primero
1,a197ee802a1b5e306881a2ea9af366266ac2d1f4643272...,1562,2024,,Primero
2,13d065def180ab40a3e0acc899c5524b4a437af268c0ec...,1563,2024,,Primero
3,65778a71ac202c4eada1eb1cd76b30206b0546c7a34b25...,1564,2024,,Primero
4,b6005ef089582419d97090147411273c47991bd937856d...,1565,2024,,Primero
...,...,...,...,...,...
589,5acf93947f69980541f2ba2990642bf327736590847762...,1997,2025,339,Segundo
590,a5607b9d2bdca49568a8b4c44925d3372790c03df27136...,2000,2025,340,Quinto
591,69a7a496c2e57c7b117869bcc305daf0eed608d6762ec0...,2001,2025,341,Octavo
592,c770616626bb56906040d885ccedae62ff67781227580f...,2003,2025,342,Primero


## Guardar df resultante como csv

In [15]:
enrollments.to_csv("../../data/processed/Enrollments.csv", index=False)
