<a href="https://colab.research.google.com/github/apchavezr/-Analisis_Grandes_Volumenes_Datos/blob/main/Modelo_PySpark_EvasionEscolar_Predictivo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 🎓 Predicción de evasión escolar con anonimización y modelo en PySpark
Este notebook extiende el análisis anterior simulando la construcción de un modelo predictivo de evasión escolar, respetando los principios de anonimización de datos.

## 🎯 Objetivo
Construir un modelo de regresión logística para predecir la evasión escolar en Bogotá a partir de variables como edad, ausencias y repitencia.

In [1]:
# Paso 1: Inicializar sesión de Spark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("ModeloEvasionEscolar").getOrCreate()

In [2]:
# Paso 2: Datos simulados anonimizados
data = [
    ("STU001", 14, "Suba", 0, 5, 0),
    ("STU002", 15, "Usme", 1, 20, 1),
    ("STU003", 13, "Bosa", 0, 2, 0),
    ("STU004", 14, "Kennedy", 1, 18, 1),
    ("STU005", 15, "Chapinero", 0, 0, 0),
    ("STU006", 13, "Engativá", 1, 12, 1),
    ("STU007", 14, "Fontibón", 0, 6, 0),
    ("STU008", 15, "Usaquén", 0, 1, 0)
]
columnas = ["id_estudiante", "edad", "localidad", "repitente", "ausencias", "evadio"]
from pyspark.sql.functions import sha2, lit
df = spark.createDataFrame(data, columnas)
df = df.withColumn("id_hash", sha2(df.id_estudiante, 256)).drop("id_estudiante")
df.show(truncate=False)

+----+---------+---------+---------+------+----------------------------------------------------------------+
|edad|localidad|repitente|ausencias|evadio|id_hash                                                         |
+----+---------+---------+---------+------+----------------------------------------------------------------+
|14  |Suba     |0        |5        |0     |ccda4464abcb9c2596058e503666e131f6f55e77f89f685a243ba0fc16fc0ac2|
|15  |Usme     |1        |20       |1     |15ae7f8e6a895344b20e105af4b717292636619817f4be3683fb4d805ac5212c|
|13  |Bosa     |0        |2        |0     |70cbaef19ca9d745093f764b47ccc54654db42f5c9347acd99504365adb2ff01|
|14  |Kennedy  |1        |18       |1     |9d048afbff9b552b1f6eec6ee15c9cf56c6e0af1ee4b85bcc9b6468e8abfc86b|
|15  |Chapinero|0        |0        |0     |9a878dc8cef4d72dc82c125b9dc8dcb3db4690f41f9a45bdfa1a511f08cc58c2|
|13  |Engativá |1        |12       |1     |afa01fe6b2e8ce5fc9ab83e1dd6665a15c7e95b1e09a6a3cbbdad5eef7eb7e34|
|14  |Fontibón |0  

## 📈 Paso 3: Construcción del modelo con PySpark ML

In [3]:
# Selección de variables predictoras
from pyspark.ml.feature import VectorAssembler
ensamblador = VectorAssembler(
    inputCols=["edad", "repitente", "ausencias"],
    outputCol="features"
)
df_modelo = ensamblador.transform(df)
df_modelo.select("features", "evadio").show(truncate=False)

+---------------+------+
|features       |evadio|
+---------------+------+
|[14.0,0.0,5.0] |0     |
|[15.0,1.0,20.0]|1     |
|[13.0,0.0,2.0] |0     |
|[14.0,1.0,18.0]|1     |
|[15.0,0.0,0.0] |0     |
|[13.0,1.0,12.0]|1     |
|[14.0,0.0,6.0] |0     |
|[15.0,0.0,1.0] |0     |
+---------------+------+



In [4]:
# Entrenamiento del modelo de regresión logística
from pyspark.ml.classification import LogisticRegression
lr = LogisticRegression(featuresCol="features", labelCol="evadio")
modelo = lr.fit(df_modelo)
predicciones = modelo.transform(df_modelo)
predicciones.select("features", "prediction", "probability", "evadio").show(truncate=False)

+---------------+----------+-------------------------------------------+------+
|features       |prediction|probability                                |evadio|
+---------------+----------+-------------------------------------------+------+
|[14.0,0.0,5.0] |0.0       |[0.9999999925881196,7.4118804427314444E-9] |0     |
|[15.0,1.0,20.0]|1.0       |[2.4775141468262574E-9,0.9999999975224858] |1     |
|[13.0,0.0,2.0] |0.0       |[0.9999999932207237,6.77927625147845E-9]   |0     |
|[14.0,1.0,18.0]|1.0       |[1.1733366568283796E-9,0.9999999988266633] |1     |
|[15.0,0.0,0.0] |0.0       |[0.9999999999899547,1.0045297926808416E-11]|0     |
|[13.0,1.0,12.0]|1.0       |[1.578277732311376E-8,0.9999999842172227]  |1     |
|[14.0,0.0,6.0] |0.0       |[0.9999999828893317,1.711066832665864E-8]  |0     |
|[15.0,0.0,1.0] |0.0       |[0.9999999999768099,2.3190116493765345E-11]|0     |
+---------------+----------+-------------------------------------------+------+



## ✅ Conclusión
- Se construyó un modelo básico de predicción de evasión escolar utilizando PySpark.
- El modelo tiene en cuenta edad, ausencias y repitencia como variables predictoras.
- Todos los datos fueron anonimizados para proteger la identidad de los estudiantes.