# Data Aggregation

# Goal
Create a aggregation on both dataset (Census and Covid) and join.

# Methodology
- Do multiple aggregation on the census data by city
- Do multiple aggregation on the covid data.
- Join the previous two df.
- Save this df to s3.

### Aggregations
**Personas**
- VA_EE: 1 Servicio Electrico. Porcentaje de personas con acceso a energia electrica.
- VA1_ESTRATO: [0-6, 9] Porcentaje de personas por estrato
- VB_ACU: Servicio de acueducto. Porcentaje de personas con acceso a acueducto.
- VF_INTERNET: Servicio de Internet
- HA_TO_PER: Promedio de personas por hogar
- P_EDADR: [0-4, 5-9, 10-14, 15-19, .. >100] Piramide poblacional por ciudad.
- P_ALFABETA: Porcentaje de alfabetismo
- PA1_CALIDAD_SERV: Promedio calidad de servicio de salud [1-4 ]
- P_NIVEL_ANOSR: Personas por nivel educativo (Porcentaje por nivel) [1-9, 10 Ninguno, 99 No informa, NA]
- PA1_THFC: Número de hijos q viven fuera de Colombia

**Fallecidos**
- FA2_SEXO_FALL: Porcentaje fallecidos hombres
- FA3_EDAD_FALL: Edad al morir Promedio de edad al morir (Expectativa de Vida)
- FA2_SEXO_FALL&FA3_EDAD_FALL: Expectativa de vida por sexo
- VA1_ESTRATO&FA3_EDAD_FALL: Expectativa por Estrato
- UA_CLASE&FA3_EDAD_FALL: Expectativa por Clase municipal (1 -Cabecera, 2-Centro poblado, 3-Rural Disperso, 4-Resto Rural)

**COVID**
- FA2_SEXO_FALL: Porcentaje fallecidos hombres
- FA3_EDAD_FALL: Edad al morir Promedio de edad al morir (Expectativa de Vida)
- FA2_SEXO_FALL&FA3_EDAD_FALL: Expectativa de vida por sexo
- VA1_ESTRATO&FA3_EDAD_FALL: Expectativa por Estrato
- UA_CLASE&FA3_EDAD_FALL: Expectativa por Clase municipal (1 -Cabecera, 2-Centro poblado, 3-Rural Disperso, 4-Resto Rural)

## Sections
1. [**Requirements**](#Requirements)
2. [**Functions**](#Functions)
3. [**Inputs**](#Inputs)
4. [**Pipeline**](#Pipeline)

# Requirements

In [1]:
#installing packages
sc.install_pypi_package("pandas")
sc.install_pypi_package("boto3")
sc.setCheckpointDir('hdfs:///covid')

VBox()

Starting Spark application


ID,YARN Application ID,Kind,State,Spark UI,Driver log,Current session?
2,application_1594932747078_0003,pyspark,idle,Link,Link,✔


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

SparkSession available as 'spark'.


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Collecting pandas
  Using cached https://files.pythonhosted.org/packages/af/f3/683bf2547a3eaeec15b39cef86f61e921b3b187f250fcd2b5c5fb4386369/pandas-1.0.5-cp37-cp37m-manylinux1_x86_64.whl
Collecting python-dateutil>=2.6.1 (from pandas)
  Using cached https://files.pythonhosted.org/packages/d4/70/d60450c3dd48ef87586924207ae8907090de0b306af2bce5d134d78615cb/python_dateutil-2.8.1-py2.py3-none-any.whl
Installing collected packages: python-dateutil, pandas
Successfully installed pandas-1.0.5 python-dateutil-2.8.1

Collecting boto3
Collecting botocore<1.18.0,>=1.17.22 (from boto3)
  Using cached https://files.pythonhosted.org/packages/31/f5/e7bc1a13d038b812d1e1dc55b9cb19f6ac86dbc376eb0cd50df5c991ef46/botocore-1.17.22-py2.py3-none-any.whl
Collecting s3transfer<0.4.0,>=0.3.0 (from boto3)
  Using cached https://files.pythonhosted.org/packages/69/79/e6afb3d8b0b4e96cefbdc690f741d7dd24547ff1f94240c997a26fa908d3/s3transfer-0.3.3-py2.py3-none-any.whl
Collecting urllib3<1.26,>=1.20; python_version != "

In [2]:
import time
import os
import boto3
import gc
import sys
import numpy as np
import pandas as pd
import pickle
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.types import (FloatType, DateType, StructType, StructField, StringType, LongType, 
    IntegerType, ArrayType, BooleanType, DoubleType)
from pyspark.sql import functions as F
from pyspark.sql.window import Window
from pyspark.ml.feature import VectorAssembler, StandardScaler, QuantileDiscretizer
gc.enable()

spark = SparkSession.builder.config("spark.sql.shuffle.partitions", 20).appName("covid").getOrCreate()
print(spark.sparkContext.getConf().get('spark.driver.memory'))
print(spark.sparkContext.getConf().get("spark.sql.shuffle.partitions"))

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

2048M
20

# Functions

## Loading data

In [3]:
def build_schema_census(source="vivienda"):
    """
    Build schema for different sources

    Parameters:
    -----------
    source : str
        Table source may be: "VIV", "PER", "HOG", "FALL", "MGN"

    Return:
    -------
    schema : spark.schema
        Spark schema for loading source table
    """
    if source == "VIV":
        schema = StructType([StructField("TIPO_REG", IntegerType()),
                             StructField("U_DPTO", IntegerType()),
                             StructField("U_MPIO", IntegerType()),
                             StructField("UA_CLASE", IntegerType()),
                             StructField("U_EDIFICA", IntegerType()),
                             StructField("COD_ENCUESTAS", IntegerType()),
                             StructField("U_VIVIENDA", IntegerType()),
                             StructField("UVA_ESTATER", IntegerType()),
                             StructField("UVA1_TIPOTER", DoubleType()),
                             StructField("UVA2_CODTER", DoubleType()),
                             StructField("UVA_ESTA_AREAPROT", IntegerType()),
                             StructField("UVA1_COD_AREAPROT", DoubleType()),
                             StructField("UVA_USO_UNIDAD", IntegerType()),
                             StructField("V_TIPO_VIV", DoubleType()),
                             StructField("V_CON_OCUP", DoubleType()),
                             StructField("V_TOT_HOG", DoubleType()),
                             StructField("V_MAT_PARED", DoubleType()),
                             StructField("V_MAT_PISO", DoubleType()),
                             StructField("VA_EE", DoubleType()),
                             StructField("VA1_ESTRATO", DoubleType()),
                             StructField("VB_ACU", DoubleType()),
                             StructField("VC_ALC", DoubleType()),
                             StructField("VD_GAS", DoubleType()),
                             StructField("VE_RECBAS", DoubleType()),
                             StructField("VE1_QSEM", DoubleType()),
                             StructField("VF_INTERNET", DoubleType()),
                             StructField("V_TIPO_SERSA", DoubleType()),
                             StructField("L_TIPO_INST", DoubleType()),
                             StructField("L_EXISTEHOG", DoubleType()),
                             StructField("L_TOT_PERL", DoubleType())
                             ])
    elif source == "HOG":
        schema = StructType([StructField("TIPO_REG", IntegerType()),
                             StructField("U_DPTO", IntegerType()),
                             StructField("U_MPIO", IntegerType()),
                             StructField("UA_CLASE", IntegerType()),
                             StructField("COD_ENCUESTAS", IntegerType()),
                             StructField("U_VIVIENDA", IntegerType()),
                             StructField("H_NROHOG", DoubleType()),
                             StructField("H_NRO_CUARTOS", DoubleType()),
                             StructField("H_NRO_DORMIT", DoubleType()),
                             StructField("H_DONDE_PREPALIM", DoubleType()),
                             StructField("H_AGUA_COCIN", DoubleType()),
                             StructField("HA_NRO_FALL", DoubleType()),
                             StructField("HA_TOT_PER", DoubleType())
                             ])
    elif source == "PER":
        schema = StructType([StructField("TIPO_REG", IntegerType()),
                             StructField("U_DPTO", IntegerType()),
                             StructField("U_MPIO", IntegerType()),
                             StructField("UA_CLASE", IntegerType()),
                             StructField("U_EDIFICA", IntegerType()),
                             StructField("COD_ENCUESTAS", IntegerType()),
                             StructField("U_VIVIENDA", IntegerType()),
                             StructField("P_NROHOG", DoubleType()),
                             StructField("P_NRO_PER", IntegerType()),
                             StructField("P_SEXO", IntegerType()),
                             StructField("P_EDADR", IntegerType()),
                             StructField("P_PARENTESCOR", DoubleType()),
                             StructField("PA1_GRP_ETNIC", IntegerType()),
                             StructField("PA11_COD_ETNIA", DoubleType()),
                             StructField("PA12_CLAN", DoubleType()),
                             StructField("PA21_COD_VITSA", DoubleType()),
                             StructField("PA22_COD_KUMPA", DoubleType()),
                             StructField("PA_HABLA_LENG", DoubleType()),
                             StructField("PA1_ENTIENDE", DoubleType()),
                             StructField("PB_OTRAS_LENG", DoubleType()),
                             StructField("PB1_QOTRAS_LENG", DoubleType()),
                             StructField("PA_LUG_NAC", IntegerType()),
                             StructField("PA_VIVIA_5ANOS", DoubleType()),
                             StructField("PA_VIVIA_1ANO", DoubleType()),
                             StructField("P_ENFERMO", DoubleType()),
                             StructField("P_QUEHIZO_PPAL", DoubleType()),
                             StructField("PA_LO_ATENDIERON", DoubleType()),
                             StructField("PA1_CALIDAD_SERV", DoubleType()),
                             StructField("CONDICION_FISICA", DoubleType()),
                             StructField("P_ALFABETA", DoubleType()),
                             StructField("PA_ASISTENCIA", DoubleType()),
                             StructField("P_NIVEL_ANOSR", DoubleType()),
                             StructField("P_TRABAJO", DoubleType()),
                             StructField("P_EST_CIVIL", DoubleType()),
                             StructField("PA_HNV", DoubleType()),
                             StructField("PA1_THNV", DoubleType()),
                             StructField("PA2_HNVH", DoubleType()),
                             StructField("PA3_HNVM", DoubleType()),
                             StructField("PA_HNVS", DoubleType()),
                             StructField("PA1_THSV", DoubleType()),
                             StructField("PA2_HSVH", DoubleType()),
                             StructField("PA3_HSVM", DoubleType()),
                             StructField("PA_HFC", DoubleType()),
                             StructField("PA1_THFC", DoubleType()),
                             StructField("PA2_HFCH", DoubleType()),
                             StructField("PA3_HFCM", DoubleType()),
                             StructField("PA_UHNV", DoubleType()),
                             StructField("PA1_MES_UHNV", DoubleType()),
                             StructField("PA2_ANO_UHNV", DoubleType())
                             ])
    elif source == "FALL":
        schema = StructType([StructField("TIPO_REG", IntegerType()),
                             StructField("U_DPTO", IntegerType()),
                             StructField("U_MPIO", IntegerType()),
                             StructField("UA_CLASE", IntegerType()),
                             StructField("COD_ENCUESTAS", IntegerType()),
                             StructField("U_VIVIENDA", IntegerType()),
                             StructField("F_NROHOG", IntegerType()),
                             StructField("FA1_NRO_FALL", IntegerType()),
                             StructField("FA2_SEXO_FALL", IntegerType()),
                             StructField("FA3_EDAD_FALL", IntegerType()),
                             StructField("FA4_CERT_DEFUN", IntegerType())
                             ])
    elif source == "MGN":
        schema = StructType([StructField("U_DPTO", IntegerType()),
                             StructField("U_MPIO", IntegerType()),
                             StructField("UA_CLASE", IntegerType()),
                             StructField("UA1_LOCALIDAD", IntegerType()),
                             StructField("U_SECT_RUR", IntegerType()),
                             StructField("U_SECC_RUR", IntegerType()),
                             StructField("UA2_CPOB", IntegerType()),
                             StructField("U_SECT_URB", IntegerType()),
                             StructField("U_SECC_URB", IntegerType()),
                             StructField("U_MZA", IntegerType()),
                             StructField("U_EDIFICA", IntegerType()),
                             StructField("COD_ENCUESTAS", IntegerType()),
                             StructField("U_VIVIENDA", IntegerType())
                             ])
    else:
        print("Source not valid. Enter one of the following sources: VIV, PER, HOG, FALL, MGN")
    return schema


def build_schema_covid(source="covid"):
    """
    Build schema for different covid sources

    Parameters:
    -----------
    source : str
        Table source may be: "covid", "tests"

    Return:
    -------
    schema : spark.schema
        Spark schema for loading source table
    """
    if source == "covid":
        schema = StructType([StructField("fecha_de_notificaci_n", DateType()),
                             StructField("c_digo_divipola", StringType()),
                             StructField("ciudad_de_ubicaci_n", StringType()),
                             StructField("departamento", StringType()),
                             StructField("atenci_n", StringType()),
                             StructField("edad", IntegerType()),
                             StructField("sexo", StringType()),
                             StructField("tipo", StringType()),
                             StructField("estado", StringType()),
                             StructField("pa_s_de_procedencia", StringType()),
                             StructField("fis", DateType()),
                             StructField("fecha_diagnostico", DateType()),
                             StructField("fecha_recuperado", DateType()),
                             StructField("fecha_reporte_web", DateType()),
                             StructField("tipo_recuperaci_n", StringType()),
                             StructField("codigo_departamento", StringType()),
                             StructField("codigo_pais", StringType()),
                             StructField("pertenencia_etnica", StringType()),
                             StructField("nombre_grupo_etnico", StringType()),
                             StructField("fecha_de_muerte", DateType()),
                             StructField("Asintomatico", IntegerType()),
                             StructField("divipola_dpto", IntegerType()),
                             StructField("divipola_mpio", IntegerType()),
                             StructField("edad_q", IntegerType()),
                             StructField("muerto", BooleanType()),
                             StructField("edad_muerto", IntegerType()),
                             ])
    elif source == "tests":
        schema = StructType([StructField("fecha", DateType()),
                             StructField("acumuladas", DoubleType()),
                             StructField("amazonas", DoubleType()),
                             StructField("antioquia", DoubleType()),
                             StructField("arauca", DoubleType()),
                             StructField("atlantico", DoubleType()),
                             StructField("bogota", DoubleType()),
                             StructField("bolivar", DoubleType()),
                             StructField("boyaca", DoubleType()),
                             StructField("caldas", DoubleType()),
                             StructField("caqueta", DoubleType()),
                             StructField("casanare", DoubleType()),
                             StructField("cauca", DoubleType()),
                             StructField("cesar", DoubleType()),
                             StructField("choco", DoubleType()),
                             StructField("cordoba", DoubleType()),
                             StructField("cundinamarca", DoubleType()),
                             StructField("guainia", DoubleType()),
                             StructField("guajira", DoubleType()),
                             StructField("guaviare", DoubleType()),
                             StructField("huila", DoubleType()),
                             StructField("magdalena", DoubleType()),
                             StructField("meta", DoubleType()),
                             StructField("narino", DoubleType()),
                             StructField("norte_de_santander", DoubleType()),
                             StructField("putumayo", DoubleType()),
                             StructField("quindio", DoubleType()),
                             StructField("risaralda", DoubleType()),
                             StructField("san_andres", DoubleType()),
                             StructField("santander", DoubleType()),
                             StructField("sucre", DoubleType()),
                             StructField("tolima", DoubleType()),
                             StructField("valle_del_cauca", DoubleType()),
                             StructField("vaupes", DoubleType()),
                             StructField("vichada", DoubleType()),
                             StructField("procedencia_desconocida", DoubleType()),
                             StructField("positivas_acumuladas", DoubleType()),
                             StructField("negativas_acumuladas", DoubleType()),
                             StructField("positividad_acumulada", DoubleType()),
                             StructField("indeterminadas", DoubleType()),
                             StructField("barranquilla", DoubleType()),
                             StructField("cartagena", DoubleType()),
                             StructField("santa_marta", DoubleType())
                             ])
    else:
        print("Source not valid. Enter one of the following sources: 'covid', 'tests'")
    return schema
              
def build_schema_divipola(source="divipola"):
    """
    Build schema for different covid sources

    Parameters:
    -----------
    source : str
        Table source may be: "covid", "tests"

    Return:
    -------
    schema : spark.schema
        Spark schema for loading source table
    """
    if source == "divipola":
        schema = StructType([StructField("cod_depto", IntegerType()),
                             StructField("cod_mpio", IntegerType()),
                             StructField("dpto", StringType()),
                             StructField("nom_mpio", StringType()),
                             StructField("tipo_municipio", StringType())
                             ])
    else:
        print("Source not valid. Enter one of the following sources: 'covid', 'tests'")
    return schema

def build_schema_complete(source="vivienda"):
    """
    Build schema for different sources

    Parameters:
    -----------
    source : str
        Table source may be: "VIV", "PER", "HOG", "FALL", "MGN"

    Return:
    -------
    schema : spark.schema
        Spark schema for loading source table
    """
    if source == "fallecidos":
        schema = StructType([StructField("U_DPTO", IntegerType()),
                             StructField("U_MPIO", IntegerType()),
                             StructField("UA_CLASE", IntegerType()),
                             StructField("U_EDIFICA", IntegerType()),
                             StructField("COD_ENCUESTAS", IntegerType()),
                             StructField("U_VIVIENDA", IntegerType()),
                             StructField("F_NROHOG", IntegerType()),
                             StructField("FA1_NRO_FALL", IntegerType()),
                             StructField("FA2_SEXO_FALL", IntegerType()),
                             StructField("FA3_EDAD_FALL", IntegerType()),
                             StructField("FA4_CERT_DEFUN", IntegerType()),
                             StructField("UVA_USO_UNIDAD", IntegerType()),
                             StructField("V_TIPO_VIV", DoubleType()),
                             StructField("V_CON_OCUP", DoubleType()),
                             StructField("V_TOT_HOG", DoubleType()),
                             StructField("V_MAT_PARED", DoubleType()),
                             StructField("V_MAT_PISO", DoubleType()),
                             StructField("VA_EE", DoubleType()),
                             StructField("VA1_ESTRATO", DoubleType()),
                             StructField("VB_ACU", DoubleType()),
                             StructField("VC_ALC", DoubleType()),
                             StructField("VD_GAS", DoubleType()),
                             StructField("VE_RECBAS", DoubleType()),
                             StructField("VE1_QSEM", DoubleType()),
                             StructField("VF_INTERNET", DoubleType()),
                             StructField("V_TIPO_SERSA", DoubleType()),
                             StructField("L_TIPO_INST", DoubleType()),
                             StructField("L_EXISTEHOG", DoubleType()),
                             StructField("L_TOT_PERL", DoubleType()),
                             StructField("H_NRO_CUARTOS_H", DoubleType()),
                             StructField("H_NRO_DORMIT_H", DoubleType()),
                             StructField("H_DONDE_PREPALIM_H", DoubleType()),
                             StructField("H_AGUA_COCIN_H", DoubleType()),
                             StructField("HA_NRO_FALL_H", DoubleType()),
                             StructField("HA_TOT_PER_H", DoubleType()),
                             StructField("UA1_LOCALIDAD", IntegerType()),
                             StructField("U_SECT_RUR", IntegerType()),
                             StructField("U_SECC_RUR", IntegerType()),
                             StructField("UA2_CPOB", IntegerType()),
                             StructField("U_SECT_URB", IntegerType()),
                             StructField("U_SECC_URB", IntegerType()),
                             StructField("U_MZA", IntegerType()),
                             StructField("dpto", StringType()),
                             StructField("nom_mpio", StringType()),
                             StructField("tipo_municipio", StringType())])
    elif source == "personas":
        schema = StructType([StructField("U_DPTO", IntegerType()),
                             StructField("U_MPIO", IntegerType()),
                             StructField("UA_CLASE", IntegerType()),
                             StructField("U_EDIFICA", IntegerType()),
                             StructField("COD_ENCUESTAS", IntegerType()),
                             StructField("U_VIVIENDA", IntegerType()),
                             StructField("P_NROHOG", IntegerType()),
                             StructField("P_NRO_PER", IntegerType()),
                             StructField("P_SEXO", IntegerType()),
                             StructField("P_EDADR", IntegerType()),
                             StructField("P_PARENTESCOR", DoubleType()),
                             StructField("PA_LUG_NAC", IntegerType()),
                             StructField("PA_VIVIA_5ANOS", DoubleType()),
                             StructField("PA_VIVIA_1ANO", DoubleType()),
                             StructField("P_ENFERMO", DoubleType()),
                             StructField("P_QUEHIZO_PPAL", DoubleType()),
                             StructField("PA_LO_ATENDIERON", DoubleType()),
                             StructField("PA1_CALIDAD_SERV", DoubleType()),
                             StructField("CONDICION_FISICA", DoubleType()),
                             StructField("P_ALFABETA", DoubleType()),
                             StructField("PA_ASISTENCIA", DoubleType()),
                             StructField("P_NIVEL_ANOSR", DoubleType()),
                             StructField("P_TRABAJO", DoubleType()),
                             StructField("P_EST_CIVIL", DoubleType()),
                             StructField("PA_HNV", DoubleType()),
                             StructField("PA1_THNV", DoubleType()),
                             StructField("PA2_HNVH", DoubleType()),
                             StructField("PA3_HNVM", DoubleType()),
                             StructField("PA_HNVS", DoubleType()),
                             StructField("PA1_THSV", DoubleType()),
                             StructField("PA2_HSVH", DoubleType()),
                             StructField("PA3_HSVM", DoubleType()),
                             StructField("PA_HFC", DoubleType()),
                             StructField("PA1_THFC", DoubleType()),
                             StructField("PA2_HFCH", DoubleType()),
                             StructField("PA3_HFCM", DoubleType()),
                             StructField("UVA_USO_UNIDAD", IntegerType()),
                             StructField("V_TIPO_VIV", DoubleType()),
                             StructField("V_CON_OCUP", DoubleType()),
                             StructField("V_TOT_HOG", DoubleType()),
                             StructField("V_MAT_PARED", DoubleType()),
                             StructField("V_MAT_PISO", DoubleType()),
                             StructField("VA_EE", DoubleType()),
                             StructField("VA1_ESTRATO", DoubleType()),
                             StructField("VB_ACU", DoubleType()),
                             StructField("VC_ALC", DoubleType()),
                             StructField("VD_GAS", DoubleType()),
                             StructField("VE_RECBAS", DoubleType()),
                             StructField("VE1_QSEM", DoubleType()),
                             StructField("VF_INTERNET", DoubleType()),
                             StructField("V_TIPO_SERSA", DoubleType()),
                             StructField("L_TIPO_INST", DoubleType()),
                             StructField("L_EXISTEHOG", DoubleType()),
                             StructField("L_TOT_PERL", DoubleType()),
                             StructField("H_NRO_CUARTOS_H", DoubleType()),
                             StructField("H_NRO_DORMIT_H", DoubleType()),
                             StructField("H_DONDE_PREPALIM_H", DoubleType()),
                             StructField("H_AGUA_COCIN_H", DoubleType()),
                             StructField("HA_NRO_FALL_H", DoubleType()),
                             StructField("HA_TOT_PER_H", DoubleType()),
                             StructField("UA1_LOCALIDAD", IntegerType()),
                             StructField("U_SECT_RUR", IntegerType()),
                             StructField("U_SECC_RUR", IntegerType()),
                             StructField("UA2_CPOB", IntegerType()),
                             StructField("U_SECT_URB", IntegerType()),
                             StructField("U_SECC_URB", IntegerType()),
                             StructField("U_MZA", IntegerType()),
                             StructField("dpto", StringType()),
                             StructField("nom_mpio", StringType()),
                             StructField("tipo_municipio", StringType())])
    else:
        print("Source not valid. Enter one of the following sources: fallecidos, personas")
    return schema
              
def get_censo_paths(bucket_s3, directory_key):
    """
    Get dictionary of census data for each department
    
    Parameters:
    -----------
    bucket_s3 : s3.Bucket
        Boto3 Bucket object
    directory_key : path
        Directory key in S3
    
    Return:
    -------
    dict_paths_departments : dict
        Dictionary with the data path for each departtment
    """
    dict_paths_departments = {}
    for object_summary in bucket_s3.objects.filter(Prefix=directory_key):
        name = object_summary.key
        if name.endswith(".CSV"):
            list_paths = name.split("/")
            department = list_paths[2].split("_")[1]
            if "MGN" in list_paths[-1]:
                if not(department in dict_paths_departments):
                    dict_paths_departments[department] = {}
                dict_paths_departments[department].update({"MGN": os.path.join(f"s3a://{bucket_s3.name}", name)})                
            elif "FALL" in list_paths[-1]:
                if not(department in dict_paths_departments):
                    dict_paths_departments[department] = {}
                dict_paths_departments[department].update({"FALL": os.path.join(f"s3a://{bucket_s3.name}", name)})
            elif "HOG" in list_paths[-1]:
                if not(department in dict_paths_departments):
                    dict_paths_departments[department] = {}
                dict_paths_departments[department].update({"HOG": os.path.join(f"s3a://{bucket_s3.name}", name)})
            elif "VIV" in list_paths[-1]:
                if not(department in dict_paths_departments):
                    dict_paths_departments[department] = {}
                dict_paths_departments[department].update({"VIV": os.path.join(f"s3a://{bucket_s3.name}", name)})
            elif "PER" in list_paths[-1]:
                if not(department in dict_paths_departments):
                    dict_paths_departments[department] = {}
                dict_paths_departments[department].update({"PER": os.path.join(f"s3a://{bucket_s3.name}", name)})
    return dict_paths_departments

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

## Preprocessing

In [4]:
def add_suffix_to_cols(df, suffix):
    for col in df.columns:
        df = df.withColumnRenamed(col, col + suffix)
    return df

def isfloat(value):
    try:
        float(value)
        return True
    except ValueError:
        return False

def add_prefix_to_cols(df, prefix, exclude_cols):
    columns_to_prefix = [col for col in df.columns if col not in exclude_cols]
    for col in columns_to_prefix:
        if col.isdigit():
            df = df.withColumnRenamed(col, prefix + col)
        elif isfloat(col):
            df = df.withColumnRenamed(col, prefix + str(int(float(col))))
        else:
            df = df.withColumnRenamed(col, prefix + col)
    return df

def fillna_0(df, exclude_cols):
    columns_to_fill = [col for col in df.columns if col not in exclude_cols]
    df = df.fillna(0, subset=columns_to_fill)
    return df

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

# Inputs

In [5]:
metadata = {"CENSO": {"VIVIENDA": {"useful_columns": ['U_DPTO', 'U_MPIO', 'UA_CLASE', 'U_EDIFICA',
                                                      'COD_ENCUESTAS', 'U_VIVIENDA', 'UVA_USO_UNIDAD',
                                                      'V_TIPO_VIV', 'V_CON_OCUP', 'V_TOT_HOG',
                                                      'V_MAT_PARED', 'V_MAT_PISO', 'VA_EE', 'VA1_ESTRATO', 'VB_ACU', 'VC_ALC',
                                                      'VD_GAS', 'VE_RECBAS', 'VE1_QSEM', 'VF_INTERNET', 'V_TIPO_SERSA',
                                                      'L_TIPO_INST', 'L_EXISTEHOG', 'L_TOT_PERL']
                                   },
                      "HOGAR": {"useful_columns": ['U_DPTO', 'U_MPIO', 'UA_CLASE', 'COD_ENCUESTAS',
                                                   'U_VIVIENDA', 'H_NROHOG', 'H_NRO_CUARTOS', 'H_NRO_DORMIT',
                                                   'H_DONDE_PREPALIM', 'H_AGUA_COCIN', 'HA_NRO_FALL', 'HA_TOT_PER']},
                      "PERSONAS": {"useful_columns": ['U_DPTO', 'U_MPIO', 'UA_CLASE', 'U_EDIFICA',
                                                      'COD_ENCUESTAS', 'U_VIVIENDA', 'P_NROHOG', 'P_NRO_PER', 'P_SEXO',
                                                      'P_EDADR', 'P_PARENTESCOR', 'PA_LUG_NAC',
                                                      'PA_VIVIA_5ANOS', 'PA_VIVIA_1ANO', 'P_ENFERMO', 'P_QUEHIZO_PPAL',
                                                      'PA_LO_ATENDIERON', 'PA1_CALIDAD_SERV', 'CONDICION_FISICA',
                                                      'P_ALFABETA', 'PA_ASISTENCIA', 'P_NIVEL_ANOSR', 'P_TRABAJO',
                                                      'P_EST_CIVIL', 'PA_HNV', 'PA1_THNV', 'PA2_HNVH', 'PA3_HNVM', 'PA_HNVS',
                                                      'PA1_THSV', 'PA2_HSVH', 'PA3_HSVM', 'PA_HFC', 'PA1_THFC', 'PA2_HFCH',
                                                      'PA3_HFCM']},
                      "FALLECIDOS": {"useful_columns": ['U_DPTO', 'U_MPIO', 'UA_CLASE', 'COD_ENCUESTAS',
                                                        'U_VIVIENDA', 'F_NROHOG', 'FA1_NRO_FALL', 'FA2_SEXO_FALL',
                                                        'FA3_EDAD_FALL', 'FA4_CERT_DEFUN']},
                      "GEOREFERENCIACION": {"useful_columns": ['U_DPTO', 'U_MPIO', 'UA_CLASE', 'UA1_LOCALIDAD', 'U_SECT_RUR',
                                                               'U_SECC_RUR', 'UA2_CPOB', 'U_SECT_URB', 'U_SECC_URB', 'U_MZA',
                                                               'U_EDIFICA', 'COD_ENCUESTAS', 'U_VIVIENDA']},
                      "DIVIPOLA": {"useful_columns": ['cod_depto', 'cod_mpio', 'dpto', 'nom_mpio', 'tipo_municipio']}
                      },
            }

bucket='censo-covid'
s3_resource = boto3.resource('s3')
bucket_s3 = s3_resource.Bucket(bucket)
show = True

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

**Paths**

In [6]:
censo_covid_bucket_s3 = f"s3a://{bucket}"

raw_data_path = os.path.join(censo_covid_bucket_s3, "raw-data")
censo_data_path = os.path.join(raw_data_path, "censo")
covid_tests_path = os.path.join(raw_data_path, "covid-tests.csv")
covid_path = os.path.join(raw_data_path, "covid.csv")
divipola_path = os.path.join(raw_data_path, "divipola.csv")

dict_paths_departments = get_censo_paths(bucket_s3, directory_key=os.path.join("raw-data", "censo"))

final_data_path = os.path.join(censo_covid_bucket_s3, "final-data")
complete_personas_path = os.path.join(final_data_path, "complete_personas")
complete_fallecidos_path = os.path.join(final_data_path, "complete_fallecidos")

aggregates_personas_path = os.path.join(final_data_path, "aggregates_personas")
aggregates_fallecidos_path = os.path.join(final_data_path, "aggregates_fallecidos")
aggregates_covid_path = os.path.join(final_data_path, "aggregates_covid")

join_personas_covid_path = os.path.join(final_data_path, "join_personas_covid")
join_fallecidos_covid_path = os.path.join(final_data_path, "join_fallecidos_covid")

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

# Pipeline

### Load_data

In [7]:
complete_personas_data = spark.read.option("header", "true").csv(complete_personas_path, 
                                  schema=build_schema_complete(source="personas"))
if show:
    complete_personas_data.limit(4).toPandas().T
    print("Length: ", complete_personas_data.count())

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

                           0             1             2             3
U_DPTO                    11            11            11            11
U_MPIO                     1             1             1             1
UA_CLASE                   1             1             1             1
U_EDIFICA                  1             1             1             1
COD_ENCUESTAS          63612         63612         63612         64570
...                      ...           ...           ...           ...
U_SECC_URB                 3             3             3             3
U_MZA                     13            13            13            21
nom_mpio        BOGOTÁ. D.C.  BOGOTÁ. D.C.  BOGOTÁ. D.C.  BOGOTÁ. D.C.
tipo_municipio     Municipio     Municipio     Municipio     Municipio
dpto            BOGOTÁ. D.C.  BOGOTÁ. D.C.  BOGOTÁ. D.C.  BOGOTÁ. D.C.

[70 rows x 4 columns]
Length:  20386281

In [8]:
complete_fallecidos = spark.read.option("header", "true").csv(complete_fallecidos_path, 
                                  schema=build_schema_complete(source="fallecidos"))
if show:
    complete_fallecidos.limit(4).toPandas().T
    print("Length: ", complete_fallecidos.count())

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

                            0          1          2          3
U_DPTO                      5          5          5          5
U_MPIO                      1          1          1          1
UA_CLASE                    1          1          1          1
U_EDIFICA                   1          1          1          1
COD_ENCUESTAS          156689     752560     804981    1002889
U_VIVIENDA                 12          4         68        109
F_NROHOG                    1          1          1          1
FA1_NRO_FALL                1          1          1          1
FA2_SEXO_FALL               2          2          2          2
FA3_EDAD_FALL              88         30          0         53
FA4_CERT_DEFUN              1          1          2          1
UVA_USO_UNIDAD              1          1          1          1
V_TIPO_VIV                  2          1          2          2
V_CON_OCUP                  1          1          1          1
V_TOT_HOG                   1          1          1    

In [9]:
covid_data = spark.read.option("header", "true").csv(covid_path, 
                                  schema=build_schema_covid(source="covid"))
if show:
    covid_data.limit(4).toPandas().T
    print("Length: ", covid_data.count())

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

                                 0  ...            3
fecha_de_notificaci_n   2020-03-02  ...   2020-03-09
c_digo_divipola              11001  ...        05001
ciudad_de_ubicaci_n    Bogotá D.C.  ...     Medellín
departamento           Bogotá D.C.  ...    Antioquia
atenci_n                Recuperado  ...   Recuperado
edad                            19  ...           55
sexo                             F  ...            M
tipo                     Importado  ...  Relacionado
estado                        Leve  ...         Leve
pa_s_de_procedencia         ITALIA  ...          nan
fis                     2020-02-27  ...   2020-03-06
fecha_diagnostico       2020-03-06  ...   2020-03-11
fecha_recuperado        2020-03-13  ...   2020-03-26
fecha_reporte_web       2020-03-06  ...   2020-03-11
tipo_recuperaci_n              PCR  ...          PCR
codigo_departamento             11  ...            5
codigo_pais                    380  ...          nan
pertenencia_etnica            Otro  ...       

In [10]:
divipola_data = spark.read.option("header", "true").csv(divipola_path, 
                                  schema=build_schema_divipola(source="divipola"))
if show:
    divipola_data.limit(4).toPandas().T
    print("Length: ", divipola_data.count())

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

                        0          1          2           3
cod_depto               5          5          5           5
cod_mpio                1          2          4          21
dpto            ANTIOQUIA  ANTIOQUIA  ANTIOQUIA   ANTIOQUIA
nom_mpio         MEDELLÍN  ABEJORRAL   ABRIAQUÍ  ALEJANDRÍA
tipo_municipio  Municipio  Municipio  Municipio   Municipio
Length:  1121

### Processing

**Personas**

In [11]:
number_of_people_by_education_level = complete_personas_data.groupby("dpto", "nom_mpio").pivot("P_NIVEL_ANOSR")\
                                            .agg(F.count("P_NIVEL_ANOSR"))
number_of_people_by_education_level = number_of_people_by_education_level.drop("null")
number_of_people_by_education_level = add_prefix_to_cols(number_of_people_by_education_level, 
                                                         prefix="P_NIVEL_ANOSR_",
                                                         exclude_cols=["dpto", "nom_mpio"])
number_of_people_by_education_level = fillna_0(number_of_people_by_education_level, 
                                               exclude_cols=["dpto", "nom_mpio"])
if show:
    number_of_people_by_education_level.limit(5).toPandas().T
    print("Length: ", number_of_people_by_education_level.count())

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

                                 0                       1  ...         3       4
dpto                     ANTIOQUIA               ANTIOQUIA  ...    BOYACÁ  BOYACÁ
nom_mpio          VIGÍA DEL FUERTE  SAN JOSÉ DE LA MONTAÑA  ...  ARCABUCO   CHITA
P_NIVEL_ANOSR_1                263                      88  ...       145     251
P_NIVEL_ANOSR_2               2709                    1143  ...      2288    3523
P_NIVEL_ANOSR_3               1340                     561  ...       913    1310
P_NIVEL_ANOSR_4                993                     542  ...       863     836
P_NIVEL_ANOSR_5                235                      50  ...       132     123
P_NIVEL_ANOSR_6                 31                      18  ...         5      91
P_NIVEL_ANOSR_7                285                     128  ...       200      87
P_NIVEL_ANOSR_8                214                      66  ...       261     108
P_NIVEL_ANOSR_9                 59                      31  ...        66      85
P_NIVEL_ANOSR_10

In [12]:
number_of_people_by_age = complete_personas_data.groupby("dpto", "nom_mpio").pivot("P_EDADR")\
                                            .agg(F.count("P_EDADR"))
number_of_people_by_age = number_of_people_by_age.drop("null")
number_of_people_by_age = add_prefix_to_cols(number_of_people_by_age, 
                                             prefix="P_EDADR_",
                                             exclude_cols=["dpto", "nom_mpio"])
number_of_people_by_age = fillna_0(number_of_people_by_age, 
                                               exclude_cols=["dpto", "nom_mpio"])
if show:
    number_of_people_by_age.limit(5).toPandas().T
    print("Length: ", number_of_people_by_age.count())

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

                    0          1  ...          3             4
dpto        ANTIOQUIA  ANTIOQUIA  ...  ATLÁNTICO     ATLÁNTICO
nom_mpio      BRICEÑO      PEQUE  ...    BARANOA  SABANAGRANDE
P_EDADR_1         444        577  ...       4702          2670
P_EDADR_2         593        630  ...       4981          2613
P_EDADR_3         728        735  ...       5295          2779
P_EDADR_4         599        770  ...       5508          2967
P_EDADR_5         511        606  ...       5067          2902
P_EDADR_6         445        515  ...       4926          2683
P_EDADR_7         438        467  ...       4555          2434
P_EDADR_8         416        471  ...       4393          2328
P_EDADR_9         327        410  ...       3624          1980
P_EDADR_10        301        405  ...       3582          1852
P_EDADR_11        303        352  ...       3624          1765
P_EDADR_12        262        272  ...       3280          1480
P_EDADR_13        183        233  ...       2363       

In [13]:
number_of_people_per_estrato = complete_personas_data.groupby("dpto", "nom_mpio").pivot("VA1_ESTRATO")\
                                            .agg(F.count("VA1_ESTRATO"))
number_of_people_per_estrato = number_of_people_per_estrato.drop("null")
number_of_people_per_estrato = add_prefix_to_cols(number_of_people_per_estrato, 
                                                  prefix="VA1_ESTRATO_",
                                                  exclude_cols=["dpto", "nom_mpio"])
number_of_people_per_estrato = fillna_0(number_of_people_per_estrato, 
                                        exclude_cols=["dpto", "nom_mpio"])
if show:
    number_of_people_per_estrato.limit(5).toPandas().T
    print("Length: ", number_of_people_per_estrato.count())

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

                                    0       1  ...          3                 4
dpto                        ANTIOQUIA  BOYACÁ  ...  ANTIOQUIA         ANTIOQUIA
nom_mpio       SAN JOSÉ DE LA MONTAÑA    PAYA  ...      ANDES  VIGÍA DEL FUERTE
VA1_ESTRATO_0                      21      13  ...        291               540
VA1_ESTRATO_1                     708     221  ...      10015              6594
VA1_ESTRATO_2                    2071     438  ...      23857                22
VA1_ESTRATO_3                     122       2  ...       2252                14
VA1_ESTRATO_4                      15       7  ...        105                 1
VA1_ESTRATO_5                       5       0  ...         11                 0
VA1_ESTRATO_6                       0       0  ...         11                 0
VA1_ESTRATO_9                       5       8  ...         38                 4

[10 rows x 5 columns]
Length:  374

In [14]:
aggregates_by_city_personas = complete_personas_data.groupby("dpto", "nom_mpio")\
            .agg(F.count(F.col("U_MPIO")).alias("Number_of_people"), 
                 F.avg(F.col("HA_TOT_PER_H")).alias("Avg_Number_people_per_home"), 
                 F.sum(F.when(F.col("P_SEXO")==1, 1)).alias("Number_of_males"), 
                 F.sum(F.when(F.col("P_SEXO")==2, 1)).alias("Number_of_females"), 
                 F.sum(F.when(F.col("VA_EE")==1, 1)).alias("Number_of_people_with_electricity"),
                 F.sum(F.when(F.col("VA_EE")==2, 1)).alias("Number_of_people_without_electricity"),
                 F.sum(F.when(F.col("VB_ACU")==1, 1)).alias("Number_of_people_with_water_access"),
                 F.sum(F.when(F.col("VB_ACU")==2, 1)).alias("Number_of_people_without_water_access"), 
                 F.sum(F.when(F.col("VF_INTERNET")==1, 1)).alias("Number_of_people_with_internet_access"), 
                 F.sum(F.when(F.col("VF_INTERNET")==2, 1)).alias("Number_of_people_without_internet_access"),
                 F.sum(F.when(F.col("P_ALFABETA")==1, 1)).alias("Number_of_literate_people"), 
                 F.sum(F.when(F.col("P_ALFABETA")==2, 1)).alias("Number_of_non_literate_people"),
                 F.sum(F.when(F.col("PA1_CALIDAD_SERV")==1, 1)).alias("Really_Good_health_service"),
                 F.sum(F.when(F.col("PA1_CALIDAD_SERV")==2, 1)).alias("Good_health_service"),
                 F.sum(F.when(F.col("PA1_CALIDAD_SERV")==3, 1)).alias("Bad_health_service"),
                 F.sum(F.when(F.col("PA1_CALIDAD_SERV")==4, 1)).alias("Really_Bad_health_service"), 
                 F.sum(F.col("PA1_THFC")).alias("Number_of_sons_out_of_country"))\
            .orderBy(F.col("Number_of_people").desc())
if show:
    aggregates_by_city_personas.limit(5).toPandas().T
    print("Length: ", aggregates_by_city_personas.count())

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

                                                     0  ...          4
dpto                                      BOGOTÁ. D.C.  ...  ATLÁNTICO
nom_mpio                                  BOGOTÁ. D.C.  ...    SOLEDAD
Number_of_people                               7181469  ...     535984
Avg_Number_people_per_home                     3.62483  ...    4.56214
Number_of_males                                3433586  ...     260921
Number_of_females                              3747883  ...     275063
Number_of_people_with_electricity              7132834  ...     533101
Number_of_people_without_electricity             16706  ...       2538
Number_of_people_with_water_access             7109875  ...     527848
Number_of_people_without_water_access            39665  ...       7791
Number_of_people_with_internet_access          5498193  ...     270974
Number_of_people_without_internet_access       1593575  ...     262953
Number_of_literate_people                      6484469  ...     471469
Number

In [15]:
aggregates_by_city_personas = aggregates_by_city_personas.join(number_of_people_per_estrato,
                                                               on=["dpto", "nom_mpio"],
                                                               how="left")
aggregates_by_city_personas = aggregates_by_city_personas.join(number_of_people_by_education_level,
                                                               on=["dpto", "nom_mpio"],
                                                               how="left")
aggregates_by_city_personas = aggregates_by_city_personas.join(number_of_people_by_age,
                                                               on=["dpto", "nom_mpio"],
                                                               how="left")
if show:
    aggregates_by_city_personas.limit(5).toPandas().T
    print("Length: ", aggregates_by_city_personas.count())

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

                                                  0  ...                       4
dpto                                      ANTIOQUIA  ...               ANTIOQUIA
nom_mpio                                      ANDES  ...  SAN JOSÉ DE LA MONTAÑA
Number_of_people                              38144  ...                    2952
Avg_Number_people_per_home                  3.88865  ...                 4.05759
Number_of_males                               19777  ...                    1424
Number_of_females                             18367  ...                    1528
Number_of_people_with_electricity             36580  ...                    2947
Number_of_people_without_electricity            447  ...                       5
Number_of_people_with_water_access            24586  ...                    2265
Number_of_people_without_water_access         12441  ...                     687
Number_of_people_with_internet_access          7550  ...                     729
Number_of_people_without_int

In [16]:
aggregates_by_city_personas.repartition(1).write.partitionBy('dpto')\
    .mode('overwrite').option("header","true").csv(aggregates_personas_path)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

**Falllecidos**

In [17]:
aggregates_fallecidos = complete_fallecidos.groupby("dpto", "nom_mpio")\
            .agg(F.count(F.col("U_MPIO")).alias("Number_of_people_fallecidos"), 
                 F.sum(F.when(F.col("FA2_SEXO_FALL")==1, 1)).alias("Number_of_dead_males"), 
                 F.sum(F.when(F.col("FA2_SEXO_FALL")==2, 1)).alias("Number_of_dead_females"), 
                 F.avg(F.col("FA3_EDAD_FALL")).alias("Avg_Age_w_0s"),
                 F.avg(F.when(F.col("FA3_EDAD_FALL")>30, F.col("FA3_EDAD_FALL"))).alias("Avg_Death_Age"), 
                 F.avg(F.when((F.col("FA2_SEXO_FALL")==1)&(F.col("FA3_EDAD_FALL")>30),
                              F.col("FA3_EDAD_FALL"))).alias("Avg_Death_Age_Male"), 
                 F.avg(F.when((F.col("FA2_SEXO_FALL")==2)&(F.col("FA3_EDAD_FALL")>30),
                              F.col("FA3_EDAD_FALL"))).alias("Avg_Death_Age_Female"), 
                 F.avg(F.when((F.col("VA1_ESTRATO")==1)&(F.col("FA3_EDAD_FALL")>30),
                              F.col("FA3_EDAD_FALL"))).alias("Avg_Death_Age_Estrato1"), 
                 F.avg(F.when((F.col("VA1_ESTRATO")==2)&(F.col("FA3_EDAD_FALL")>30),
                              F.col("FA3_EDAD_FALL"))).alias("Avg_Death_Age_Estrato2"), 
                 F.avg(F.when((F.col("VA1_ESTRATO")==3)&(F.col("FA3_EDAD_FALL")>30),
                              F.col("FA3_EDAD_FALL"))).alias("Avg_Death_Age_Estrato3"), 
                 F.avg(F.when((F.col("VA1_ESTRATO")==4)&(F.col("FA3_EDAD_FALL")>30),
                              F.col("FA3_EDAD_FALL"))).alias("Avg_Death_Age_Estrato4"), 
                 F.avg(F.when((F.col("VA1_ESTRATO")==5)&(F.col("FA3_EDAD_FALL")>30),
                              F.col("FA3_EDAD_FALL"))).alias("Avg_Death_Age_Estrato5"), 
                 F.avg(F.when((F.col("VA1_ESTRATO")==6)&(F.col("FA3_EDAD_FALL")>30),
                              F.col("FA3_EDAD_FALL"))).alias("Avg_Death_Age_Estrato6"), 
                 F.sum(F.when((F.col("VA1_ESTRATO")==1)&(F.col("FA3_EDAD_FALL")>30), 
                              1)).alias("Number_of_dead_Estrato1"), 
                 F.sum(F.when((F.col("VA1_ESTRATO")==2)&(F.col("FA3_EDAD_FALL")>30), 
                              1)).alias("Number_of_dead_Estrato2"), 
                 F.sum(F.when((F.col("VA1_ESTRATO")==3)&(F.col("FA3_EDAD_FALL")>30), 
                              1)).alias("Number_of_dead_Estrato3"), 
                 F.sum(F.when((F.col("VA1_ESTRATO")==4)&(F.col("FA3_EDAD_FALL")>30), 
                              1)).alias("Number_of_dead_Estrato4"), 
                 F.sum(F.when((F.col("VA1_ESTRATO")==5)&(F.col("FA3_EDAD_FALL")>30), 
                              1)).alias("Number_of_dead_Estrato5"), 
                 F.sum(F.when((F.col("VA1_ESTRATO")==6)&(F.col("FA3_EDAD_FALL")>30), 
                              1)).alias("Number_of_dead_Estrato6"), 
                 F.avg(F.when((F.col("UA_CLASE")==1)&(F.col("FA3_EDAD_FALL")>30),
                              F.col("FA3_EDAD_FALL"))).alias("Avg_Death_Age_Cabecera"), 
                 F.avg(F.when((F.col("UA_CLASE")==2)&(F.col("FA3_EDAD_FALL")>30),
                              F.col("FA3_EDAD_FALL"))).alias("Avg_Death_Age_Centro_Poblado"), 
                 F.avg(F.when((F.col("UA_CLASE")==3)&(F.col("FA3_EDAD_FALL")>30),
                              F.col("FA3_EDAD_FALL"))).alias("Avg_Death_Age_Rural_Disperso"), 
                 F.avg(F.when((F.col("UA_CLASE")==4)&(F.col("FA3_EDAD_FALL")>30),
                              F.col("FA3_EDAD_FALL"))).alias("Avg_Death_Age_Resto_Rural"), 
                 F.sum(F.when((F.col("UA_CLASE")==1)&(F.col("FA3_EDAD_FALL")>30), 
                              1)).alias("Number_of_dead_Cabecera"), 
                 F.sum(F.when((F.col("UA_CLASE")==1)&(F.col("FA3_EDAD_FALL")>30), 
                              1)).alias("Number_of_dead_Centro_Poblado"), 
                 F.sum(F.when((F.col("UA_CLASE")==1)&(F.col("FA3_EDAD_FALL")>30),
                              1)).alias("Number_of_dead_Rural_Disperso"), 
                 F.sum(F.when((F.col("UA_CLASE")==1)&(F.col("FA3_EDAD_FALL")>30),
                              1)).alias("Number_of_dead_Resto_Rural"),)\
            .orderBy(F.col("Number_of_people_fallecidos").desc())
if show:
    aggregates_fallecidos.limit(3).toPandas().T
    print("Length: ", aggregates_fallecidos.count())

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

                                          0          1             2
dpto                           BOGOTÁ. D.C.  ANTIOQUIA     ATLÁNTICO
nom_mpio                       BOGOTÁ. D.C.   MEDELLÍN  BARRANQUILLA
Number_of_people_fallecidos           26929      11382          4172
Number_of_dead_males                  14416       6353          2268
Number_of_dead_females                12482       5021          1899
Avg_Age_w_0s                        64.5836    62.5208       67.8588
Avg_Death_Age                        74.138    73.0997        74.553
Avg_Death_Age_Male                  70.9326     69.653       70.4102
Avg_Death_Age_Female                74.9669    75.3202       76.5482
Avg_Death_Age_Estrato1              70.1874    69.1686       68.9175
Avg_Death_Age_Estrato2              72.8915    71.1722        72.172
Avg_Death_Age_Estrato3              74.9028    74.1777       77.8544
Avg_Death_Age_Estrato4              78.7467     77.984       77.6998
Avg_Death_Age_Estrato5            

In [18]:
aggregates_fallecidos.repartition(1).write.partitionBy('dpto')\
    .mode('overwrite').option("header","true").csv(aggregates_fallecidos_path)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

**Covid**

In [19]:
covid_data = covid_data.join(divipola_data,
                             (covid_data.divipola_dpto==divipola_data.cod_depto)&\
                             (covid_data.divipola_mpio==divipola_data.cod_mpio),
                             how="left")

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [20]:
number_of_people_per_atencion = covid_data.groupby("dpto", "nom_mpio").pivot("atenci_n")\
                                            .agg(F.count("atenci_n"))
number_of_people_per_atencion = number_of_people_per_atencion.drop("null")
number_of_people_per_atencion = add_prefix_to_cols(number_of_people_per_atencion, 
                                                  prefix="atenci_n_",
                                                  exclude_cols=["dpto", "nom_mpio"])
number_of_people_per_atencion = fillna_0(number_of_people_per_atencion, 
                                        exclude_cols=["dpto", "nom_mpio"])
if show:
    number_of_people_per_atencion.limit(5).toPandas().T
    print("Length: ", number_of_people_per_atencion.count())

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

                             0          1  ...                    3       4
dpto                     CHOCÓ  MAGDALENA  ...      VALLE DEL CAUCA   CESAR
nom_mpio               ISTMINA    CIÉNAGA  ...  GUADALAJARA DE BUGA  LA PAZ
atenci_n_Casa               21        259  ...                   93       7
atenci_n_Fallecido           3         83  ...                    2       0
atenci_n_Hospital            3         48  ...                   10       2
atenci_n_Hospital UCI        0          6  ...                    0       0
atenci_n_N/A                 0          1  ...                    1       1
atenci_n_Recuperado         32        304  ...                   45      28

[8 rows x 5 columns]
Length:  750

In [21]:
number_of_people_per_estado = covid_data.groupby("dpto", "nom_mpio").pivot("estado")\
                                            .agg(F.count("estado"))
number_of_people_per_estado = number_of_people_per_estado.drop("null")
number_of_people_per_estado = add_prefix_to_cols(number_of_people_per_estado, 
                                                  prefix="estado_",
                                                  exclude_cols=["dpto", "nom_mpio"])
number_of_people_per_estado = fillna_0(number_of_people_per_estado, 
                                        exclude_cols=["dpto", "nom_mpio"])
if show:
    number_of_people_per_estado.limit(5).toPandas().T
    print("Length: ", number_of_people_per_estado.count())

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

                                      0  ...                 4
dpto                 NORTE DE SANTANDER  ...         ANTIOQUIA
nom_mpio                          OCAÑA  ...  VIGÍA DEL FUERTE
estado_Asintomático                   3  ...                 1
estado_Fallecido                      6  ...                 2
estado_Grave                          1  ...                 0
estado_Leve                          19  ...                17
estado_Moderado                       3  ...                 1
estado_N/A                            0  ...                 0

[8 rows x 5 columns]
Length:  750

In [22]:
number_of_people_per_edad_q = covid_data.groupby("dpto", "nom_mpio").pivot("edad_q")\
                                            .agg(F.count("edad_q"))
number_of_people_per_edad_q = number_of_people_per_edad_q.drop("null")
number_of_people_per_edad_q = add_prefix_to_cols(number_of_people_per_edad_q, 
                                                  prefix="edad_q_",
                                                  exclude_cols=["dpto", "nom_mpio"])
number_of_people_per_edad_q = fillna_0(number_of_people_per_edad_q, 
                                        exclude_cols=["dpto", "nom_mpio"])
if show:
    number_of_people_per_edad_q.limit(5).toPandas().T
    print("Length: ", number_of_people_per_edad_q.count())

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

                      0          1        2       3           4
dpto          ATLÁNTICO  ATLÁNTICO    CHOCÓ  ARAUCA  LA GUAJIRA
nom_mpio   SABANAGRANDE    BARANOA  ISTMINA  ARAUCA   HATONUEVO
edad_q_1             34         56        2      12           1
edad_q_2             73        127        7      11           1
edad_q_3             40         72        3       2           0
edad_q_4             30         61        5       1           0
edad_q_5             51        104        9       9           0
edad_q_6              0         10        0       0           0
edad_q_7             49         82        4      22           0
edad_q_8             56        111        8       9           1
edad_q_9             53         87        8       4           0
edad_q_10            18         33        3       0           0
edad_q_11            24         36        1       0           0
edad_q_12            60         76        4       3           0
edad_q_13            28         57      

In [23]:
aggregates_covid = covid_data.groupby("dpto", "nom_mpio")\
            .agg(F.count(F.col("nom_mpio")).alias("Number_of_people_covid"),
                 F.sum(F.when(F.col("sexo")=="M", 1)).alias("Number_of_males_covid"), 
                 F.sum(F.when(F.col("sexo")=="F", 1)).alias("Number_of_females_covid"), 
                 F.sum(F.when(F.col("muerto")=="True", 1)).alias("Number_of_deaths"),
                 F.sum(F.when(F.col("muerto")=="False", 1)).alias("Number_of_non_deaths"),
                 F.sum(F.col("edad_muerto")).alias("Sum_of_dead_ages"))\
            .orderBy(F.col("Number_of_people_covid").desc())
if show:
    aggregates_covid.limit(5).toPandas().T
    print("Length: ", aggregates_covid.count())

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

                                    0  ...          4
dpto                     BOGOTÁ. D.C.  ...  ATLÁNTICO
nom_mpio                 BOGOTÁ. D.C.  ...    SOLEDAD
Number_of_people_covid          55056  ...       8845
Number_of_males_covid           27708  ...       4790
Number_of_females_covid         27348  ...       4055
Number_of_deaths                 1339  ...        488
Number_of_non_deaths            53717  ...       8357
Sum_of_dead_ages                91969  ...      31981

[8 rows x 5 columns]
Length:  750

In [24]:
aggregates_covid = aggregates_covid.join(number_of_people_per_edad_q,
                                         on=["dpto", "nom_mpio"],
                                         how="left")
aggregates_covid = aggregates_covid.join(number_of_people_per_estado,
                                         on=["dpto", "nom_mpio"],
                                         how="left")
aggregates_covid = aggregates_covid.join(number_of_people_per_atencion,
                                         on=["dpto", "nom_mpio"],
                                         how="left")
aggregates_covid = aggregates_covid.orderBy(F.col("Number_of_people_covid").desc())
if show:
    aggregates_covid.limit(5).toPandas().T
    print("Length: ", aggregates_covid.count())

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

                                    0  ...          4
dpto                     BOGOTÁ. D.C.  ...  ATLÁNTICO
nom_mpio                 BOGOTÁ. D.C.  ...    SOLEDAD
Number_of_people_covid          55056  ...       8845
Number_of_males_covid           27708  ...       4790
Number_of_females_covid         27348  ...       4055
Number_of_deaths                 1339  ...        488
Number_of_non_deaths            53717  ...       8357
Sum_of_dead_ages                91969  ...      31981
edad_q_1                         2395  ...        264
edad_q_2                         6449  ...       1309
edad_q_3                         4075  ...        624
edad_q_4                         3479  ...        559
edad_q_5                         6818  ...       1124
edad_q_6                          441  ...         69
edad_q_7                         5222  ...        727
edad_q_8                         5666  ...       1040
edad_q_9                         4871  ...        801
edad_q_10                   

In [25]:
aggregates_covid.repartition(1).write\
    .mode('overwrite').option("header","true").csv(aggregates_covid_path)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

### Joins

**Personas&Covid**

In [26]:
joined_personas = aggregates_by_city_personas.join(aggregates_covid, 
                                                   on=["dpto", "nom_mpio"],
                                                   how="inner")
if show:
    joined_personas.limit(3).toPandas().T
    print("Length: ", joined_personas.count())

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

                                    0          1          2
dpto                        ANTIOQUIA  ANTIOQUIA  ANTIOQUIA
nom_mpio                        ANDES    BRICEÑO  CONCORDIA
Number_of_people                38144       5946      16095
Avg_Number_people_per_home    3.88865    3.93206    3.91303
Number_of_males                 19777       3071       8022
...                               ...        ...        ...
atenci_n_Fallecido                  0          0          0
atenci_n_Hospital                   1          0          0
atenci_n_Hospital UCI               0          0          0
atenci_n_N/A                        0          0          0
atenci_n_Recuperado                 1          1          0

[98 rows x 3 columns]
Length:  236

In [27]:
joined_personas.repartition(1).write\
    .mode('overwrite').option("header","true").csv(join_personas_covid_path)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

**Falllecidos&Covid**

In [28]:
joined_fallecidos = aggregates_fallecidos.join(aggregates_covid, 
                                                   on=["dpto", "nom_mpio"],
                                                   how="inner")
if show:
    joined_fallecidos.limit(3).toPandas().T
    print("Length: ", joined_fallecidos.count())

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

                                        0             1                    2
dpto                         BOGOTÁ. D.C.     ATLÁNTICO              BOLÍVAR
nom_mpio                     BOGOTÁ. D.C.  BARRANQUILLA  CARTAGENA DE INDIAS
Number_of_people_fallecidos         26929          4172                 2690
Number_of_dead_males                14416          2268                 1529
Number_of_dead_females              12482          1899                 1158
...                                   ...           ...                  ...
atenci_n_Fallecido                   1287          1191                  394
atenci_n_Hospital                    3018           898                  231
atenci_n_Hospital UCI                 158           131                   51
atenci_n_N/A                           89            56                   17
atenci_n_Recuperado                 19668          9836                 6901

[68 rows x 3 columns]
Length:  236

In [29]:
joined_fallecidos.repartition(1).write\
    .mode('overwrite').option("header","true").csv(join_fallecidos_covid_path)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…