# Data Preprocessing for Water Analysis in Mexico City

* Date : Aug 2025
* Author : Axel Daniel Malváez Flores
* Description : This notebook contains the data preprocessing steps for the water analysis project in Mexico City. It includes loading, cleaning, and transforming various datasets related to water consumption, property data, and population data.

### Libraries

In [11]:
import pandas as pd
import numpy as np
import geopandas as gpd

import json
import warnings

warnings.filterwarnings("ignore")

### Constants

In [12]:
# WATER DATASETS

# Data consumption during the whole 2019 by neighborhood
watConsPath="../raw-data/water/consumo/consumo_agua_historico_2019.csv"
# Municipalities with drought data and higher/lower probability of drought
droughtPath="../raw-data/water/sequia/Municipios_con_ sequia.xlsx"
# Water Reports in Mexico City by neighborhood.
reports2224Path="../raw-data/water/reportes/reportes_agua_2024_01.csv"
reportsHistory="../raw-data/water/reportes/reportes_agua_hist.csv"
# Hidric feasibility
hidFeasPath="../raw-data/water/factibilidad/factibilidad-hdrica.json"
# Consumption every two months by neighborhood
habPath=("../raw-data/water/consumo/consumo-habitacional-promedio-"
           "bimestral-de-agua-por-colonia-m3.json")    

# PROPERTY DATASETS

# Index SHF for housing price in the area
indexSHFPath="../raw-data/property-population/SHF/indice_SHF.csv"
# Population Growth Rate
growthRateAlcPath=("../raw-data/property-population/CrecimientoPoblacional/"
                     "poblacion_total_tasa_crecimiento_alcaldia_1.2.csv")
# Density
housesColPath=("../raw-data/property-population/"
                "Hogares por colonia/hogares_colonia.shp")
# Concentración habitacional
densPath = ("../raw-data/property-population/alta_concentracion/"
            "zonas_vivienda.shp")

### Data Reading and Preprocessing

In [13]:
# ---------------------
# WATER DATA
# ---------------------

# Water Consumption 2019
watCons19 = pd.read_csv(watConsPath)
watCons19.drop_duplicates(inplace=True)

# Drought MEX CITY
drought = pd.read_excel(droughtPath, sheet_name='MONITOR - SEMAFORO - USO EFIC')
cldCols = drought.columns.to_series().where(
    ~drought.columns.str.contains('^Unnamed'), '')
drought.columns = [x + '/' + y if x != '' else y 
                    for x, y in zip(cldCols, drought.iloc[0].astype(str))]
drought = drought.iloc[1:].reset_index().iloc[:,1:]

# Water Reports in Mexico City by neighborhood.
reports2224 = pd.read_csv(reports2224Path)
reportsHist = pd.read_csv(reportsHistory)

# Hidric feasibility
with open(hidFeasPath, "r", encoding="utf-8") as f : 
    feasFeat = json.load(f)
feasibility = gpd.GeoDataFrame.from_features(feasFeat["features"])

# Consumption every two months by neighborhood
with open(habPath, "r", encoding="utf-8") as f : 
    habConsFeats = json.load(f)
habCons = gpd.GeoDataFrame.from_features(habConsFeats['features'])

# ---------------------
# PROPERTY DATA
# ---------------------

indexSHF = pd.read_csv(indexSHFPath, encoding='iso-8859-1', delimiter=';')
data_tasa_crecimiento_alcaldia = pd.read_csv(growthRateAlcPath,encoding='utf-8')
data_hogares_col = gpd.read_file(housesColPath) 
data_concentracion = gpd.read_file(densPath)

### Processing Datasets

#### Water Consumption Data 2019

Información bimestral por el concepto de suministro de agua a nivel manzana, considerando la facturación por servicio de consumo medido y promedio.

In [14]:
# Date 
watCons19['fecha_referencia'] = pd.to_datetime(watCons19['fecha_referencia'],
                                               format='ISO8601')

# Drop Duplicates 
watCons19[['fecha_referencia', 'anio', 'bimestre', 'indice_des', 'colonia',
           'alcaldia', 'latitud', 'longitud']].drop_duplicates(inplace=True)

# Adding derived columns for total consumption
inDom = watCons19['consumo_total_dom'] / watCons19['consumo_prom_dom']
watCons19['inmuebles_domesticos'] = inDom

inNoDom = watCons19['consumo_total_no_dom'] / watCons19['consumo_prom_no_dom']
watCons19['inmuebles_no_domesticos'] = inNoDom

inMix = watCons19['consumo_total_mixto'] / watCons19['consumo_prom_mixto']
watCons19['inmuebles_mixtos'] = inMix

# replace NaN values with 0 in the derived column
watCons19['inmuebles_domesticos'].fillna(0, inplace=True)
watCons19['inmuebles_no_domesticos'].fillna(0, inplace=True)
watCons19['inmuebles_mixtos'].fillna(0, inplace=True)

# Data structure and column ordering
watCons19 = watCons19[['fecha_referencia', 'consumo_total',
                       'inmuebles_domesticos', 'consumo_total_dom',
                       'consumo_prom_dom', 'inmuebles_no_domesticos',
                       'consumo_total_no_dom', 'consumo_prom_no_dom',
                       'inmuebles_mixtos', 'consumo_total_mixto',
                       'consumo_prom_mixto', 'indice_des', 'colonia',
                       'alcaldia']]

# Adding total consumption and total number of properties
watCons19['total_inmuebles'] = inDom + inNoDom + inMix

* OUTPUT

In [15]:
# watCons19.to_csv("data/consumoAgua19.csv")

#### Factibility Data

| Grado de Factibilidad Hídrica | Descripción / Significado                                | Color       
| ----------------------------- | -------------------------------------------------------- | ----------- 
| Alta factibilidad 🌟          | Zonas con alta capacidad para mitigar riesgo de escasez  | 🟩 Verde    
| Media-alta factibilidad ⚠️    | Zonas con buena capacidad, pero con algunas limitaciones | 🟨 Amarillo 
| Media-baja factibilidad 🔶    | Zonas con capacidad limitada para mitigar riesgo         | 🟧 Naranja  
| Baja factibilidad ❌           | Zonas con poca o nula capacidad para mitigar riesgo      | 🟥 Rojo    

In [None]:
# Deleting last value since it contains an invalid coordinate
feasibility = feasibility.iloc[:-1,:]
feasibility.rename(columns={'NOMBRE':'colonia', 'DELEGACIO' : 'alcaldia'}, inplace=True)

#### [Do not require preprocessing] Habitational Consumption Data

Diccionario : 

**SUM_cons_t**   
Suma del consumo total de agua por colonia (m3) por bimestre

**MEAN_cons_**   
Promedio de consumo total de colonia (m3) por bimestre

**VIV2010**   
Vivienda censo 2010

**PROMVIVCON**   
"Promedio de consumo total de agua de la colonia por número de viviendas"

**C_PROMVIVC**   
"Campo reclasificado en cuantiles en 5 rangos a partir de "PROMVIVCON"; donde el valor "5" corresponde a un consumo de agua muy alto, mientras que el número "1" a un consumo bajo"



#### Probabilidad de Escasez

In [None]:
# Load CDMX alcaldías GeoJSON
with open('../raw-data/water/sequia/limite-de-las-alcaldias.json', 'r', encoding='utf-8') as f:
    data = json.load(f)
alcaldias_cdmx = gpd.GeoDataFrame.from_features(data['features'])
alcaldias_cdmx = alcaldias_cdmx.set_crs(epsg=4326, inplace=True)
geojson = alcaldias_cdmx.__geo_interface__

# Step 1: Identify id_vars (the static columns)
id_vars = ['Region de lluvias y fecha de solicitud de corroboración', 'CVE_CONCATENADA', 'CVE_ENT', 'CVE_MUN', 'NOMBRE_MUN', 'ENTIDAD',
           'ORG_CUENCA', 'CLV_OC', 'CON_CUENCA', 'CVE_CONC']

# Step 2: Melt all other columns
df_melted = drought.melt(id_vars=id_vars, var_name='variable', value_name='value')

# Create category and date columns
category = []
date = []
for i in df_melted['variable'].astype(str):
    tup = i.split('/')
    category.append(tup[0])
    date.append(tup[1])
    
df_melted['CATEGORY'] = category
df_melted['CATEGORY'] = df_melted['CATEGORY'].str.replace(r'\.\d+$', '', regex=True)

df_melted['DATE'] = date
df_melted['DATE'] = pd.to_datetime(df_melted['DATE'])
df_melted['MONTH'] = df_melted['DATE'].dt.month_name()
df_melted['YEAR'] = df_melted['DATE'].dt.year
df_melted['DATE'] = df_melted['DATE'].dt.strftime('%Y-%m')

# Mexico city filter, adding map and taking MAGNITUD value onl7
mxc_drought = df_melted[(df_melted['ENTIDAD'] == 'Ciudad de México') & (df_melted['CATEGORY'] == 'MAGNITUD')]
mxc_drought = pd.merge(mxc_drought, alcaldias_cdmx[['CVE_MUN', 'geometry']], on='CVE_MUN', how='left')

color_map = {
    'SIN SEQUIA': '#f0f0f0',     # Very light gray / near white
    'PRE-ALERTA': '#1a9850',     # Light gray
    'VERDE': '#a6d96a',          # Soft green
    'AMARILLO': '#ffffbf',       # Yellow
    'NARANJA': '#fdae61',        # Orange
    'ROJO': '#d73027'            # Red
}

mxc_drought['color'] = mxc_drought[mxc_drought['CATEGORY'] == 'MAGNITUD']['value'].map(color_map)
mxc_drought = gpd.GeoDataFrame(mxc_drought, geometry='geometry')
# mxc_drought["geometry"] = mxc_drought["geometry"].simplify(tolerance=0.0005, preserve_topology=True)

# Data Processing
number_map = {
    'SIN SEQUIA': 1,     # Very light gray / near white
    'PRE-ALERTA': 2,     # Light gray
    'VERDE': 3,          # Soft green
    'AMARILLO': 4,       # Yellow
    'NARANJA': 5,        # Orange
    'ROJO': 6            # Red
}

mxc_drought['VALUE'] = mxc_drought[mxc_drought['CATEGORY'] == 'MAGNITUD']['value'].map(number_map)
mxc_drought['DATE'] = pd.to_datetime(mxc_drought['DATE'])

mxc_drought.head()

Unnamed: 0,Region de lluvias y fecha de solicitud de corroboración,CVE_CONCATENADA,CVE_ENT,CVE_MUN,NOMBRE_MUN,ENTIDAD,ORG_CUENCA,CLV_OC,CON_CUENCA,CVE_CONC,variable,value,CATEGORY,DATE,MONTH,YEAR,geometry,color,VALUE
0,Región A - Primeros Quince días hábiles de cad...,9002,9,2,Azcapotzalco,Ciudad de México,Aguas del Valle de México,XIII,Valle de Mexico,26,MAGNITUD/2003-01-31 00:00:00,PRE-ALERTA,MAGNITUD,2003-01,January,2003,"POLYGON ((-99.18231 19.50748, -99.18229 19.507...",#1a9850,2
1,Región A - Primeros Quince días hábiles de cad...,9003,9,3,Coyoacán,Ciudad de México,Aguas del Valle de México,XIII,Valle de Mexico,26,MAGNITUD/2003-01-31 00:00:00,PRE-ALERTA,MAGNITUD,2003-01,January,2003,"POLYGON ((-99.13427 19.35654, -99.13397 19.356...",#1a9850,2
2,Región A - Primeros Quince días hábiles de cad...,9004,9,4,Cuajimalpa de Morelos,Ciudad de México,Aguas del Valle de México,XIII,Valle de Mexico,26,MAGNITUD/2003-01-31 00:00:00,PRE-ALERTA,MAGNITUD,2003-01,January,2003,"POLYGON ((-99.25738 19.40112, -99.25698 19.400...",#1a9850,2
3,Región A - Primeros Quince días hábiles de cad...,9005,9,5,Gustavo A. Madero,Ciudad de México,Aguas del Valle de México,XIII,Valle de Mexico,26,MAGNITUD/2003-01-31 00:00:00,PRE-ALERTA,MAGNITUD,2003-01,January,2003,"POLYGON ((-99.11124 19.56150, -99.11485 19.557...",#1a9850,2
4,Región A - Primeros Quince días hábiles de cad...,9006,9,6,Iztacalco,Ciudad de México,Aguas del Valle de México,XIII,Valle de Mexico,26,MAGNITUD/2003-01-31 00:00:00,PRE-ALERTA,MAGNITUD,2003-01,January,2003,"POLYGON ((-99.05751 19.40673, -99.05753 19.406...",#1a9850,2


In [26]:
mxc_drought['VALUE']

0       2
1       2
2       2
3       2
4       2
       ..
4027    5
4028    5
4029    5
4030    5
4031    5
Name: VALUE, Length: 4032, dtype: int64