## Pareto Municipality

The dengue data is originally provided at the municipality level, which is the smallest administrative unit in Brazil.
Yet, Brazil is a very large country, having 5570 municipalities in the dataset. Creating forecasts for each
municipality would be computationally expensive and not very useful, as many of them have very few cases.
Thus, the goal of this notebook is to try to select a subset of municipalities.

#### Notes on Methodology

The Mosqlimate sprint objective is to predict on the state level. Therefore, our selection of municipalities must
consider such division.

Some states have much more concentration than others. Hence, we must determine a different quantity of  municipalities
to be used by UF.

- DF has a single municipality, so nothing has to be done.

In [17]:
import numpy as np
import pandas as pd
import polars as pl
pl.Config.set_tbl_rows(30)
from autogluon.timeseries import TimeSeriesDataFrame, TimeSeriesPredictor
from lets_plot import *
LetsPlot.setup_html()
from matplotlib import pyplot as plt


from inequality import gini

In [3]:
data = pl.read_parquet('../data/dataset_complete_dengue_municipality.parquet')

In [75]:
gdf = data.group_by(['geocode','uf']).agg(pl.col('casos').sum().alias('casos'))
gdf = gdf.join(gdf.group_by(['uf']).agg(pl.col('casos').sum().alias('casos_uf')),on=['uf'])
gdf = gdf.with_columns(
    np.log1p(pl.col('casos')).alias('log_casos'),
    np.log1p(pl.col('casos_uf')).alias('log_casos_uf'),
    (pl.col('casos')/pl.col('casos_uf')).alias('casos_percent_uf'),
)
gdf = gdf.with_columns((pl.col('log_casos')/pl.col('log_casos_uf')).alias('log_casos_percent_uf'))

gdf = gdf.with_columns(
    pl.col("casos_percent_uf")
    .rank("ordinal", descending=True)
    .over(["uf"])
    .alias("rank_within_uf")
)

gdf = gdf.sort(['uf','casos_percent_uf'],descending=True).with_columns(
    pl.col("casos_percent_uf")
    .cum_sum()
    .over(["uf"])
    .alias("cumulative_share")
)

gdf


geocode,uf,casos,casos_uf,log_casos,log_casos_uf,casos_percent_uf,log_casos_percent_uf,rank_within_uf,cumulative_share
i64,str,i64,i64,f64,f64,f64,f64,u32,f64
1721000,"""TO""",51706,126949,10.853348,11.751549,0.407297,0.923568,1,0.407297
1702109,"""TO""",14370,126949,9.572968,11.751549,0.113195,0.814613,2,0.520492
1718204,"""TO""",5152,126949,8.547334,11.751549,0.040583,0.727337,3,0.561076
1716109,"""TO""",4995,126949,8.516393,11.751549,0.039347,0.724704,4,0.600422
1709500,"""TO""",4238,126949,8.352083,11.751549,0.033383,0.710722,5,0.633806
1707009,"""TO""",2675,126949,7.892078,11.751549,0.021071,0.671578,6,0.654877
1705508,"""TO""",2403,126949,7.784889,11.751549,0.018929,0.662456,7,0.673806
1702000,"""TO""",2207,126949,7.699842,11.751549,0.017385,0.655219,8,0.691191
1721208,"""TO""",2200,126949,7.696667,11.751549,0.01733,0.654949,9,0.708521
1708205,"""TO""",2122,126949,7.660585,11.751549,0.016715,0.651879,10,0.725236


In [125]:
def filter_by_casos_percent_uf(df: pl.dataframe, threshold: float = 0.8) -> pl.dataframe:
    """
    Filters the input DataFrame to retain municipalities that cumulatively explain up to a specified percentage
    of dengue cases within each state (UF).

    Parameters:
    ----------
    df : pl.DataFrame
        Input Polars DataFrame containing dengue case data at the municipality level.
        Must include the following columns:
        - 'uf': State code.
        - 'casos_percent_uf': Percentage of cases contributed by each municipality within its state.

    threshold : float, optional (default=0.8)
        The cumulative percentage threshold for filtering municipalities. Only municipalities that contribute
        to the cumulative percentage of cases up to this threshold will be retained.

    Returns:
    -------
    pl.DataFrame
        A filtered Polars DataFrame containing only the municipalities that meet the cumulative percentage
        threshold criteria within each state.
    """

    df = (
        df.sort(["uf", "rank_within_uf"])
        .with_columns([
            # Compute cumulative sum per group
            pl.col("casos_percent_uf").cum_sum().over("uf").alias("cum_sum")
        ])
        .with_columns([
            # Flag: cum_sum <= threshold
            (pl.col("cum_sum") < threshold).alias("keep_flag"),
        ])
        .with_columns([
            # Shift cross_flag backward per group
            pl.col("keep_flag").shift(1).fill_null(True).over("uf").alias("keep_flag")
        ])
        .filter(
            pl.col("keep_flag")
        )        .drop(["keep_flag"])

    )
    return df

pareto = filter_by_casos_percent_uf(gdf,0.70)

geocode,uf,casos,casos_uf,log_casos,log_casos_uf,casos_percent_uf,log_casos_percent_uf,rank_within_uf,cumulative_share
i64,str,i64,i64,f64,f64,f64,f64,u32,f64
1721000,"""TO""",51706,126949,10.853348,11.751549,0.407297,0.923568,1,0.407297
1702109,"""TO""",14370,126949,9.572968,11.751549,0.113195,0.814613,2,0.520492
1718204,"""TO""",5152,126949,8.547334,11.751549,0.040583,0.727337,3,0.561076
1716109,"""TO""",4995,126949,8.516393,11.751549,0.039347,0.724704,4,0.600422
1709500,"""TO""",4238,126949,8.352083,11.751549,0.033383,0.710722,5,0.633806
1707009,"""TO""",2675,126949,7.892078,11.751549,0.021071,0.671578,6,0.654877
1705508,"""TO""",2403,126949,7.784889,11.751549,0.018929,0.662456,7,0.673806
1702000,"""TO""",2207,126949,7.699842,11.751549,0.017385,0.655219,8,0.691191
1721208,"""TO""",2200,126949,7.696667,11.751549,0.01733,0.654949,9,0.708521
1708205,"""TO""",2122,126949,7.660585,11.751549,0.016715,0.651879,10,0.725236


In [128]:
pareto.group_by('uf').agg(pl.col('casos_percent_uf').sum(),pl.len()).sort(by='len')

geocode,uf,casos,casos_uf,log_casos,log_casos_uf,casos_percent_uf,log_casos_percent_uf,rank_within_uf,cumulative_share,cum_sum
i64,str,i64,i64,f64,f64,f64,f64,u32,f64,f64
1200401,"""AC""",77649,166131,11.259967,12.020538,0.467396,0.936727,1,0.467396,0.467396
1200203,"""AC""",48919,166131,10.797942,12.020538,0.29446,0.898291,2,0.761857,0.761857
2704302,"""AL""",96183,249907,11.474018,12.428848,0.384875,0.923176,1,0.384875,0.384875
2700300,"""AL""",53365,249907,10.884929,12.428848,0.213539,0.875779,2,0.598415,0.598415
2706307,"""AL""",14312,249907,9.568923,12.428848,0.057269,0.769896,3,0.655684,0.655684
2708006,"""AL""",6694,249907,8.809116,12.428848,0.026786,0.708764,4,0.68247,0.68247
2709301,"""AL""",6313,249907,8.750525,12.428848,0.025261,0.70405,5,0.707731,0.707731
1302603,"""AM""",100872,160360,11.521618,11.985183,0.629035,0.961322,1,0.629035,0.629035
1304203,"""AM""",6104,160360,8.716863,11.985183,0.038064,0.727303,2,0.667099,0.667099
1303809,"""AM""",4658,160360,8.446556,11.985183,0.029047,0.70475,3,0.696146,0.696146


In [137]:
top_municipality = gdf.filter(pl.col('rank_within_uf')==1)

In [139]:
top_municipality.write_parquet("../data/top_municipalities.parquet")