<a href="https://colab.research.google.com/github/acoiman/pdt/blob/main/asthma_mortality/notebooks/colab/03_Asthma_Mortality_EDA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#📊 Exploratory Data Analysis (EDA) of Asthma Mortality.

The main goal of Exploratory Data Analysis (EDA) of asthma mortality is to discover patterns, trends, and associations between variables that guide the implementation of the methodology to accept or reject our research hypotheses.
Our asthma mortality EDA includes:

* Descriptive statistics
* Temporal analysis
* Demographic analysis
* Geospatial analysis
* Outlier detection

## 🤖 Load libraries

The libraries required for the analysis will be loaded

In [None]:
# dataframe libraries
import pandas as pd

# geospatial libraries
import geopandas as gpd
import pysal
import mapclassify

# numpy
import numpy as np

# plot libraries
import matplotlib.pyplot as plt
import matplotlib.colors as mcolors
import seaborn as sns
import plotly.express as px
from matplotlib.ticker import MaxNLocator

# picture libraries
from IPython.display import Image
import imageio.v2 as imageio

# other libraries
from scipy.stats import poisson
import os
from itables import init_notebook_mode

## 💾 Load dataset

The dataset resulting from the preprocessing will be loaded.

In [None]:
# change directory to work folder (at the begining docker container enter into /home/jovyan/)
%cd work

In [None]:
# Load the cleaned asthma mortality dataset from 2001 to 2022
df = pd.read_csv('pdt/asthma_mortality/data/csv/def_asma_2001_2022_clean_02.csv',  dtype={'IDDPTO': str})

In [None]:
# Display the first few rows of the DataFrame
init_notebook_mode(all_interactive=False)
df.head()

## 📈 Descriptive statistics

In this EDA, descriptive statistics will allow us to systematically summarize and contextualize asthma mortality data to discover useful preliminary information about its temporal, demographic, and spatial behavior.

### Temporal trends
Through EDA temporal trends, we will understand how asthma mortality evolves over time to identify patterns.

#### Number of deaths from asthma in Argentina (2001-2022)

In [None]:
# Grouping the data by year and summing the 'CANTIDAD' column
df_mpa = df.groupby('ANIO')['CANTIDAD'].sum().reset_index()

In [None]:
# Display the first few rows of the DataFrame
init_notebook_mode(all_interactive=False)
df_mpa.head()

In [None]:
# Create figure
plt.figure(figsize=(12, 6))

# Create line plot
plt.plot(df_mpa['ANIO'], df_mpa['CANTIDAD'],
         marker='o',
         linestyle='-',
         color='#2c7bb6',
         linewidth=2.5,
         markersize=8)

# Formatting
plt.title('Number of deaths from asthma in Argentina (2001-2022)', fontsize=14, pad=20)
plt.xlabel('Year', fontsize=12)
plt.ylabel('Number of deaths', fontsize=12)
plt.grid(True, alpha=0.3)
plt.xticks(df_mpa['ANIO'][::2], rotation=45)  # Show every other year
plt.xlim(2000.5, 2022.5)  # Add buffer on x-axis

# Improve tick formatting
plt.gca().xaxis.set_major_locator(plt.MultipleLocator(2))
plt.gca().xaxis.set_minor_locator(plt.MultipleLocator(1))

plt.tight_layout()
plt.show();

#### Number of deaths by sex and age group

In [None]:
# Load the cleaned asthma mortality dataset from 2001 to 2022
df = pd.read_csv('pdt/asthma_mortality/data/csv/def_asma_2001_2022_clean_02.csv', dtype={'IDDPTO': str})

In [None]:
# Display the first few rows of the DataFrame
init_notebook_mode(all_interactive=False)
df.head()

In [None]:
# Map the 'SEXO' column values to their corresponding labels using the sex_map dictionary
sex_map = {1: 'Masculino', 2: 'Femenino'}
df['SEXO'] = df['SEXO'].map(sex_map)

In [None]:
# Display the first few rows of the DataFrame
init_notebook_mode(all_interactive=False)
df.head()

In [None]:
# Custom order of age groups
age_order = ['0-4', '5-19', '20-44', '45-64', '65-74', '>= 75']

In [None]:
# Convert the 'GRUPEDAD' column to a categorical type with a specific order
df['GRUPEDAD'] = pd.Categorical(df['GRUPEDAD'], categories=age_order, ordered=True)

In [None]:
# Display the first few rows of the DataFrame
init_notebook_mode(all_interactive=False)
df.head()

In [None]:
# Group the DataFrame by 'GRUPEDAD', 'SEXO', and 'ANIO', summing the 'CANTIDAD' column
df_agg = df.groupby(['GRUPEDAD', 'SEXO', 'ANIO'], as_index=False, observed=True)['CANTIDAD'].sum()

In [None]:
# Display the first few rows of the DataFrame
init_notebook_mode(all_interactive=False)
df_agg.head()

Line graphs will be created showing annual asthma deaths by sex and age group (2001–2022)

In [None]:
# Set style
sns.set(style="whitegrid")

# Create the FacetGrid: one plot per sex
g = sns.FacetGrid(df_agg, col="SEXO", height=5, aspect=1.4, sharey=False)

# Plot lines: deaths over year, color by age_group
g.map_dataframe(sns.lineplot, x="ANIO", y="CANTIDAD", hue="GRUPEDAD", marker="o")

# Extract legend handles and labels
handles, labels = g.axes[0][0].get_legend_handles_labels()

# Define your custom order
custom_order = ['0-4', '5-19', '20-44', '45-64', '65-74', '>= 75']

# Sort handles and labels based on custom order
sorted_pairs = sorted(zip(labels, handles), key=lambda x: custom_order.index(x[0]))
sorted_labels, sorted_handles = zip(*sorted_pairs)


# Add legend
g.fig.legend(
    handles=sorted_handles,
    labels=sorted_labels,
    title="Age group",
    loc='center left',
    bbox_to_anchor=(0.91, 0.67),
    borderaxespad=0.
)

# for tiles chane Femenino to female and Masculino to Male
g.set_titles(col_template="{col_name}")
for ax in g.axes.flat:
    if ax.get_title() == "Masculino":
        ax.set_title("Male")
    elif ax.get_title() == "Femenino":
        ax.set_title("Female")

g.fig.suptitle("Annual deaths from asthma by sex and age group (2001–2022)", fontsize=16, y=1.05)

# Force integer ticks on y-axis
for ax in g.axes.flat:
    ax.yaxis.set_major_locator(MaxNLocator(integer=True))

# change x and y labels
g.set_xlabels("Year")
g.set_ylabels("Number of deaths")

# Ensure that subplots fit into the figure area properly
plt.tight_layout()

plt.show();

### Demographic information

Demographic EDA will allow us to discover disparities in asthma mortality according to age and sex.

#### Deaths by sex and age group

In [None]:
# Load the cleaned asthma mortality dataset from 2001 to 2022
df = pd.read_csv('pdt/asthma_mortality/data/csv/def_asma_2001_2022_clean_02.csv', dtype={'IDDPTO': str})

In [None]:
# Display the first few rows of the DataFrame
init_notebook_mode(all_interactive=False)
df.head()

In [None]:
# Map the 'SEXO' column values to their corresponding labels using the sex_map dictionary
sex_map = {1: 'Masculino', 2: 'Femenino'}
df['SEXO'] = df['SEXO'].map(sex_map)

In [None]:
# Grouping the data by 'SEXO' (sex) and summing up the 'CANTIDAD' (quantity) column
deaths_by_sex = df.groupby('SEXO')['CANTIDAD'].sum().reset_index()

# change  Femenino to Female and Masculino to male
deaths_by_sex['SEXO'] = deaths_by_sex['SEXO'].replace({'Femenino': 'Female', 'Masculino': 'Male'})

# Grouping the data by 'GRUPEDAD' (age group) and summing up the 'CANTIDAD' (quantity) column
deaths_by_age_group = df.groupby('GRUPEDAD')['CANTIDAD'].sum().reset_index()

In [None]:
# Sorting DataFrame by Age Group
age_order = ['0-4', '5-19', '20-44', '45-64', '65-74', '>= 75']
deaths_by_age_group['GRUPEDAD'] = pd.Categorical(deaths_by_age_group['GRUPEDAD'], categories=age_order, ordered=True)
deaths_by_age_group = deaths_by_age_group.sort_values('GRUPEDAD')

Create a pie chart showing the distribution of deaths by sex, and a bar chart showing deaths by age group.

In [None]:
# Set style
sns.set(style="whitegrid")

# Create figure
fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# pie chart
axes[0].pie(
    deaths_by_sex['CANTIDAD'],
    labels=deaths_by_sex['SEXO'],
    autopct='%1.1f%%',
    startangle=90,
    colors=['#1f77b4', '#d62728']
)
axes[0].set_title('Distribution of deaths by sex (2001-2022)')

# bar chart
sns.barplot(
    data=deaths_by_age_group,
    x='GRUPEDAD',
    y='CANTIDAD',
    ax=axes[1],
    palette='viridis',
    hue='GRUPEDAD',
    legend=False,
    order=age_order
)

# Add values above the bars
# for p in axes[1].patches:
#     axes[1].annotate(f'{p.get_height()}', (p.get_x() + p.get_width() / 2., p.get_height()),
#                    ha='center', va='center', fontsize=10, color='black', xytext=(0, 5),
#                    textcoords='offset points')

# set titles and labels
axes[1].set_title('Deaths by age group (2001-2022)')
axes[1].set_xlabel('Age group')
axes[1].set_ylabel('Number of deaths')
axes[1].tick_params(axis='x', rotation=45)

# final adjustments
plt.tight_layout()
plt.show();

#### Age pyramid by sex – Deaths from asthma

In [None]:
# load data from a CSV file containing asthma mortality data from 2001 to 2022
df = pd.read_csv('pdt/asthma_mortality/data/csv/def_asma_2001_2022_clean_02.csv', dtype={'IDDPTO': str})

In [None]:
# Display the first few rows of the DataFrame
init_notebook_mode(all_interactive=False)
df.head()

In [None]:
# Map the 'SEXO' column values to their corresponding labels using the sex_map dictionary
sex_map = {1: 'Masculino', 2: 'Femenino'}
df['SEXO'] = df['SEXO'].map(sex_map)

In [None]:
# group data by age group ('GRUPEDAD') and gender ('SEXO'), and calculate the total count ('CANTIDAD')
pyramid_df = df.groupby(['GRUPEDAD', 'SEXO'])['CANTIDAD'].sum().reset_index()

In [None]:
# Sorting DataFrame (pyramid_df ) by Age Group
age_order = ['0-4', '5-19', '20-44', '45-64', '65-74', '>= 75']
pyramid_df['GRUPEDAD'] = pd.Categorical(pyramid_df['GRUPEDAD'], categories=age_order, ordered=True)
pyramid_df = pyramid_df.sort_values('GRUPEDAD')

In [None]:
# Convert male deaths to negative to get a pyramid efect
pyramid_df['CANTIDAD'] = pyramid_df.apply(
    lambda row: -row['CANTIDAD'] if row['SEXO'] == 'Masculino' else row['CANTIDAD'], axis=1
)

Crear gráfico de barras horizontales que muestre la pirámide de edad por sexopara defunciones por asma

In [None]:
# Create horizontal bar chart
plt.figure(figsize=(10, 6))
for sex, color in zip(['Masculino', 'Femenino'], ['#1f77b4', '#d62728']):
    subset = pyramid_df[pyramid_df['SEXO'] == sex]
    # Change labels here before plotting
    label = "Male" if sex == "Masculino" else "Female"
    plt.barh(subset['GRUPEDAD'], subset['CANTIDAD'], color=color, label=label)

# Aesthetics of the graph
plt.axvline(0, color='black', linewidth=0.5)
plt.xlabel('Number of deaths')
plt.ylabel('Age group')
plt.title('Age pyramid by sex – Deaths from asthma')
plt.legend(loc='lower right')

# Make all x-tick labels positive
xticks = plt.xticks()[0]  # Get current x-tick locations
plt.xticks(xticks, [abs(int(x)) for x in xticks])  # Replace labels with absolute values

# final adjustments
plt.tight_layout()
plt.show()

### Geospatial patterns

Exploratory Spatial Data Analysis (ESDA) will allow us to detect regional (departmental) and temporal variations in asthma mortality.

#### Mortality map by department (choropleth map)

We will create choropleth maps in as faceted mapas with the asthma mortality rate per 100,000 inhabitants for each department and by year of study (2001-2022).

In [None]:
# Load  `gdf`  from a shapefile containing asthma mortality data from 2001 to 2022.
gdf = gpd.read_file("pdt/asthma_mortality/data/shp/tma_2001_2022.shp", encoding='utf-8')

In [None]:
# Display the first few rows of the DataFrame
init_notebook_mode(all_interactive=True)
gdf.head()

In [None]:
# Determine the number of rows in the GeoDataFrame.
len(gdf)

We will use [Pysal](https://pysal.org/)'s [mapclassify](https://pysal.org/mapclassify/index.html) library to determine the best classifier for the choropleth map.

Weq will use the map classifier with the best ACDM (mean Absolute Deviation Around the class Median). In Pysal, ACDM refers to the mean absolute deviation around the class median. It is a measure of a classifier's fit to the data, specifically by evaluating the average distance between each data point and the median value of the assigned class.

In [None]:
# Select columns corresponding to normalized mortality rate  from 2001 to 2022
selected_data_2001_2022 = gdf.loc[:,["CA_2001", "CA_2002", "CA_2003", "CA_2004", "CA_2005", "CA_2006", "CA_2007", "CA_2008", "CA_2009", "CA_2010",
                                     "CA_2011", "CA_2012", "CA_2013", "CA_2014", "CA_2015", "CA_2016", "CA_2017", "CA_2018", "CA_2019", "CA_2020",
                                     "CA_2021", "CA_2022"]]

In [None]:
# Classify the data into 5 quantile groups
q5 = mapclassify.Quantiles(selected_data_2001_2022, k=5)
q5

In [None]:
# Equal Interval Classification
ei5 = mapclassify.EqualInterval(selected_data_2001_2022, k=5)
ei5

In [None]:
# Classify the data into groups based on the head/tail breaks algorithm
ht = mapclassify.HeadTailBreaks(selected_data_2001_2022)
ht

In [None]:
# MaximumBreaks classification method
mb5 = mapclassify.MaximumBreaks(selected_data_2001_2022, k=5)
mb5

In [None]:
# Apply the Standard Deviation and Mean classification method to the selected data.
msd = mapclassify.StdMean(selected_data_2001_2022)
msd

In [None]:
# Apply Fisher-Jenks classification with 5 classes
fj5 = mapclassify.FisherJenks(selected_data_2001_2022, k=5)
fj5

ACDM(mean Absolute Deviation Around the class Median) visualization

In [None]:
# Bunch classifier objects
class5 = q5, ei5, ht, mb5, msd, fj5
# Collect ADCM for each classifier
fits = np.array([c.adcm for c in class5])
# Convert ADCM scores to a DataFrame
adcms = pd.DataFrame(fits)
# Add classifier names
adcms["classifier"] = [c.name for c in class5]
# Add column names to the ADCM
adcms.columns = ["ADCM", "Classifier"]
ax = sns.barplot(
    y="Classifier", x="ADCM", data=adcms, hue= adcms["Classifier"],  legend=False
)

Two classifiers have the lowest ACDM: FisherJenks and HeadTailBreaks. We'll select FisherJenks as the classifier to create the choropleth maps.

In [None]:
# Convert the bins to a list for further processing
bins = fj5.bins.tolist()
bins

In [None]:
# insert 0 at 0 position
bins.insert(0, 0.0)
bins

In [None]:
# round and transform to integer
bins = [int(round(num, 0)) for num in bins]
# fix last value
bins[-1]=193
bins

We will create a function to set a custom classification scheme based on FisherJenks

In [None]:
def maptma(colname, title, row, col):
    # Create a custom classification using UserDefined
    classification = mapclassify.UserDefined(gdf[colname], bins)

    classification.plot(
          gdf,  # GeoDataFrame containing the data to be plotted
          legend=True,  # Enable the legend for the plot
          legend_kwds={
                "fmt": "{:.0f}",  # Format the legend labels as integers
                "loc": "upper right",  # Position the legend in the upper right corner
                "bbox_to_anchor": (1.1, 0.4),  # Adjust the legend's position
                "fontsize": 8,  # Set the font size of the legend
                "labels": ["0", "0-2", "2-7", "7-22", "22-71", "71-193"],  # Use the custom legend labels
          },
          axis_on=False,  # Disable the axis display
          border_color='black',  # Set the border color of the plot
          cmap="viridis_r",  # Use the reversed Viridis colormap
          ax=axes[row, col]  # Specify the subplot to draw the plot on
     )

    # Set the title for the current axis
    axes[row, col].set_title(title)

In [None]:
# Create a 5x5 grid of subplots with a figure size of 20x20
fig, axes = plt.subplots(5, 5, figsize=(20, 20))
# Add a title to the entire figure with specific font size and position
plt.suptitle('Normalized Asthma Mortality Rate 2001-2022', fontsize=14, y=1)

# row 0
maptma("CA_2001", "2001", 0, 0)  # Map data for 2001 in row 0, column 0
maptma("CA_2002", "2002", 0, 1)  # Map data for 2002 in row 0, column 1
maptma("CA_2003", "2003", 0, 2)  # Map data for 2003 in row 0, column 2
maptma("CA_2004", "2004", 0, 3)  # Map data for 2004 in row 0, column 3
maptma("CA_2005", "2005", 0, 4)  # Map data for 2005 in row 0, column 4

# row 1
maptma("CA_2006", "2006", 1, 0)  # Map data for 2006 in row 1, column 0
maptma("CA_2007", "2007", 1, 1)  # Map data for 2007 in row 1, column 1
maptma("CA_2008", "2008", 1, 2)  # Map data for 2008 in row 1, column 2
maptma("CA_2009", "2009", 1, 3)  # Map data for 2009 in row 1, column 3
maptma("CA_2010", "2010", 1, 4)  # Map data for 2010 in row 1, column 4

# row 2
maptma("CA_2011", "2011", 2, 0)  # Map data for 2011 in row 2, column 0
maptma("CA_2012", "2012", 2, 1)  # Map data for 2012 in row 2, column 1
maptma("CA_2013", "2013", 2, 2)  # Map data for 2013 in row 2, column 2
maptma("CA_2014", "2014", 2, 3)  # Map data for 2014 in row 2, column 3
maptma("CA_2015", "2015", 2, 4)  # Map data for 2015 in row 2, column 4

# row 3
maptma("CA_2016", "2016", 3, 0)  # Map data for 2016 in row 3, column 0
maptma("CA_2017", "2017", 3, 1)  # Map data for 2017 in row 3, column 1
maptma("CA_2018", "2018", 3, 2)  # Map data for 2018 in row 3, column 2
maptma("CA_2019", "2019", 3, 3)  # Map data for 2019 in row 3, column 3
maptma("CA_2020", "2020", 3, 4)  # Map data for 2020 in row 3, column 4

# row 4
maptma("CA_2021", "2021", 4, 0)  # Map data for 2021 in row 4, column 0
maptma("CA_2022", "2022", 4, 1)  # Map data for 2022 in row 4, column 1

axes[4,2].axis('off')  # Turn off axis for row 4, column 2
axes[4,3].axis('off')  # Turn off axis for row 4, column 3
axes[4,4].axis('off')  # Turn off axis for row 4, column 4

plt.tight_layout()  # Adjust layout to prevent overlap
plt.show();  # Display the plot

Creating an animated GIF showing the mortality rate by year

In [None]:
# This function generates a map for a specific year and saves it as an image for use in a GIF.
def plot_map_for_gif(colname, year, bins, gdf, output_folder="pdt/asthma_mortality/data/images/gif"):
    # Create a figure and axis for the plot
    fig, ax = plt.subplots(figsize=(4, 4))

    # Classify the data using user-defined bins
    classification = mapclassify.UserDefined(gdf[colname], bins)

    # Plot the classified data with custom legend and styling
    classification.plot(
        gdf,
        legend=True,
        legend_kwds={
            "fmt": "{:.0f}",  # Format legend labels as integers
            "loc": "upper right",  # Place legend in the upper right
            "bbox_to_anchor": (1.1, 0.4),  # Adjust legend position
            "fontsize": 8,  # Set legend font size
            "labels": ["0", "0-2", "2-7", "7-22", "22-71", "71-193"]  # Custom labels for bins
        },
        axis_on=False,  # Turn off axis
        border_color='black',  # Set border color
        cmap="viridis_r",  # Use reversed Viridis colormap
        ax=ax  # Plot on the created axis
    )

    # Set the title for the map
    ax.set_title(f"Normalized Asthma Mortality Rate - {year}", fontsize=12)
    # Turn off axis display
    ax.axis('off')

    # Create the output folder if it doesn't exist
    if not os.path.exists(output_folder):
        os.makedirs(output_folder)

    # Define the file path for saving the image
    filepath = os.path.join(output_folder, f"asthma_{year}.png")
    # Save the plot as a PNG file
    plt.savefig(filepath, bbox_inches="tight", dpi=150)
    # Close the plot to free up memory
    plt.close()

In [None]:
bins = [0, 2, 7, 22, 71, 193]  # Define the bin ranges for categorizing data
years = list(range(2001, 2023))  # Create a list of years from 2001 to 2022

for year in years:  # Iterate through each year in the list
    colname = f"CA_{year}"  # Generate a column name based on the year
    plot_map_for_gif(colname, year, bins, gdf)  # Call the function to plot the map for the given year and bins


In [None]:
def create_gif_from_maps(image_folder="pdt/asthma_mortality/data/images/gif", gif_name="pdt/asthma_mortality/data/images/gif/asthma_mortality.gif"):
    # Initialize an empty list to store images
    images = []
    # Loop through the years 2001 to 2022
    for year in range(2001, 2023):
        # Construct the filename for each year's image
        filename = f"asthma_{year}.png"
        # Create the full file path by joining the folder and filename
        filepath = os.path.join(image_folder, filename)
        # Read the image and append it to the images list
        images.append(imageio.imread(filepath))

    # Save the images as an animated GIF with 1 frame per second
    imageio.mimsave(gif_name, images, fps=1)


In [None]:
# apply create_gif_from_maps function
create_gif_from_maps()

In [None]:
# Displaying an image of asthma mortality
Image(filename="pdt/asthma_mortality/data/images/gif/asthma_mortality.gif")

#### Departments with persistent asthama mortality rate (> 0) from 2001 to 2022

In [None]:
# Load  `gdf`  from a shapefile containing asthma mortality data from 2001 to 2022
gdf = gpd.read_file("pdt/asthma_mortality/data/shp/tma_2001_2022.shp", encoding='utf-8')

## 🐕‍🦺 Outlier Detection

ESDA showed departments with with unusually high/low asthama mortality rate. In this section we will  focus on outlier detection and removal using statistical methods  

### Load dataset

In [None]:
# Load  `gdf`  from a shapefile containing only asthma mortality rate from 2001 to 2022
gdf = gpd.read_file("pdt/asthma_mortality/data/shp/tma_2001_2022.shp", encoding='utf-8')

In [None]:
# Display the first few rows of the DataFrame
init_notebook_mode(all_interactive=True)
gdf.head()

### Statistical summary and distribution analysis

In [None]:
# Filter asthma mortality rate columns
mortality_cols = [col for col in gdf.columns if col.startswith('CA_')]
mortality_df = gdf[mortality_cols].copy()

In [None]:
# Summary statistics and initial distribution plots
init_notebook_mode(all_interactive=False)
mortality_df.describe().T

In [None]:
# Plot boxplots for each year
plt.figure(figsize=(16, 6))
sns.boxplot(data=mortality_df, orient='h')
plt.title("Boxplots of Normalized Asthma Mortality Rates (2001–2022)")
plt.show()

We observed that the asthma mortality rate in the Lihuel Calel department is unusually high (192 deaths per 100,000 inhabitants). To investigate potential environmental causes, we conducted an online search for relevant events in the area. However, no significant or widely documented environmental incidents were recorded in Lihuel Calel during 2001. According to the national census, the department's population in 2001 was only 520 inhabitants—a 12.16% decrease from the 1991 census, which reported 592 inhabitants. Therefore, the elevated mortality rate is primarily due to the very small population size in the department, which amplifies the impact of even a few recorded deaths

In [None]:
# filer data of Lihuel Calel department
init_notebook_mode(all_interactive=True)
filtered_gdf = gdf[gdf['CA_2001'] == 192.31]
filtered_gdf

### Outlier detection  and removal

This effect is denominated small counts problem in epidemiology, and refers to the statistical challenges that arise when analyzing rare events or diseases in small populations, leading to unstable or unreliable estimates of rates, risks, or associations<sup>1</sup>.


The limitations of small population sizes and low death numbers significantly impact the accuracy of asthma mortality rate estimates by introducing variability and potential biases in the data. Small populations may not capture the full spectrum of mortality events, leading to unreliable age-specific mortality rates. Additionally, low death counts can result in statistical fluctuations, making it difficult to discern true trends and patterns in asthma-related mortality, ultimately compromising the validity of the estimates derived from such data<sup>2,3</sup>.

To deal with outliers coming from low-populated and  with low mortality rate departments, we figured out the following solution based on<sup>4</sup> :

* Flag as statistically unstable those departments with population < 5,000, deaths < 5, and relative CI (Confidence Interval) width > 2.0.

* The normalized mortality rate for those departments will be 0.

To flag departments according to these condition,s we will execute the following function:

In [None]:
def flag_and_correct_unstable_rates_efficient(gdf, start_year=2001, end_year=2022):
    all_ci_dfs = []  # List to collect CI and flag DataFrames for each year

    # Loop through each year in the specified range
    for year in range(start_year, end_year + 1):
        pop_col    = f"A_{year}"    # Population column for the year
        deaths_col = f"C_{year}"    # Deaths count column for the year
        rate_col   = f"CA_{year}"   # Mortality rate column for the year

        # Skip processing if any of the required columns are missing
        if not {pop_col, deaths_col, rate_col}.issubset(gdf.columns):
            continue  # Move to next year if columns are not found

        # Extract the relevant series for calculations
        pop    = gdf[pop_col]       # Population data
        deaths = gdf[deaths_col]    # Death counts
        rate   = gdf[rate_col]      # Mortality rate

        # Compute the 95% Poisson confidence interval for deaths
        ci_low   = poisson.ppf(0.025, deaths) / pop * 100000   # Lower bound per 100k
        ci_high  = poisson.ppf(0.975, deaths) / pop * 100000   # Upper bound per 100k
        ci_width = ci_high - ci_low                            # Absolute CI width
        rel_w    = ci_width / rate.replace(0, pd.NA)           # Relative CI width (avoid divide-by-zero)

        # Flag as unstable when all conditions are met:
        # 1) population < 5000
        # 2) deaths < 5
        # 3) relative CI width > 2.0
        unstable = (pop < 5000) & (deaths < 5) & (rel_w > 2.0)

        # Overwrite the original rate with zero for unstable entries
        gdf.loc[unstable, rate_col] = 0

        # Create a temporary DataFrame with CI and flag columns
        ci_df = pd.DataFrame({
            f"ci_low_{year}"      : ci_low,
            f"ci_high_{year}"     : ci_high,
            f"ci_width_{year}"    : ci_width,
            f"rel_ci_width_{year}": rel_w,
            f"unstable_{year}"    : unstable
        }, index=gdf.index)

        # Append this year's CI DataFrame to the list
        all_ci_dfs.append(ci_df)

        # Log the result
        num_unstable = ci_df[f'unstable_{year}'].sum()
        print(f"Year {year}: {num_unstable} unstable departments flagged.")

    # After looping, concatenate all CI/flag DataFrames alongside the original GeoDataFrame
    if all_ci_dfs:
        gdf = pd.concat([gdf] + all_ci_dfs, axis=1)

    return gdf  # Return the updated GeoDataFrame with CI and flag columns added


In [None]:
# apply the function
gdf2 = flag_and_correct_unstable_rates_efficient(gdf)

The number of departaments flagged as statistically unstable per year is as follows:

* Year 2001: 4 unstable departments flagged.
* Year 2002: 0 unstable departments flagged.
* Year 2003: 5 unstable departments flagged.
* Year 2004: 0 unstable departments flagged.
* Year 2005: 1 unstable departments flagged.
* Year 2006: 3 unstable departments flagged.
* Year 2007: 1 unstable departments flagged.
* Year 2008: 1 unstable departments flagged.
* Year 2009: 3 unstable departments flagged.
* Year 2010: 0 unstable departments flagged.
* Year 2011: 0 unstable departments flagged.
* Year 2012: 0 unstable departments flagged.
* Year 2013: 2 unstable departments flagged.
* Year 2014: 1 unstable departments flagged.
* Year 2015: 2 unstable departments flagged.
* Year 2016: 0 unstable departments flagged.
* Year 2017: 1 unstable departments flagged.
* Year 2018: 0 unstable departments flagged.
* Year 2019: 1 unstable departments flagged.
* Year 2020: 0 unstable departments flagged.
* Year 2021: 0 unstable departments flagged.
* Year 2022: 0 unstable departments flagged.




In [None]:
# Display the first few rows of the DataFrame
init_notebook_mode(all_interactive=True)
gdf2.head()

In [None]:
# select the same columns as the original gdf
gdf2 = gdf2[list(gdf.columns)]

In [None]:
# Display the first few rows of the DataFrame
init_notebook_mode(all_interactive=True)
gdf2.head()

### Visual Confirmation

In [None]:
# Filter asthma mortality rate columns
mortality_cols = [col for col in gdf2.columns if col.startswith('CA_')]
mortality_df = gdf2[mortality_cols].copy()

In [None]:
# Plot boxplots for each year
plt.figure(figsize=(16, 6))
sns.boxplot(data=mortality_df, orient='h')
plt.title("Boxplots of Normalized Asthma Mortality Rates (2001–2022)")
plt.show()

In [None]:
# export gdf as a shapefile
gdf2.to_file("pdt/asthma_mortality/data/shp/tma_2001_2022_2.shp", encoding='utf-8')

## References


1. Andresen EM, Diehr PH, Luke DA. Public Health Surveillance of Low-Frequency Populations. Annual Review of Public Health. 2004;25(Volume 25, 2004):25-52. doi:10.1146/annurev.publhealth.25.101802.123111

2. Kostaki A, and Zafeiris KN. Dealing with limitations of empirical mortality data in small populations. Communications in Statistics: Case Studies, Data Analysis and Applications. 2019;5(1):39-45. doi:10.1080/23737484.2019.1578706

3. Berrill WT. Is the death rate from asthma exaggerated? Evidence from west Cumbria. BMJ. 1993;306(6871):193-194. doi:10.1136/bmj.306.6871.193

4. Washington State Department of Health. Guidelines for using confidence intervals for public health assessment. Published online 2012. https://doh.wa.gov/sites/default/files/legacy/Documents/1500/ConfIntGuide.pdf




