<a href="https://colab.research.google.com/github/acoiman/pdt/blob/main/asthma_mortality/notebooks/R/09.Asthma_Mortality_EDA_Predictor_Variables.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#ðŸ“Š Exploratory Data Analysis (EDA) of Predictor Variables

In this Notebook we will perform the Exploratory Data Analysis (EDA) of the predictor variables to better understand their temporal and spatial dynamics and their potential association with asthma mortality rate in Argentina. The analysis begins by loading and preparing the data, ensuring its quality through checks for missing values and internal consistency. It then explores the statistical behavior of each predictor through distributional analysis, yearly trends, treemaps, and spatial visualizations such as departmental choropleth maps.

Special attention is given to critical variablesâ€”including PMâ‚‚.â‚…, normalized burned areas (NBA), population density (PD), land-use transitions (NAGRT, NNWVT, NBUT), and derived interactions like PDÃ—PMâ‚‚.â‚…â€”providing insight into their geographic patterns and evolution over time.

The notebook further evaluates the relationship between predictors and the asthma mortality rate (CA) through correlation analysis, interaction terms, lagged effects, and multicollinearity diagnostics using correlation matrices and VIF. Finally, dimensionality reduction techniques such as PCA or clustering are applied to identify underlying structures in the predictor space. Together, these steps establish a rigorous analytical foundation for subsequent modeling and inference on environmental determinants of asthma mortality.rate.

## ðŸ¤– Load libraries

The libraries required for the analysis will be loaded

In [None]:
# dataframe libraries
import pandas as pd
import numpy as np

# plot libraries
import matplotlib.pyplot as plt
import matplotlib as mpl
from matplotlib.patches import Patch
import seaborn as sns
import squarify

# statistical libraries
from statsmodels.tsa.seasonal import seasonal_decompose
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
from statsmodels.tsa.stattools import adfuller
from statsmodels.stats.outliers_influence import variance_inflation_factor
import statsmodels.api as sm

# geospatial libraries
import geopandas as gpd
import mapclassify

# other libraries
from datetime import datetime
from itables import init_notebook_mode, show
import os

In [None]:
# change directory to work folder (at the begining docker container enter into /home/jovyan/)
%cd work

In [None]:
# Set the PROJ_LIB path
os.environ['PROJ_LIB'] = "/opt/conda/envs/gds/share/proj"

## ðŸ’¾ Load and reduce data

In [None]:
# Load data
gdf = gpd.read_file("pdt/asthma_mortality/data/gpkg/data.gpkg")

In [None]:
# visualize geo data.frame
init_notebook_mode(all_interactive=True)
gdf.head()

In [None]:
# inspect any nan in columns
init_notebook_mode(all_interactive=True)
gdf.isna().sum()

In [None]:
# Reshape df to ts long format
years = range(2001, 2023)
records = []

In [None]:
for _, row in gdf.iterrows():
    iddpto = row["IDDPTO"]
    geometry = row["geometry"]
    for year in years:
        records.append({
            "IDDPTO": iddpto,
            "YEAR": year,
            "CA": row.get(f"CA_{year}", np.nan),
            "PM25": row.get(f"PM25_{year}", np.nan),
            "NBA": row.get(f"NBA_{year}", np.nan),
            "PD": row.get(f"PD_{year}", np.nan),
            "PDPM25": row.get(f"PDPM25_{year}", np.nan),
            "NAGRT": row.get(f"NAGRT_{year}", np.nan),
            "NNWVT": row.get(f"NNWVT_{year}", np.nan),
            "NBUT": row.get(f"NBUT_{year}", np.nan),
            "ELEV": row.get(f"ELEV_{year}", np.nan),
            "geometry": geometry # Add geometry
            })

In [None]:
# create new df from list and sort
panel_gdf = pd.DataFrame(records)

In [None]:
# Sort and reset index
panel_gdf = panel_gdf.sort_values(by=["IDDPTO", "YEAR"]).reset_index(drop=True)

In [None]:
# visualize the fisrt rows
init_notebook_mode(all_interactive=False)
panel_gdf.head()

In [None]:
# visualize the number of rows and columsn of the data.frame
panel_gdf.shape

## ðŸ§¹ Data Quality

In this section we will inspect missing values per columns, check distribution of each feature and  verify temporal consistency of the dataset.

### Inspect missing values

In [None]:
# drop gemetry column
df = panel_gdf.drop(columns=["geometry"])

In [None]:
# check missing values
print(df.isna().sum())

In [None]:
# check number of duplicated rows
print(df.duplicated().sum())

### Verify data consistency

In [None]:
# Features to check (excluding ID, geometry)
features = ['CA', 'PM25', 'NBA', 'PD', 'PDPM25', 'NAGRT', 'NNWVT', 'NBUT', 'ELEV']

# Total zero counts per feature
zero_counts = (df[features] == 0).sum().sort_values(ascending=False)
print("Total Zero Values per Feature")
print(zero_counts)

# Heatmap visualization
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
# Percentage of zero counts per feature
total_records = len(df)
zero_percentages = round((zero_counts / total_records) * 100, 2)
print("Percentage of Zero Values per Feature ")
print(zero_percentages.sort_values(ascending=False))

In [None]:
# Heatmap visualization
# Calculate zero counts per feature per year
zero_counts_yearly = df.groupby('YEAR')[features].apply(lambda x: (x == 0).sum())

plt.figure(figsize=(12, 6))
sns.heatmap(zero_counts_yearly.T, annot=True, fmt="d", cmap="Reds")
plt.title("Zero Value Counts per Feature per Year")
plt.xlabel("Year")
plt.ylabel("Feature")
plt.tight_layout()
plt.show()

## ðŸ”® Predictor Variable Exploration

In this section wi will explore the distribution, correlations, and spatio-tempral trends of the predictor variables.

### Visualization of distributions of each variable

With the following code, we will generate histograms with KDE curves to visualize the distribution shapes (e.g., normal, skewed, multimodal). In addition, we will create boxplots to examine the spread of the data and identify potential outliers

In [None]:
# Select only numeric variables for distribution check
numeric_vars = ['CA', 'PM25', 'NBA', 'PD', 'PDPM25', 'NAGRT', 'NNWVT', 'NBUT', 'ELEV']

In [None]:
# Plot histograms + KDE
n_cols = 3
n_rows = (len(numeric_vars) + n_cols - 1) // n_cols

plt.figure(figsize=(16, n_rows * 4))

for i, col in enumerate(numeric_vars, 1):
    plt.subplot(n_rows, n_cols, i)
    sns.histplot(df[col], kde=True, bins=30, color="steelblue")
    plt.title(f"Distribution of {col}")
    plt.xlabel(col)
    plt.ylabel("Frequency")

plt.tight_layout()
plt.show()


In [None]:
# show boxplots for skewed distributions
plt.figure(figsize=(16, n_rows * 3))
for i, col in enumerate(numeric_vars, 1):
    plt.subplot(n_rows, n_cols, i)
    sns.boxplot(x=df[col], color="orange")
    plt.title(f"Boxplot of {col}")
    plt.xlabel(col)

plt.tight_layout()
plt.show()

### Yearly trends of predictor variables

#### Ploting yearly trends for each predictor variable

In [None]:
# visualize the data.frame
df.head()

In [None]:
# copy dataframe
df_ts = df.copy()

In [None]:
# Ensure YEAR is in datetime format and extract the year number
df_ts['YEAR_DT'] = pd.to_datetime(df_ts['YEAR'], format='%Y')
df_ts['year_num'] = df_ts['YEAR_DT'].dt.year

In [None]:
df_ts.head()

In [None]:
# Define predictor variables
predictors = ['PM25', 'NBA', 'PD', 'PDPM25', 'NAGRT', 'NNWVT', 'NBUT', 'ELEV']

In [None]:
# Plot yearly trends for each predictor variable in separate subplots
n_predictors = len(predictors)  # Get the number of predictor variables
n_cols = 3  # Set the number of columns for subplots
n_rows = (n_predictors + n_cols - 1) // n_cols  # Calculate the number of rows needed for subplots

plt.figure(figsize=(15, n_rows * 5))  # Create a figure with a specified size

for i, var in enumerate(predictors, 1):  # Loop through each predictor variable
    plt.subplot(n_rows, n_cols, i)  # Create a subplot for the current variable
    yearly_mean = df_ts.groupby('year_num')[var].mean()  # Calculate yearly mean for the variable
    plt.plot(yearly_mean.index, yearly_mean.values, marker='o', linestyle='-')  # Plot the yearly trend
    plt.title(f"Yearly Trend of {var}")  # Set the title for the subplot
    plt.xlabel("Year")  # Label the x-axis
    plt.ylabel("Mean Value")  # Label the y-axis
    plt.grid(True)  # Enable grid for better visualization

plt.tight_layout()  # Adjust layout to prevent overlapping
plt.show()  # Display the plots


#### Creating treemaps of predictor variables by year

In [None]:
# Create subplots (adjust number of rows/columns as needed)
n_vars = len(predictors)
n_cols = 4
n_rows = (n_vars + n_cols - 1) // n_cols

fig, axes = plt.subplots(n_rows, n_cols, figsize=(18, 10))
axes = axes.flatten()

for i, var in enumerate(predictors):
    ax = axes[i]

    # Compute yearly mean for each predictor
    yearly_mean = df_ts.groupby('year_num')[var].mean().reset_index()

    # Prepare labels and sizes for the treemap
    sizes = yearly_mean[var].values
    labels = [f"{year}\n{val:.2f}" for year, val in zip(yearly_mean['year_num'], yearly_mean[var])]

    # Create treemap
    squarify.plot(
        sizes=sizes,
        label=labels,
        color=plt.cm.RdYlBu(sizes / max(sizes)),
        alpha=0.8,
        ax=ax
    )

    ax.set_title(f"{var} - Mean Value by Year", fontsize=11)
    ax.axis('off')

# Remove any empty subplots
for j in range(i + 1, len(axes)):
    fig.delaxes(axes[j])

plt.suptitle("Treemaps of Predictor Variables (Mean Value per Year)", fontsize=14, y=1.02)
plt.tight_layout()
plt.show()

### Spatial patterns of predictor variables

In this section, we will create choropleth maps as faceted of predictor variables by year of study (2001-2022).  We will use [Pysal](https://pysal.org/)'s [mapclassify](https://pysal.org/mapclassify/index.html) library to determine the best classifier for the choropleth map by calculating the  best ACDM (mean Absolute Deviation Around the class Median). In Pysal, ACDM refers to the mean absolute deviation around the class median. It is a measure of a classifier's fit to the data, specifically by evaluating the average distance between each data point and the median value of the assigned class.

#### Particulate Matter < 2.5 um (PM2.5)

In [None]:
# create a copy of  the geodataframe
gdf_cl = gdf.copy()

In [None]:
# Get a list of column names that start with "PM25_"
pm25_columns = [col for col in gdf_cl.columns if col.startswith("PM25_")]

In [None]:
# Select data to analize
selected_data = gdf_cl.loc[:,pm25_columns]

In [None]:
# Classify the data into 4 quantile groups
q4 = mapclassify.Quantiles(selected_data, k=4)
q4

In [None]:
# Equal Interval Classification
ei5 = mapclassify.EqualInterval(selected_data, k=5)
ei5

In [None]:
# Classify the data into groups based on the head/tail breaks algorithm
ht = mapclassify.HeadTailBreaks(selected_data)
ht

In [None]:
# MaximumBreaks classification method
mb5 = mapclassify.MaximumBreaks(selected_data, k=5)
mb5

In [None]:
# Apply the Standard Deviation and Mean classification method to the selected data.
msd = mapclassify.StdMean(selected_data)
msd

In [None]:
# Apply Fisher-Jenks classification with 5 classes
fj5 = mapclassify.FisherJenks(selected_data, k=5)
fj5

In [None]:
# Bunch classifier objects
class5 = q4, ei5, ht, mb5, msd, fj5
# Collect ADCM for each classifier
fits = np.array([c.adcm for c in class5])
# Convert ADCM scores to a DataFrame
adcms = pd.DataFrame(fits)
# Add classifier names
adcms["classifier"] = [c.name for c in class5]
# Add column names to the ADCM
adcms.columns = ["ADCM", "Classifier"]
ax = sns.barplot(
    y="Classifier", x="ADCM", data=adcms, hue= adcms["Classifier"],  legend=False
)

##### Create choropleth maps of PM2.5


Two classifiers have the lowest ACDM: FisherJenks and HeadTailBreaks. We'll select FisherJenks as the classifier to create the choropleth maps.

In [None]:
# Convert the bins to a list for further processing
bins = fj5.bins.tolist()
bins

We will create a function to set a custom classification scheme based on FisherJenks

In [None]:
def maptma(colname, title, row, col):
    # Create a custom classification using UserDefined
    classification = mapclassify.UserDefined(gdf_cl[colname], bins)

    classification.plot(
          gdf,  # GeoDataFrame containing the data to be plotted
          legend=True,  # Enable the legend for the plot
          legend_kwds={
                "fmt": "{:.0f}",  # Format the legend labels as integers
                # "loc": "upper right",  # Position the legend in the upper right corner
                "bbox_to_anchor": (1.2, 0.4),  # Adjust the legend's position
                "fontsize": 8,  # Set the font size of the legend
                "labels": ["4.51-10.43", "10.43-13.56", "13.56-17.85", "17.85-24.82", "24.82-48.35"]  # Use the custom legend labels
          },
          axis_on=False,  # Disable the axis display
          border_color='black',  # Set the border color of the plot
          cmap="viridis_r",  # Use the reversed Viridis colormap
          ax=axes[row, col]  # Specify the subplot to draw the plot on
     )

    # Set the title for the current axis
    axes[row, col].set_title(title)

In [None]:
# Create a 5x5 grid of subplots with a figure size of 20x20
fig, axes = plt.subplots(5, 5, figsize=(20, 20))
# Add a title to the entire figure with specific font size and position
plt.suptitle('Annual mean concentration (Âµg/mÂ³) by departments of $PM_{2.5}$ - 2001-2022', fontsize=14, y=1)

# row 0
maptma("PM25_2001", "2001", 0, 0)  # Map data for 2001 in row 0, column 0
maptma("PM25_2002", "2002", 0, 1)  # Map data for 2002 in row 0, column 1
maptma("PM25_2003", "2003", 0, 2)  # Map data for 2003 in row 0, column 2
maptma("PM25_2004", "2004", 0, 3)  # Map data for 2004 in row 0, column 3
maptma("PM25_2005", "2005", 0, 4)  # Map data for 2005 in row 0, column 4

# row 1
maptma("PM25_2006", "2006", 1, 0)  # Map data for 2006 in row 1, column 0
maptma("PM25_2007", "2007", 1, 1)  # Map data for 2007 in row 1, column 1
maptma("PM25_2008", "2008", 1, 2)  # Map data for 2008 in row 1, column 2
maptma("PM25_2009", "2009", 1, 3)  # Map data for 2009 in row 1, column 3
maptma("PM25_2010", "2010", 1, 4)  # Map data for 2010 in row 1, column 4

# row 2
maptma("PM25_2011", "2011", 2, 0)  # Map data for 2011 in row 2, column 0
maptma("PM25_2012", "2012", 2, 1)  # Map data for 2012 in row 2, column 1
maptma("PM25_2013", "2013", 2, 2)  # Map data for 2013 in row 2, column 2
maptma("PM25_2014", "2014", 2, 3)  # Map data for 2014 in row 2, column 3
maptma("PM25_2015", "2015", 2, 4)  # Map data for 2015 in row 2, column 4

# row 3
maptma("PM25_2016", "2016", 3, 0)  # Map data for 2016 in row 3, column 0
maptma("PM25_2017", "2017", 3, 1)  # Map data for 2017 in row 3, column 1
maptma("PM25_2018", "2018", 3, 2)  # Map data for 2018 in row 3, column 2
maptma("PM25_2019", "2019", 3, 3)  # Map data for 2019 in row 3, column 3
maptma("PM25_2020", "2020", 3, 4)  # Map data for 2020 in row 3, column 4

# row 4
maptma("PM25_2021", "2021", 4, 0)  # Map data for 2021 in row 4, column 0
maptma("PM25_2022", "2022", 4, 1)  # Map data for 2022 in row 4, column 1

axes[4,2].axis('off')  # Turn off axis for row 4, column 2
axes[4,3].axis('off')  # Turn off axis for row 4, column 3
axes[4,4].axis('off')  # Turn off axis for row 4, column 4

plt.tight_layout()  # Adjust layout to prevent overlap
plt.show();  # Display the plot

#### NBA (Normalized Burned Areas)

In [None]:
# create a copy of  the geodataframe
gdf_cl = gdf.copy()

In [None]:
# Get a list of column names that start with "NBA_"
nba_columns = [col for col in gdf_cl.columns if col.startswith("NBA_")]

In [None]:
# Select data to analize
selected_data = gdf_cl.loc[:,nba_columns]

In [None]:
# Classify the data into 4 quantile groups
q4 = mapclassify.Quantiles(selected_data, k=4)
q4

In [None]:
# Equal Interval Classification
ei5 = mapclassify.EqualInterval(selected_data, k=5)
ei5

In [None]:
# Classify the data into groups based on the head/tail breaks algorithm
ht = mapclassify.HeadTailBreaks(selected_data)
ht

In [None]:
# MaximumBreaks classification method
mb5 = mapclassify.MaximumBreaks(selected_data, k=5)
mb5

In [None]:
# Apply the Standard Deviation and Mean classification method to the selected data.
msd = mapclassify.StdMean(selected_data)
msd

In [None]:
# Apply Fisher-Jenks classification with 5 classes
fj5 = mapclassify.FisherJenks(selected_data, k=5)
fj5

In [None]:
# Bunch classifier objects
class5 = q4, ei5, ht, mb5, msd, fj5
# Collect ADCM for each classifier
fits = np.array([c.adcm for c in class5])
# Convert ADCM scores to a DataFrame
adcms = pd.DataFrame(fits)
# Add classifier names
adcms["classifier"] = [c.name for c in class5]
# Add column names to the ADCM
adcms.columns = ["ADCM", "Classifier"]
ax = sns.barplot(
    y="Classifier", x="ADCM", data=adcms, hue= adcms["Classifier"],  legend=False
)

##### Create choropleth maps of NBA


Two classifiers have the lowest ACDM: FisherJenks and HeadTailBreaks. We'll select FisherJenks as the classifier to create the choropleth maps.

In [None]:
# Convert the bins to a list for further processing
bins = fj5.bins.tolist()
bins

We will include a zero (0) value at the first position of the bin list, since in some departments and years no wildfires were detected by satellites.

In [None]:
# insert 0 at 0 position
bins.insert(0, 0)
bins

We will create a function to set a custom classification scheme based on FisherJenks

In [None]:
# Extract intervals from bins
intervals = [f"{bins[i]:.2f}-{bins[i+1]:.2f}" for i in range(len(bins)-1)]
intervals

In [None]:
# add '0.00' at postion 0 to the list of intervals
intervals.insert(0, '0.00')
intervals

In [None]:
def maptma(colname, title, row, col):
    # Create a custom classification using UserDefined
    classification = mapclassify.UserDefined(gdf_cl[colname], bins)

    classification.plot(
          gdf,  # GeoDataFrame containing the data to be plotted
          legend=True,  # Enable the legend for the plot
          legend_kwds={
                "fmt": "{:.0f}",  # Format the legend labels as integers
                # "loc": "upper right",  # Position the legend in the upper right corner
                "bbox_to_anchor": (1.30, 0.4),  # Adjust the legend's position
                "fontsize": 8,  # Set the font size of the legend
                "labels":  intervals # Use the custom legend labels
          },
          axis_on=False,  # Disable the axis display
          border_color='black',  # Set the border color of the plot
          cmap="viridis_r",  # Use the reversed Viridis colormap
          ax=axes[row, col]  # Specify the subplot to draw the plot on
     )

    # Set the title for the current axis
    axes[row, col].set_title(title)

In [None]:
# Create a 5x5 grid of subplots with a figure size of 20x20
fig, axes = plt.subplots(5, 5, figsize=(20, 20))
# Add a title to the entire figure with specific font size and position
plt.suptitle('Normalized Burned Area (km$^2$) per 1000  km$^2$ by departments- 2001-2022', fontsize=14, y=1)

# row 0
maptma("NBA_2001", "2001", 0, 0)  # Map data for 2001 in row 0, column 0
maptma("NBA_2002", "2002", 0, 1)  # Map data for 2002 in row 0, column 1
maptma("NBA_2003", "2003", 0, 2)  # Map data for 2003 in row 0, column 2
maptma("NBA_2004", "2004", 0, 3)  # Map data for 2004 in row 0, column 3
maptma("NBA_2005", "2005", 0, 4)  # Map data for 2005 in row 0, column 4

# row 1
maptma("NBA_2006", "2006", 1, 0)  # Map data for 2006 in row 1, column 0
maptma("NBA_2007", "2007", 1, 1)  # Map data for 2007 in row 1, column 1
maptma("NBA_2008", "2008", 1, 2)  # Map data for 2008 in row 1, column 2
maptma("NBA_2009", "2009", 1, 3)  # Map data for 2009 in row 1, column 3
maptma("NBA_2010", "2010", 1, 4)  # Map data for 2010 in row 1, column 4

# row 2
maptma("NBA_2011", "2011", 2, 0)  # Map data for 2011 in row 2, column 0
maptma("NBA_2012", "2012", 2, 1)  # Map data for 2012 in row 2, column 1
maptma("NBA_2013", "2013", 2, 2)  # Map data for 2013 in row 2, column 2
maptma("NBA_2014", "2014", 2, 3)  # Map data for 2014 in row 2, column 3
maptma("NBA_2015", "2015", 2, 4)  # Map data for 2015 in row 2, column 4

# row 3
maptma("NBA_2016", "2016", 3, 0)  # Map data for 2016 in row 3, column 0
maptma("NBA_2017", "2017", 3, 1)  # Map data for 2017 in row 3, column 1
maptma("NBA_2018", "2018", 3, 2)  # Map data for 2018 in row 3, column 2
maptma("NBA_2019", "2019", 3, 3)  # Map data for 2019 in row 3, column 3
maptma("NBA_2020", "2020", 3, 4)  # Map data for 2020 in row 3, column 4

# row 4
maptma("NBA_2021", "2021", 4, 0)  # Map data for 2021 in row 4, column 0
maptma("NBA_2022", "2022", 4, 1)  # Map data for 2022 in row 4, column 1

axes[4,2].axis('off')  # Turn off axis for row 4, column 2
axes[4,3].axis('off')  # Turn off axis for row 4, column 3
axes[4,4].axis('off')  # Turn off axis for row 4, column 4

plt.tight_layout()  # Adjust layout to prevent overlap
plt.show();  # Display the plot

#### Population density (PD)

In [None]:
# create a copy of  the geodataframe
gdf_cl = gdf.copy()

In [None]:
# Get a list of column names that start with "PD_"
pd_columns = [col for col in gdf_cl.columns if col.startswith("PD_")]

In [None]:
# Select data to analize
selected_data = gdf_cl.loc[:,pd_columns]

In [None]:
# Classify the data into 4 quantile groups
q4 = mapclassify.Quantiles(selected_data, k=4)
q4

In [None]:
# Equal Interval Classification
ei5 = mapclassify.EqualInterval(selected_data, k=5)
ei5

In [None]:
# Classify the data into groups based on the head/tail breaks algorithm
ht = mapclassify.HeadTailBreaks(selected_data)
ht

In [None]:
# MaximumBreaks classification method
mb5 = mapclassify.MaximumBreaks(selected_data, k=5)
mb5

In [None]:
# Apply the Standard Deviation and Mean classification method to the selected data.
msd = mapclassify.StdMean(selected_data)
msd

In [None]:
# Apply Fisher-Jenks classification with 5 classes
fj5 = mapclassify.FisherJenks(selected_data, k=5)
fj5

In [None]:
# Bunch classifier objects
class5 = q4, ei5, ht, mb5, msd, fj5
# Collect ADCM for each classifier
fits = np.array([c.adcm for c in class5])
# Convert ADCM scores to a DataFrame
adcms = pd.DataFrame(fits)
# Add classifier names
adcms["classifier"] = [c.name for c in class5]
# Add column names to the ADCM
adcms.columns = ["ADCM", "Classifier"]
ax = sns.barplot(
    y="Classifier", x="ADCM", data=adcms, hue= adcms["Classifier"],  legend=False
)

##### Create choropleth maps of PD


We observed that the Fisherâ€“Jenks and Headâ€“Tail Breaks classifiers produced the lowest ACDM values. However, in this case, they do not provide a clear visual spatial pattern for the PD variable. Therefore, we will use the Quantile classifier instead

In [None]:
# Convert the bins to a list for further processing
bins = q4.bins.tolist()
bins

In [None]:
# add 0.02 to bin for labeling
bins_label  = bins.copy()
bins_label.insert(0, 0.02)

# Extract intervals from bins
intervals = [f"{bins_label[i]:.2f}-{bins_label[i+1]:.2f}" for i in range(len(bins_label)-1)]
intervals

We will create a function to set a custom classification scheme based on FisherJenks

In [None]:
def maptma(colname, title, row, col):
    # Create a custom classification using UserDefined
    classification = mapclassify.UserDefined(gdf_cl[colname], bins)

    classification.plot(
          gdf,  # GeoDataFrame containing the data to be plotted
          legend=True,  # Enable the legend for the plot
          legend_kwds={
                "fmt": "{:.0f}",  # Format the legend labels as integers
                # "loc": "upper right",  # Position the legend in the upper right corner
                "bbox_to_anchor": (1.3, 0.4),  # Adjust the legend's position
                "fontsize": 8,  # Set the font size of the legend
                "labels": intervals # Use the custom legend labels
          },
          axis_on=False,  # Disable the axis display
          border_color='black',  # Set the border color of the plot
          cmap="viridis_r",  # Use the reversed Viridis colormap
          ax=axes[row, col]  # Specify the subplot to draw the plot on
     )

    # Set the title for the current axis
    axes[row, col].set_title(title)

In [None]:
# Create a 5x5 grid of subplots with a figure size of 20x20
fig, axes = plt.subplots(5, 5, figsize=(20, 20))
# Add a title to the entire figure with specific font size and position
plt.suptitle('Population Density (km$^2$/hab) by departments - 2001-2022', fontsize=14, y=1)

# row 0
maptma("PD_2001", "2001", 0, 0)  # Map data for 2001 in row 0, column 0
maptma("PD_2002", "2002", 0, 1)  # Map data for 2002 in row 0, column 1
maptma("PD_2003", "2003", 0, 2)  # Map data for 2003 in row 0, column 2
maptma("PD_2004", "2004", 0, 3)  # Map data for 2004 in row 0, column 3
maptma("PD_2005", "2005", 0, 4)  # Map data for 2005 in row 0, column 4

# row 1
maptma("PD_2006", "2006", 1, 0)  # Map data for 2006 in row 1, column 0
maptma("PD_2007", "2007", 1, 1)  # Map data for 2007 in row 1, column 1
maptma("PD_2008", "2008", 1, 2)  # Map data for 2008 in row 1, column 2
maptma("PD_2009", "2009", 1, 3)  # Map data for 2009 in row 1, column 3
maptma("PD_2010", "2010", 1, 4)  # Map data for 2010 in row 1, column 4

# row 2
maptma("PD_2011", "2011", 2, 0)  # Map data for 2011 in row 2, column 0
maptma("PD_2012", "2012", 2, 1)  # Map data for 2012 in row 2, column 1
maptma("PD_2013", "2013", 2, 2)  # Map data for 2013 in row 2, column 2
maptma("PD_2014", "2014", 2, 3)  # Map data for 2014 in row 2, column 3
maptma("PD_2015", "2015", 2, 4)  # Map data for 2015 in row 2, column 4

# row 3
maptma("PD_2016", "2016", 3, 0)  # Map data for 2016 in row 3, column 0
maptma("PD_2017", "2017", 3, 1)  # Map data for 2017 in row 3, column 1
maptma("PD_2018", "2018", 3, 2)  # Map data for 2018 in row 3, column 2
maptma("PD_2019", "2019", 3, 3)  # Map data for 2019 in row 3, column 3
maptma("PD_2020", "2020", 3, 4)  # Map data for 2020 in row 3, column 4

# row 4
maptma("PD_2021", "2021", 4, 0)  # Map data for 2021 in row 4, column 0
maptma("PD_2022", "2022", 4, 1)  # Map data for 2022 in row 4, column 1

axes[4,2].axis('off')  # Turn off axis for row 4, column 2
axes[4,3].axis('off')  # Turn off axis for row 4, column 3
axes[4,4].axis('off')  # Turn off axis for row 4, column 4

plt.tight_layout()  # Adjust layout to prevent overlap
plt.show();  # Display the plot

#### PDPM25 (Population Desity X Particulate Matter < 2.5 um (PM2.5)

In [None]:
# create a copy of  the geodataframe
gdf_cl = gdf.copy()

In [None]:
# Get a list of column names that start with "PDPM25_"
pdpm25_columns = [col for col in gdf_cl.columns if col.startswith("PDPM25_")]

In [None]:
# Select data to analize
selected_data = gdf_cl.loc[:,pdpm25_columns]

In [None]:
# Classify the data into 4 quantile groups
q4 = mapclassify.Quantiles(selected_data, k=4)
q4

In [None]:
# Equal Interval Classification
ei5 = mapclassify.EqualInterval(selected_data, k=5)
ei5

In [None]:
# Classify the data into groups based on the head/tail breaks algorithm
ht = mapclassify.HeadTailBreaks(selected_data)
ht

In [None]:
# MaximumBreaks classification method
mb5 = mapclassify.MaximumBreaks(selected_data, k=5)
mb5

In [None]:
# Apply the Standard Deviation and Mean classification method to the selected data.
msd = mapclassify.StdMean(selected_data)
msd

In [None]:
# Apply Fisher-Jenks classification with 5 classes
fj5 = mapclassify.FisherJenks(selected_data, k=5)
fj5

In [None]:
# Bunch classifier objects
class5 = q4, ei5, ht, mb5, msd, fj5
# Collect ADCM for each classifier
fits = np.array([c.adcm for c in class5])
# Convert ADCM scores to a DataFrame
adcms = pd.DataFrame(fits)
# Add classifier names
adcms["classifier"] = [c.name for c in class5]
# Add column names to the ADCM
adcms.columns = ["ADCM", "Classifier"]
ax = sns.barplot(
    y="Classifier", x="ADCM", data=adcms, hue= adcms["Classifier"],  legend=False
)

##### Create choropleth maps of PDPM25


We observed that the Fisherâ€“Jenks and Headâ€“Tail Breaks classifiers produced the lowest ACDM values. However, in this case, they do not provide a clear visual spatial pattern for the PD variable. Therefore, we will use the Quantile classifier instead

In [None]:
# Convert the bins to a list for further processing
bins = q4.bins.tolist()
bins

In [None]:
# add 0.19 to bin for labeling
bins_label  = bins.copy()
bins_label.insert(0, 0.19)

# Extract intervals from bins
intervals = [f"{bins_label[i]:.2f}-{bins_label[i+1]:.2f}" for i in range(len(bins_label)-1)]
intervals

We will create a function to set a custom classification scheme based on FisherJenks

In [None]:
def maptma(colname, title, row, col):
    # Create a custom classification using UserDefined
    classification = mapclassify.UserDefined(gdf_cl[colname], bins)

    classification.plot(
          gdf,  # GeoDataFrame containing the data to be plotted
          legend=True,  # Enable the legend for the plot
          legend_kwds={
                "fmt": "{:.0f}",  # Format the legend labels as integers
                # "loc": "upper right",  # Position the legend in the upper right corner
                "bbox_to_anchor": (1.35, 0.3),  # Adjust the legend's position
                "fontsize": 8,  # Set the font size of the legend
                "labels": intervals # Use the custom legend labels
          },
          axis_on=False,  # Disable the axis display
          border_color='black',  # Set the border color of the plot
          cmap="viridis_r",  # Use the reversed Viridis colormap
          ax=axes[row, col]  # Specify the subplot to draw the plot on
     )

    # Set the title for the current axis
    axes[row, col].set_title(title)

In [None]:
# Create a 5x5 grid of subplots with a figure size of 20x20
fig, axes = plt.subplots(5, 5, figsize=(20, 20))
# Add a title to the entire figure with specific font size and position
plt.suptitle('Population Density X $PM_{2.5}$ by departments - 2001-2022', fontsize=14, y=1)

# row 0
maptma("PDPM25_2001", "2001", 0, 0)  # Map data for 2001 in row 0, column 0
maptma("PDPM25_2002", "2002", 0, 1)  # Map data for 2002 in row 0, column 1
maptma("PDPM25_2003", "2003", 0, 2)  # Map data for 2003 in row 0, column 2
maptma("PDPM25_2004", "2004", 0, 3)  # Map data for 2004 in row 0, column 3
maptma("PDPM25_2005", "2005", 0, 4)  # Map data for 2005 in row 0, column 4

# row 1
maptma("PDPM25_2006", "2006", 1, 0)  # Map data for 2006 in row 1, column 0
maptma("PDPM25_2007", "2007", 1, 1)  # Map data for 2007 in row 1, column 1
maptma("PDPM25_2008", "2008", 1, 2)  # Map data for 2008 in row 1, column 2
maptma("PDPM25_2009", "2009", 1, 3)  # Map data for 2009 in row 1, column 3
maptma("PDPM25_2010", "2010", 1, 4)  # Map data for 2010 in row 1, column 4

# row 2
maptma("PDPM25_2011", "2011", 2, 0)  # Map data for 2011 in row 2, column 0
maptma("PDPM25_2012", "2012", 2, 1)  # Map data for 2012 in row 2, column 1
maptma("PDPM25_2013", "2013", 2, 2)  # Map data for 2013 in row 2, column 2
maptma("PDPM25_2014", "2014", 2, 3)  # Map data for 2014 in row 2, column 3
maptma("PDPM25_2015", "2015", 2, 4)  # Map data for 2015 in row 2, column 4

# row 3
maptma("PDPM25_2016", "2016", 3, 0)  # Map data for 2016 in row 3, column 0
maptma("PDPM25_2017", "2017", 3, 1)  # Map data for 2017 in row 3, column 1
maptma("PDPM25_2018", "2018", 3, 2)  # Map data for 2018 in row 3, column 2
maptma("PDPM25_2019", "2019", 3, 3)  # Map data for 2019 in row 3, column 3
maptma("PDPM25_2020", "2020", 3, 4)  # Map data for 2020 in row 3, column 4

# row 4
maptma("PDPM25_2021", "2021", 4, 0)  # Map data for 2021 in row 4, column 0
maptma("PDPM25_2022", "2022", 4, 1)  # Map data for 2022 in row 4, column 1

axes[4,2].axis('off')  # Turn off axis for row 4, column 2
axes[4,3].axis('off')  # Turn off axis for row 4, column 3
axes[4,4].axis('off')  # Turn off axis for row 4, column 4

plt.tight_layout()  # Adjust layout to prevent overlap
plt.show();  # Display the plot

#### NAGRT (Normalized Agricultural and Livestock Transition areas)

In [None]:
# create a copy of  the geodataframe
gdf_cl = gdf.copy()

In [None]:
# Get a list of column names that start with "NAGRT_"
nagrt_columns = [col for col in gdf_cl.columns if col.startswith("NAGRT_")]

In [None]:
# Select data to analize
selected_data = gdf_cl.loc[:,nagrt_columns]

In [None]:
# Classify the data into 4 quantile groups
q4 = mapclassify.Quantiles(selected_data, k=4)
q4

In [None]:
# Equal Interval Classification
ei5 = mapclassify.EqualInterval(selected_data, k=5)
ei5

In [None]:
# Classify the data into groups based on the head/tail breaks algorithm
ht = mapclassify.HeadTailBreaks(selected_data)
ht

In [None]:
# MaximumBreaks classification method
mb5 = mapclassify.MaximumBreaks(selected_data, k=5)
mb5

In [None]:
# Apply the Standard Deviation and Mean classification method to the selected data.
msd = mapclassify.StdMean(selected_data)
msd

In [None]:
# Apply Fisher-Jenks classification with 5 classes
fj5 = mapclassify.FisherJenks(selected_data, k=5)
fj5

In [None]:
# Bunch classifier objects
class5 = q4, ei5, ht, mb5, msd, fj5
# Collect ADCM for each classifier
fits = np.array([c.adcm for c in class5])
# Convert ADCM scores to a DataFrame
adcms = pd.DataFrame(fits)
# Add classifier names
adcms["classifier"] = [c.name for c in class5]
# Add column names to the ADCM
adcms.columns = ["ADCM", "Classifier"]
ax = sns.barplot(
    y="Classifier", x="ADCM", data=adcms, hue= adcms["Classifier"],  legend=False
)

##### Create choropleth maps of NAGRT


Two classifiers have the lowest ACDM: FisherJenks and HeadTailBreaks. We'll select FisherJenks as the classifier to create the choropleth maps.

In [None]:
# Convert the bins to a list for further processing
bins = fj5.bins.tolist()
bins

In [None]:
# insert 0 at 0 position
bins.insert(0, 0)
bins

In [None]:
# Extract intervals from bins
intervals = [f"{bins[i]:.2f}-{bins[i+1]:.2f}" for i in range(len(bins)-1)]
intervals

In [None]:
# add '0.00' at postion 0 to the list of intervals
intervals.insert(0, '0.00')
intervals

We will create a function to set a custom classification scheme based on FisherJenks

In [None]:
def maptma(colname, title, row, col):
    # Create a custom classification using UserDefined
    classification = mapclassify.UserDefined(gdf_cl[colname], bins)

    classification.plot(
          gdf,  # GeoDataFrame containing the data to be plotted
          legend=True,  # Enable the legend for the plot
          legend_kwds={
                "fmt": "{:.0f}",  # Format the legend labels as integers
                # "loc": "upper right",  # Position the legend in the upper right corner
                "bbox_to_anchor": (1.25, 0.4),  # Adjust the legend's position
                "fontsize": 8,  # Set the font size of the legend
                "labels": intervals  # Use the custom legend labels
          },
          axis_on=False,  # Disable the axis display
          border_color='black',  # Set the border color of the plot
          cmap="viridis_r",  # Use the reversed Viridis colormap
          ax=axes[row, col]  # Specify the subplot to draw the plot on
     )

    # Set the title for the current axis
    axes[row, col].set_title(title)

In [None]:
# Create a 5x5 grid of subplots with a figure size of 20x20
fig, axes = plt.subplots(5, 5, figsize=(20, 20))
# Add a title to the entire figure with specific font size and position
plt.suptitle('Normalized Agricultural and Livestock Transition areas (km$^2$) per 1000 km$^2$ by departments- 2001-2022', fontsize=14, y=1)

# row 0
maptma("NAGRT_2001", "2001", 0, 0)  # Map data for 2001 in row 0, column 0
maptma("NAGRT_2002", "2002", 0, 1)  # Map data for 2002 in row 0, column 1
maptma("NAGRT_2003", "2003", 0, 2)  # Map data for 2003 in row 0, column 2
maptma("NAGRT_2004", "2004", 0, 3)  # Map data for 2004 in row 0, column 3
maptma("NAGRT_2005", "2005", 0, 4)  # Map data for 2005 in row 0, column 4

# row 1
maptma("NAGRT_2006", "2006", 1, 0)  # Map data for 2006 in row 1, column 0
maptma("NAGRT_2007", "2007", 1, 1)  # Map data for 2007 in row 1, column 1
maptma("NAGRT_2008", "2008", 1, 2)  # Map data for 2008 in row 1, column 2
maptma("NAGRT_2009", "2009", 1, 3)  # Map data for 2009 in row 1, column 3
maptma("NAGRT_2010", "2010", 1, 4)  # Map data for 2010 in row 1, column 4

# row 2
maptma("NAGRT_2011", "2011", 2, 0)  # Map data for 2011 in row 2, column 0
maptma("NAGRT_2012", "2012", 2, 1)  # Map data for 2012 in row 2, column 1
maptma("NAGRT_2013", "2013", 2, 2)  # Map data for 2013 in row 2, column 2
maptma("NAGRT_2014", "2014", 2, 3)  # Map data for 2014 in row 2, column 3
maptma("NAGRT_2015", "2015", 2, 4)  # Map data for 2015 in row 2, column 4

# row 3
maptma("NAGRT_2016", "2016", 3, 0)  # Map data for 2016 in row 3, column 0
maptma("NAGRT_2017", "2017", 3, 1)  # Map data for 2017 in row 3, column 1
maptma("NAGRT_2018", "2018", 3, 2)  # Map data for 2018 in row 3, column 2
maptma("NAGRT_2019", "2019", 3, 3)  # Map data for 2019 in row 3, column 3
maptma("NAGRT_2020", "2020", 3, 4)  # Map data for 2020 in row 3, column 4

# row 4
maptma("NAGRT_2021", "2021", 4, 0)  # Map data for 2021 in row 4, column 0
maptma("NAGRT_2022", "2022", 4, 1)  # Map data for 2022 in row 4, column 1

axes[4,2].axis('off')  # Turn off axis for row 4, column 2
axes[4,3].axis('off')  # Turn off axis for row 4, column 3
axes[4,4].axis('off')  # Turn off axis for row 4, column 4

plt.tight_layout()  # Adjust layout to prevent overlap
plt.show();  # Display the plot

#### NNWVT (Normalized Natural Wooded Vegetation Transitions areas)

In [None]:
# create a copy of  the geodataframe
gdf_cl = gdf.copy()

In [None]:
# Get a list of column names that start with "NNWVT_"
nnwvt_columns = [col for col in gdf_cl.columns if col.startswith("NNWVT_")]

In [None]:
# Select data to analize
selected_data = gdf_cl.loc[:,nnwvt_columns]

In [None]:
# Classify the data into 4 quantile groups
q4 = mapclassify.Quantiles(selected_data, k=4)
q4

In [None]:
# Equal Interval Classification
ei5 = mapclassify.EqualInterval(selected_data, k=5)
ei5

In [None]:
# Classify the data into groups based on the head/tail breaks algorithm
ht = mapclassify.HeadTailBreaks(selected_data)
ht

In [None]:
# MaximumBreaks classification method
mb5 = mapclassify.MaximumBreaks(selected_data, k=5)
mb5

In [None]:
# Apply the Standard Deviation and Mean classification method to the selected data.
msd = mapclassify.StdMean(selected_data)
msd

In [None]:
# Apply Fisher-Jenks classification with 5 classes
fj5 = mapclassify.FisherJenks(selected_data, k=5)
fj5

In [None]:
# Bunch classifier objects
class5 = q4, ei5, ht, mb5, msd, fj5
# Collect ADCM for each classifier
fits = np.array([c.adcm for c in class5])
# Convert ADCM scores to a DataFrame
adcms = pd.DataFrame(fits)
# Add classifier names
adcms["classifier"] = [c.name for c in class5]
# Add column names to the ADCM
adcms.columns = ["ADCM", "Classifier"]
ax = sns.barplot(
    y="Classifier", x="ADCM", data=adcms, hue= adcms["Classifier"],  legend=False
)

##### Create choropleth maps of NNWVT

Two classifiers have the lowest ACDM: FisherJenks and HeadTailBreaks. We'll select FisherJenks as the classifier to create the choropleth maps.

In [None]:
# Convert the bins to a list for further processing
bins = fj5.bins.tolist()
bins

In [None]:
# insert 0 at 0 position
bins.insert(0, 0)
bins

In [None]:
# Extract intervals from bins
intervals = [f"{bins[i]:.2f}-{bins[i+1]:.2f}" for i in range(len(bins)-1)]
intervals

In [None]:
# add '0.00' at postion 0 to the list of intervals
intervals.insert(0, '0.00')
intervals

We will create a function to set a custom classification scheme based on FisherJenks

In [None]:
def maptma(colname, title, row, col):
    # Create a custom classification using UserDefined
    classification = mapclassify.UserDefined(gdf_cl[colname], bins)

    classification.plot(
          gdf,  # GeoDataFrame containing the data to be plotted
          legend=True,  # Enable the legend for the plot
          legend_kwds={
                "fmt": "{:.0f}",  # Format the legend labels as integers
                # "loc": "upper right",  # Position the legend in the upper right corner
                "bbox_to_anchor": (1.25, 0.4),  # Adjust the legend's position
                "fontsize": 8,  # Set the font size of the legend
                "labels": intervals  # Use the custom legend labels
          },
          axis_on=False,  # Disable the axis display
          border_color='black',  # Set the border color of the plot
          cmap="viridis_r",  # Use the reversed Viridis colormap
          ax=axes[row, col]  # Specify the subplot to draw the plot on
     )

    # Set the title for the current axis
    axes[row, col].set_title(title)

In [None]:
# Create a 5x5 grid of subplots with a figure size of 20x20
fig, axes = plt.subplots(5, 5, figsize=(20, 20))
# Add a title to the entire figure with specific font size and position
plt.suptitle('Normalized Natural Wooded Vegetation Transitions areas (km$^2$) per 1000 km$^2$ by departments- 2001-2022', fontsize=14, y=1)

# row 0
maptma("NAGRT_2001", "2001", 0, 0)  # Map data for 2001 in row 0, column 0
maptma("NAGRT_2002", "2002", 0, 1)  # Map data for 2002 in row 0, column 1
maptma("NAGRT_2003", "2003", 0, 2)  # Map data for 2003 in row 0, column 2
maptma("NAGRT_2004", "2004", 0, 3)  # Map data for 2004 in row 0, column 3
maptma("NAGRT_2005", "2005", 0, 4)  # Map data for 2005 in row 0, column 4

# row 1
maptma("NAGRT_2006", "2006", 1, 0)  # Map data for 2006 in row 1, column 0
maptma("NAGRT_2007", "2007", 1, 1)  # Map data for 2007 in row 1, column 1
maptma("NAGRT_2008", "2008", 1, 2)  # Map data for 2008 in row 1, column 2
maptma("NAGRT_2009", "2009", 1, 3)  # Map data for 2009 in row 1, column 3
maptma("NAGRT_2010", "2010", 1, 4)  # Map data for 2010 in row 1, column 4

# row 2
maptma("NAGRT_2011", "2011", 2, 0)  # Map data for 2011 in row 2, column 0
maptma("NAGRT_2012", "2012", 2, 1)  # Map data for 2012 in row 2, column 1
maptma("NAGRT_2013", "2013", 2, 2)  # Map data for 2013 in row 2, column 2
maptma("NAGRT_2014", "2014", 2, 3)  # Map data for 2014 in row 2, column 3
maptma("NAGRT_2015", "2015", 2, 4)  # Map data for 2015 in row 2, column 4

# row 3
maptma("NAGRT_2016", "2016", 3, 0)  # Map data for 2016 in row 3, column 0
maptma("NAGRT_2017", "2017", 3, 1)  # Map data for 2017 in row 3, column 1
maptma("NAGRT_2018", "2018", 3, 2)  # Map data for 2018 in row 3, column 2
maptma("NAGRT_2019", "2019", 3, 3)  # Map data for 2019 in row 3, column 3
maptma("NAGRT_2020", "2020", 3, 4)  # Map data for 2020 in row 3, column 4

# row 4
maptma("NAGRT_2021", "2021", 4, 0)  # Map data for 2021 in row 4, column 0
maptma("NAGRT_2022", "2022", 4, 1)  # Map data for 2022 in row 4, column 1

axes[4,2].axis('off')  # Turn off axis for row 4, column 2
axes[4,3].axis('off')  # Turn off axis for row 4, column 3
axes[4,4].axis('off')  # Turn off axis for row 4, column 4

plt.tight_layout()  # Adjust layout to prevent overlap
plt.show();  # Display the plot

####NBUT (Normalized Built-up Transitions areas)

In [None]:
# create a copy of  the geodataframe
gdf_cl = gdf.copy()

In [None]:
# Get a list of column names that start with "NBUT_"
nbut_columns = [col for col in gdf_cl.columns if col.startswith("NBUT_")]

In [None]:
# Select data to analize
selected_data = gdf_cl.loc[:,nbut_columns]

In [None]:
# Classify the data into 4 quantile groups
q4 = mapclassify.Quantiles(selected_data, k=4)
q4

In [None]:
# Equal Interval Classification
ei5 = mapclassify.EqualInterval(selected_data, k=5)
ei5

In [None]:
# Classify the data into groups based on the head/tail breaks algorithm
ht = mapclassify.HeadTailBreaks(selected_data)
ht

In [None]:
# MaximumBreaks classification method
mb5 = mapclassify.MaximumBreaks(selected_data, k=5)
mb5

In [None]:
# Apply the Standard Deviation and Mean classification method to the selected data.
msd = mapclassify.StdMean(selected_data)
msd

In [None]:
# Apply Fisher-Jenks classification with 5 classes
fj5 = mapclassify.FisherJenks(selected_data, k=5)
fj5

In [None]:
# Bunch classifier objects
class5 = q4, ei5, ht, mb5, msd, fj5
# Collect ADCM for each classifier
fits = np.array([c.adcm for c in class5])
# Convert ADCM scores to a DataFrame
adcms = pd.DataFrame(fits)
# Add classifier names
adcms["classifier"] = [c.name for c in class5]
# Add column names to the ADCM
adcms.columns = ["ADCM", "Classifier"]
ax = sns.barplot(
    y="Classifier", x="ADCM", data=adcms, hue= adcms["Classifier"],  legend=False
)

##### Create choropleth maps of NBUT

Two classifiers have the lowest ACDM: FisherJenks and HeadTailBreaks. We'll select FisherJenks as the classifier to create the choropleth maps.

In [None]:
# Convert the bins to a list for further processing
bins = fj5.bins.tolist()
bins

In [None]:
# add 0.02 to bin for labeling
bins_label  = bins.copy()
bins_label.insert(0, 0.02)

# Extract intervals from bins
intervals = [f"{bins_label[i]:.2f}-{bins_label[i+1]:.2f}" for i in range(len(bins_label)-1)]
intervals

We will create a function to set a custom classification scheme based on FisherJenks

In [None]:
def maptma(colname, title, row, col):
    # Create a custom classification using UserDefined
    classification = mapclassify.UserDefined(gdf_cl[colname], bins)

    classification.plot(
          gdf,  # GeoDataFrame containing the data to be plotted
          legend=True,  # Enable the legend for the plot
          legend_kwds={
                "fmt": "{:.0f}",  # Format the legend labels as integers
                # "loc": "upper right",  # Position the legend in the upper right corner
                "bbox_to_anchor": (1.2, 0.4),  # Adjust the legend's position
                "fontsize": 8,  # Set the font size of the legend
                "labels": intervals  # Use the custom legend labels
          },
          axis_on=False,  # Disable the axis display
          border_color='black',  # Set the border color of the plot
          cmap="viridis_r",  # Use the reversed Viridis colormap
          ax=axes[row, col]  # Specify the subplot to draw the plot on
     )

    # Set the title for the current axis
    axes[row, col].set_title(title)

In [None]:
# Create a 5x5 grid of subplots with a figure size of 20x20
fig, axes = plt.subplots(5, 5, figsize=(20, 20))
# Add a title to the entire figure with specific font size and position
plt.suptitle('Normalized Built-up Transitions areas by departments - 2001-2022', fontsize=14, y=1)

# row 0
maptma("NBUT_2001", "2001", 0, 0)  # Map data for 2001 in row 0, column 0
maptma("NBUT_2002", "2002", 0, 1)  # Map data for 2002 in row 0, column 1
maptma("NBUT_2003", "2003", 0, 2)  # Map data for 2003 in row 0, column 2
maptma("NBUT_2004", "2004", 0, 3)  # Map data for 2004 in row 0, column 3
maptma("NBUT_2005", "2005", 0, 4)  # Map data for 2005 in row 0, column 4

# row 1
maptma("NBUT_2006", "2006", 1, 0)  # Map data for 2006 in row 1, column 0
maptma("NBUT_2007", "2007", 1, 1)  # Map data for 2007 in row 1, column 1
maptma("NBUT_2008", "2008", 1, 2)  # Map data for 2008 in row 1, column 2
maptma("NBUT_2009", "2009", 1, 3)  # Map data for 2009 in row 1, column 3
maptma("NBUT_2010", "2010", 1, 4)  # Map data for 2010 in row 1, column 4

# row 2
maptma("NBUT_2011", "2011", 2, 0)  # Map data for 2011 in row 2, column 0
maptma("NBUT_2012", "2012", 2, 1)  # Map data for 2012 in row 2, column 1
maptma("NBUT_2013", "2013", 2, 2)  # Map data for 2013 in row 2, column 2
maptma("NBUT_2014", "2014", 2, 3)  # Map data for 2014 in row 2, column 3
maptma("NBUT_2015", "2015", 2, 4)  # Map data for 2015 in row 2, column 4

# row 3
maptma("NBUT_2016", "2016", 3, 0)  # Map data for 2016 in row 3, column 0
maptma("NBUT_2017", "2017", 3, 1)  # Map data for 2017 in row 3, column 1
maptma("NBUT_2018", "2018", 3, 2)  # Map data for 2018 in row 3, column 2
maptma("NBUT_2019", "2019", 3, 3)  # Map data for 2019 in row 3, column 3
maptma("NBUT_2020", "2020", 3, 4)  # Map data for 2020 in row 3, column 4

# row 4
maptma("NBUT_2021", "2021", 4, 0)  # Map data for 2021 in row 4, column 0
maptma("NBUT_2022", "2022", 4, 1)  # Map data for 2022 in row 4, column 1

axes[4,2].axis('off')  # Turn off axis for row 4, column 2
axes[4,3].axis('off')  # Turn off axis for row 4, column 3
axes[4,4].axis('off')  # Turn off axis for row 4, column 4

plt.tight_layout()  # Adjust layout to prevent overlap
plt.show();  # Display the plot

#### ELEV (Elevation)

In this section, we will map the mean elevation for each department for the year 2001 only, as elevation values remain constant throughout the study period.

In [None]:
# create a copy of  the geodataframe
gdf_cl = gdf.copy()

In [None]:
# Select data to analize
selected_data = gdf_cl.loc[:,"ELEV_2001"]

In [None]:
# Classify the data into 4 quantile groups
q4 = mapclassify.Quantiles(selected_data, k=4)
q4

In [None]:
# Equal Interval Classification
ei5 = mapclassify.EqualInterval(selected_data, k=5)
ei5

In [None]:
# Classify the data into groups based on the head/tail breaks algorithm
ht = mapclassify.HeadTailBreaks(selected_data)
ht

In [None]:
# MaximumBreaks classification method
mb5 = mapclassify.MaximumBreaks(selected_data, k=5)
mb5

In [None]:
# Apply the Standard Deviation and Mean classification method to the selected data.
msd = mapclassify.StdMean(selected_data)
msd

In [None]:
# Apply Fisher-Jenks classification with 5 classes
fj5 = mapclassify.FisherJenks(selected_data, k=5)
fj5

In [None]:
# Bunch classifier objects
class5 = q4, ei5, ht, mb5, msd, fj5
# Collect ADCM for each classifier
fits = np.array([c.adcm for c in class5])
# Convert ADCM scores to a DataFrame
adcms = pd.DataFrame(fits)
# Add classifier names
adcms["classifier"] = [c.name for c in class5]
# Add column names to the ADCM
adcms.columns = ["ADCM", "Classifier"]
ax = sns.barplot(
    y="Classifier", x="ADCM", data=adcms, hue= adcms["Classifier"],  legend=False
)

##### Create choropleth maps of Elevation


Two classifiers have the lowest ACDM: FisherJenks and HeadTailBreaks. We'll select FisherJenks as the classifier to create the choropleth maps.

In [None]:
# Convert the bins to a list for further processing
bins = fj5.bins.tolist()
bins

In [None]:
# add 2.08 to bin for labeling
bins_label  = bins.copy()
bins_label.insert(0, 2.08)

# Extract intervals from bins
intervals = [f"{bins_label[i]:.2f}-{bins_label[i+1]:.2f}" for i in range(len(bins_label)-1)]
intervals

In [None]:
# Create a custom classification using UserDefined for actual values
classi_2001 = mapclassify.UserDefined(selected_data, bins)

In [None]:
# Create a single-plot figure
fig, ax = plt.subplots(1, 1, figsize=(8, 8))
plt.suptitle('Mean Elevation (m) by departments- 2001', fontsize=20, y=0.95)

# Plot 2001 data
classi_2001.plot(
    gdf_cl,
    legend=False,  # Custom legend
    axis_on=False,
    border_color='black',
    cmap="viridis_r",
    ax=ax
)

# Custom bin labels and colors
bin_labels = intervals

n_bins = len(bin_labels)
cmap = mpl.colormaps.get_cmap("viridis_r").resampled(n_bins)
colors = [mpl.colors.to_hex(cmap(i)) for i in range(cmap.N)]

# Create legend patches for bins
bin_patches = [Patch(facecolor=color, edgecolor='black', label=label)
               for color, label in zip(colors, bin_labels)]

# Display legend
ax.legend(handles=bin_patches, loc='upper right', bbox_to_anchor=(1.1, 0.25), fontsize=10)

plt.tight_layout()
plt.show()

### Correlation of predictor variables with target variable

#### Scatterplots + Linear Fits for Each Predictor

In [None]:
predictors = ["PM25", "NBA", "PD", "PDPM25", "NAGRT", "NNWVT", "NBUT", "ELEV"]

plt.figure(figsize=(16, 12))

for i, var in enumerate(predictors, 1):
    plt.subplot(3, 3, i)
    sns.regplot(data=df_ts, x=var, y="CA", scatter_kws={'alpha':0.3}, line_kws={"color":"red"})
    plt.title(f"CA vs {var}")
    plt.xlabel(var)
    plt.ylabel("CA")

plt.tight_layout()
plt.show()


#### Pearson & Spearman Correlations (CA vs predictors)

In [None]:
corr_results = {}

for var in predictors:
    pearson = df_ts["CA"].corr(df_ts[var], method="pearson")
    spearman = df_ts["CA"].corr(df_ts[var], method="spearman")
    corr_results[var] = {"Pearson": pearson, "Spearman": spearman}

corr_df = pd.DataFrame(corr_results).T
corr_df

#### Pretty heatmap of correlations

In [None]:
plt.figure(figsize=(6, 6))
sns.heatmap(corr_df, annot=True, cmap="coolwarm", center=0, fmt=".3f")
plt.title("Correlation of CA with Predictors")
plt.show()


## Interaction & Derived Features

In [None]:
# code here

### Check if PDPM25 is stronger than PM25

In [None]:
# copy dataframe
df_ts = df.copy()

In [None]:
# Correlation analysis
corr_pm25 = df_ts['CA'].corr(df_ts['PM25'])
corr_pdpm25 = df_ts['CA'].corr(df_ts['PDPM25'])

In [None]:
print("Correlation with CA (NAMR):")
print(f"  PM25:    {corr_pm25:.3f}")
print(f"  PDPM25:  {corr_pdpm25:.3f}")

In [None]:
# Simple regression comparison (RÂ²)
# PM25 model
X1 = sm.add_constant(df_ts['PM25'])
model_pm25 = sm.OLS(df_ts['CA'], X1, missing='drop').fit()

# PDPM25 model
X2 = sm.add_constant(df_ts['PDPM25'])
model_pdpm25 = sm.OLS(df_ts['CA'], X2, missing='drop').fit()

In [None]:
print("\nR-squared comparison:")
print(f"  PM25 model RÂ²:    {model_pm25.rsquared:.3f}")
print(f"  PDPM25 model RÂ²:  {model_pdpm25.rsquared:.3f}")

In [None]:
# Visualization
plt.figure(figsize=(10,4))

plt.subplot(1,2,1)
sns.regplot(x='PM25', y='CA', data=df_ts, scatter_kws={'alpha':0.4}, color='royalblue')
plt.title(f"PM25 vs CA (r={corr_pm25:.2f})")

plt.subplot(1,2,2)
sns.regplot(x='PDPM25', y='CA', data=df_ts, scatter_kws={'alpha':0.4}, color='darkorange')
plt.title(f"PDPM25 vs CA (r={corr_pdpm25:.2f})")

plt.tight_layout()
plt.show()

**Interpretation**

* Both variables have extremely low correlation with asthma mortality rate. |r| < 0.1 usually indicates no meaningful linear relationship.
* RÂ² = 0.003 so PMâ‚‚.â‚… explains about 0.3% of the variance in NAMR. PDPM25 explains almost none (RÂ² â‰ˆ 0). This means that multiplying PMâ‚‚.â‚… by population density (PDPM25) did not increase predictive power, it may have even diluted the signal.
* The association between asthma mortality and PMâ‚‚.â‚… might be nonlinear or confounded by other spatial and temporal factors (e.g., climate, health access, or socioeconomic structure).PDPM25 might introduce multicollinearity with PD or PMâ‚‚.â‚…, weakening its independent explanatory power.





### Lag effects

In [None]:
# copy df as df_ts
df_ts = df.copy()

In [None]:
# Define predictor variables and lags to test
predictors = ['PM25', 'NBA', 'PD', 'PDPM25', 'NAGRT', 'NNWVT', 'NBUT']
lags = [1, 2]  # lag 1 and lag 2 years

In [None]:
# Create lagged versions per department
df_lagged = df_ts.copy()

for var in predictors:
    for lag in lags:
        df_lagged[f'{var}_lag{lag}'] = df_lagged.groupby('IDDPTO')[var].shift(lag)

In [None]:
# Compute correlations of CA with current and lagged predictors ---
corrs = {}

for var in predictors:
    corrs[var] = {
        'r_current': df_lagged['CA'].corr(df_lagged[var]),
        'r_lag1': df_lagged['CA'].corr(df_lagged[f'{var}_lag1']),
        'r_lag2': df_lagged['CA'].corr(df_lagged[f'{var}_lag2'])
    }

corr_df = pd.DataFrame(corrs).T.round(3)
print("ðŸ“ˆ Correlation of CA with predictors and their lags:")
print(corr_df)

In [None]:
# Visualize lag correlations as a heatmap
plt.figure(figsize=(8, 5))
sns.heatmap(corr_df, annot=True, cmap="coolwarm", center=0)
plt.title("Lagged Correlation of Predictors with Asthma Mortality (CA)")
plt.ylabel("Predictor")
plt.xlabel("Lag type")
plt.tight_layout()
plt.show()

**Interpretation of result**
| Predictor                             |  r_current |   r_lag1   |   r_lag2   | ðŸ§­ Interpretation                                                                                                                             |
| ------------------------------------- | :--------: | :--------: | :--------: | --------------------------------------------------------------------------------------------------------------------------------------------- |
| **PMâ‚‚.â‚…**                             |   âˆ’0.054   |   âˆ’0.058   |   âˆ’0.046   | Weak, slightly stronger at lag-1 â†’ possible short delayed effect but negligible magnitude.                                                    |
| **NBA (Burned area)**                 |   +0.011   |   âˆ’0.002   |   âˆ’0.009   | No clear linear relationship; effects may be local or nonlinear.                                                                              |
| **PD (Population density)**           |   âˆ’0.004   |   âˆ’0.004   |   âˆ’0.003   | Essentially no correlation; consistent across time.                                                                                           |
| **PDÃ—PMâ‚‚.â‚…**                          |    0.000   |   âˆ’0.001   |   +0.001   | Confirms PDPM25 adds no value beyond PMâ‚‚.â‚….                                                                                                   |
| **NAGRT (Agro-livestock transition)** | **+0.112** | **+0.109** | **+0.102** | Only variable with modest positive correlation; stable through time. Suggests link between agricultural land conversion and asthma mortality. |
| **NNWVT (Natural wooded transition)** |   âˆ’0.013   |   âˆ’0.018   |   âˆ’0.019   | Weak negative trend; may indicate protective vegetation cover.                                                                                |
| **NBUT (Built-up transitions)**       |   +0.043   |   +0.043   |   +0.039   | Very weak positive relation, possibly urbanization signal.                                                                                    |


## Multicollinearity & Dimension Reduction

### Correlation Matrix

In [None]:
panel_gdf.columns

In [None]:
predictors = ['PM25', 'NBA', 'PD', 'PDPM25', 'NAGRT', 'NNWVT','NBUT', 'ELEV']  # example
X = panel_gdf[predictors].copy()

In [None]:
#Correlation Matrix

corr_matrix = X.corr()

plt.figure(figsize=(8,6))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt=".2f", square=True)
plt.title("Correlation Matrix of Predictors")
plt.show()

###  VIF Calculation

In [None]:
# code here

### Apply PCA or clustering

In [None]:
# code here

**Pending**

In 03.Asthma_Mortality_EDA.ipynb

ðŸ”¥ Heatmaps: CA across departments vs years (matrix form).

ðŸ“Š Space-time cube idea: Department vs year vs CA (3D visualization optional)