In [None]:
# python libraries
import os
import json
import requests
import urllib.parse
from pathlib import Path
import subprocess
import tempfile
import shutil
import pprint as pp
import time
import json
import re
from zipfile import ZipFile
import random
from typing import Optional, List, Dict, Tuple, Any

# Geospatial & Data Handling
import pandas as pd
import geopandas as gpd
import geodatasets # For geospatial datasets
import duckdb
import h3
import pyarrow.parquet as pq
import pyarrow as pa
import xarray as xr # For ND-Array section
import pystac_client # For STAC section
from shapely.geometry import Point, Polygon, MultiPolygon
from shapely import wkt

# Visualization
import matplotlib.pyplot as plt
import pydeck as pdk
import folium
import lonboard

# Presentation/Notebook Specific
# from IPython.display import display, Markdown, Latex
from IPython.display import display
from IPython.display import clear_output
from tqdm import tqdm
import ipywidgets as widgets
from jupyter_bbox_widget import BBoxWidget
from ipywidgets import Layout, interact

# data
import duckdb
import datahugger
import sciencebasepy
from seedir import seedir

# Import refactored utility functions
from utils.fetch_and_preprocess import (
    fetch_dataset_files, 
    filter_gdf_duplicates, 
    process_vector_geoms, 
    geom_db_consolidate_dataset,
    ddb_filter_duplicates
)
from utils.visualizations import (
    format_dataset_info,
    create_map_buttons,
    create_folium_cluster_map,
    create_folium_choropleth,
    create_folium_heatmap,
    create_pydeck_scatterplot,
    create_pydeck_polygons,
    create_pydeck_heatmap
)

from utils.st_context_processing import (
    add_h3_index_to_pv_labels,
    ddb_alter_table_add_h3,
    ddb_save_div_matches,
    ddb_save_subtype_geoms,
    get_duckdb_connection,
    group_pv_by_h3_cells,
    spatial_join_stac_items_with_h3,
    create_h3_stac_fetch_plan,
    fetch_overture_maps_theme,
    spatial_join_pv_overture_duckdb
)

from dotenv import load_dotenv
load_dotenv()

print("Libraries imported.")

# Leveraging Hierarchical Spatial Clustering and DGGS for Planetary-scale surveys of Photovoltaic Solar Panel Arrays

*CCOM6050: Analysis and Design of Algorithms*  
**Alejandro Vega Nogales**  
*Data Scientist @ Maxar Puerto Rico*   
*CCOM MS Student*

## Outline

1.  **Introduction & Background (Earth Observation)**
    * Earth Observation (EO) 
    * Remote Sensing (RS)
    * Geospatial Data
    * Thesis Topic 
2.  **Data & Methodology**
    * Open, Published PV Solar Panel Location Datasets
    * Geospatial Data Handling 
        - Overture
        - H3
    * Cloud-Native Geospatial Stack
        - GeoParquet 
        - DuckDB
        - STAC Collections
        - Xarray & ND-Arrays
        - Virtualization & Virtual Datasets
4.  **Algorithms Topic: Hierarchical Spatial Clustering**
    * Relevant Algorithms & Literature
        - Minimum Spanning Trees (1)
        - DGGS (2)
        - Unbounded Parallelism (3)
    * Application with H3 for PV Cluster Analysis
    * Minimizing STAC queries for multi-sensor and multi-temporal data
5.  **Preliminary Findings & Next Steps**
    * Preliminary Dataset description
    * Testing Clustering with preliminary dataset and h3 (results in report)
    * Testing STAC query performance improvements
    * Scaling to the Cloud (Thesis)
6.  **Conclusion**

## Earth Observation (EO): Fundamentals and Background


### What is Earth Observation (EO)?

- Gathering information about Earth's physical, chemical, and biological systems via remote sensing technologies.
- Sensors on satellites, aircraft, drones, etc.
- Key characteristics: Spatial, Temporal, and Spectral Resolution.

<figure style="text-align: right">
<img src="report/assets/figures/schmitt_et_al_fig1_geospatial_data.png" style="width:auto; height:40%;">
<figcaption align = "center"> Illustration of different RS sources, imagery types, and imaging details </figcaption>
</figure>

### Sensor Modalities

- **Optical:** Captures visible and near-infrared light (e.g., satellite imagery).
  - Panchromatic (grayscale), True Color (RGB), Multispectral (4-15 bands), Hyperspectral (100+ bands).
- **Radar (SAR):** Active sensor, penetrates clouds, measures surface properties and elevation.
- **Thermal:** Detects heat emitted/reflected.

### EO Data Complexities

- **Spatial Resolution (GSD):** Size of ground area covered by one pixel.
- **Temporal Resolution:** Time between observations of the same location.
- **Spectral Resolution:** Number and width of electromagnetic bands captured.
- **Challenges:** Clouds, atmospheric distortion, data volume, coordinate systems.

### Geospatial Data Types

- **Raster:**
  - Grid-based data (pixels).
  - Represents continuous phenomena (e.g., elevation, temperature, imagery).
  - Cell values store attribute information.
- **Vector:**
  - Coordinate-based data.
  - Represents discrete features (e.g., points, lines, polygons).
  - Examples: Roads (lines), buildings (polygons), PV panels (points/polygons).

### Proposed Thesis Topic: 

#### Leveraging _planetary-scale_ Datasets of PV Solar Panel locations to train Computer Vision models that enable Nation-scale PV Spatio-Temporal Surveys

## Data & Methodology

### Open, Published PV Solar Panel Locations

- Aggregated multiple open, published datasets of PV locations worldwide.
- Sources include Zenodo, Figshare, GitHub, ScienceBase.
- Variety of formats (CSV, *GeoJSON* [ideal], Shapefile, etc.).
- Goal: Create a consolidated, deduplicated dataset for analysis.

Here we list the dataset titles of publications alongside their first author, DOI links, and their number of labels:
- **"Distributed solar photovoltaic array location and extent dataset for remote sensing object identification"** - K. Bradbury, 2016 | [paper DOI](https://doi.org/10.1038/sdata.2016.106) | [dataset DOI](https://doi.org/10.6084/m9.figshare.3385780.v4) | polygon annotations for 19,433 PV modules in 4 cities in California, USA
- "A solar panel dataset of very high resolution satellite imagery to support the Sustainable Development Goals" - C. Clark et al, 2023 | [paper DOI](https://doi.org/10.1038/s41597-023-02539-8) | [dataset DOI](https://doi.org/10.6084/m9.figshare.22081091.v3) | 2,542 object labels (per spatial resolution)
- **"A harmonised, high-coverage, open dataset of solar photovoltaic installations in the UK" - D. Stowell et al, 2020** | [paper DOI](https://doi.org/10.1038/s41597-020-00739-0) | [dataset DOI](https://zenodo.org/records/4059881) | 265,418 data points (over 255,000 are stand-alone installations, 1067 solar farms, and rest are subcomponents within solar farms)
- "Georectified polygon database of ground-mounted large-scale solar photovoltaic sites in the United States" - K. Sydny, 2023 | [paper DOI](https://doi.org/10.1038/s41597-023-02644-8) | [dataset DOI](https://www.sciencebase.gov/catalog/item/6671c479d34e84915adb7536) | 4186 data points 
- "Vectorized solar photovoltaic installation dataset across China in 2015 and 2020" - J. Liu et al, 2024 | [paper DOI](https://doi.org/10.1038/s41597-024-04356-z) | [dataset link](https://github.com/qingfengxitu/ChinaPV) | 3,356 PV labels (inspect quality!)
- "Multi-resolution dataset for photovoltaic panel segmentation from satellite and aerial imagery" - H. Jiang, 2021 | [paper DOI](https://doi.org/10.5194/essd-13-5389-2021) | [dataset DOI](https://doi.org/10.5281/zenodo.5171712) | 3,716 samples of PV data points
- "A crowdsourced dataset of aerial images with annotated solar photovoltaic arrays and installation metadata" - G. Kasmi, 2023 | [paper DOI](https://doi.org/10.1038/s41597-023-01951-4) | [dataset DOI](https://doi.org/10.5281/zenodo.6865878) | > 28K points of PV installations; 13K+ segmentation masks for PV arrays; metadata for 8K+ installations
- **"An Artificial Intelligence Dataset for Solar Energy Locations in India"** - A. Ortiz, 2022 | [paper DOI](https://doi.org/10.1038/s41597-022-01499-9) | [dataset link 1](https://researchlabwuopendata.blob.core.windows.net/solar-farms/solar_farms_india_2021.geojson) or [dataset link 2](https://raw.githubusercontent.com/microsoft/solar-farms-mapping/refs/heads/main/data/solar_farms_india_2021_merged_simplified.geojson) | 117 geo-referenced points of solar installations across India
- "GloSoFarID: Global multispectral dataset for Solar Farm IDentification in satellite imagery" - Z. Yang, 2024** | [paper DOI](https://doi.org/10.48550/arXiv.2404.05180) | [dataset DOI](https://github.com/yzyly1992/GloSoFarID/tree/main/data_coordinates) | 6,793 PV samples across 3 years (double counting of samples)
- **"A global inventory of photovoltaic solar energy generating units" - L. Kruitwagen et al, 2021** | [paper DOI](https://doi.org/10.1038/s41586-021-03957-7) | [dataset DOI](https://doi.org/10.5281/zenodo.5005867) | 50,426 for training, cross-validation, and testing; 68,661 predicted polygon labels 
- **"Harmonised global datasets of wind and solar farm locations and power" - S. Dunnett et al, 2020** | [paper DOI](https://doi.org/10.1038/s41597-020-0469-8) | [dataset DOI](https://doi.org/10.6084/m9.figshare.11310269.v6) | 35272 PV installations

In [None]:
# load environment variables
load_dotenv()
DATASET_DIR = Path(os.getenv('DATA_PATH'))
# read dataset metadata from json file
with open('dataset_metadata.json', 'r') as f:
    dataset_metadata = json.load(f)

dataset_choices = [
    'global_harmonized_large_solar_farms_2020',
    'global_pv_inventory_sent2_2024',
    'global_pv_inventory_sent2_spot_2021',
    'fra_west_eur_pv_installations_2023',
    'ind_pv_solar_farms_2022',
    'usa_cali_usgs_pv_2016',
    'chn_med_res_pv_2024',
    'usa_eia_large_scale_pv_2023',
    'uk_crowdsourced_pv_2020',
    'deu_maxar_vhr_2023'   
]

In [None]:
# Initialize a list to store selected datasets
# mostly gen by github copilot with Claude 3.7 model
selected_datasets = dataset_choices.copy()

# Create an accordion to display selected datasets with centered layout
dataset_accordion = widgets.Accordion(
    children=[widgets.HTML(format_dataset_info(ds)) for ds in selected_datasets],
    layout=Layout(width='50%', margin='0 auto')
)
for i, ds in enumerate(selected_datasets):
    dataset_accordion.set_title(i, ds)

# Define a function to add or remove datasets
def manage_datasets(action, dataset=None):
    global selected_datasets, dataset_accordion
    
    if action == 'add' and dataset and dataset not in selected_datasets:
        selected_datasets.append(dataset)
    elif action == 'remove' and dataset and dataset in selected_datasets:
        selected_datasets.remove(dataset)
    
    # Update the accordion with current selections
    dataset_accordion.children = [widgets.HTML(format_dataset_info(ds)) for ds in selected_datasets]
    for i, ds in enumerate(selected_datasets):
        dataset_accordion.set_title(i, ds)
    
    f"Currently selected datasets: {len(selected_datasets)}"

# Create dropdown for available datasets
dataset_dropdown = widgets.Dropdown(
    options=list(dataset_metadata.keys()),
    description='Dataset:',
    disabled=False,
    layout=Layout(width='70%', margin='20 20 auto 20 20')
)

# Create buttons for actions
add_button = widgets.Button(description="Add Dataset", button_style='success')
remove_button = widgets.Button(description="Remove Dataset", button_style='danger')

# Define button click handlers
def on_add_clicked(b):
    manage_datasets('add', dataset_dropdown.value)

def on_remove_clicked(b):
    manage_datasets('remove', dataset_dropdown.value)

# Link buttons to handlers
add_button.on_click(on_add_clicked)
remove_button.on_click(on_remove_clicked)

### Dataset Selection Interface
#### Use the dropdown and buttons below to customize which solar panel datasets will be fetched and processed.
- Select a dataset from the dropdown:
    - Click "Add Dataset" to include it in processing
    - Click "Remove Dataset" to exclude it
- View metadata table for each selected dataset by clicking on it's row in the list

In [None]:
# Display the widgets
display(widgets.HBox([dataset_dropdown, add_button, remove_button]))
display(dataset_accordion)

## Data Fusion: Geospatial Data Context and Handling

- Focus on tools optimized for scalable cloud environments.
- **Goal:** Process and analyze large geospatial datasets efficiently, leveraging cloud storage and compute.

### Overture Maps: Adding Geospatial Context

<!-- From their [Division theme guide](https://docs.overturemaps.org/guides/divisions/) and their [brief blog on the history of the project](https://overturemaps.org/blog/2025/overture-maps-foundation-making-open-data-the-winning-choice/): -->

Overture Maps is a collaborative project that aims to create a high-quality, open map datasets for the entire world:
    - The project is a collaboration between several organizations, including Meta, Amazon Web Services (AWS), and Microsoft. 
    - Overture distributes its open datasets as GeoParquet files, and can be accessed through CLI, API or downloaded directly from [their S3](https://docs.overturemaps.org/guides/divisions/#data-access-and-retrieval) buckets

The Overture divisions theme: 
- has three feature types (division, **division_area**, and division_boundary) and contains more than 5.45 million point, line, and polygon representations of human settlements, such as countries, regions, states, cities, and even neighborhoods. 
- is derived from a conflation of OpenStreetMap data and geoBoundaries data
- **Used here as contextual layers** (e.g. dividing our data by continent, country, etc) to enrich PV data.
- their `division_area` subset provides a **hierarchical structure of administrative boundaries**, including countries, states, and cities.

<figure style="text-align: center">
<img src="https://docs.overturemaps.org/assets/images/divisions-admin0-admin1-coverage-ff1a8d4c6d68c88047b34d1f9c9109be.png" style="width:65%; height:auto;">
<figcaption align = "center"> Overture divisions data, styled by subtype: countries in purple, region boundaries as green lines. </figcaption>
</figure>

## Cloud-Native Geospatial Stack

- Focus on tools optimized for scalable cloud environments.
- **Goal:** Process and analyze large geospatial datasets efficiently, and can scale to leverage cloud storage and compute.

### GeoParquet: Cloud-Optimized Vector Data
<div style="max-width: 80%; margin: 0 auto; padding-left: 1em; padding-right: 1em; text-align: justify;">
<h4 style="text-align: left">GeoParquet: Intro</h2>

<p>GeoParquet is <a href="https://geoparquet.org/">an incubating Open Geospatial Consortium (OGC) standard</a> that simply adds compatible geospatial <a href="https://docs.safe.com/fme/html/FME-Form-Documentation/FME-ReadersWriters/geoparquet/Geometry-Support.htm">geometry types</a> (Point, Line, Polygon, etc) to the mature and widely adopted <a href="https://parquet.apache.org/">Apache Parquet format</a>, a popular columnar storage file format commonly used in big data processing and modern data engineering pipelines and analytics. This is analogous to how the GeoTIFF raster format adds geospatial metadata to the longstanding TIFF standard. GeoParquet is designed to be a simple and efficient way to store geospatial <em>vector</em> data in a columnar format, and is designed to be compatible with existing Parquet tools and libraries to enable Cloud <em>Data Warehouse</em> Interoperability.</p>

<figure style="text-align: center">
<img src="https://miro.medium.com/v2/resize:fit:1400/1*QEQJjtnDb3JQ2xqhzARZZw.png" style="width:70%; height:auto;">
<figcaption align = "center"> Visualization of the layout of a Parquet file </figcaption>
</figure>

<div style="max-width: 80%; margin: 0 auto; padding-left: 1em; padding-right: 1em; text-align: justify;">
<h4 style="text-align: left">GeoParquet: Internal Layout</h2>

<p>These files are organized in a set of file chunks called "row groups". Row groups are logical groups of columns with the same number of rows. Each of these columns is actually a "column chunk" which is a contiguous block of data for that column. The schema across row groups must be consistent, i.e. the data types and number of columns must be the same for every row group. The new geospatial standard adds some relevant additional metadata such as the geometry's Coordinate Reference System (CRS), additional metadata for geometry columns, and <a href="https://medium.com/radiant-earth-insights/geoparquet-1-1-coming-soon-9b72c900fbf2">support for spatial indexing in v1.1</a>.
</div>

<figure style="text-align: center">
<img src="https://guide.cloudnativegeo.org/images/geoparquet_layout.png" style="width:40%; height:auto;">
<figcaption align = "center"> GeoParquet has the same layout with additional metadata </figcaption>
</figure>

<!-- GeoParquet is only the latest in a long line of cloud-native file formats  -->

<div style="max-width: 77%; margin: 0 auto; padding-left: 1em; padding-right: 1em; text-align: justify;">
<h4 style="text-align: left">GeoParquet: Features & Performance</h2>


- Efficient storage and compression: 
    - Internally compressed by default, and can be configured to optimize decompression (time) or storage size (space)
    - columnar format is more efficient for filtering on columns which is common in analytical workloads and results in better compression ratios vs row-based formats
- Scalability and Efficient data access:
    - Spatial indexing, spatial partitioning, and other optimizations enables
        - spatial joins and containment operations like intersection, within, overlaps, etc (ST_*)
        - [spatial predicate pushdowns](https://medium.com/radiant-earth-insights/geoparquet-1-1-coming-soon-9b72c900fbf2)
            - can significantly speed up spatial queries over the network by **applying filters at the storage level**
            - greatly reducing data movement if applied correctly
- Optimized for *read-heavy workflows*: 
    - Parquet itself is an immutable file format, which means taking advantage of cheap reads, and efficient filtering and grouping
    - Popular choice for storing large datasets using *modern cloud-centric DBMS architectures* like data lakes and data warehouses.
    - Designed for analytical workloads that require fast reads and complex queries (but not transactions and frequent updates)
        - idealfor OLAP (Online Analytical Processing) and BI (Business Intelligence) workloads
        - these revolve around historical and aggregated data that dont require high-frequency updates
- Cloud-native format: Optimized for object storage (s3, gcs, abfs, etc.)
    - **designed to be highly compressed**, which reduces storage and data transfer costs and improves RW performance
    - integrates into existing ecosystem of cloud data pipelines and workflows that have been built around the parquet format
    - Broad and fast adoption across the data engineering and geospatial ecosystems

- **Benefits for Spatial Analysis:**
  - *Fast Joins & Aggregation:* Quickly combine data across datasets based on cell ID.
  - *Efficient Neighborhood Queries:* Hexagons have uniform adjacency and H3 provides a built-in Grid Traversal API with distance metrics.
  - *Hierarchical Structure:* Easy aggregation/disaggregation across resolutions (parent/child cells).
  - *Optimized Grid Traversal:* Useful for spatial algorithms.
  - **Foundation for spatial indexing and clustering in this work.**

### DuckDB: In-Process SQL OLAP RDBMS

From their ["Why DuckDB?" page](https://duckdb.org/why_duckdb.html):

DuckDB is an **in-process analytical data management system (OLAP RDBMS)**. Unlike traditional client-server databases (like PostgreSQL or MySQL), DuckDB runs directly within the host process (e.g., our Python script or Jupyter kernel), similar to SQLite. However, unlike SQLite which is optimized for transactional workloads (OLTP), DuckDB is specifically designed for **analytical queries (OLAP)** involving complex, long-running queries over potentially huge datasets, typical in big data analytics and scientific computing workflows.

Key benefits for our workflow include:
-   **Simplicity & Portability:** Easy installation (`pip install duckdb`) and no external dependencies or database server management required. Databases are stored as single, portable files (`.duckdb`), making them easy to manage, share, and archive.
-   **Direct Data Access:** Can directly query various file formats, including the **Parquet and GeoParquet files** we are generating and (geo)pandas DataFrames(!), without needing a separate, time-consuming ingestion/copy step. This is highly efficient for consolidating data from multiple files, and remote sources (e.g., S3, GCS).
-   **Powerful SQL:** Offers a rich, modern SQL dialect, including window functions, complex joins, and support for common table expressions (CTEs), allowing sophisticated data manipulation and analysis directly in SQL.
-   **Geospatial Capabilities:** Crucially, DuckDB has a **`spatial` extension** that provides functions for handling and querying geospatial data types (like points, lines, and polygons) using libraries like GEOS. This enables operations such as spatial joins (e.g., `ST_Intersects`, `ST_Contains`), area calculations (`ST_Area`), centroid computation (`ST_Centroid`), and reading/writing WKT/WKB formats directly within the database. This is essential for our tasks like deduplication and integrating PV labels with contextual layers like Overture Maps.
-   **Performance:** Its **column-vectorized query execution engine** is optimized for analytical performance, often *significantly faster than row-based systems* and more optimized than *pure Python/Pandas operations* for large datasets that may not fit into memory. 
-   **Python Integration:** Seamlessly integrates with Python libraries like Pandas and GeoPandas through its client API and tools like `jupysql`, allowing easy data exchange between dataframes and the database directly from our notebooks! 

We use DuckDB to:
1.  Efficiently consolidate multiple GeoParquet files (one per source dataset) into a single database table using its ability to natively read Parquet (including directly from s3/httpfs!).
2.  Leverage its `spatial` extension for geospatial indexing, filtering, and performing spatial joins with the [Overture Maps divisions](#Overture-Maps-Divisions) data based on [H3 indices](#H3-Geospatial-Indexing-System-and-Spatial-Clustering).
3.  Provide a persistent, queryable, and portable database (`.duckdb` file) containing the cleaned, consolidated, and spatially enriched PV label data.


In [None]:
%%time
ZSTD_COMPRESSION = os.getenv("GPQ_ZSTD_COMPRESSION", 5)
TABLE_NAME = os.getenv("PV_DB_TABLE", "global_consolidated_pv")

dataset_choices = [
    'global_harmonized_large_solar_farms_2020', 'global_pv_inventory_sent2_spot_2021', 'ind_pv_solar_farms_2022', 'usa_cali_usgs_pv_2016', 'uk_crowdsourced_pv_2020'
]

# get list of geoparquet files to be consolidated
get_full_gpq_path = lambda f: DATASET_DIR / 'raw' / 'labels' / 'geoparquet' / f
parquet_files = [get_full_gpq_path(f) for f in os.listdir(DATASET_DIR / 'raw' / 'labels' / 'geoparquet') if any(os.path.splitext(f)[0].startswith(ds) for ds in selected_datasets)]
flist = '\n-'.join([os.path.relpath(f) for f in parquet_files])
print(f"Consolidating these {len(parquet_files)} files:\n-{flist}")

DB_DIR = Path(os.getenv("DUCKDB_DIR", DATASET_DIR / 'db'))
out_consolidated_parquet = DATASET_DIR / 'prepared' / 'labels' / 'geoparquet' / 'global_consolidated_pv.geoparquet'
out_consolidated_db = DB_DIR / 'global_consolidated_pv.duckdb'
# create the output directories if they don't exist
print(f"Creating output directories: {out_consolidated_parquet.parent}")
os.makedirs(out_consolidated_parquet.parent, exist_ok=True)

# consolidate the dataset into a single duckdb database that will also be saved as a geoparquet file
# exclude POINT and MULTIPOINT geometries until we have implemented a heuristic to extract a PV polygon label from the points
# TODO: look at usability of SAM2 models that perform segmentation from single point input 
db_file = geom_db_consolidate_dataset(
    parquet_files=parquet_files,
    table_name=TABLE_NAME,
    geom_column="geometry",
    keep_geoms=["POLYGON", "MULTIPOLYGON", "POINT", "MULTIPOINT"],
    spatial_index=True,
    out_parquet=out_consolidated_parquet,
    printout=True
)

In [None]:
%load_ext sql
%config SqlMagic.feedback = False
%config SqlMagic.displaycon = False
conn = get_duckdb_connection(db_file)
%sql conn --alias duckdb 

In [None]:
# display db tables
%sql SHOW TABLES;

In [None]:
%sql DESCRIBE {{TABLE_NAME}};

In [None]:
%sql SUMMARIZE SELECT unified_id, area_m2, centroid_lon, centroid_lat, dataset, bbox FROM {{TABLE_NAME}}

In [None]:
# load the consolidated geoparquet file into a geopandas for visualization
%time ds_gdf = gpd.read_parquet(out_consolidated_parquet)
print(f"Loaded {len(ds_gdf)} geometries from {out_consolidated_parquet}")
# display some stats about our raw combined gdf
print(f"Combined {len(parquet_files)} datasets into one gdf with {len(ds_gdf)} rows and {len(ds_gdf.columns)} columns.")
print(f"Combined gdf has the following columns:\n{list(ds_gdf.columns)}")
display(ds_gdf.describe())
display(ds_gdf.sample(3))

In [None]:
# source gdf for visualization directly from db in cases where ds_gdf is deleted further below
fetch_query = f"""
SELECT unified_id, dataset, area_m2, centroid_lon, centroid_lat, bbox, ST_AsText(geometry) AS geometry
FROM {TABLE_NAME}; """
%time viz_gdf = conn.sql(fetch_query).df()

# convert the geometry column from WKT to shapely geometries
viz_gdf = gpd.GeoDataFrame(viz_gdf, geometry=viz_gdf['geometry'].apply(wkt.loads), crs="EPSG:4326")
# only keep the rows that have area_m2 > 0
viz_gdf = viz_gdf[viz_gdf['area_m2'] > 0]
# # sample 100K rows for visualization
# viz_gdf = viz_gdf.sample(50000, random_state=42)

In [None]:
%sql DESCRIBE country_geoms

In [None]:
%sql SUMMARIZE SELECT country_iso, division_name, country_pv_count, country_pv_area_m2 FROM country_geoms

In [None]:
# get country geoms in gdf 
from mpl_toolkits.axes_grid1 import make_axes_locatable
countries_query = f"""
SELECT division_id, country_iso, division_name, country_pv_count, country_pv_area_m2, ST_AsText(geometry) AS geometry
FROM country_geoms;"""
country_gdf = conn.sql(countries_query).df()
# convert the geometry column from WKT to shapely geometries
country_gdf = gpd.GeoDataFrame(country_gdf, geometry=country_gdf['geometry'].apply(wkt.loads), crs="EPSG:4326")

# plot chloropleth map of the country geoms colored by the number of PV installations
fig, ax = plt.subplots(1, 1, figsize=(20, 10))
divider = make_axes_locatable(ax)
cax = divider.append_axes("left", size="5%", pad=0.1)

country_gdf.plot(column='country_pv_count', ax=ax,
    cmap='viridis', linewidth=0.8,  edgecolor='0.8',
   legend=True, cax=cax, vmin=0, vmax=country_gdf['country_pv_count'].max(),
    legend_kwds={'label': "Number of PV Installations",
                 'orientation': "vertical", 'shrink': 0.5, 'aspect': 30,
                 'ticks': [0, 1000, 2000, 3000, 4000, 5000]})
ax.set_title('Number of PV Installations by Country', fontdict={'fontsize': '25', 'fontweight' : '3'})
ax.set_axis_off()
plt.show()

In [None]:
# prepare interactive scatterplot map with Folium

# first plot geodatasets natural earth basemap
natural_earth = geodatasets.get_path('naturalearth.land')
natural_earth_gdf = gpd.read_file(natural_earth)

fig, ax = plt.subplots(1, 1, figsize=(20, 10))
natural_earth_gdf.plot(ax=ax, color='lightgray', edgecolor='black')
# plot the country geometries 
# country_gdf.plot(ax=ax, color='none', edgecolor='black', linewidth=0.5)
# plot the PV installations
viz_gdf.plot(ax=ax, color='red', markersize=5, alpha=0.8)
plt.title(f'PV Installation Scatterplot (n={len(viz_gdf)})', fontsize=20)
plt.axis('off')

### Querying and Searching STAC Collections

- **S**patio**T**emporal **A**sset **C**atalog (STAC).
- Standardized specification for describing geospatial information.
- Enables searching and discovery of EO data (imagery, etc.) across different catalog providers (e.g. Microsoft Planetary Computer, AWS Open Data, Google Earth Engine, etc.)
- STAC collections are a standardized way to describe datasets, including metadata, spatial and temporal extents, and links to assets.
- Key concepts: 
    - **STAC Item**: Represents a single observation or asset, including metadata and links to assets (e.g., images, metadata files).
    - **STAC Collection**: A collection of STAC items and collections, organized hierarchically.
    - **STAC Catalog**: A collection of STAC items and collections, organized hierarchically.
    - **STAC API**: RESTful API for searching and retrieving STAC items and collections.
    - **STAC Browser**: Web-based interface for exploring and visualizing STAC collections.
- Libraries like `pystac-client` facilitate programmatic searching based on spatial (bbox, geometry) and temporal criteria.
    - STAC API supports CQL (Common Query Language) for complex queries over catalog fields (e.g. `datetime` in range, `eo:cloud_coverage` < 20%, etc.)

In [None]:
import leafmap
m = leafmap.Map(center=[36.844461, 37.386475], zoom=8)
url = 'https://github.com/opengeos/maxar-open-data/raw/master/datasets/Kahramanmaras-turkey-earthquake-23.geojson'
m.add_geojson(url, layer_name="Footprints")
m

In [None]:
from IPython.display import Image
Image('report/assets/figures/maxar_stac_demo_tile_footprints.gif', width=800, height=400)

In [None]:
from IPython.display import IFrame
IFrame("https://radiantearth.github.io/stac-browser/#/external/maxar-opendata.s3.amazonaws.com/events/catalog.json?.language=en", width=1080, height=600)

### Xarray and ND-arrays in Scientific Computing

- Xarray introduces labels (dimensions, coordinates, attributes) to multi-dimensional arrays (like NumPy's ndarray).
- **Benefits for Geospatial/EO Data:**
  - Handles complex data like satellite image time series (e.g., dimensions: time, band, y, x).
  - Facilitates operations like alignment, indexing, and aggregation based on labels (e.g., time series analysis, **operations over multispectral bands**).
  - Integrates well with Dask for **parallel computing on large datasets**
- Common in climate science, oceanography, and remote sensing that require analysis of multi-dimensional data.
- **Xarray + Dask:** Enables parallel processing of large datasets, leveraging Dask's task scheduling and lazy evaluation.
- **Xarray + GeoParquet:** Enables efficient reading/writing of geospatial data in Parquet format, leveraging Xarray's capabilities for handling multi-dimensional data.
- **Xarray + STAC:** Enables easy access to EO data stored in STAC collections, allowing for efficient querying and analysis of large datasets.

### The Critical Role of Virtualization in Cloud Advances

- **Foundation of Cloud Computing:** Allows abstraction of physical hardware (networks, storage, compute, you name it!) into virtual, ephemeral resources.
- **Resource Pooling & Elasticity:** Enables efficient sharing and dynamic allocation/scaling of compute, storage, and network resources centralized in data centers distributed across the globe.
- **Separation of Concerns:** Decouples applications from **underlying infrastructure**, allowing developers to focus on building applications without worrying about hardware management and reproducibility across environments.
- **Cost Efficiency:** Pay-as-you-go model for resources, reducing upfront costs and allowing for on-demand scaling.
    - **Cost Efficiency:** Pay-as-you-go model reduces upfront capital expenses.
        - Cloud providers offer flexible pricing models (on-demand, reserved, spot instances).
        - Be aware of potential vendor lock-in and hidden costs that can impact long-term economics.
    - **Computational Scale:** Access to dirt-cheap storage and massive computing resources on demand without infrastructure management overhead.
        - **Making Big Data Cheap:** Easily scale resources up or down, enabling rapid deployment and cost-effective huge analytical workloads.
- **Enabler for:**
  - Infrastructure as a Service (IaaS), Platform as a Service (PaaS), Software as a Service (SaaS).
  - Modern data architectures (Data Lakes, Lakehouses)
  - **Serverless computing**
    - particularly relevant for data processing and analysis
    - easier to scale and manage briefly as needed for analysis
        - e.g. no need to keep a high-availability cluster with replicas and failover running 24/7

### Rise of Virtual Datasets in EO 

(Kerchunk, VirtuliZarr, Icechunk)

- **Concept:** Datasets defined by *references* to data assets stored elsewhere (often cloud object storage), rather than containing the data itself.
    - Pointers or references, but for TB's of scientific data.
    - These formats create lightweight indexes that map to specific byte ranges in cloud-stored files.
- **Motivation:** Avoid data duplication and large data transfers; analyze data *in place*, facilitate rapid development of derivative data products from huge raw data.
- **Benefits:** 
        - Avoids data duplication and large data transfers.
        - Enables analysis of data *in place*.
        - Facilitates sharing and collaboration without transferring large datasets.
- **Impact:** Enables efficient analysis of massive planetary-scale archives (e.g., climate models, satellite imagery) directly from cloud storage.
- **Examples:**
  - **Kerchunk:** Creates reference files that map logical chunks (e.g., in Xarray) to byte ranges in cloud storage. Allowing libraries like Xarray to read cloud data **as if it were a single local file**
      - completely serverless architecture: asynchronous concurrent fetching, parallel access to multiple files, and lazy loading of data
      - supports reading from all of the storage backends supported by fsspec (s3, gcs, abfs, etc), http, cloud user storage (dropbox, **gdrive**) and network protocols (ftp, ssh, hdfs, smb…)
      - default JSON schema can be slow to load and heavy on memory → **supports exporting references as [parquet files](https://fsspec.github.io/kerchunk/spec.html#parquet-references)** for efficient storage and retrieval
  - **VirtualiZarr:** Similar concepts for creating virtual Zarr datasets via Kerchunk references.
  - **Icechunk:** A rising file format based on Apache Iceberg for chunked data access in cloud storage, enabling efficient reading of large datasets without downloading them entirely.

## Hierarchical Spatial Clustering: 

- **Hierarchical Clustering:** Groups data points into a hierarchy of clusters.
- **Spatial Clustering:** Groups data points based on their spatial proximity.
- **Hierarchical Spatial Clustering:** Combines both concepts, creating a hierarchy of clusters based on spatial relationships.
- **Benefits:**
  - Captures multi-scale spatial patterns.
  - Provides a hierarchical structure for data exploration and analysis.
  - Useful for large datasets with varying spatial resolutions.

### "Clustering with minimum spanning trees: How good can it be?"

- **What:** A Graph with set of nodes V and edges E. *Connects all vertices in a weighted, undirected graph with minimum total edge weight*, no cycles
    * Key property: n vertices ⟹n−1 edges.
- **Why?:**: 
    * Natural cluster representation: Removing k−1 "longest"  MST edges yields k clusters (connected components)
    * Detects clusters of arbitrary shapes (see fig) unlike k-means and other methods that require knowing number of cluster a priori
    * Foundation for various clustering algorithms (single linkage, divisive, agglomerative).
- **Answer:** 
    * Authors: 
        - "As far as the current benchmark battery is concerned, the MST-based methods outperform the popular “parametric” approaches (Gaussian Mixtures, K-means) and other algorithms (Birch, Ward, Average, Complete linkage, and spectral clustering with proper parameters) implemented in the scikit-learn package"
        - "[MST Clustering methods] are quite simple and easy to compute: once the minimum spanning tree is considered (which takes up to O(n2) time, but approximate methods exist as well) we can potentially **get a whole hierarchy of clusters of any cardinality**"
            - "For instance, our top performer...needs O(n√n) to *generate all possible partitions given a prebuilt MST*"

<!-- display MST cluster examples -->
<figure style="text-align: center">
<img src="report/assets/figures/mst_arbitrary_clusters.png" style="width:40%; height:40%;">
<figcaption align = "center"> MST to Cluster examples </figcaption>
</figure>

### Discrete Global Grid Systems and Geospatial Indexing

<!-- From their [home page](https://h3geo.org/), [announcement blog](https://www.uber.com/blog/h3/), and [overview page](https://h3geo.org/docs/core-library/overview/): -->
- **What:** A Consistent global framework for Hierarchical *tessellation* of Earth's surface into cells. 
- **H3 DGGS:**
    * The H3 geospatial indexing system is a **discrete global grid system** developed at Uber (2018)
    * It was designed for indexing geographies via multi-resolution tiling into a **hexagonal grid with hierarchical indexes**.
        - Advantages of Hexagons: Uniform neighbor distances, good for spatial calculations
    * Geospatial coords can be indexed to *cell IDs* at diff. resolutions (0-15) that each represent a *unique cell* in the grid at each resolution.
    * Natural clustering of PoIs/RoIs within H3's hierarchy
        - The hexagonal grid system is designed to be **hierarchical**, meaning that each cell at a given resolution can be subdivided into 7 smaller cells at higher resolutions, allowing for efficient spatial queries and analysis.

   
<!-- It is common to use WGS84/EPSG:4326 CRS data with the H3 library. -->

<figure style="text-align: center">
<img src="https://blog.uber-cdn.com/cdn-cgi/image/width=2160,quality=100,onerror=redirect,format=auto/wp-content/uploads/2018/06/Twitter-H3.png" style="width:75%; height:50%;">
<figcaption align = "center"> H3 enables users to partition the globe into hexagons for more accurate analysis. </figcaption>
</figure>

In [None]:
from IPython.display import IFrame
# display h3 viewer 
IFrame("https://h3.chotard.com", width=1080, height=540)

## Application in "Planning for Earth Imaging Tasks via Grid Significance Mapping"

- proposes using H3 to uniformly map Points of Interest (PoIs) and Regions of Interest (RoIs) for EO satellite **future** task planning
- H3 levels (e.g., 6 and 7) are chosen so grid cell sizes are relevant to **satellite strip widths** for better planning
- introduces a method to calculate the "significance" or importance of each grid cell based on the POI's it contains
- these and other authors note how clustering your data in grids lends itself easily for parallelization 
- **Let's flip the time dimension!:**
    *  Instead of planning **new** image acquisitions, this grid significance scheme can be used to **optimize queries to STAC** archives for **existing imagery**
    * **Objective:** *Maximize coverage* of your clustered dataset while *minimizing the number of queries* and downloaded STAC assets.

<figure style="text-align: center">
<img src="report/assets/figures/h3_dggs_EO_tasks.png" style="width:70%; height:60%;">
<figcaption align = "center"> H3 DGGS for EO tasks </figcaption>
</figure>

### Research References
- "Fast Parallel Algorithms for Euclidean Minimum Spanning Tree and Hierarchical Spatial Clustering∗"
- "Optimal Parallel Algorithms for Dendrogram Computation and Single-Linkage Clustering" (resolves and parallelizes sequential bottleneck in algorithm above)
- "PANDORA: A Parallel Dendrogram Construction Algorithm for Single Linkage Clustering on GPU" (same as above but for GPUs)

## Application with H3 for PV Cluster Analysis

- **Goal:** Identify optimal spatial clusters of PV solar panel labels.
- **Approach:**
  - Use Uber's H3 DGGS.
  - Index PV label locations (centroids or polygons) into H3 cells at an appropriate resolution.
  - Treat H3 cells containing PV labels as the nodes/vertices in the clustering algorithms.
- **Leveraging H3 Features:**
  - **Proximity:** Efficiently find neighboring cells (`k_ring`).
  - **Hierarchy:** Quickly compute parent/child cells for potential multi-resolution clustering or aggregation.
  - **Traversal:** Efficient grid traversal algorithms can be adapted.
- **Hypothesis:** H3 provides a performant spatial indexing foundation for implementing and scaling these hierarchical clustering algorithms, especially for distributed/parallel computation (relevant to thesis).

## Application to PV Array Analysis: Next steps & Future work

# [Live Demo]: 

## Rogar al profe para que permita entregar el reporte escrito más tarde esta semana

# ¡Gracias! ¿Preguntas?