# Can we sample millions of points to a 30m xarray dataset

## Challenges
- Super high res Zarr + millions of points!

### Geoparquet sample strategy - default, but scaling is hard
Issues:
- Scaling is hard! geopandas in mem - millions of points!
- lots of conversion duckd -> geopandas -> shapely -> xarray -> pandas etc........
- Maybe we can batch write by xarray chunk or atomic unit
- if so, we write partioned geoparquet by some index?
- combine after with read hive partitoned data?


### Keep in duckdb?
- Convert zarr grid to a duckdb table of geometries and risks. | risk | geom      |
                                                               | 3    | rectangle |
- If we have a zarr grid into duckdb, can we sample the points? ST_Within?
- Can we open with xvec, get geoms, export?

### XVEC
- Can we convert the Zarr data to an XVEC vector data cube
- Can we convert geoparquet building points into vector data cube?
- Then, sample or zonal stats, then export back to gpq?

In [2]:
import numpy as np
import xarray as xr
import dask.array as da
from datetime import datetime
import zarr 
import coiled
import icechunk
import icechunk.xarray
import geopandas as gpd 
import duckdb 
from ocr import datasets


In [12]:

from ocr.utils import apply_s3_creds, install_load_extensions

install_load_extensions()
apply_s3_creds()


InvalidInputException: Invalid Input Error: Temporary secret with name '__default_s3' already exists!

In [94]:
query = duckdb.sql("""SELECT ST_AsText(building_centroid) as centroid from read_parquet('s3://carbonplan-ocr/intermediate/fire-risk/vector/CONUS_overture_buildings_with_centroid_2025-03-19.1.parquet') LIMIT 100000;""")

In [95]:
from shapely import wkt
df = query.df()
geometry_series = df['centroid'].apply(
    lambda g: wkt.loads(g) if g is not None else None
)

# gdf = gpd.GeoDataFrame(df, geometry=geometry_series, crs="4326")

In [96]:
x_list = [point.x for point in geometry_series]
y_list = [point.y for point in geometry_series]

In [97]:
x = xr.DataArray(x_list, dims=['location'])
y = xr.DataArray(y_list, dims=['location'])

In [91]:
storage = icechunk.s3_storage(bucket="carbonplan-ocr", prefix="intermediate/fire-risk/tensor/30m_CONUS_synthetic_risk_4326_icechunk", from_env=True)
repo = icechunk.Repository.open(storage)
session = repo.readonly_session('main')
rtds = xr.open_zarr(session.store, consolidated=False)
rtds

Unnamed: 0,Array,Chunk
Bytes,19.22 GiB,14.31 MiB
Shape,"(94444, 218518)","(3000, 5000)"
Dask graph,1408 chunks in 2 graph layers,1408 chunks in 2 graph layers
Data type,int8 numpy.ndarray,int8 numpy.ndarray
"Array Chunk Bytes 19.22 GiB 14.31 MiB Shape (94444, 218518) (3000, 5000) Dask graph 1408 chunks in 2 graph layers Data type int8 numpy.ndarray",218518  94444,

Unnamed: 0,Array,Chunk
Bytes,19.22 GiB,14.31 MiB
Shape,"(94444, 218518)","(3000, 5000)"
Dask graph,1408 chunks in 2 graph layers,1408 chunks in 2 graph layers
Data type,int8 numpy.ndarray,int8 numpy.ndarray


In [98]:
nearest_pixels = rtds.risk.sel(
    lon=xr.DataArray(x_list, dims='points'),
    lat=xr.DataArray(y_list, dims='points'),
    method='nearest',
)


In [None]:
df = nearest_pixels.to_dataset().to_dataframe().reset_index()[['lat','lon','risk']]
df
gdf = gpd.GeoDataFrame(
    df, geometry=gpd.points_from_xy(df.lon, df.lat), crs="EPSG:4326"
)[['risk', 'geometry']]



In [101]:
gdf.to_parquet('s3://carbonplan-ocr/intermediate/fire-risk/vector/risk_sampled_10K_points.parquet',compression='zstd',write_covering_bbox=True, schema_version='1.1.0')


In [102]:
rtgdf = gpd.read_parquet('s3://carbonplan-ocr/intermediate/fire-risk/vector/risk_sampled_10K_points.parquet')

In [105]:
result  = duckdb.sql("""SELECT * FROM read_parquet('s3://carbonplan-ocr/intermediate/fire-risk/vector/risk_sampled_10K_points.parquet')""")
result

┌──────┬────────────────────────────────────────────────┬────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ risk │                    geometry                    │                                                        bbox                                                        │
│ int8 │                    geometry                    │                             struct(xmin double, ymin double, xmax double, ymax double)                             │
├──────┼────────────────────────────────────────────────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│    7 │ POINT (-115.91011810302734 30.5081787109375)   │ {'xmin': -115.91011810302734, 'ymin': 30.5081787109375, 'xmax': -115.91011810302734, 'ymax': 30.5081787109375}     │
│    1 │ POINT (-115.90984344482422 30.5081787109375)   │ {'xmin': -115.90984344482422, 'ymin': 30.5081787109375, 'xmax': -11

In [None]:
ds = xr.tutorial.open_dataset('air_temperature')

In [None]:
ds

In [None]:
import geopandas as gpd 
from shapely import Point
points = [
Point(45, 205),
Point(23, 320)]
gdf = gpd.GeoDataFrame([1, 2], geometry=points, crs=4326)

In [None]:
x_coords, y_coords = gdf.geometry.centroid.x, gdf.geometry.centroid.y

nearest_pixels = ds.air.sel(
    lon=xr.DataArray(x_coords, dims='points'),
    lat=xr.DataArray(y_coords, dims='points'),
    method='nearest',
)


In [None]:
nearest_pixels.isel(points=0).isel(time=0).to_parquet('tmp.parquet')