# Geospatial file format Performance Evaluation

This notebook contains: 
- An Overview of the GeoParquet standard
- Benchmark code for evaluating the GeoParquet file format using the following datasets:
    - The [Google-Microsoft combined Open Buildings](https://beta.source.coop/vida/google-microsoft-open-buildings/) for 2D building footprint data
    - The [Overture buildings dataset](https://medium.com/mapular/overture-maps-a-fusion-of-open-and-commercial-data-for-a-new-era-in-mapping-f26b4b56ad9a) for 2.5D building data
        - Explore the impact of Overture's 
- A discussion of the current state of cloud-optimized geospatial file formats generally and the potential of GeoParquet specifically

# Introduction

 GeoParquet is [an incubating Open Geospatial Consortium (OGC) standard](https://geoparquet.org/) that simply adds compatible geospatial [geometry types](https://docs.safe.com/fme/html/FME-Form-Documentation/FME-ReadersWriters/geoparquet/Geometry-Support.htm) (MultiPoint, Line, Polygon, etc) to the mature and widely adopted Apache Parquet format, a popular columnar storage format commonly used in big data processing. Parquet is a mature file format and has a wide ecosystem that GeoParquet seamlessly integrates with. This is analogous to how the GeoTIFF raster format adds geospatial metadata to the TIFF standard. GeoParquet is designed to be a simple and efficient way to store geospatial *vector* data in a columnar format, and is designed to be compatible with existing Parquet tools and libraries to enable Cloud _Data Warehouse_ Interopability. 

A Parquet file is made up of a a set of file chunks called "row groups". Row groups are logical groups of columns with the same number of rows. Each of these columns is actually a "column chunk" which is a contiguous block of data for that column. The schema across row groups must be consistent, ie the data types and number of columns must be the same for every row group. The new standard adds some relevant additional metadata such as the geometry's Coordinate Reference System (CRS), additional metadata for geometry columns, and future realeses will enable support for spatial indexing. [Spatial indexing](https://towardsdatascience.com/geospatial-data-engineering-spatial-indexing-18200ef9160b) is a technique used to optimize spatial queries by indexing or partitioning the data based on its geometry features such that you can make spatial queries (e.g. intersection, within, within x distance, etc) more efficiently. 

<figure>
<img src="https://miro.medium.com/v2/resize:fit:1400/1*QEQJjtnDb3JQ2xqhzARZZw.png" style="width:100%">
<figcaption align = "center"> Visualization of the layout of a Parquet file </figcaption>
</figure>

Beyond the file data itself, Parquet also stores metadata at the end of the file that describes the internal "chunking" of the file, byte ranges of every column chunks, several column statistics, among other things. 

<figure>
<img src="https://guide.cloudnativegeo.org/images/geoparquet_layout.png" style="width:100%">
<figcaption align = "center"> GeoParquet has the same laylout with additional metadata </figcaption>
</figure>

 

## Features and Advantages

- Efficient storage and compression: 
    - leverages the columnar data format which is more efficient for filtering on columns
    - GeoParquet is internally compressed by default, and can be configured to optimize decompression time or storage size depending on the use case
    - These make it ideal for applications dealing with _massive_ geospatial datasets and cloud data warehouses
- Scalability and High-Performance:
    - the nature of the file format is well-suited for parallel and/or distributed processing such as in Spark, Dask, or Hadoop
    - Support for data partitioning: 
        - Parquet files can be partitioned by one or more columns
        - In the geospatial context this enables efficient spatial queries and filtering (e.g. partitioning by ISO country code) 
- Optimized for *read-heavy workflows*: 
    - Parquet is an immutable file format, which means taking advantage of cheap reads, and efficient filtering and aggregation operations
        - This is ideal for data warehousing and modern analytic workflows 
        - Best paired with Analytical Databases like Amazon Redshift, Google BigQuery, or DuckDB
        - Ideal for OLAP (Online Analytical Processing) and BI (Business Intelligence) workloads that leverage historical and aggregated data that don't require frequent updates
 - Interoperability and wide ecosystem:
    - GeoParquet is designed to be compatible with existing Parquet readers, tools, and libraries
    - Facilitates integration into existing data pipelines and workflows
    - Broad compatibility:
        - support for multiple spatial reference systems 
        - support for multiple geometry types and multiple geometry columns
        - works with both planar and spherical coordinates 
        - support for 2D and 3D geometries
        
## Limitations and Disadvantages

- Poorly suited for write-heavy workflows:
    - Transactional and CRUD (Create, Read, Update, Delete) operations are not well-suited for Parquet files
    - Not recommended for applications that require frequent updates or real-time data ingestion
- Not a Silver Bullet for all geospatial data:
    - deals only with vector data, not raster data
    - storage and compression benefits require a certain scale of data to be realized
    - performance overhead for small datasets
- Limited support for spatial indexing:
    - GeoParquet did not implement spatial indexing in the 1.0.0 release
    - This is planned for future release in 

# Benchmark Results

In [None]:
# Constants and local env configuration

# URI for combined open buildings data
combined_open_buildings_uri = "https://beta.source.coop/vida/google-microsoft-open-buildings/geoparquet/by_country/"


# list of ISO country codes to fetch 
open_buildings_path = 
buildings_countries = ["USA"]
compared_file_formats = [".geojson", ".shp", ".fgb", ".gpkg"] 
compression_types = ["snappy", "gzip", "brotli"]

### Filesystem performance comparison

In [None]:
import gdaltools

# use ogr2ogr to convert geoparquet to our target formats

def ogr_gdal_convert(input_file, output_format):

    # use pygdaltools to convert via ogr2ogr
    try: 
        ogr = gdaltools.ogr2ogr()\
        .set_input(input_file)\
        .set_output(output_file)
    except Exception as e:
        print(f"Error converting {input_file} to {output_format}: {e}")

# use geopandas to convert to our target formats

def geopandas_convert(input_df, output_format):
    # use geopandas to convert to our target formats
    try:
        input_df.to_file(output_format)
    except Exception as e:
        print(f"Error converting {input_df} to {output_format}: {e}")

# use duckdb to convert to our target formats

### Querying in-memory with GeoPandas 

### Querying from files with DuckDB

### Visualization with Basemaps

### 3D Data with Overture

# Discussion on cloud-native geospatial data formats

# References
- https://geoparquet.org/
- https://geopandas.org/
- https://radiant.earth/blog/2023/10/what-is-source-cooperative/
- https://guide.cloudnativegeo.org/geoparquet/
- https://medium.com/mapular/overture-maps-a-fusion-of-open-and-commercial-data-for-a-new-era-in-mapping-f26b4b56ad9a
- https://towardsdatascience.com/geospatial-data-engineering-spatial-indexing-18200ef9160b
- https://github.com/opengeospatial/geoparquet/blob/main/format-specs/geoparquet.md 
- https://medium.com/radiant-earth-insights/geoparquet-1-1-coming-soon-9b72c900fbf2