# GeoParquet Example

This notebook will give an overview of how to read and write GeoParquet files with GeoPandas, putting an emphasis on cloud-native operations where possible.

The easiest way to read and write GeoParquet files is to use GeoPandas' [`read_parquet`](https://geopandas.org/en/stable/docs/reference/api/geopandas.read_parquet.html) and [`to_geoparquet`](https://geopandas.org/en/stable/docs/reference/api/geopandas.GeoDataFrame.to_parquet.html) functions.

::: {.callout-note}

Make sure to use the specific `read_parquet` and `to_parquet` functions. These will be much, much faster than using the usual `read_file` and `to_file`.

:::

In [3]:
import geopandas as gpd
from fsspec.implementations.http import HTTPFileSystem
from urllib.request import urlretrieve
import fsspec

## Comparison with FlatGeobuf

In order to compare reading GeoParquet with FlatGeobuf, we'll cover reading and writing GeoParquet files on local disk storage. To be consistent with the FlatGeobuf example, we'll fetch the same US counties FlatGeobuf file (13 MB) and convert it to GeoParquet using `ogr2ogr`.

In [4]:
# URL to download
url = "https://flatgeobuf.org/test/data/UScounties.fgb"

# Download, saving to the current directory
local_fgb_path, _ = urlretrieve(url, "countries.fgb")

In [5]:
!ogr2ogr countries.parquet countries.fgb

Loading this GeoParquet file is really fast! 13% faster than loading the same data via FlatGeobuf (shown in the FlatGeobuf example notebook).

In [20]:
%timeit gdf = gpd.read_parquet("countries.parquet")

21.2 ms ± 266 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


## Writing to local disk

We can use `GeoDataFrame.to_parquet` to write out this data to GeoParquet files locally. This is about 3x faster than writing the same dataset to FlatGeobuf, but note that FlatGeobuf's writing is also calculating a spatial index.

In [22]:
%timeit gdf.to_parquet("countries_written.parquet")

35.9 ms ± 723 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


## Reading from the cloud

As of GeoParquet version 1.0.0-rc.1, spatial indexing has not yet been implemented. Therefore, there is not yet an API in GeoPandas to read data given a specific bounding box.

What is already efficient in GeoParquet is reading only specified columns from a dataset.

In [17]:
url = "https://data.source.coop/cholmes/eurocrops/unprojected/geoparquet/FR_2018_EC21.parquet"

Note that since we're fetching this data directly from the cloud, we need to pass in an `fsspec` filesystem object. Otherwise GeoPandas will attempt to load a local file.

In [18]:
filesystem = HTTPFileSystem()

By default, calling `read_parquet` will fetch the entire file and parse it all into a single `GeoDataFrame`. Since this is a 3GB file, downloading the file takes a long time:

In [12]:
%time gdf = gpd.read_parquet(url, filesystem=filesystem)

CPU times: user 27.2 s, sys: 21.3 s, total: 48.5 s
Wall time: 5min 56s


We can make this faster by only fetching specific columns. Because GeoParquet stores data in a columnar fashion, when selecting only specific columns we can download a lot less data.

In [23]:
%time gdf = gpd.read_parquet(url, columns=["ID_PARCEL", "geometry"], filesystem=filesystem)

CPU times: user 19.6 s, sys: 14 s, total: 33.6 s
Wall time: 3min 5s
