# GeoParquet Example

This notebook will give an overview of how to read and write GeoParquet files with GeoPandas, putting an emphasis on cloud-native operations where possible.

The easiest way to read and write GeoParquet files is to use GeoPandas' [`read_parquet`](https://geopandas.org/en/stable/docs/reference/api/geopandas.read_parquet.html) and [`to_geoparquet`](https://geopandas.org/en/stable/docs/reference/api/geopandas.GeoDataFrame.to_parquet.html) functions.

::: {.callout-note}

Make sure to use the specific `read_parquet` and `to_parquet` functions. These will be much, much faster than using the usual `read_file` and `to_file`.

:::

In [4]:
from urllib.request import urlretrieve

import fsspec
import geopandas as gpd
from fsspec.implementations.http import HTTPFileSystem

## Comparison with FlatGeobuf

In order to compare reading GeoParquet with FlatGeobuf, we'll cover reading and writing GeoParquet files on local disk storage. To be consistent with the FlatGeobuf example, we'll fetch the same US counties FlatGeobuf file (13 MB) and convert it to GeoParquet using `ogr2ogr`.

In [4]:
# URL to download
url = "https://flatgeobuf.org/test/data/UScounties.fgb"

# Download, saving to the current directory
local_fgb_path, _ = urlretrieve(url, "countries.fgb")

In [5]:
!ogr2ogr countries.parquet countries.fgb

Loading this GeoParquet file is really fast! 13% faster than loading the same data via FlatGeobuf (shown in the FlatGeobuf example notebook).

In [20]:
%timeit gdf = gpd.read_parquet("countries.parquet")

21.2 ms ± 266 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


## Writing to local disk

We can use `GeoDataFrame.to_parquet` to write out this data to GeoParquet files locally. This is about 3x faster than writing the same dataset to FlatGeobuf, but note that FlatGeobuf's writing is also calculating a spatial index.

In [22]:
%timeit gdf.to_parquet("countries_written.parquet")

35.9 ms ± 723 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


## Reading from the cloud

As of GeoParquet version 1.0.0-rc.1, spatial indexing has not yet been implemented. Therefore, there is not yet an API in GeoPandas to read data given a specific bounding box.

What is already efficient in GeoParquet is reading only specified columns from a dataset.

In [5]:
url = "https://data.source.coop/cholmes/eurocrops/unprojected/geoparquet/FR_2018_EC21.parquet"

Note that since we're fetching this data directly from the cloud, we need to pass in an `fsspec` filesystem object. Otherwise GeoPandas will attempt to load a local file.

In [6]:
filesystem = HTTPFileSystem()

By default, calling `read_parquet` will fetch the entire file and parse it all into a single `GeoDataFrame`. Since this is a 3GB file, downloading the file takes a long time:

In [12]:
%time gdf = gpd.read_parquet(url, filesystem=filesystem)

CPU times: user 27.2 s, sys: 21.3 s, total: 48.5 s
Wall time: 5min 56s


We can make this faster by only fetching specific columns. Because GeoParquet stores data in a columnar fashion, when selecting only specific columns we can download a lot less data.

In [23]:
%time gdf = gpd.read_parquet(url, columns=["ID_PARCEL", "geometry"], filesystem=filesystem)

CPU times: user 19.6 s, sys: 14 s, total: 33.6 s
Wall time: 3min 5s


## Working with GeoParquet row groups (Advanced)

As described in the [intro document](./index.qmd), GeoParquet is a chunked format, which allows you to access one of the chunks of rows very efficiently. This can allow you to stream a dataset — loading and operating on one chunk at a time — if the dataset is larger than your memory.

GeoPandas does not yet have built-in support for working with row groups, so this section will use the underlying [`pyarrow`](https://arrow.apache.org/docs/python/index.html) library directly.

In [18]:
import pyarrow.parquet as pq
from geopandas.io.arrow import _arrow_to_geopandas

First, we'll create a [`ParquetFile`](https://arrow.apache.org/docs/python/generated/pyarrow.parquet.ParquetFile.html#pyarrow.parquet.ParquetFile) object from the remote URL. All this does is load the metadata from the file, allowing you to inspect the schema and number of columns, rows, and row groups. Because this doesn't load any actual data, it's nearly instant to complete.

In [10]:
parquet_file = pq.ParquetFile(url, filesystem=filesystem)

We can access the column names in the dataset:

In [23]:
parquet_file.schema_arrow.names

['ID_PARCEL',
 'SURF_PARC',
 'CODE_CULTU',
 'CODE_GROUP',
 'CULTURE_D1',
 'CULTURE_D2',
 'EC_org_n',
 'EC_trans_n',
 'EC_hcat_n',
 'EC_hcat_c',
 'geometry']

As well as the number of row groups:

In [11]:
parquet_file.num_row_groups

146

Then to load one of the row groups by numeric index, we can call [`ParquetFile.read_row_group`](https://arrow.apache.org/docs/python/generated/pyarrow.parquet.ParquetFile.html#pyarrow.parquet.ParquetFile.read_row_group).

In [12]:
pyarrow_table = parquet_file.read_row_group(0)

Note that this returns a [`pyarrow.Table`](https://arrow.apache.org/docs/python/generated/pyarrow.Table.html#pyarrow.Table), not a `geopandas.GeoDataFrame`. To convert between the two, we can use `_arrow_to_geopandas`. This conversion is very fast.

In [27]:
geopandas_gdf = _arrow_to_geopandas(pyarrow_table, parquet_file.metadata.metadata)
geopandas_gdf.head()

Unnamed: 0,ID_PARCEL,SURF_PARC,CODE_CULTU,CODE_GROUP,CULTURE_D1,CULTURE_D2,EC_org_n,EC_trans_n,EC_hcat_n,EC_hcat_c,geometry
0,123563,6.38,CZH,5,,,Colza d’hiver,Winter rapeseed,winter_rapeseed_rape,3301060401,"MULTIPOLYGON (((3.33896 49.84122, 3.33948 49.8..."
1,5527076,2.3,PPH,18,,,Prairie permanente - herbe prédominante (resso...,Permanent pasture - predominantly grass (woody...,pasture_meadow_grassland_grass,3302000000,"MULTIPOLYGON (((-1.44483 49.61280, -1.44467 49..."
2,11479241,6.33,PPH,18,,,Prairie permanente - herbe prédominante (resso...,Permanent pasture - predominantly grass (woody...,pasture_meadow_grassland_grass,3302000000,"MULTIPOLYGON (((2.87821 46.53674, 2.87820 46.5..."
3,12928442,5.1,PPH,18,,,Prairie permanente - herbe prédominante (resso...,Permanent pasture - predominantly grass (woody...,pasture_meadow_grassland_grass,3302000000,"MULTIPOLYGON (((-0.19026 48.28723, -0.19025 48..."
4,318389,0.92,PPH,18,,,Prairie permanente - herbe prédominante (resso...,Permanent pasture - predominantly grass (woody...,pasture_meadow_grassland_grass,3302000000,"MULTIPOLYGON (((5.72084 44.03576, 5.72081 44.0..."


As before, we can speed up the data fetching by requesting only specific columns in the `read_row_group` call.:

In [28]:
pyarrow_table = parquet_file.read_row_group(0, columns=["ID_PARCEL", "geometry"])

Then the resulting `GeoDataFrame` will only have those two columns:

In [29]:
_arrow_to_geopandas(pyarrow_table, parquet_file.metadata.metadata).head()

Unnamed: 0,ID_PARCEL,geometry
0,123563,"MULTIPOLYGON (((3.33896 49.84122, 3.33948 49.8..."
1,5527076,"MULTIPOLYGON (((-1.44483 49.61280, -1.44467 49..."
2,11479241,"MULTIPOLYGON (((2.87821 46.53674, 2.87820 46.5..."
3,12928442,"MULTIPOLYGON (((-0.19026 48.28723, -0.19025 48..."
4,318389,"MULTIPOLYGON (((5.72084 44.03576, 5.72081 44.0..."
