# FlatGeobuf example

This notebook will give an overview of how to read and write FlatGeobuf files with GeoPandas, putting an emphasis on cloud-native operations where possible.

The primary way to interact with FlatGeobuf in Python is via bindings to GDAL, as there is no pure-Python implementation of FlatGeobuf.

There are two different Python libraries for interacting between Python and GDAL's vector support: `fiona` and `pyogrio`. Both of these are integrated into [`geopandas.read_file`](https://geopandas.org/en/stable/docs/reference/api/geopandas.read_file.html) via the `engine` keyword, but `pyogrio` is much faster. Set `engine="pyogrio"` when using `read_file` or [`GeoDataFrame.to_file`](https://geopandas.org/en/stable/docs/reference/api/geopandas.GeoDataFrame.to_file.html) to speed up reading and writing significantly. We also suggest passing `use_arrow=True` when reading for a slight extra speedup (this is not supported when writing).

::: {.callout-note}

[`fiona`](https://github.com/Toblerity/Fiona) is the default engine for `geopandas.read_file`. It provides full-featured bindings to GDAL but does not implement _vectorized_ operations. [Vectorization](https://wesmckinney.com/book/numpy-basics#ndarray_binops) refers to operating on whole arrays of data at once rather than operating on individual values using a Python for loop. `fiona`'s non-vectorized approach means that each row of the source file is read individually with Python, and a Python for loop. In contrast, [`pyogrio`](https://github.com/geopandas/pyogrio)'s vectorized implementation reads all rows in C before passing the data to Python, allowing it to achieve vast speedups (up to 40x) over `fiona`.

You can opt in to using `pyogrio` with `geopandas.read_file` by passing `engine="pyogrio"`.

Additionally, if you're using GDAL version 3.6 or higher (usually the case when using pyogrio), you can pass `use_arrow=True` to `geopandas.read_file` to use `pyogrio`'s support for [GDAL's RFC 86](https://gdal.org/development/rfc/rfc86_column_oriented_api.html), which speeds up data reading even more.

:::

In [1]:
from tempfile import TemporaryDirectory
from urllib.request import urlretrieve

import geopandas as gpd
import pyogrio

## Reading from local disk

First we'll cover reading FlatGeobuf from local disk storage. As a first example, we'll use the US counties FlatGeobuf data from [this example](https://observablehq.com/@bjornharrtell/streaming-flatgeobuf). This file is only 13 MB in size, which we'll download to cover simple loading from disk.

In [12]:
# Create a temporary directory in which to save the file
tmpdir = TemporaryDirectory()

# URL to download
url = "https://flatgeobuf.org/test/data/UScounties.fgb"

# Download, saving the output path
local_fgb_path, _ = urlretrieve(url, f"{tmpdir.name}/countries.fgb")

In each of the cases below, we use `geopandas.read_file` to read the file into a `GeoDataFrame`.

First we'll show that reading this file with `engine="fiona"` (the default) is slower. Taking an extra 500 milliseconds might not seem like a lot, but this file contains only 3,000 rows, so this difference gets magnified with larger files.

In [20]:
%timeit gdf = gpd.read_file(local_fgb_path, engine="fiona")

518 ms ± 6.57 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


Passing `engine="pyogrio"` speeds up loading by 18x here!

In [21]:
%timeit gdf = gpd.read_file(local_fgb_path, engine="pyogrio")

28.9 ms ± 468 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


Using `use_arrow=True` often makes loading slightly faster again! We're now 21x faster than using fiona.

In [19]:
%timeit gdf = gpd.read_file(local_fgb_path, engine="pyogrio", use_arrow=True)

24 ms ± 351 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


## Writing to local disk

Similarly, we can use `GeoDataFrame.to_file` to write to a local FlatGeobuf file. As expected, writing can be much faster if you pass `engine="pyogrio"`.

In [31]:
%time gdf.to_file(f"{tmpdir.name}/out_fiona.fgb")

CPU times: user 354 ms, sys: 32 ms, total: 386 ms
Wall time: 394 ms


In [32]:
%time gdf.to_file(f"{tmpdir.name}/out_pyogrio.fgb", engine="pyogrio")

CPU times: user 61 ms, sys: 23.5 ms, total: 84.5 ms
Wall time: 94 ms


## Reading from the cloud

Knowing how to read and write local files is important, but given that FlatGeobuf is a cloud-optimized format, it's important to be able to read from cloud-hosted files as well.

For this example, we'll use the EuroCrops data hosted on Source Cooperative because it has versions of the same data in both FlatGeobuf and GeoParquet format. Hopefully using the same dataset for both the FlatGeobuf and GeoParquet example notebooks will be helpful.

In [5]:
url = "https://data.source.coop/cholmes/eurocrops/unprojected/flatgeobuf/FR_2018_EC21.fgb"

Usually when reading from the cloud, you want to filter on some spatial extent. Pyogrio offers a `read_info` function to access many pieces of information about the file:

In [10]:
pyogrio.read_info(url)

{'crs': 'EPSG:4326',
 'encoding': 'UTF-8',
 'fields': array(['ID_PARCEL', 'SURF_PARC', 'CODE_CULTU', 'CODE_GROUP', 'CULTURE_D1',
        'CULTURE_D2', 'EC_org_n', 'EC_trans_n', 'EC_hcat_n', 'EC_hcat_c'],
       dtype=object),
 'dtypes': array(['object', 'float64', 'object', 'object', 'object', 'object',
        'object', 'object', 'object', 'object'], dtype=object),
 'geometry_type': 'MultiPolygon',
 'features': 9517874,
 'driver': 'FlatGeobuf',
 'capabilities': {'random_read': 1,
  'fast_set_next_by_index': 0,
  'fast_spatial_filter': 1},
 'layer_metadata': None,
 'dataset_metadata': None}

::: {.callout-note}

Sadly the output of `read_info` does [not yet include](https://github.com/geopandas/pyogrio/issues/274) the bounding box of the file, even though the FlatGeobuf file contains that information in the header. This may be a reason to consider externalizing metadata using [Spatio-Temporal Asset Catalog files](https://stacspec.org/en) (STAC) in the future. For now we'll hard-code a region around Valence in the south of France.

:::

In [6]:
# The order of bounds is
# (min longitude, min latitude, max longitude, max latitude)
bounds = (3.733296, 44.677061, 4.717124, 45.179431)

We can fetch a dataframe containing only the records in these bounds by passing a `bbox` argument to `read_file`. Note that the Coordinate Reference System of this bounding box **must match** the CRS of the dataset. Here, we know from the output of `read_info` that the CRS of the dataset is EPSG:4326, so we can pass a longitude-latitude bounding box.

In [7]:
%time crops_gdf = gpd.read_file(url, bbox=bounds)

CPU times: user 14.7 s, sys: 1.16 s, total: 15.9 s
Wall time: 2min 12s


Passing `engine="pyogrio"` is only slightly faster, which may mean that most of the time is taken up in network requests, not in parsing the actual data into Python.

In [9]:
%time crops_gdf = gpd.read_file(url, bbox=bounds, engine="pyogrio")

CPU times: user 3.58 s, sys: 1.55 s, total: 5.12 s
Wall time: 1min 59s


This gives us a much smaller dataset of only 123,000 rows (down from 9.5 million rows in the original dataset).

In [11]:
crops_gdf.head()

Unnamed: 0,ID_PARCEL,SURF_PARC,CODE_CULTU,CODE_GROUP,CULTURE_D1,CULTURE_D2,EC_org_n,EC_trans_n,EC_hcat_n,EC_hcat_c,geometry
0,1291862,4.79,PTR,19,,,Autre prairie temporaire de 5 ans ou moins,Other temporary grassland 5 years or less,temporary_grass,3301090100,"MULTIPOLYGON (((3.18191 45.07997, 3.18192 45.0..."
1,466911,2.88,LU8,16,,,Luzerne implantée pour la récolte 2018,Alfalfa planted for the 2018 harvest,alfalfa_lucerne,3301090301,"MULTIPOLYGON (((4.71244 44.67436, 4.71288 44.6..."
2,527813,0.15,SNE,28,,,Surface agricole temporairement non exploitée,Temporarily unused agricultural area,unmaintained,3308000000,"MULTIPOLYGON (((4.71242 44.67437, 4.71247 44.6..."
3,465635,2.08,ORH,3,,,Orge d'hiver,Winter barley,winter_barley,3301010401,"MULTIPOLYGON (((4.71240 44.67440, 4.71151 44.6..."
4,465636,0.76,PPH,18,,,Prairie permanente - herbe prédominante (resso...,Permanent pasture - predominantly grass (woody...,pasture_meadow_grassland_grass,3302000000,"MULTIPOLYGON (((4.71240 44.67440, 4.71247 44.6..."


In [12]:
crops_gdf.shape

(123537, 11)

There are other useful keyword arguments to `read_file`. Since we're using the `pyogrio` engine, we can pass specific column names into `read_file`, and only those columns will be parsed. In the case of FlatGeobuf, this doesn't save us much time, because the same amount of data needs to be fetched.

In [13]:
column_names = ["ID_PARCEL", "SURF_PARC", "CODE_CULTU", "geometry"]
%time crops_gdf = gpd.read_file(url, bbox=bounds, columns=column_names, engine="pyogrio")

CPU times: user 3.55 s, sys: 1.54 s, total: 5.09 s
Wall time: 2min 30s


In [14]:
crops_gdf.head()

Unnamed: 0,CODE_CULTU,ID_PARCEL,SURF_PARC,geometry
0,PTR,1291862,4.79,"MULTIPOLYGON (((3.18191 45.07997, 3.18192 45.0..."
1,LU8,466911,2.88,"MULTIPOLYGON (((4.71244 44.67436, 4.71288 44.6..."
2,SNE,527813,0.15,"MULTIPOLYGON (((4.71242 44.67437, 4.71247 44.6..."
3,ORH,465635,2.08,"MULTIPOLYGON (((4.71240 44.67440, 4.71151 44.6..."
4,PPH,465636,0.76,"MULTIPOLYGON (((4.71240 44.67440, 4.71247 44.6..."
