# Kata 0: Full scan

For small and medium-sized datasets (say, less than a million items), we sometimes just want to get _everything_.
This might be true for large datasets as well, especially in a train-a-model workflow.
Here, we build up to a full scan of ten thousand [NAIP](https://naip-usdaonline.hub.arcgis.com/) items over Colorado.

## Baby steps

First, though, we want to explore the performance characteristics of our API over page size.
Let's start with the default page size.

In [20]:
from pystac_client import Client

from labs_375 import STAC_FASTAPI_GEOPARQUET_URI, Timer

In [21]:
client = Client.open(STAC_FASTAPI_GEOPARQUET_URI)

with Timer() as timer:
    items = list(client.search(collections=["naip"], max_items=100).items_as_dicts())
    timer.report(items)

Retrieved 100 in 17.31s (5.78 items/s)


That's not excellent.
Let's try bumping it up.

In [22]:
client = Client.open(STAC_FASTAPI_GEOPARQUET_URI)
with Timer() as timer:
    items = list(
        client.search(collections=["naip"], max_items=100, limit=100).items_as_dicts()
    )
    timer.report(items)

Retrieved 100 in 1.69s (59.16 items/s)


So much better.
So lots of little requests are much worse, at least in this full scan case.
Is there a maximum?

In [23]:
client = Client.open(STAC_FASTAPI_GEOPARQUET_URI)
with Timer() as timer:
    items = list(
        client.search(collections=["naip"], max_items=2000, limit=2000).items_as_dicts()
    )
    timer.report(items)

with Timer() as timer:
    items = list(
        client.search(collections=["naip"], max_items=5000, limit=5000).items_as_dicts()
    )
    timer.report(items)

Retrieved 2000 in 3.04s (656.88 items/s)
Retrieved 5000 in 4.96s (1008.75 items/s)


Whoa, no.
So our **stac-fastapi-geoparquet** server, at least when doing large no-search requests, wants as large of a page size as possible.
So let's try that for everything.

In [24]:
client = Client.open(STAC_FASTAPI_GEOPARQUET_URI)
with Timer() as timer:
    items = list(
        client.search(
            collections=["naip"], max_items=10000, limit=10000
        ).items_as_dicts()
    )
    timer.report(items)

Retrieved 10000 in 8.23s (1215.79 items/s)


One neat feature of **stac-geoparquet** is that we can query it directly using **DuckDB** from our client.
[stacrs](https://stac-utils.github.io/stacrs/) can do that.
What happens when we hit our **stac-geoparquet** in an s3 bucket directly?

!!! note "You need to configure your AWS account, either w/ access to the bucket via the eoAPI sub-account, or with requestor pays"

In [None]:
from rustac import DuckdbClient

from labs_375 import NAIP_GEOPARQUET_URI

client = DuckdbClient()
client.execute("CREATE SECRET (TYPE S3, PROVIDER CREDENTIAL_CHAIN)")
with Timer() as timer:
    items = client.search(
        NAIP_GEOPARQUET_URI,
    )
    timer.report(items)

Retrieved 10000 in 1.40s (7137.57 items/s)


## Comparison with pgstac

We've got the same items loaded into a [pgstac](https://github.com/stac-utils/pgstac) database, with a [stac-fastapi-pgstac](https://github.com/stac-utils/stac-fastapi-pgstac) serving them over HTTP.
Let's try the same tests against that server, except for the full scan case — that one times out.

In [28]:
from labs_375 import STAC_FASTAPI_PGSTAC_URI

client = Client.open(STAC_FASTAPI_PGSTAC_URI)

with Timer() as timer:
    items = list(client.search(collections=["naip"], max_items=100).items_as_dicts())
    timer.report(items)

with Timer() as timer:
    items = list(
        client.search(collections=["naip"], max_items=100, limit=100).items_as_dicts()
    )
    timer.report(items)

with Timer() as timer:
    items = list(
        client.search(collections=["naip"], max_items=2000, limit=2000).items_as_dicts()
    )
    timer.report(items)

with Timer() as timer:
    items = list(
        client.search(collections=["naip"], max_items=5000, limit=5000).items_as_dicts()
    )
    timer.report(items)

Retrieved 100 in 1.01s (99.10 items/s)
Retrieved 100 in 0.21s (484.68 items/s)
Retrieved 2000 in 2.72s (734.03 items/s)
Retrieved 5000 in 6.96s (718.13 items/s)


## Sorting

It looks like there's about equal performance in the 2000 item case, so let's use that point to explore how sorting effects performance.
Our best guess is that **pgstac** will perform better, since it's a database!
Let's see.

In [34]:
geoparquet_client = Client.open(STAC_FASTAPI_GEOPARQUET_URI)
pgstac_client = Client.open(STAC_FASTAPI_PGSTAC_URI)

for sortby in ["datetime", "-datetime", "naip:year"]:
    with Timer() as timer:
        items = list(
            geoparquet_client.search(
                collections=["naip"], sortby=sortby, max_items=2000, limit=2000
            ).items_as_dicts()
        )
        print("geoparquet", sortby)
        timer.report(items)
    with Timer() as timer:
        items = list(
            pgstac_client.search(
                collections=["naip"], sortby=sortby, max_items=2000, limit=2000
            ).items_as_dicts()
        )
        print("pgstac", sortby)
        timer.report(items)

    print()

/Users/gadomski/Code/developmentseed/labs-375-stac-geoparquet-backend/.venv/lib/python3.12/site-packages/pystac_client/item_search.py:442: DoesNotConformTo: Server does not conform to SORT


geoparquet datetime
Retrieved 2000 in 3.03s (660.35 items/s)
pgstac datetime
Retrieved 2000 in 2.98s (672.20 items/s)

geoparquet -datetime
Retrieved 2000 in 2.78s (718.80 items/s)
pgstac -datetime
Retrieved 2000 in 2.90s (688.56 items/s)

geoparquet naip:year
Retrieved 2000 in 2.80s (714.57 items/s)
pgstac naip:year
Retrieved 2000 in 3.08s (650.32 items/s)



## Fields

One of the "sells" of (geo)parquet is that you don't need to fetch the entirety of the data, if you only need a few of the fields.
For example, if you're only visualizing the STAC items, you might just return the `id` and the `geometry`.
How do the two backends perform in this scenario?
Let's also test against the direct access (without the API server).

In [None]:
geoparquet_client = Client.open(STAC_FASTAPI_GEOPARQUET_URI)
pgstac_client = Client.open(STAC_FASTAPI_PGSTAC_URI)
duckdb_client = DuckdbClient()
duckdb_client.execute("CREATE SECRET (TYPE S3, PROVIDER CREDENTIAL_CHAIN)")

with Timer() as timer:
    items = list(
        geoparquet_client.search(
            collections=["naip"], fields=["id", "geometry"], max_items=2000, limit=2000
        ).items_as_dicts()
    )
    print("geoparquet")
    timer.report(items)

with Timer() as timer:
    items = list(
        pgstac_client.search(
            collections=["naip"], fields=["id", "geometry"], max_items=2000, limit=2000
        ).items_as_dicts()
    )
    print("pgstac")
    timer.report(items)

with Timer() as timer:
    items = duckdb_client.search(
        NAIP_GEOPARQUET_URI, fields=["id", "geometry"], max_items=2000, limit=2000
    )
    print("duckdb")
    timer.report(items)

/Users/gadomski/Code/developmentseed/labs-375-stac-geoparquet-backend/.venv/lib/python3.12/site-packages/pystac_client/item_search.py:480: DoesNotConformTo: Server does not conform to FIELDS


geoparquet
Retrieved 2000 in 2.97s (672.97 items/s)
pgstac
Retrieved 2000 in 1.60s (1251.75 items/s)
Retrieved 2000 in 1.12s (1778.71 items/s)
