# Kata 0: Full scan

For small and medium-sized datasets (say, less than a million items), we sometimes just want to get _everything_.
This might be true for large datasets as well, especially in a train-a-model workflow.
Here, we build up to a full scan of ten thousand [NAIP](https://naip-usdaonline.hub.arcgis.com/) items over Colorado.

## Baby steps

First, though, we want to explore the performance characteristics of our API over page size.
Let's start with the default page size (10).

In [3]:
from pystac_client import Client

from labs_375 import STAC_FASTAPI_GEOPARQUET_URI, Timer

In [4]:
client = Client.open(STAC_FASTAPI_GEOPARQUET_URI)

with Timer() as timer:
    items = list(client.search(collections=["naip"], max_items=100).items_as_dicts())
    timer.report(items)

Retrieved 100 in 32.19s (3.11 items/s)


That's not excellent.
Let's try bumping it all the way up.

In [5]:
client = Client.open(STAC_FASTAPI_GEOPARQUET_URI)
with Timer() as timer:
    items = list(
        client.search(collections=["naip"], max_items=100, limit=100).items_as_dicts()
    )
    timer.report(items)

Retrieved 100 in 3.25s (30.73 items/s)


So much better.
So lots of little requests are much worse, at least in this full scan case.
Is there a maximum?

In [6]:
client = Client.open(STAC_FASTAPI_GEOPARQUET_URI)
with Timer() as timer:
    items = list(
        client.search(collections=["naip"], max_items=2000, limit=2000).items_as_dicts()
    )
    timer.report(items)

Retrieved 2000 in 4.63s (432.17 items/s)


Whoa, no.
So our **stac-fastapi-geoparquet** server, at least when doing large no-search requests, wants as large of a page size as possible.
So let's try that for everything.

In [7]:
client = Client.open(STAC_FASTAPI_GEOPARQUET_URI)
with Timer() as timer:
    items = list(
        client.search(
            collections=["naip"], max_items=10000, limit=10000
        ).items_as_dicts()
    )
    timer.report(items)

Retrieved 10000 in 10.64s (939.65 items/s)


One neat feature of **stac-geoparquet** is that we can query it directly using **DuckDB** from our client.
[stacrs](https://stac-utils.github.io/stacrs/) is a relatively new Python library that can do that.
What happens when we hit our **stac-geoparquet** in an s3 bucket directly?

!!! note "You need to configure your AWS account, either w/ access to the bucket via the eoAPI sub-account, or with requestor pays"

In [None]:
from rustac import DuckdbClient

from labs_375 import NAIP_GEOPARQUET_URI

client = DuckdbClient()
with Timer() as timer:
    items = client.search(
        NAIP_GEOPARQUET_URI,
    )["features"]
    timer.report(items)

Retrieved 10000 in 1.60s (6239.00 items/s)
