Skip to content

feat: implements async iterators#3352

Merged
l2ysho merged 34 commits intomasterfrom
3338-implement-the-missing-iterators-in-crawleememory-storage
Feb 6, 2026
Merged

feat: implements async iterators#3352
l2ysho merged 34 commits intomasterfrom
3338-implement-the-missing-iterators-in-crawleememory-storage

Conversation

@l2ysho
Copy link
Contributor

@l2ysho l2ysho commented Jan 19, 2026

PR Summary: Implement Async Iterators for Storage Classes

Overview

This PR implements async iterators for KeyValueStore and Dataset storage classes, allowing users to iterate over storage items using for await...of loops. The implementation follows the pattern established in apify-client-js.

Changes

@crawlee/types (packages/types/src/storages.ts)

  • Updated DatasetClient.listItems() return type to Partial<AsyncIterable<Data>> & Promise<PaginatedList<Data>>
  • Added DatasetClient.listEntries() optional method returning AsyncIterable<[number, Data]> & Promise<PaginatedList<[number, Data]>>
  • Updated KeyValueStoreClient.listKeys() return type to Partial<AsyncIterable<KeyValueStoreItemData>> & Promise<KeyValueStoreClientListData>
  • Added KeyValueStoreClient.keys() optional method returning AsyncIterable<string> & Promise<KeyValueStoreClientListData>
  • Added KeyValueStoreClient.values() optional method returning AsyncIterable<unknown> & Promise<unknown[]>
  • Added KeyValueStoreClient.entries() optional method returning AsyncIterable<[string, unknown]> & Promise<[string, unknown][]>

@crawlee/memory-storage

packages/memory-storage/src/utils.ts

  • Added createPaginatedList() helper for offset-based pagination (Dataset)
  • Added createPaginatedEntryList() helper for offset-based pagination with index-value entries (Dataset)
  • Added createKeyList() helper for cursor-based pagination (KeyValueStore)
  • Added createKeyStringList() helper for cursor-based pagination yielding key strings
  • All helpers return a hybrid object that can be awaited directly OR iterated with for await...of

packages/memory-storage/src/resource-clients/dataset.ts

  • Refactored listItems() to use createPaginatedList helper
  • Added listEntries() method using createPaginatedEntryList helper
  • Added private listItemsPage() method for fetching individual pages

packages/memory-storage/src/resource-clients/key-value-store.ts

  • Refactored listKeys() to use createKeyList helper
  • Added keys() method using createKeyStringList helper
  • Added values() method yielding record values (not full records)
  • Added entries() method yielding [key, value] tuples
  • Added private listKeysPage() method for fetching individual pages

@crawlee/core

packages/core/src/storages/key_value_store.ts

  • Added keys(options?) - async generator yielding all keys
  • Added values<T>(options?) - returns hybrid AsyncIterable<T> & Promise<T[]> yielding all values
  • Added entries<T>(options?) - returns hybrid AsyncIterable<[string, T]> & Promise<[string, T][]> yielding [key, value] tuples
  • Added [Symbol.asyncIterator]() - makes KeyValueStore directly iterable (yields entries)
  • Added KeyValueStoreIteratorOptions interface

packages/core/src/storages/dataset.ts

  • Added values(options?) - returns hybrid AsyncIterable<Data> & Promise<PaginatedList<Data>> yielding all items
  • Added entries(options?) - returns hybrid AsyncIterable<[number, Data]> & Promise<PaginatedList<[number, Data]>> yielding [index, item] tuples
  • Added [Symbol.asyncIterator]() - makes Dataset directly iterable (yields items)
  • Added DatasetIteratorOptions interface

Tests

packages/memory-storage/test/async-iteration.test.ts (new file)

  • Comprehensive tests covering memory-storage async iteration for Dataset and KeyValueStore
  • Tests for listItems, listKeys, keys, values, and entries methods
  • Tests for pagination options (limit, offset, exclusiveStartKey, prefix, desc)

test/core/storages/key_value_store.test.ts

  • New tests for KeyValueStore async iterators (keys, values, entries, Symbol.asyncIterator)

test/core/storages/dataset.test.ts

  • New tests for Dataset async iterators (values, entries, Symbol.asyncIterator)

Usage Examples

// KeyValueStore iteration
const kvs = await KeyValueStore.open();

for await (const key of kvs.keys()) {
    console.log(key);
}

for await (const value of kvs.values()) {
    console.log(value);
}

for await (const [key, value] of kvs.entries()) {
    console.log(key, value);
}

// Direct iteration (yields entries)
for await (const [key, value] of kvs) {
    console.log(key, value);
}

// Dataset iteration
const dataset = await Dataset.open();

for await (const item of dataset.values()) {
    console.log(item);
}

for await (const [index, item] of dataset.entries()) {
    console.log(index, item);
}

// Direct iteration (yields items)
for await (const item of dataset) {
    console.log(item);
}

// Memory-storage client level (backward compatible)
const result = await client.listItems(); // still works
for await (const item of client.listItems()) { // also works
    console.log(item);
}

Backward Compatibility

All existing code continues to work unchanged. The listItems() and listKeys() methods can still be awaited directly to get the paginated response object.

Closes #3338

@l2ysho l2ysho linked an issue Jan 19, 2026 that may be closed by this pull request
@github-actions github-actions bot added this to the 132nd sprint - Tooling team milestone Jan 19, 2026
@github-actions github-actions bot added t-tooling Issues with this label are in the ownership of the tooling team. tested Temporary label used only programatically for some analytics. labels Jan 19, 2026
@l2ysho l2ysho requested review from B4nan, barjin and janbuchar January 19, 2026 09:42
let currentPage = await firstPagePromise;
yield* currentPage.items;

while (currentPage.isTruncated && currentPage.nextExclusiveStartKey) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isTruncated is true until the very end of the KVS.

This has consequences - e.g. the following snippet will always return 10 items (discarding the limit: 2 param).

    const storage = new MemoryStorage().keyValueStore('test');

    for (let i = 0; i < 10; i++) {
        await storage.setRecord({ key: `key-${i}`, value: `value-${i}` });
    }

    for await (const item of storage.listKeys({ limit: 2 })) {
        console.log(item);
    }

edit: note that apify-client respects the limit param, only printing two items in the example above.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is great catch, I add test for this scenario and fix in 828f5cc

*
* @param options Options for the iteration.
*/
async *keys(options: KeyValueStoreIteratorOptions = {}): AsyncGenerator<string, void, undefined> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we make all the new methods also awaitable? So e.g. this works?

const keys = await kvs.keys();
// keys == ['a', 'b', 'c', ...];

Array.fromAsync is a fairly recent addition to Node and might not be the best DX. On the other hand, the AsyncIterable Promise might also be an unexpected twist for many.

wdyt? :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As Array.fromAsync is a thing I would follow the Map and Set native approach where this functions always return iterator 🤔 But I don't have a problem with AsyncIterable Promise way

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also don't have a horse in this race, let's hear other's ideas then 👍

One note is that Set and Map methods all return synchronous Iterators, so you can do spreads ([...Set().keys]), map, reduce etc. - users won't even notice they are not using native Arrays. AsyncIterators are much understandably more limited, so the DX might suffer.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd probably support both AsyncIterable and Promise so that we provide a more consistent API. No strong opinion though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@B4nan any chance you have strong opinion on this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

speed is one part of issue, other is users ends with results in memory 🤔 Frankly I can't find problem I would solve with this solution (await entries()).

Copy link
Member

@barjin barjin Jan 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

users end with results in memory

I think this is the expected outcome if they call KVS.entries(), and e.g. hitting the potential memory limit is understandable for the end-user.

Ofc I also like the iterative approach better (and we still should educate the user through the docs), but I would still like to have the Promise option, just to have the API predictable among the storages.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I can imagine a legit use case for the Promise option, too - such as loading a bunch of small JSON files and doing some aggregations on them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok lets do it if it is valuable for us, with my current context, I can't evaluate this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

@barjin
Copy link
Member

barjin commented Jan 19, 2026

Sorry, I dropped the main comment for the review 😅

I'm mostly pro-merging here (good job!), please treat most of the comments above as discussion points rather than directly actionable commands.

Thanks!

Copy link
Contributor

@janbuchar janbuchar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks mostly fine 🙂 The PR title feels a bit off - the changes are not limited to memory-storage.

*
* @param options Options for the iteration.
*/
async *values<T = unknown>(options: KeyValueStoreIteratorOptions = {}): AsyncGenerator<T, void, undefined> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The T type parameter is just trust-me-bro-typing. I believe we should improve this as a part of #3082

*
* @param options Options for the iteration.
*/
async *keys(options: KeyValueStoreIteratorOptions = {}): AsyncGenerator<string, void, undefined> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd probably support both AsyncIterable and Promise so that we provide a more consistent API. No strong opinion though.

@l2ysho l2ysho changed the title feat(memory-storage): implements async iterators feat(): implements async iterators Jan 21, 2026
@B4nan B4nan changed the title feat(): implements async iterators feat: implements async iterators Jan 21, 2026
Copy link
Contributor

@janbuchar janbuchar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@l2ysho l2ysho requested a review from janbuchar January 30, 2026 14:17
Comment on lines 89 to 92
listItems(options?: DatasetClientListOptions): AsyncIterable<Data> & Promise<PaginatedList<Data>>;
listEntries(
options?: DatasetClientListOptions,
): AsyncIterable<[number, Data]> & Promise<PaginatedList<[number, Data]>>;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@B4nan now that I think of it, isn't this a BC break? Not that I'm aware of any 3rd party storage implementations, but still, this would require them to implement additional methods.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, we have at least the old sql storage, which I think some people still use. So let's make those optional to be sure?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me. And an exception in the "storage frontends" if the method is not implemented?

Copy link
Contributor Author

@l2ysho l2ysho Jan 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

new functions optional now, what do you mean by "storage frontends" ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, if we can't use a simple fallback, let's throw.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do a runtime check and throw in resource-clients? I am not following probably

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

storage frontends means the Dataset, KeyValueStore and RequestQueue classes from @crawlee/core, i.e., the ones that delegate to a resource client which may or may not have the new, optional iteration methods implemented.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@B4nan @janbuchar something like this 1464b60 ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly. Just a bunch of nits:

  1. it's a method, not a function
  2. missing "the" before " function"
  3. it's good to delimit identifiers so that the reader can easily distinguish them from natural language - "missing the keys method" would be much better

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

this.listItemsPage({
desc,
offset: pageOffset,
limit: Math.min(pageLimit, LIST_ITEMS_LIMIT),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it's too late for this, but do we really want to cap the limit without telling the user? In their place, I would prefer the library to tell me, instead of quietly changing behavior.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

default limit was there also before and it is quite a number 999_999_999_999

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that leaves me wondering what the purpose of this check even is... but it's not important enough to be resolved here and now

@l2ysho l2ysho requested a review from janbuchar February 4, 2026 07:41
delete(): Promise<void>;
downloadItems(...args: unknown[]): Promise<Buffer>;
listItems(options?: DatasetClientListOptions): Promise<PaginatedList<Data>>;
listItems(options?: DatasetClientListOptions): AsyncIterable<Data> & Promise<PaginatedList<Data>>;
Copy link
Contributor

@janbuchar janbuchar Feb 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a similar vein to #3352 (comment), this is breaking. Can we use Partial<AsyncIterable<Data>> in the interface and use the isAsyncIterable helper to check the return value in the storage frontends?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines 629 to 631
if (!isAsyncIterable(result)) {
throw new Error('Resource client "listItems" method does not return an async iterable.');
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm afraid that achieving full BC will be trickier than that - you'll need to return something that will behave normally when await-ed and throw when async-iterated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 any specific idea how to do so? I am getting a feeling that these combined Promise + Iterator types would introduce this BC problems everywhere

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the result is not AsyncIterable, just Object.defineProperty a Symbol.asyncIterator that throws an error. Not sure about the exact symbol, perhaps it's the other one.

Well, that's my idea of it, anyways. Let's see if that works.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is about some (legacy) storage clients not having the async iterable interface, right?

Any chance we can add a compat layer here in Crawlee, something like:

if (!(Symbol.asyncIterator in maybeAsyncIterable)) {
    Object.defineProperty(maybeAsyncIterable, Symbol.asyncIterator, {
        value: async function* () {
            yield* await maybeAsyncIterable;
        }
    });
}

and provide all Crawlee methods to everyone, regardless of the storage implementation?

Or am I understanding the problem wrong?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@barjin I think you are right, but I moved all this Object.defineProperty out of storage frontends to the memory-storage (so everything is handled there) and now I need to redefine object here anyway. 🤔 but I can't see any other option

Copy link
Contributor Author

@l2ysho l2ysho Feb 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it 515dd77 but for Iterator it returns only first page which is not the same as iterator returned from memory-storage. I can reimplement it with something like

 Object.defineProperty(result, Symbol.asyncIterator, {
        async *value() {
            let offset = opts.offset ?? 0;
            const limit = opts.limit ?? DATASET_ITERATORS_DEFAULT_LIMIT;
            while (true) {
                const page = await client.listItems({ ...opts, offset, limit });
                yield* page.items;
                if (offset + page.count >= page.total) break;
                offset += limit;
            }
        },
    });

but then we are back with solution 10 commits ago with iterator in storage frontend (and we are doing same thing twice on 2 different places).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, nice catch 👍 maybe BC for the new iterable interface is not
really necessary (given the near-100% market share that memory-storage
and apify have) and we can throw from the Symbol.asyncIterator, if
it's not specified by the storage backend.

Sorry about the detour, my bad here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@barjin no problem I guess 👍
@janbuchar what do you say?

  1. keep throwing the error
  2. return iterator for first page (current state, but different behaviour than iterator from memory storage)
  3. reimplement iterator again in fronted for this method (but the same logic is applied in memory storage)
  4. some other idea I cannot see right now

1 seems to me less noise in a code and flow, if we break BC for ~1% of users it is not big deal.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1, but throw it only in Symbol.asyncIterator

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@l2ysho l2ysho requested a review from janbuchar February 4, 2026 15:47
Copy link
Member

@barjin barjin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @l2ysho !

I don't have many more points to discuss, except for the double fetching in memory-storage. Once that's sorted out (or discussed that it's wontfix), I'm in favor of merging.

Cheers!

Comment on lines 205 to 224
const firstPagePromise = (async () => {
const firstPageKeys = await self.keys(options);
const entries: [string, unknown][] = [];
for (const item of firstPageKeys.items) {
const record = await self.getRecord(item.key);
if (record) {
entries.push([item.key, record.value]);
}
}
return entries;
})();

async function* asyncGenerator(): AsyncGenerator<[string, unknown]> {
for await (const key of self.keys(options)) {
const record = await self.getRecord(key);
if (record) {
yield [key, record.value];
}
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will fetch each KVS record two times if we use the async iterator. The firstPagePromise IIFE is executing immediately, but its result is unused if we for await the returned object (and the asyncGenerator will fetch all the results again).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

Comment on lines 201 to 202
// eslint-disable-next-line @typescript-eslint/no-this-alias
const self = this;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ESLint knows (usually) better, why this? 😅

If it's for the this binding in async function* asyncGenerator, can't we use value: asyncGenerator.bind(this) instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

Copy link
Member

@B4nan B4nan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mostly checked the tests and it looks good to me. Lets address the double fetching Jindra mentioned, otherwise lets ship it finally :]

Comment on lines 300 to 301
test('respects limit option when iterating', async () => {
const values: unknown[] = [];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why dropping the tests? The limit option should still work as before this commit, right?


// Yield first page entries
for (const item of firstPageKeys.items) {
const record = await getRecord(item.key);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will still double-fetch each record's value on for await... but since this is memory-storage, let's merge this now and keep optimizations for later.

Can you please create an issue about this once merged @l2ysho ?

@l2ysho l2ysho merged commit 7f7a4ab into master Feb 6, 2026
9 checks passed
@l2ysho l2ysho deleted the 3338-implement-the-missing-iterators-in-crawleememory-storage branch February 6, 2026 13:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

t-tooling Issues with this label are in the ownership of the tooling team. tested Temporary label used only programatically for some analytics.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Implement the missing iterators in @crawlee/memory-storage

4 participants