Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Buffer bug in @crawlee/memory-storage triggering eventLoop overloads and Node heap crashes #1710

Closed
1 task
corford opened this issue Dec 7, 2022 · 2 comments · Fixed by #1734
Closed
1 task
Assignees
Labels
bug Something isn't working.

Comments

@corford
Copy link
Contributor

corford commented Dec 7, 2022

Which package is this bug report for? If unsure which one to select, leave blank

@crawlee/memory-storage

Issue description

The setRecord() function in @crawlee/memory-storage/resource-clients/key-value-store.js uses shapeshift to enforce the types of the record object. If a Buffer is passed through this, it can sometimes cause shapeshift's .parse() to hang for upwards of 10 seconds before returning (sometimes it blows the heap and kills Node).

This has caused havoc with many of our spiders after migrating to Crawlee v3.1.3 (from Apify 2.3.2). One symptom is a pegged eventLoop, the other is heap allocation errors that eventually kill the Node main process.

Many but not all of our spiders were affected which leads us to believe it depends heavily on the raw contents of the Buffer as to what pathological behaviour shapeshift exhibits when trying to parse it. Preventing shapeshift from parsing Buffers fixed all the issues immediately.

Our workaround looks like this:

    async setRecord(record: storage.KeyValueStoreRecord): Promise {
        if (!Buffer.isBuffer(record?.value)) {
            s.object({
                key: s.string.lengthGreaterThan(0),
                value: s.union(s.null, s.string, s.number, s.object({}).passthrough),
                contentType: s.string.lengthGreaterThan(0).optional,
            }).parse(record);
        }

        // Check by id
        const existingStoreById = await findOrCacheKeyValueStoreByPossibleId(this.client, this.name ?? this.id);

        if (!existingStoreById) {
            this.throwOnNonExisting(StorageTypes.KeyValueStore);
        }

        const { key } = record;
        let { value, contentType } = record;

        const valueIsStream = Buffer.isBuffer(value) ? false : isStream(value);

       ...

Original source is here:

async setRecord(record: storage.KeyValueStoreRecord): Promise<void> {

Code sample

// Note the `body` and `contentType` vars come from https://crawlee.dev/api/cheerio-crawler/interface/CheerioCrawlingContext

import { v4 as uuidv4 } from 'uuid';

const encoding = Buffer.isBuffer(body) && contentType?.encoding
      ? contentType.encoding
      : 'utf8';

const resourceBuffer = Buffer.isBuffer(body)
    ? body
    : Buffer.from(body, encoding);

const resourceWal = await KeyValueStore.open('wal-resource');
await resourceWal.setValue(
    uuidv4(),
    resourceBuffer,
    { contentType: 'application/octet-stream' },
);

Package version

@crawlee/memory-storage@3.1.2

Node.js version

18.11.0

Operating system

Ubuntu focal

Apify platform

  • Tick me if you encountered this issue on the Apify platform

I have tested this on the next release

No response

Other context

No response

@corford corford added the bug Something isn't working. label Dec 7, 2022
@B4nan
Copy link
Member

B4nan commented Dec 8, 2022

cc @vladfrangu I guess something to fix in shapeshift?

@corford
Copy link
Contributor Author

corford commented Dec 8, 2022

Looking at the shapeshift README, I couldn't see a single mention of Buffer anywhere, suggesting it's not a supported or anticipated type. So even if the performance issue was solved, not sure how Buffers would be treated by s.union(s.null, s.string, s.number, s.object({}).passthrough) ?

B4nan pushed a commit that referenced this issue Jan 12, 2023
Closes #1732 
Closes #1710 

Simple fix really, tell shapeshift inputs can be buffers too (I'll also
see if we can add a generic "just check that its an object" in
shapeshift for times like these)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants