Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ScanCommand with no Limit set only ever reads one page of data #6043

Open
3 tasks done
mn-prp opened this issue May 1, 2024 · 2 comments
Open
3 tasks done

ScanCommand with no Limit set only ever reads one page of data #6043

mn-prp opened this issue May 1, 2024 · 2 comments
Assignees
Labels
bug This issue is a bug. p3 This is a minor priority issue

Comments

@mn-prp
Copy link

mn-prp commented May 1, 2024

Checkboxes for prior research

Describe the bug

Using ScanCommand without a Limit results in only one page of data being read and no LastEvaluatedKey being returned to resume at the next page. The documentation clearly suggests that if the 1mb page limit is hit, we should get a LastEvaluatedKey in the response.

I am (now) aware that this functionality can be achieved using paginateScan, but I think it is still a bug in ScanCommand as a standalone.

SDK version number

@aws-sdk/lib-dynamodb@npm:3.565.0

Which JavaScript Runtime is this issue in?

Node.js

Details of the browser/Node.js/ReactNative version

Node v20.12.0

Reproduction Steps

Here is what I wrote. Note from my comment below that adding or removing the Limit changes the behavior of this code, even though I do not expect it to.

const scanFrom = (lastEvaluatedKey?: Record<string, any>) =>
      client.send(
        new ScanCommand({
          TableName: this.tableName,
          FilterExpression: 'begins_with(sk, :sk)',
          ExpressionAttributeValues: {
            ':sk': sk
          },
          ExclusiveStartKey: lastEvaluatedKey,
          // Limit: 1000 <-- if we uncomment this line, and set it to anything other than 0 or undefined, it works, otherwise only one page of data returned with no ExclusiveStartKey
        })
      )

    let results: ScanCommandOutput['Items'] = []

    let page = await scanFrom(undefined)

    while (page.LastEvaluatedKey !== undefined) {
      results = results.concat(page.Items ?? [])
      page = await scanFrom(page.LastEvaluatedKey)
    }

    return results

Observed Behavior

As my comment indicates, adding or removing the Limit changes the behavior such that either a LastEvaluatedKey is returned or not, and if Limit is missing, we always only get one page of data (not the complete scan results after paginating).

Expected Behavior

This is the behavior I was trying to achieve (i.e., get everything from the table matching the filter expression), but using the raw ScanCommand instead:

    const paginator = paginateScan(
      { client },
      {
        TableName: this.tableName,
        FilterExpression: 'begins_with(sk, :sk)',
        ExpressionAttributeValues: {
          ':sk': sk
        }
      }
    )

    let results: ScanCommandOutput['Items'] = []
    for await (const page of paginator) {
      results = results.concat(page.Items ?? [])
    }

    return results

Possible Solution

No response

Additional Information/Context

No response

@mn-prp mn-prp added bug This issue is a bug. needs-triage This issue or PR still needs to be triaged. labels May 1, 2024
@aBurmeseDev aBurmeseDev self-assigned this May 7, 2024
@aBurmeseDev aBurmeseDev added investigating Issue is being investigated and/or work is in progress to resolve the issue. and removed needs-triage This issue or PR still needs to be triaged. labels May 7, 2024
@aBurmeseDev
Copy link
Member

Hi @mn-prp - thanks for reaching out.

I'm not able to reproduce it on my end and wanted to verify a few things with you. The expected behavior is that Limit is used for non-pagination requests whereas pageSize can be used when you want to specify results per page.

If you'd like to limit the total returned results, here's what I'd do:

const paginator = paginateScan(
        {
            client: DDBClient,
            pageSize: 1,
        },
        {
            TableName: tableName,
            FilterExpression: 'begins_with(sk, :sk)',
            ExpressionAttributeValues: {
                       ':sk': sk
           }          
        }
    );
    const LIMIT = 2;
    let count = 0;

    for await (const page of paginator) {
        if (++count >= LIMIT) {
            break;
        }
    }
}

Hope that makes sense but let me know if you have any further questions.
Best,
John

@aBurmeseDev aBurmeseDev added response-requested Waiting on additional info and feedback. Will move to \"closing-soon\" in 7 days. p3 This is a minor priority issue and removed investigating Issue is being investigated and/or work is in progress to resolve the issue. labels May 9, 2024
@mn-prp
Copy link
Author

mn-prp commented May 10, 2024

Thanks for looking into this. My use case is to retrieve all items, without any (final) pagination, since the goal of the query is to sum some values across the whole table for an internal analytics job that is run periodically.

I don't have the time to try to create a reproduction, since it obviously depends on a populated database, but I can comment on the context a bit. We have two environments, one staging and more production, and the staging environment has a table ~1mb in size, whereas production is ~5mb. The "total" values in both tables were known to be increasing, but whereas the staging environment did continue to increase, we noticed that the total of the production database query was stagnating at around some value (let's say 200,000 -- we were seeing totals at 210,000 then 209,000 then 197,000 then 201,000 etc.) This led me to think there was a page size that was being hit.

So I went and looked at the sdk call, which was the first snippet copied in my issue. By adding the Limit: 1000 I got the expected much-higher total. Then I tried Limit: 50 and again, the expected much higher value (but with many more network requests, as my logging showed). But when I removed the line, the value dropped back down to the ~200,000 and my logging revealed only one loop was being executed, with no LastEvaluatedKey.

Is it expected that Limit is required to read the full data set read from a multi-megabyte table? I would have expected that not setting Limit would either (a) retrieve the max page size and return a LastEvaluatedKey or (b) make the sdk continue fetching the next page automatically (and appending it to the result set so far) until there were no more pages.

@github-actions github-actions bot removed the response-requested Waiting on additional info and feedback. Will move to \"closing-soon\" in 7 days. label May 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug This issue is a bug. p3 This is a minor priority issue
Projects
None yet
Development

No branches or pull requests

2 participants