Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: Remove or increase download cap, restrict pagination on large datasets #5884

Open
3 tasks
lbeaufort opened this issue Jun 26, 2024 · 2 comments

Comments

@lbeaufort
Copy link
Member

lbeaufort commented Jun 26, 2024

Issue

When paginating through millions of records, it can take several minutes to retrieve just 100 records at a time. This inefficiency prevents users from accessing the data they need promptly and results in expensive queries being run repeatedly.

Proposed solution

To improve this process, we propose either removing or increasing the download cap and restricting pagination for datasets larger than 500k or 1 million records. This change would allow users to queue up a download for large datasets, eliminating the need to paginate through all the data.

Action item(s)

  • Load test with custom tests
  • Compare performance
  • Consider protections

Completion criteria

(What does the end state look like - as long as this task(s) is done, this work is complete)

  • [ ]

References/resources/technical considerations

(Is there sample code or a screenshot you can include to highlight a particular issue? Here is where you reinforce why this work is important)

@cnlucas
Copy link
Member

cnlucas commented Jul 23, 2024

#1378 Background (was set to 100k and was raised in #2584)

@patphongs
Copy link
Member

patphongs commented Jul 23, 2024

increasing the download cap and restricting pagination for datasets larger than 500k or 1 million records

Notes from 7/23/2024 discussion

  • Excel spreadsheet limit is 1,048,576 rows
  • Specific key for downloads endpoint
  • Direct user to generate bulk download CSV of those records
  • Generally get 3k-6k in downloads, sometimes get 50k daily
  • Locust testing is an option
  • There's 2 calls that are made, one for count and one for data.
  • Long running query isn't necessarily a complex query.
  • Start with the API umbrella and with the calls that are over 5 minutes.

What are meaningful indicators of expensive queries?

  • Should we looking at response time? or at 500k+ record count?
  • How do we measure query complexity? (Explain plan may have a cost score? This is run to get the count)

Questions?

  • Would this be faster?
    • How long does it take to generate a CSV of 500k+ records?
  • Should we do this for only API users at first? Don't do this for public website for now?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: 🗄️ PI backlog
Development

No branches or pull requests

3 participants