Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Benchmark bulk entity create API #1095

Closed
ktuite opened this issue Feb 23, 2024 · 7 comments
Closed

Benchmark bulk entity create API #1095

ktuite opened this issue Feb 23, 2024 · 7 comments
Assignees
Labels
entities Multiple Encounter workflows

Comments

@ktuite
Copy link
Member

ktuite commented Feb 23, 2024

Questions about bulk upload limits:

  • What is the upper limit of an entity CSV that backend can handle?
  • How does that change with # of rows vs. # of columns?
  • What limits will be imposed from other parts of the system (e.g. 100MB nginx request body limit ) or limits of what frontend can handle?
  • Frontend might be able to parse a larger file than backend?

Optimization idea:

There was an optimization using sql.unnest made to another part of backend for inserting many form fields at once. Should that approach be used here, too? https://github.com/gajus/slonik?tab=readme-ov-file#inserting-large-number-of-rows

Commit where this was added to insertMany: 222b2a8

What is the upper limit?

Postgresql because database supports only 65535 parameters

10 parameters per entity def --> maybe ~6K entities able to be inserted at once without the unnest optimization? Things to try!!

Related idea:

Should createMany and createNew (multiple vs. single entity creation) be combined? Does it make sense to use the multi case for a single entity or is createNew better optimized for the single-entity scenario?

@ktuite ktuite added the entities Multiple Encounter workflows label Feb 23, 2024
@ktuite ktuite self-assigned this Feb 23, 2024
@ktuite ktuite changed the title Test limits of bulk entity create API Benchmark bulk entity create API Feb 27, 2024
@matthew-white
Copy link
Member

I have some basic CSV parsing in Frontend in place, and I've been benchmarking that parsing. I also tried sending the parsed entities to Backend, and so far, I'm seeing two limits.

I'm able to send 6553 entities, but there's a Postgres error for 6554 entities. That's consistent with the limit mentioned above. Happy to share my JSON data if that'd be helpful!

Secondly, even though client_max_body_size is 100m in nginx (here), Backend itself seems to set a much smaller limit for JSON request bodies, of 250 kB. I had to increase that limit in order for larger requests to go through.

service.use(bodyParser.json({ type: 'application/json', limit: '250kb' }));

@ktuite
Copy link
Member Author

ktuite commented Mar 1, 2024

I just ran into the 6553 max entity and 250kb JSON limit myself.

Removing that limit, I've so far been able to upload 6553 entities with 1000 properties, but not 1500 properties.

@matthew-white
Copy link
Member

Based on discussion yesterday, we're hoping to be able to bulk-create 50,000 entities with 20 properties each, where each value is 100 characters.

@lognaturel
Copy link
Member

250kb JSON limit

This seems like a good idea to have in general. Can we lift/override the limit only for this endpoint?

@matthew-white
Copy link
Member

100MB nginx request body limit

Two thoughts about the nginx limit:

  1. 50,000 entities x 20 properties/entity x 100 bytes/property = 100 MB of data, which doesn't include JSON syntax characters (e.g., quotes and commas) or property names that are repeated for each entity. I think that means that we should increase the nginx limit, though probably just for this one endpoint. What should we increase it to?
  2. Currently, before Frontend attempts to send a file to Backend (e.g., a form attachment), it checks whether it exceeds the nginx limit, showing an informative error message if it does. If we think there's a risk of the new nginx limit being exceeded, I think we should consider having Frontend do something similar for request bodies that aren't files. (Frontend will send JSON, not the file itself, so the current check won't be tripped.) Right now, Frontend would just say, "Something went wrong: error code 413."

@matthew-white
Copy link
Member

matthew-white commented Mar 13, 2024

I think that means that we should increase the nginx limit

We discussed this today and concluded that it'd be fine to leave the nginx limit where it is today, at 100 MB. We've seen that the size of the JSON request body is usually larger than the CSV file, so this will probably limit the size of CSV files to less than 100 MB. (I think the JSON is larger in large part because certain strings are repeated for each entity: label, data, and each property name. The CSV file only lists property names once.) At some point, we may make the nginx limit configurable (related: getodk/central#609).

Currently, before Frontend attempts to send a file to Backend (e.g., a form attachment), it checks whether it exceeds the nginx limit, showing an informative error message if it does. If we think there's a risk of the new nginx limit being exceeded, I think we should consider having Frontend do something similar for request bodies that aren't files.

I've filed getodk/central#611 for this idea, but for now, we're just going to implement getodk/central#610. I've filed that as a separate issue so that we can continue to use this issue to discuss other limits and aspects of benchmarking.

@matthew-white
Copy link
Member

I just wanted to leave a comment with some observations from @alxndrsn on Slack. @alxndrsn uploaded a CSV file of 100K entities:

from drag-drop -> preview is near-instant
"sending file" about 25s (counting in my head)
"processing file" about 90s (counting in my head)
as a ballpark, it takes less than 1 second to insert this data into my local db

It'd be interesting to learn more about why processing the file takes so long.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
entities Multiple Encounter workflows
Projects
Status: ✅ done
Development

No branches or pull requests

3 participants