Benchmark bulk entity create API #1095

ktuite · 2024-02-23T17:39:53Z

Questions about bulk upload limits:

What is the upper limit of an entity CSV that backend can handle?
How does that change with # of rows vs. # of columns?
What limits will be imposed from other parts of the system (e.g. 100MB nginx request body limit ) or limits of what frontend can handle?
Frontend might be able to parse a larger file than backend?

Optimization idea:

There was an optimization using sql.unnest made to another part of backend for inserting many form fields at once. Should that approach be used here, too? https://github.com/gajus/slonik?tab=readme-ov-file#inserting-large-number-of-rows

Commit where this was added to insertMany: 222b2a8

What is the upper limit?

Postgresql because database supports only 65535 parameters

10 parameters per entity def --> maybe ~6K entities able to be inserted at once without the unnest optimization? Things to try!!

Related idea:

Should createMany and createNew (multiple vs. single entity creation) be combined? Does it make sense to use the multi case for a single entity or is createNew better optimized for the single-entity scenario?

The text was updated successfully, but these errors were encountered:

matthew-white · 2024-02-28T21:42:15Z

I have some basic CSV parsing in Frontend in place, and I've been benchmarking that parsing. I also tried sending the parsed entities to Backend, and so far, I'm seeing two limits.

I'm able to send 6553 entities, but there's a Postgres error for 6554 entities. That's consistent with the limit mentioned above. Happy to share my JSON data if that'd be helpful!

Secondly, even though client_max_body_size is 100m in nginx (here), Backend itself seems to set a much smaller limit for JSON request bodies, of 250 kB. I had to increase that limit in order for larger requests to go through.

central-backend/lib/http/service.js

Line 33 in f98187d

service.use(bodyParser.json({ type: 'application/json', limit: '250kb' }));

ktuite · 2024-03-01T01:14:07Z

I just ran into the 6553 max entity and 250kb JSON limit myself.

Removing that limit, I've so far been able to upload 6553 entities with 1000 properties, but not 1500 properties.

matthew-white · 2024-03-06T21:13:15Z

Based on discussion yesterday, we're hoping to be able to bulk-create 50,000 entities with 20 properties each, where each value is 100 characters.

lognaturel · 2024-03-07T01:05:57Z

250kb JSON limit

This seems like a good idea to have in general. Can we lift/override the limit only for this endpoint?

matthew-white · 2024-03-08T22:31:57Z

100MB nginx request body limit

Two thoughts about the nginx limit:

50,000 entities x 20 properties/entity x 100 bytes/property = 100 MB of data, which doesn't include JSON syntax characters (e.g., quotes and commas) or property names that are repeated for each entity. I think that means that we should increase the nginx limit, though probably just for this one endpoint. What should we increase it to?
Currently, before Frontend attempts to send a file to Backend (e.g., a form attachment), it checks whether it exceeds the nginx limit, showing an informative error message if it does. If we think there's a risk of the new nginx limit being exceeded, I think we should consider having Frontend do something similar for request bodies that aren't files. (Frontend will send JSON, not the file itself, so the current check won't be tripped.) Right now, Frontend would just say, "Something went wrong: error code 413."

matthew-white · 2024-03-13T05:40:15Z

I think that means that we should increase the nginx limit

We discussed this today and concluded that it'd be fine to leave the nginx limit where it is today, at 100 MB. We've seen that the size of the JSON request body is usually larger than the CSV file, so this will probably limit the size of CSV files to less than 100 MB. (I think the JSON is larger in large part because certain strings are repeated for each entity: label, data, and each property name. The CSV file only lists property names once.) At some point, we may make the nginx limit configurable (related: getodk/central#609).

Currently, before Frontend attempts to send a file to Backend (e.g., a form attachment), it checks whether it exceeds the nginx limit, showing an informative error message if it does. If we think there's a risk of the new nginx limit being exceeded, I think we should consider having Frontend do something similar for request bodies that aren't files.

I've filed getodk/central#611 for this idea, but for now, we're just going to implement getodk/central#610. I've filed that as a separate issue so that we can continue to use this issue to discuss other limits and aspects of benchmarking.

matthew-white · 2024-04-09T16:30:55Z

I just wanted to leave a comment with some observations from @alxndrsn on Slack. @alxndrsn uploaded a CSV file of 100K entities:

from drag-drop -> preview is near-instant
"sending file" about 25s (counting in my head)
"processing file" about 90s (counting in my head)
as a ballpark, it takes less than 1 second to insert this data into my local db

It'd be interesting to learn more about why processing the file takes so long.

ktuite added the entities Multiple Encounter workflows label Feb 23, 2024

ktuite self-assigned this Feb 23, 2024

ktuite mentioned this issue Feb 23, 2024

Bulk entity creation via API #1070

Merged

2 tasks

ktuite changed the title ~~Test limits of bulk entity create API~~ Benchmark bulk entity create API Feb 27, 2024

ktuite mentioned this issue Mar 6, 2024

Use sql unnest to optimize entity insertion #1098

Merged

2 tasks

ktuite mentioned this issue Apr 11, 2024

Apply increased json body limit to entity post endpoint #1123

Merged

2 tasks

matthew-white closed this as completed Apr 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmark bulk entity create API #1095

Benchmark bulk entity create API #1095

ktuite commented Feb 23, 2024

matthew-white commented Feb 28, 2024

ktuite commented Mar 1, 2024

matthew-white commented Mar 6, 2024

lognaturel commented Mar 7, 2024

matthew-white commented Mar 8, 2024

matthew-white commented Mar 13, 2024 •

edited

Loading

matthew-white commented Apr 9, 2024

Benchmark bulk entity create API #1095

Benchmark bulk entity create API #1095

Comments

ktuite commented Feb 23, 2024

Questions about bulk upload limits:

Optimization idea:

Related idea:

matthew-white commented Feb 28, 2024

ktuite commented Mar 1, 2024

matthew-white commented Mar 6, 2024

lognaturel commented Mar 7, 2024

matthew-white commented Mar 8, 2024

matthew-white commented Mar 13, 2024 • edited Loading

matthew-white commented Apr 9, 2024

matthew-white commented Mar 13, 2024 •

edited

Loading