Pulling media files from server with 1gb of RAM fails through direct API call or export generation #297

lognaturel · 2020-10-09T20:35:20Z

Attempting to export submissions for https://test.central.getodk.org/#/projects/188/forms/all-widgets/submissions produces a 45byte, corrupt zip (presumably empty).

verify that published form definition works in Collect
look at resource utilization when request being handled -- memory usage from node and postgres go up
download submissions with Briefcase to see whether there are any media attachments -- yes and biggest is 17mb.
try the same form definition with fewer submissions -- works fine without media
look at DO's memory usage -- doesn't seem to have gone above 80% in last 14 days?!
try with text submissions plus the one with the largest attachment -- works fine
explore whether the issue is related to a threshold of total media files across submissions or aggregate size of media files for a single submission - Same form, I get failures when pulling from Briefcase using the submission attachments endpoint with 1gb of RAM. The CPU also spikes when attachments are requested. Once I add swap and do the original export, 400mb of swap gets used. The resulting zip is 115mb.

Possible actions:

Always recommend swap when using a machine with 1GB of RAM
Add troubleshooting section that recommends adding memory or as a second option, adding swap

Possibly related service log though I don't see it happening with every failed request:

service               |
service               | <--- Last few GCs --->
service               | ma[458:0x3482490]   243418 ms: Mark-sweep 31.7 (36.5) -> 31.7 (36.5) MB, 27.9 / 0.0 ms  (+ 0.1 ms in 2 steps since start of marking, biggest step 0.0 ms, walltime since start of marking 114 ms) (average mu = 0.872, current mu = 0.858) finalize incremental m[458:0x3482490]   243584 ms: Mark-sweep 31.7 (36.5) -> 31.7 (36.5) MB, 18.1 / 0.0 ms  (+ 0.1 ms in 2 steps since start of marking, biggest step 0.0 ms, walltime since start of marking 82 ms) (average mu = 0.880, current mu = 0.890) finalize incremental ma
service               |
service               | <--- JS stacktrace --->
service               |
service               | ==== JS stack trace =========================================
service               |
service               |     0: ExitFrame [pc: 0x13162b9]
service               | Security context: 0x09e716040911 <JSObject>
service               |     1: fromString(aka fromString) [0x3e3bbccf34e9] [buffer.js:327] [bytecode=0x2d63ab9768a1 offset=151](this=0x2d47e73c04d1 <undefined>,0x286e43202f41 <Very long string[29575502]>,0x168ea14a8b89 <String[#3]: hex>)
service               |     2: from [0x1eac040b7c89] [buffer.js:198] [bytecode=0x27cf830aadf1 offset=11](this=0x3e3bbccc4a51 <JSFunction Buffer (sfi = 0x1eac0409edd1)>,0x286e4...
service               |
service               |     at processTicksAndRejections (internal/process/task_queues.js:85:5)FATAL ERROR: NewArray Allocation failed - process out of memory
service               |  1: 0x9aedf0 node::Abort() [/usr/local/bin/node]
service               |  2: 0x9aff86 node::OnFatalError(char const*, char const*) [/usr/local/bin/node]
service               |  3: 0xb078ce v8::Utils::ReportOOMFailure(v8::internal::Isolate*, char const*, bool) [/usr/local/bin/node]
service               |  4: 0xb07c49 v8::internal::V8::FatalProcessOutOfMemory(v8::internal::Isolate*, char const*, bool) [/usr/local/bin/node]
service               |  5: 0xb07d3b  [/usr/local/bin/node]
service               |  6: 0xb2b3b6 v8::String::Value::Value(v8::Isolate*, v8::Local<v8::Value>) [/usr/local/bin/node]
service               |  7: 0xa652a3 node::StringBytes::Write(v8::Isolate*, char*, unsigned long, v8::Local<v8::Value>, node::encoding, int*) [/usr/local/bin/node]
service               |  8: 0x994571 node::Buffer::New(v8::Isolate*, v8::Local<v8::String>, node::encoding) [/usr/local/bin/node]
service               |  9: 0x99480b  [/usr/local/bin/node]
service               | 10: 0xb8df76  [/usr/local/bin/node]
service               | 11: 0xb8fe97 v8::internal::Builtin_HandleApiCall(int, unsigned long*, v8::internal::Isolate*) [/usr/local/bin/node]
service               | 12: 0x13162b9  [/usr/local/bin/node]

The text was updated successfully, but these errors were encountered:

lognaturel · 2020-10-09T21:49:27Z

This seems pretty clearly related to memory although it's confusing that Digital Ocean doesn't see RAM usage above 80% for the QA server.

We should describe this case in the documentation and help users recover (by adding RAM?). It would be helpful to know more about what conditions are likely to lead to this outcome. I initially thought it was driven by the largest media file but my experimentation suggests that's not the case. Does it have to do with the total size of media? The submission count?

It almost feels like media does not start streaming until the CSV is generated. Is that possible? As I add more submissions, I see nothing happening for a while and then a quick download.

issa-tseng · 2020-10-09T22:29:20Z

yes, that's correct: the root table is the first stream inserted into the zip so it's the first thing over the wire.

lognaturel · 2020-10-10T19:48:09Z

it's the first thing over the wire

And so everything else for the export including all media has to fit in memory?

issa-tseng · 2020-10-11T02:48:37Z

no. there is backpressure, managed by node stream internals.

…

On Oct 10, 2020, at 12:48, Hélène Martin ***@***.***> wrote: it's the first thing over the wire And so everything else for the export including all media has to fit in memory? — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

lognaturel · 2020-10-11T21:26:31Z

Does this mean that the root table must be /started/ for the zip to start being delivered but in principle, media files should be inserted as they are processed? Right now it seems like there's a threshhold of total media file size beyond which the server crashes and nothing ever gets sent.

EDIT: I get a server with 1gb of memory to run out of RAM when pulling from Briefcase which uses the attachment endpoint. Both pulling form the attachment endpoint and the export zip work with 1gb of added swap. There's 400mb of swap used to produce a 115mb zip and again a long pause before anything is transmitted.

issa-tseng · 2020-11-16T01:26:00Z

solutions here are imprecise. i can only limit my results from the database by number of rows, not total size in the stream buffer. so when we get blobs to stream i set it to a number that is not likely to ever run out of memory given the number of entries (i think it's like 4 or 8 or something pretty tiny), but it can still happen.

either way, it is not true that the media attachments all sit around in node memory. that's what the backpressure's for. the small number of blobs should just sit until the main data is done processing, then the rest of the blobs come in. it just apparently isn't good enough.

issa-tseng · 2020-11-16T01:26:46Z

once we insert streams into the archives they are there in that order. we can't "insert" media files midway.

lognaturel · 2020-11-20T01:48:26Z

I did another trial with ~1.5mb images. The export RAM utilization was roughly the same between 1000 and 2000 submissions. When the download starts, there's a spike of about 400mb in memory usage (same as with original scenario) and then that holds steady. In htop, see 8 processes with each roughly the same RAM usage. Seems that's the 8 rows @issa was talking about and I think it's all seeming consistent and explainable. I'm surprised that each row requires so much RAM but for most contexts it shouldn't matter.

One thing I'm not clear on is whether each attachment does need to be able to be held entirely in memory. That is, if there's one 500mb attachment, will the process handling it need 500+mb of memory? I vaguely remember someone saying they couldn't export because their audit file got really big but I'm not immediately finding it. Do you know, @issa-tseng? This is also something I could try if we'd rather see how it looks for real.

lognaturel · 2020-12-14T23:19:01Z

We recently found out that we were affected by brianc/node-postgres#2240 which has a big effect on media performance. I haven't verified these exact scenarios with the patch but given what I have tried I'm pretty sure that the behavior is much better now.

lognaturel mentioned this issue Oct 9, 2020

When pushing to Central, push form definitions for all versions referenced in submissions getodk/briefcase#872

Merged

10 tasks

lognaturel changed the title ~~Corrupt zip on export~~ Pulling media files from server with 1gb of RAM fails through direct API call or export generation Oct 15, 2020

lognaturel closed this as completed Dec 14, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pulling media files from server with 1gb of RAM fails through direct API call or export generation #297

Pulling media files from server with 1gb of RAM fails through direct API call or export generation #297

lognaturel commented Oct 9, 2020 •

edited

Loading

lognaturel commented Oct 9, 2020 •

edited

Loading

issa-tseng commented Oct 9, 2020

lognaturel commented Oct 10, 2020

issa-tseng commented Oct 11, 2020 via email

lognaturel commented Oct 11, 2020 •

edited

Loading

issa-tseng commented Nov 16, 2020

issa-tseng commented Nov 16, 2020

lognaturel commented Nov 20, 2020

lognaturel commented Dec 14, 2020

Pulling media files from server with 1gb of RAM fails through direct API call or export generation #297

Pulling media files from server with 1gb of RAM fails through direct API call or export generation #297

Comments

lognaturel commented Oct 9, 2020 • edited Loading

lognaturel commented Oct 9, 2020 • edited Loading

issa-tseng commented Oct 9, 2020

lognaturel commented Oct 10, 2020

issa-tseng commented Oct 11, 2020 via email

lognaturel commented Oct 11, 2020 • edited Loading

issa-tseng commented Nov 16, 2020

issa-tseng commented Nov 16, 2020

lognaturel commented Nov 20, 2020

lognaturel commented Dec 14, 2020

lognaturel commented Oct 9, 2020 •

edited

Loading

lognaturel commented Oct 9, 2020 •

edited

Loading

lognaturel commented Oct 11, 2020 •

edited

Loading