Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pulling media files from server with 1gb of RAM fails through direct API call or export generation #297

Closed
6 of 9 tasks
lognaturel opened this issue Oct 9, 2020 · 9 comments

Comments

@lognaturel
Copy link
Member

lognaturel commented Oct 9, 2020

Attempting to export submissions for https://test.central.getodk.org/#/projects/188/forms/all-widgets/submissions produces a 45byte, corrupt zip (presumably empty).

  • verify that published form definition works in Collect
  • look at resource utilization when request being handled -- memory usage from node and postgres go up
  • download submissions with Briefcase to see whether there are any media attachments -- yes and biggest is 17mb.
  • try the same form definition with fewer submissions -- works fine without media
  • look at DO's memory usage -- doesn't seem to have gone above 80% in last 14 days?!
  • try with text submissions plus the one with the largest attachment -- works fine
  • explore whether the issue is related to a threshold of total media files across submissions or aggregate size of media files for a single submission - Same form, I get failures when pulling from Briefcase using the submission attachments endpoint with 1gb of RAM. The CPU also spikes when attachments are requested. Once I add swap and do the original export, 400mb of swap gets used. The resulting zip is 115mb.

Possible actions:

  • Always recommend swap when using a machine with 1GB of RAM
  • Add troubleshooting section that recommends adding memory or as a second option, adding swap

Possibly related service log though I don't see it happening with every failed request:

service               |
service               | <--- Last few GCs --->
service               | ma[458:0x3482490]   243418 ms: Mark-sweep 31.7 (36.5) -> 31.7 (36.5) MB, 27.9 / 0.0 ms  (+ 0.1 ms in 2 steps since start of marking, biggest step 0.0 ms, walltime since start of marking 114 ms) (average mu = 0.872, current mu = 0.858) finalize incremental m[458:0x3482490]   243584 ms: Mark-sweep 31.7 (36.5) -> 31.7 (36.5) MB, 18.1 / 0.0 ms  (+ 0.1 ms in 2 steps since start of marking, biggest step 0.0 ms, walltime since start of marking 82 ms) (average mu = 0.880, current mu = 0.890) finalize incremental ma
service               |
service               | <--- JS stacktrace --->
service               |
service               | ==== JS stack trace =========================================
service               |
service               |     0: ExitFrame [pc: 0x13162b9]
service               | Security context: 0x09e716040911 <JSObject>
service               |     1: fromString(aka fromString) [0x3e3bbccf34e9] [buffer.js:327] [bytecode=0x2d63ab9768a1 offset=151](this=0x2d47e73c04d1 <undefined>,0x286e43202f41 <Very long string[29575502]>,0x168ea14a8b89 <String[#3]: hex>)
service               |     2: from [0x1eac040b7c89] [buffer.js:198] [bytecode=0x27cf830aadf1 offset=11](this=0x3e3bbccc4a51 <JSFunction Buffer (sfi = 0x1eac0409edd1)>,0x286e4...
service               |
service               |     at processTicksAndRejections (internal/process/task_queues.js:85:5)FATAL ERROR: NewArray Allocation failed - process out of memory
service               |  1: 0x9aedf0 node::Abort() [/usr/local/bin/node]
service               |  2: 0x9aff86 node::OnFatalError(char const*, char const*) [/usr/local/bin/node]
service               |  3: 0xb078ce v8::Utils::ReportOOMFailure(v8::internal::Isolate*, char const*, bool) [/usr/local/bin/node]
service               |  4: 0xb07c49 v8::internal::V8::FatalProcessOutOfMemory(v8::internal::Isolate*, char const*, bool) [/usr/local/bin/node]
service               |  5: 0xb07d3b  [/usr/local/bin/node]
service               |  6: 0xb2b3b6 v8::String::Value::Value(v8::Isolate*, v8::Local<v8::Value>) [/usr/local/bin/node]
service               |  7: 0xa652a3 node::StringBytes::Write(v8::Isolate*, char*, unsigned long, v8::Local<v8::Value>, node::encoding, int*) [/usr/local/bin/node]
service               |  8: 0x994571 node::Buffer::New(v8::Isolate*, v8::Local<v8::String>, node::encoding) [/usr/local/bin/node]
service               |  9: 0x99480b  [/usr/local/bin/node]
service               | 10: 0xb8df76  [/usr/local/bin/node]
service               | 11: 0xb8fe97 v8::internal::Builtin_HandleApiCall(int, unsigned long*, v8::internal::Isolate*) [/usr/local/bin/node]
service               | 12: 0x13162b9  [/usr/local/bin/node]

@lognaturel
Copy link
Member Author

lognaturel commented Oct 9, 2020

This seems pretty clearly related to memory although it's confusing that Digital Ocean doesn't see RAM usage above 80% for the QA server.

We should describe this case in the documentation and help users recover (by adding RAM?). It would be helpful to know more about what conditions are likely to lead to this outcome. I initially thought it was driven by the largest media file but my experimentation suggests that's not the case. Does it have to do with the total size of media? The submission count?

It almost feels like media does not start streaming until the CSV is generated. Is that possible? As I add more submissions, I see nothing happening for a while and then a quick download.

@issa-tseng
Copy link
Member

yes, that's correct: the root table is the first stream inserted into the zip so it's the first thing over the wire.

@lognaturel
Copy link
Member Author

it's the first thing over the wire

And so everything else for the export including all media has to fit in memory?

@issa-tseng
Copy link
Member

issa-tseng commented Oct 11, 2020 via email

@lognaturel
Copy link
Member Author

lognaturel commented Oct 11, 2020

Does this mean that the root table must be /started/ for the zip to start being delivered but in principle, media files should be inserted as they are processed? Right now it seems like there's a threshhold of total media file size beyond which the server crashes and nothing ever gets sent.

EDIT: I get a server with 1gb of memory to run out of RAM when pulling from Briefcase which uses the attachment endpoint. Both pulling form the attachment endpoint and the export zip work with 1gb of added swap. There's 400mb of swap used to produce a 115mb zip and again a long pause before anything is transmitted.

@lognaturel lognaturel changed the title Corrupt zip on export Pulling media files from server with 1gb of RAM fails through direct API call or export generation Oct 15, 2020
@issa-tseng
Copy link
Member

solutions here are imprecise. i can only limit my results from the database by number of rows, not total size in the stream buffer. so when we get blobs to stream i set it to a number that is not likely to ever run out of memory given the number of entries (i think it's like 4 or 8 or something pretty tiny), but it can still happen.

either way, it is not true that the media attachments all sit around in node memory. that's what the backpressure's for. the small number of blobs should just sit until the main data is done processing, then the rest of the blobs come in. it just apparently isn't good enough.

@issa-tseng
Copy link
Member

once we insert streams into the archives they are there in that order. we can't "insert" media files midway.

@lognaturel
Copy link
Member Author

I did another trial with ~1.5mb images. The export RAM utilization was roughly the same between 1000 and 2000 submissions. When the download starts, there's a spike of about 400mb in memory usage (same as with original scenario) and then that holds steady. In htop, see 8 processes with each roughly the same RAM usage. Seems that's the 8 rows @issa was talking about and I think it's all seeming consistent and explainable. I'm surprised that each row requires so much RAM but for most contexts it shouldn't matter.

One thing I'm not clear on is whether each attachment does need to be able to be held entirely in memory. That is, if there's one 500mb attachment, will the process handling it need 500+mb of memory? I vaguely remember someone saying they couldn't export because their audit file got really big but I'm not immediately finding it. Do you know, @issa-tseng? This is also something I could try if we'd rather see how it looks for real.

@lognaturel
Copy link
Member Author

We recently found out that we were affected by brianc/node-postgres#2240 which has a big effect on media performance. I haven't verified these exact scenarios with the patch but given what I have tried I'm pretty sure that the behavior is much better now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants