Add flag to ignore attachments when downloading submissions csv #150

ghostfreak3000 · 2020-07-21T05:18:16Z

Usecase:

Recently we had a requirement to sync form data ( minus multi media ) from our odk instance to one of our mongodb clusters.
The form data made up ~80mb out of a 2gb download ( not sure what the ratio is, but looks like 1:40 ). Considering most of the download trouble was as a result of the large size ( and it's still growing, 6gb as of this writing ), an option to ignore attachments ( maybe via a query param ) would be nice.

As a side note, because this sync runs every 6 hours ( with a standing requirement for it to run every hour ), it brought our modest server server ( 2 CPU, 4GB RAM ) to it's knees and we had to provision a larger one ( about 2x larger )... i'm thinking ignoring the attachments might let us scale back the server ( considering this thing only runs odk and nothing else )... but that's not the main concern.

lognaturel · 2020-07-21T17:32:01Z

Thanks for writing this up, @ghostfreak3000!

Did you consider using the OData access? If so, what made you favor the CSV? Did you see that the OData data document provides a straightforward json representation?

Does your form definition have repeats?

We'll definitely consider your specific request of CSV-only access. However, I do want to make sure you've considered using the JSON representation. Note also that the OData access allows for paging. It won't significantly improve the performance of the export itself but depending on your connection speed it could still be helpful not to download 80mb of data each time.

ghostfreak3000 · 2020-07-22T07:16:32Z

Hi @lognaturel

Context:

The particular project we were working on had a heavy time constraint ( about 2 days ) and the "sync odk to mongodb" was just a small milestone of many.

Answer:

A: Did we consider Odata?

Yes we considered OData because the submissions endpoint did not support filtering of data ( we wanted to only sync data for that day the sync ran ).. but it was dropped for three reasons;

a parser would have needed to be written in order to extract relevant data from the OData response,
we were currently using mongoimport to upload the data to the cluster, and having to write a parser would have required us to drop mongoimport and write our own uploader ( a fun activity i might add, but not something to do when on a tight schedule )
This particular server did not have a limited network connection so dealing with 100's of GBs wouldn't really raise anyone's eyebrow so the "download all data from odk every hour and upload to mongo" wasn't too scary from a network perspective, it's been a headache from a ram/cpu perspective though ( with the server hanging every now and then ).

so point 3) effectively negated the main reason for considering OData.

B: Does your form definition have repeats?

I wouldn't know, i wasn't part of the team that handles form definition, what i do know was that we were supposed to sync all non-multimedia data ( projecs, forms.. etc ) from odk to mongo regardless of if more were added ( Another source of headache considering the xmlFormID takes the name of the form and alot of the form names had spaces causing URL encoding issues, but i digress )

AOB:

I did tell @yanokwa that i wouldn't mind sending in a PR but I'm not sure when i would have time to do it, so it was decided that the issue with all the context is documented for future reference ( incase someone else wanted to do it )

florianm · 2020-09-09T02:18:12Z

A1. is indeed quite involved. If R is an option at all, the R package ruODK can list projects, forms, form tables, and programmatically download all submissions (with or without media attachments, or with skip logic to only download new or force-download all media attachments) and parse each data type into native R objects. From there, you can write the data to CSV files ready for mongoimport.

lognaturel · 2020-09-09T05:53:50Z

We're still finalizing criteria for v1.1 but are likely to add a CSV-only endpoint for form definitions without repeats.

ghostfreak3000 · 2020-09-09T06:54:33Z

A1. is indeed quite involved. If R is an option at all, the R package ruODK can list projects, forms, form tables, and programmatically download all submissions (with or without media attachments, or with skip logic to only download new or force-download all media attachments) and parse each data type into native R objects. From there, you can write the data to CSV files ready for mongoimport.

@florianm Never had a reason to try out R. Will definitely give it a try

florianm · 2020-09-09T08:35:07Z

Great to hear!
If you want to "go faster", a pre-built Docker image running RStudio Server with ruODK installed lives at https://hub.docker.com/u/dbcawa/ruodk, or you can use the hosted version at BinderHub (linked from https://github.com/ropensci/ruODK).
I'm using ruODK in combination with the maketool drake for data ETL/QA pipelines similar to your use case, an example lives here.
Lastly, https://ropensci.org/ is a great community of R users and packages around data access and processing, they run a very helpful Slack.

See getodk/central#150.

matthew-white · 2020-12-23T22:58:20Z

We have just released v1.1, which makes two changes to the API along these lines:

/projects/…/forms/…/submissions.csv allows download of the root table (excluding repeat data) as CSV, without a zipfile.
/projects/…/forms/…/submissions.csv.zip now allows ?attachments=false to exclude attachments.

matthew-white added the needs discussion Discussion needed before work can begin label Aug 26, 2020

matthew-white added a commit to getodk/central-frontend that referenced this issue Nov 12, 2020

Add submission download options

07e43d0

See getodk/central#150.

matthew-white mentioned this issue Nov 12, 2020

Add submission download options getodk/central-frontend#384

Merged

matthew-white closed this as completed Dec 23, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add flag to ignore attachments when downloading submissions csv #150

Add flag to ignore attachments when downloading submissions csv #150

ghostfreak3000 commented Jul 21, 2020

lognaturel commented Jul 21, 2020

ghostfreak3000 commented Jul 22, 2020 •

edited

florianm commented Sep 9, 2020

lognaturel commented Sep 9, 2020

ghostfreak3000 commented Sep 9, 2020

florianm commented Sep 9, 2020

matthew-white commented Dec 23, 2020

Add flag to ignore attachments when downloading submissions csv #150

Add flag to ignore attachments when downloading submissions csv #150

Comments

ghostfreak3000 commented Jul 21, 2020

lognaturel commented Jul 21, 2020

ghostfreak3000 commented Jul 22, 2020 • edited

Context:

Answer:

florianm commented Sep 9, 2020

lognaturel commented Sep 9, 2020

ghostfreak3000 commented Sep 9, 2020

florianm commented Sep 9, 2020

matthew-white commented Dec 23, 2020

ghostfreak3000 commented Jul 22, 2020 •

edited