Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add flag to ignore attachments when downloading submissions csv #150

Closed
ghostfreak3000 opened this issue Jul 21, 2020 · 7 comments
Closed
Labels
needs discussion Discussion needed before work can begin

Comments

@ghostfreak3000
Copy link
Contributor

Usecase:

Recently we had a requirement to sync form data ( minus multi media ) from our odk instance to one of our mongodb clusters.
The form data made up ~80mb out of a 2gb download ( not sure what the ratio is, but looks like 1:40 ). Considering most of the download trouble was as a result of the large size ( and it's still growing, 6gb as of this writing ), an option to ignore attachments ( maybe via a query param ) would be nice.

As a side note, because this sync runs every 6 hours ( with a standing requirement for it to run every hour ), it brought our modest server server ( 2 CPU, 4GB RAM ) to it's knees and we had to provision a larger one ( about 2x larger )... i'm thinking ignoring the attachments might let us scale back the server ( considering this thing only runs odk and nothing else )... but that's not the main concern.

@lognaturel
Copy link
Member

Thanks for writing this up, @ghostfreak3000!

Did you consider using the OData access? If so, what made you favor the CSV? Did you see that the OData data document provides a straightforward json representation?

Does your form definition have repeats?

We'll definitely consider your specific request of CSV-only access. However, I do want to make sure you've considered using the JSON representation. Note also that the OData access allows for paging. It won't significantly improve the performance of the export itself but depending on your connection speed it could still be helpful not to download 80mb of data each time.

@ghostfreak3000
Copy link
Contributor Author

ghostfreak3000 commented Jul 22, 2020

Hi @lognaturel

Context:

The particular project we were working on had a heavy time constraint ( about 2 days ) and the "sync odk to mongodb" was just a small milestone of many.

Answer:

A: Did we consider Odata?

Yes we considered OData because the submissions endpoint did not support filtering of data ( we wanted to only sync data for that day the sync ran ).. but it was dropped for three reasons;

  1. a parser would have needed to be written in order to extract relevant data from the OData response,

  2. we were currently using mongoimport to upload the data to the cluster, and having to write a parser would have required us to drop mongoimport and write our own uploader ( a fun activity i might add, but not something to do when on a tight schedule )

  3. This particular server did not have a limited network connection so dealing with 100's of GBs wouldn't really raise anyone's eyebrow so the "download all data from odk every hour and upload to mongo" wasn't too scary from a network perspective, it's been a headache from a ram/cpu perspective though ( with the server hanging every now and then ).

so point 3) effectively negated the main reason for considering OData.

B: Does your form definition have repeats?

I wouldn't know, i wasn't part of the team that handles form definition, what i do know was that we were supposed to sync all non-multimedia data ( projecs, forms.. etc ) from odk to mongo regardless of if more were added ( Another source of headache considering the xmlFormID takes the name of the form and alot of the form names had spaces causing URL encoding issues, but i digress )

AOB:

I did tell @yanokwa that i wouldn't mind sending in a PR but I'm not sure when i would have time to do it, so it was decided that the issue with all the context is documented for future reference ( incase someone else wanted to do it )

@matthew-white matthew-white added the needs discussion Discussion needed before work can begin label Aug 26, 2020
@florianm
Copy link

florianm commented Sep 9, 2020

A1. is indeed quite involved. If R is an option at all, the R package ruODK can list projects, forms, form tables, and programmatically download all submissions (with or without media attachments, or with skip logic to only download new or force-download all media attachments) and parse each data type into native R objects. From there, you can write the data to CSV files ready for mongoimport.

@lognaturel
Copy link
Member

We're still finalizing criteria for v1.1 but are likely to add a CSV-only endpoint for form definitions without repeats.

@ghostfreak3000
Copy link
Contributor Author

A1. is indeed quite involved. If R is an option at all, the R package ruODK can list projects, forms, form tables, and programmatically download all submissions (with or without media attachments, or with skip logic to only download new or force-download all media attachments) and parse each data type into native R objects. From there, you can write the data to CSV files ready for mongoimport.

@florianm Never had a reason to try out R. Will definitely give it a try

@florianm
Copy link

florianm commented Sep 9, 2020

Great to hear!
If you want to "go faster", a pre-built Docker image running RStudio Server with ruODK installed lives at https://hub.docker.com/u/dbcawa/ruodk, or you can use the hosted version at BinderHub (linked from https://github.com/ropensci/ruODK).
I'm using ruODK in combination with the maketool drake for data ETL/QA pipelines similar to your use case, an example lives here.
Lastly, https://ropensci.org/ is a great community of R users and packages around data access and processing, they run a very helpful Slack.

@matthew-white
Copy link
Member

We have just released v1.1, which makes two changes to the API along these lines:

  • /projects/…/forms/…/submissions.csv allows download of the root table (excluding repeat data) as CSV, without a zipfile.
  • /projects/…/forms/…/submissions.csv.zip now allows ?attachments=false to exclude attachments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs discussion Discussion needed before work can begin
Projects
None yet
Development

No branches or pull requests

4 participants