Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Export data (resources) and metadata (datapackage.json) in zip file #52

Open
danfowler opened this issue Aug 8, 2016 · 2 comments
Open

Comments

@danfowler
Copy link

Source:

My assumption was that, as it was a datapackage that it would deliver on download as a datapackage, exactly as it had been uploaded; i.e. the one ZIP file containing the datapackage.json and the CSV.

@Stephen-Gates
Copy link

I to was surprised to learn that CSVs are replaced by URLs. Perhaps offering a choice would be helpful

@amercader
Copy link
Member

Exporting a ZIP file that includes the data as well as descriptor was one of the first things discussed when this extension was started. You can follow some of the discussion with @vitorbaptista here: #30.

Basically this is something that on paper is relatively straight-forward but it's hard to implement in a way that is safe for all CKAN instances. Generating the zip file on demand it's dangerous (imagine exporting a dataset like this), so it's almost a given that the zipped data package should be generated asynchronously in a queue and stored at a given location. Starting from CKAN 2.7 we have really nice support for background jobs (that can also be enabled on 2.6). But this presents further problems as you need to trigger an update of the zip file whenever the dataset or the resources are updated to update the cached data package, so there might be a period when versions are out of sync.

Then there is the issue of whether to run the async creation on demand (ie when someone wants to export the data package) or pregenerate all data packages on the background. The UI for the former is difficult to implement (what happens after the user clicks, and while the DP is being created), but the latter basically implies keeping a duplicate of all the upload files (minus compression rate) on all datasets, for a feature that might not be heavily used (until data packages take over the world of course). Not all maintainers might be keen on this.

Perhaps the implementation I would be more keen on would be:

  • Config option that controls if zipped data packages with data are enabled at all
  • If true, data package creation happens on demand (ie when user clicks "Export data package")
  • The data package generation happens asynchronously, in the meantime the user sees a spinner or something in the UI. The page polls an endpoint regularly to see if the export finished and redirects to the zip when it does
  • The cached zip file is updated whenever the dataset is updated

Even this one is not trivial to implement so if we decide to go for it I'd spec it more thoroughly

A ballpark estimate for this feature is 4-7 days

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
No open projects
Frictionless General
  
Software (wide)
Development

No branches or pull requests

4 participants