Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DataStore to CSV service, for download of large resources. #34

Open
cphsolutionslab opened this issue Apr 8, 2014 · 8 comments
Open

Comments

@cphsolutionslab
Copy link

Explanation:

I would like to use the DataStore via the API as primary data-source. This works without a problem already.

However, if people wants to download the entire resource as a CSV, via /dump/, it only downloads 100K records (this is hardcoded into CKAN).
It also takes quite a long time to generate the CSV file.

I have resources with over 10+ mio. Rows and would like to offer a complete download via CSV. But changing the hardcoded 100K row limit puts a lot of pressure on the system.
It would be very nice to have a feature where, using the API for the DataStore, would update a corresponding CSV-file for download. So download wouldn’t need to generate the file.

@andresmrm
Copy link

+1

@davidread
Copy link

This sounds most useful

@rufuspollock
Copy link
Member

+1 - think this seems sensible. Shouldn't be hard to link this with the filestore so one auto-pushes there ...

@nigelbabu
Copy link

This sounds like it could use a celery like service which we're talking about in #66.

@jqnatividad
Copy link
Contributor

jqnatividad commented Dec 1, 2016

bumping this up now that we have background jobs http://docs.ckan.org/en/latest/maintaining/background-tasks.html.

And rely instead on running a native Postgres COPY command asynchronously which is much faster, doesn't need to load everything in memory, skipping the 100,000 row limit.

We have several clients who want to use the Datastore transactionally for large resources, and the current dump mechanism is presenting a problem.

If we do implement it as a background task, I also suggest putting into place a caching mechanism so that this relatively expensive process is not unnecessarily started if the table has not changed.

I'd also add a way to export to other formats other than CSV as well (e.g. JSON, XLSX, XML) as other data portal solutions allow files to be uploaded as CSV and downloaded using different formats. Perhaps, using tools like https://github.com/lukasmartinelli/pgclimb

cc @wardi @amercader

@wardi
Copy link
Contributor

wardi commented Dec 2, 2016

Streaming data on request is a nice approach too. That gives you live data and doesn't multiply your storage requirements.

edit: I've found openpyxl dumps XLSX data quite efficiently and has constant memory overhead with its write_only=True mode. CSV, JSON and XML can be streamed easily too.

@pwalsh
Copy link
Member

pwalsh commented Dec 2, 2016

@wardi
Copy link
Contributor

wardi commented Dec 2, 2016

Here's a simple fix that reduces memory usage and allows large CSV dumps from datastore: ckan/ckan#3344

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants