DataStore to CSV service, for download of large resources. #34

cphsolutionslab · 2014-04-08T06:55:45Z

Explanation:

I would like to use the DataStore via the API as primary data-source. This works without a problem already.

However, if people wants to download the entire resource as a CSV, via /dump/, it only downloads 100K records (this is hardcoded into CKAN).
It also takes quite a long time to generate the CSV file.

I have resources with over 10+ mio. Rows and would like to offer a complete download via CSV. But changing the hardcoded 100K row limit puts a lot of pressure on the system.
It would be very nice to have a feature where, using the API for the DataStore, would update a corresponding CSV-file for download. So download wouldn’t need to generate the file.

andresmrm · 2014-04-08T11:07:43Z

+1

davidread · 2014-05-20T13:51:49Z

This sounds most useful

rufuspollock · 2014-05-20T15:09:34Z

+1 - think this seems sensible. Shouldn't be hard to link this with the filestore so one auto-pushes there ...

nigelbabu · 2014-07-04T06:02:23Z

This sounds like it could use a celery like service which we're talking about in #66.

jqnatividad · 2016-12-01T23:37:50Z

bumping this up now that we have background jobs http://docs.ckan.org/en/latest/maintaining/background-tasks.html.

And rely instead on running a native Postgres COPY command asynchronously which is much faster, doesn't need to load everything in memory, skipping the 100,000 row limit.

We have several clients who want to use the Datastore transactionally for large resources, and the current dump mechanism is presenting a problem.

If we do implement it as a background task, I also suggest putting into place a caching mechanism so that this relatively expensive process is not unnecessarily started if the table has not changed.

I'd also add a way to export to other formats other than CSV as well (e.g. JSON, XLSX, XML) as other data portal solutions allow files to be uploaded as CSV and downloaded using different formats. Perhaps, using tools like https://github.com/lukasmartinelli/pgclimb

cc @wardi @amercader

wardi · 2016-12-02T00:58:00Z

Streaming data on request is a nice approach too. That gives you live data and doesn't multiply your storage requirements.

edit: I've found openpyxl dumps XLSX data quite efficiently and has constant memory overhead with its write_only=True mode. CSV, JSON and XML can be streamed easily too.

pwalsh · 2016-12-02T06:17:40Z

https://github.com/frictionlessdata/jsontableschema-py and https://github.com/frictionlessdata/tabulator-py can be used for various aspects of this.

wardi · 2016-12-02T23:03:03Z

Here's a simple fix that reduces memory usage and allows large CSV dumps from datastore: ckan/ckan#3344

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DataStore to CSV service, for download of large resources. #34

DataStore to CSV service, for download of large resources. #34

cphsolutionslab commented Apr 8, 2014

andresmrm commented Apr 8, 2014

davidread commented May 20, 2014

rufuspollock commented May 20, 2014

nigelbabu commented Jul 4, 2014

jqnatividad commented Dec 1, 2016 •

edited

wardi commented Dec 2, 2016 •

edited

pwalsh commented Dec 2, 2016 •

edited

wardi commented Dec 2, 2016

DataStore to CSV service, for download of large resources. #34

DataStore to CSV service, for download of large resources. #34

Comments

cphsolutionslab commented Apr 8, 2014

andresmrm commented Apr 8, 2014

davidread commented May 20, 2014

rufuspollock commented May 20, 2014

nigelbabu commented Jul 4, 2014

jqnatividad commented Dec 1, 2016 • edited

wardi commented Dec 2, 2016 • edited

pwalsh commented Dec 2, 2016 • edited

wardi commented Dec 2, 2016

jqnatividad commented Dec 1, 2016 •

edited

wardi commented Dec 2, 2016 •

edited

pwalsh commented Dec 2, 2016 •

edited