Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update Datastore Datapusher documentation to mention Express Loader #4415

Open
jqnatividad opened this issue Aug 21, 2018 · 18 comments
Open
Assignees

Comments

@jqnatividad
Copy link
Contributor

jqnatividad commented Aug 21, 2018

The current Datastore documentation makes no mention of Express Loader (https://github.com/davidread/ckanext-xloader), which is a marked improvement over Datapusher.

https://ckan.org/2017/11/20/express-loader-for-ckan-datastore/

For legacy purposes, I guess Datapusher needs to be maintained. But for new installations, shouldn't Express Loader be the recommended mechanism?

cc @davidread

@LukeTully
Copy link
Contributor

LukeTully commented Aug 22, 2018

@jqnatividad This seems sensible, given that I'd not heard of express-loader until just now.

Of course, I'm apparently not as informed as I should be, but I spend most of my time in the docs.

@davidread
Copy link
Contributor

I'm supportive of this. Since xloader was launched it's been used in a dozen or so CKANs in OpenGov. And we had some good validation of it being used and improved (PR1 PR2 by @metaodi who switched the Open Data Zurich site from datapusher to xloader.

I have two concerns about xloader being the recommended choice:

  • datapusher autodetects column types, whereas xloader assumes they are all text. So either an administrator manually changes them using Data Dictionary or they . Users may miss that when doing SQL queries on datastore. I guess they can still cast a column to int (say) in the query though. I think it is a reasonable trade-off for the speed, robustness, simplicity & debugability of xloader.

  • Unresponsive resource pages during xloading ckanext-xloader#41 Loading large CSVs (100Mb+) into Postgres causes Postgres to consume a lot of CPU load. This appears to highlight an issue with displaying resources and resource pages can become unresponsive during these times. Whilst there is a solution, in adding a second postgres server, it could do with a bit more battle testing. But maybe it's not going to get that without more people knowing about it.

It would be good to get comments from @amercader @wardi and the rest of the tech team on this.

If we make xloader the official choice, we should also move it from github.com/davidread/ckanext-xloader to github.com/ckan/ckanext-xloader (I am supporting it).

@jqnatividad
Copy link
Contributor Author

jqnatividad commented Aug 22, 2018

As all our sites have the Datastore installed, before Express Loader, the single largest source of ongoing support issues was the Datapusher.

One day, it will work fine without a hitch (albeit very slowly), and then the following day, it won't because of some data quality issue in the last row of a large CSV, which aborts the whole job, which is even more frustrating for large CSVs as it does so when it encounters the problem hours into the datapusher job!

@davidread, your first concern is a feature not a bug. Uploading everything as text, and casting the column after its loaded is a much better workflow, with a more graceful error handling (thanks to the more verbose logging messages of xloader too) in my book.

As for your second concern, we have some insight that we add to ckan/ckanext-xloader#41 that should help.

@TkTech
Copy link
Member

TkTech commented Aug 22, 2018

I've had likewise experiences, and not just with CKAN. Guessing column types always causes more problems then it solves, especially when it starts doing things like truncating 0s because it thinks that UPC is a number.

IMO the "better" solution is to continue expanding the data dictionary feature. We should only load the data after the data dictionary has been defined. Load the first couple of rows using something like papaparser to get column names and type hints. We could also allow selecting previously defined data dictionaries (for example we have one site that uploads many datasets of energy usage samples, one per province. The data dictionary would be identical for every one)

The pros as I see it:

  1. Not using variable length text fields (better data layout on disk without vacuums)
  2. Avoids forcing the use of CAST/:: everywhere, which is more user-friendly when writing queries.
  3. Properly typed, efficient indexes (instead of text)
  4. Simple user-friendly form to define the schema

@TkTech
Copy link
Member

TkTech commented Aug 22, 2018

For legacy purposes, I guess Datapusher needs to be maintained. But for new installations, shouldn't Express Loader be the recommended mechanism?

@jqnatividad switching away from datapusher should be trivial, xloader is mostly a drop-in replacement. I would vote for immediate deprecation and removal on the next release unless there's a strong reason to keep it. We would backport security fixes only. Replace the documentation with xloader along with a footnote to see earlier versions of the documentation for help with legacy datapusher installs.

@jqnatividad
Copy link
Contributor Author

jqnatividad commented Aug 22, 2018

@TkTech, xloader's genesis was this discussion 3 years ago :)

ckan/ideas#150

And I'm all for expanding the data dictionary too to help not just with xloading, but as as way to get descriptive statistics metadata as well.

ckan/ideas#196

But instead of using csvkit to compile the descriptive statistics, leverage the pg_statistic_ext statistics compiled by Postgres after an ANALYZE.

@TkTech
Copy link
Member

TkTech commented Aug 22, 2018

Regarding the statistics, I'm all for it but rather then using csvkit and an asynchronous job it should be just a part of the loading process. Once the data is in postgres we can easily get all of that information without having to use something like csvkit.

  1. No additional read of the data via csvkit
  2. No additional dependency
  3. Since it's SQL on the post-loaded data it works regardless of the original import type (csv, excel, JSONL, etc...)
  4. A lot of these statistics are already known by the postgres planner and available in the statistics table (most common, min, max, unique count, etc..)

@jqnatividad
Copy link
Contributor Author

jqnatividad commented Aug 22, 2018

@TkTech LOL! Our comments re using Postgres statistics crossed path in the internets :)

@LukeTully
Copy link
Contributor

I like the idea of simply including the data dictionary workflow as an obvious next step in the resource addition process. (Create dataset -> add resource -> define dictionary)

Rather than loading just a few rows, or maybe in addition to that, we could try and surface data validation exceptions on the client before upload. For example, a user uploads a 100MB+ csv, and we gradually validate the columns to surface any data that wouldn't work given the user's selected datatype.

@TkTech
Copy link
Member

TkTech commented Aug 22, 2018

Rather than loading just a few rows, or maybe in addition to that, we could try and surface data validation exceptions on the client before upload. For example, a user uploads a 100MB+ csv, and we gradually validate the columns to surface any data that wouldn't work given the user's selected datatype.

I'm fine with this as long as it's 100% client side and running transparently in the background, but I'd give it a pretty low priority given restricted time. Reading the first few rows is just to populate the initial column names and type guesses on the form.

@amercader
Copy link
Member

Thanks all for this very useful discussion. There's lots to unpack here so I'll summarize my views separately.

  • Mention Express Loader in docs: absolutely. We should reword the DataPusher section to "Automatically getting data in the DataStore" and describe EL there (with or without mentinoning DataPusher too, see next point)

  • Dropping DataPusher and replacing it with EL: I'm happy with this is the rest of the team is happy. I suffered DataPusher issues like everyone so if EL provides a more robust experience then we should embrace it. But one thing I'd say for the DataPusher is this, many many people uses it (basically because there was no other option) and it I'd venture to say that for all it's flaws it works in a lot of cases (I don't have data to sustain this other that my own experience). My point is that EL should match this. As I understand the main caveats of EL right now are:

    1. No type guessing. I don't think it's a problem as I agree with others than we should move towards explicitly-defined data schemas (or data dictionaries). More on this later. The only problem I can think with this is whether if having all columns defined as text by default affects the visualizations. Without having actually tried, I'd imagine that eg charts wouldn't work correctly, and not sure about maps (lat/lon columns). We should review what's the default behaviour in the default visualizations when EL is activated (this must have been assessed in the opengov instances using it)
    2. No support for private datasets. I got that from the EL README, not sure if it's still relevant. That could be a blocker for adoption to me. @davidread is this something that can be overcome?
    3. Performance issues. Looking at the issue I understand that they can be worked around.

    Again, these are not listed to argue against EL adoption, just as points that need to be clarified

  • Moving EL to the CKAN org: yes, that makes perfect sense. But I'm going to be completely honest. If EL is going to be the default way of getting data in DataStore why not make it part of the datastore plugin (ckan.datastore.automatic_upload defaulting to True)? It's clear that 99% of people using the DataStore (wild guess here but >80% of CKANs?) wants data to be automatically uploaded, so shouldn't this functionality be backed into the datastore itself? I know we have the argument of easier iterations and releases when living in a separate extension, but we also have the counter-argument of less attention by maintainers, as it's the case with DataPusher now. What do people (especially @davidread) think?

  • Defining the Data Schema before uploading (can you tell that I prefer the term Schema instead of Dictionary ;) ): this is absolutely the way to go, and one we have been working heavily on ckanext-validation. We have a working prototype widget to infer the fields client-side and present an interface to the publishers so they can tweak the field types directly into the form (see screenshot below). This stores the information as a Table Schema object in the resource, that could be then used by EL (or DataPusher) to create the relevant field types. It only needs some UX and UI polishing that could be fitted in future wider work around the dataset form. IMO given all the work done I think this is the best path forward for extending the current data dictionary support and having the best of both worlds (proper field types and EL). Would love to see if there is interest in getting this in core (and resources!)

    screenshot_2018-08-23 add resource - example - demo validation wprdc

    The current prototype can be tested live at https://demo-validation.l3.ckan.io/, PM me for a user / password.

@jqnatividad
Copy link
Contributor Author

Hi @amercader,
On Data Schema, that'd be excellent! Though switching over might be non-trivial given all the work that was done to store Dictionary values in Postgres comments.

I've been tracking the Frictionless Data project and we've been planning on installing ckanext-datapackager as a standard extension especially since we've announced support for Fiscal Data Package and Data Packages in general. Can ckanext-validation and ckanext-datapackager coexist?

@wardi
Copy link
Contributor

wardi commented Aug 25, 2018

+1M to data schemas before uploading data

open.canada.ca has been doing something like this internally with ckanext-recombinant for a couple years now, and we have departments lined up to use it because the users love it and the data quality is awesome.

Recombinant benefits:

  • controlled lists (single and multiple select) are supported
  • schemas are applied across organizations so we can easily merge and publish data from all orgs
  • schemas are managed by a developer who can help choose the best primary keys, remove redundant information, link to existing controlled lists and migrate old data as required
  • custom validation is possible (e.g. field B required only when field A is a specific value -- this is a very common requirement)
  • data is validated first in a provided Excel template so users can immediately see their errors before uploading
  • validation is enforced by the datastore database using triggers (very fast!) and whole update is rolled back on errors (very safe!)
  • data is updated on an incremental basis and old data is kept in datastore tables
  • triggers are used to record which user updated every row, and when every row was created and updated

Recombinant drawbacks:

  • schemas are applied across organizations, not ad-hoc for individual datasets
  • schemas need to be written by a developer (bottleneck!)
  • custom validation needs to be written by a developer (in Excel and pl/pgsql functions)
  • migrations are required on some schema changes, written by a developer
  • updating data requires copy and paste from datatable preview or keeping by old templates which can cause conflicts and re-uploading old data

Data schemas might benefit from some of the lessons we've learned from recombinant, e.g.

  1. Thought about providing templates based on the schema already populated with validation rules? Users would love it (immediate error reporting), and the code is mostly there in recombinant.
  2. What about incremental updates instead or reloading all the data all the time? It's been good for recombinant and enabled regular reporting on all the data types we collect. Too difficult?
  3. Separate the data schema from the dataset so that the people allowed to upload data aren't the same ones designing the schema? This is a common, and pretty darn useful requirement.
  4. Custom validation? Extend JSON schema enough to cover enough of the missing cases?
  5. Enforced validation? We can prevent bad data with triggers instead of reporting on it afterwards. That's good for some users and some types of data.

@wardi
Copy link
Contributor

wardi commented Aug 25, 2018

This got lost in my long-winded comment: ckanext-validation and the the approach @amercader describes looks awesome and super useful without any changes!

Wait, is this an issue about xloader?

Yes, I think we should recommend xloader over datapusher :-)

@amercader amercader self-assigned this Aug 28, 2018
@davidread
Copy link
Contributor

davidread commented Aug 31, 2018

Covering off the caveats that @amercader raises:

i. No type guessing potentially affecting vizualizations.

There doesn't seem to be a problem with it interpreting the text as numbers:

screen shot 2018-08-31 at 11 03 29

However there is an issue with dates (e.g. '2018-01-25') which end up just as a year, distorting the gold prices example:

screen shot 2018-08-31 at 10 55 04

So this seems a concern for datasets which are time series like this one.

ii. Private dataset support.

I've now tested this and it works fine with private datasets. 👍

iii. Performance issues.

Looking at the issue I understand that they can be worked around.

Indeed. I'll look at this more no though.

@davidread
Copy link
Contributor

I've finished the work on CKAN performance that was the key concern on this ticket.

Yes there is the concern about recline not graphing dates until the Data Dictionary is edited to say they are a date. But I reckon we can live with that for the improved loading. (And btw I'm fine with renaming Data Dictionary to Data Schema - I think the latter is more commonly used.)

I'll do a PR to change the recommendation in the docs from DataPusher to Express Loader.

I'll also look at moving Express Loader into the Datastore extension (i.e. in the core ckan repo), as @amercader suggests. I'll ensure there is an option to disable it, so that datapusher or some other loader can still be used.

@Gauravp-NEC
Copy link
Contributor

I have updated and mentioned Express Loader in DataStore Extension document in PR #6972. Do we need to completely obsolete the DataPusher?

@Labuschagne-Miro
Copy link
Contributor

#8044

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants