Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Advanced Data Dictionary #18

Open
Tracked by #5
jqnatividad opened this issue Apr 28, 2022 · 12 comments
Open
Tracked by #5

Advanced Data Dictionary #18

jqnatividad opened this issue Apr 28, 2022 · 12 comments
Labels
enhancement New feature or request

Comments

@jqnatividad
Copy link
Contributor

jqnatividad commented Apr 28, 2022

Currently, CKAN's Data Dictionary is limited to data type, label and description.

With qsv stats we collect descriptive statistics when we infer each column's data type during the Analysis phase of a DP+ job.

Currently - sum, min/max, min/max length, mean, stddev, variance, quartiles, median, modes, cardinality & nullcount.

When cardinality = rowcount and nullcount = 0, we can infer that a column can be a primary key and be a unique index, and annotate its data dictionary accordingly (and going further, create a unique index on it after the Copy phase).

When nullcount = rowcount, we can infer that a column is empty, and note it in the data dictionary as well.

And with qsv frequency - we can also compile frequency tables for the top N values of a column, and if the cardinality of a column is below a given N threshold, we can even infer the domain of a column as enumerated values.

Since we paid for compiling the statistics when we inferred the column data types already, we can store these statistics in the data dictionary as well as "schemata" (a term I coined for schema metadata) for "free" ( or nearly free, as running qsv frequency is not currently done by DP+, but even against a large file like the 1m row, 500mb NYC 311 benchmark data, it only takes 2.8 seconds).

We have several options:

  1. add additional properties to the existing Data Dictionary JSON which is stored as a table comment
  2. keep the existing Data Dictionary JSON as is, and store RESOURCE_NAME-stats.csv and RESOURCE_NAME-freq.csv as "system resources", that can be downloaded and queried with the CKAN API.
  3. extend the Data Dictionary with additional properties, and also store the -stats and -freq CSVs as system resources.
  4. alternatively, instead of using the original Data Dictionary JSON, we can instead insert the jsonschema file produced by the qsv schema command. The added benefit of doing so is that we use the jsonschema file with qsv validate to check if an external file conforms to the schema. And since qsv validate accepts a jsonschema URL, you can even validate an external file against the CKAN hosted jsonschema.
  5. Do 4, and add the "system resources" like 3.
  6. Store all these schemata data in a "schemata catalog" in the datastore database or a dedicated schemata database as native PostgreSQL objects. Perhaps, by using a resource's ID and adding a special prefix and/or suffix to it (e.g. RESOURCEID_datadict, RESOURCEID_stats, RESOURCEID_freq). Doing so has the added benefit of being able to query all the data dictionaries - e.g. columns with the same name, infer related resources, suggest joins, etc. and other "Linked Data" like queries and inferences in a performant manner.

@wardi , since you originally implemented the Data Dictionary, would be keen to get your opinion.

@wardi
Copy link
Contributor

wardi commented Apr 29, 2022

I like the idea of adding the jsonschema generated by qsv to the resource as a "schema" metadata field like the way ckanext-validation does. This way there's a clear method to download and reuse the schema, and it might even be interoperable with other extensions.

I don't like the data dictionary option because this data is all generated on upload so populating the data dictionary with these fields would be mix generated data together with user-supplied metadata. (would you make them read-only in the form? what about in the api?)

@jqnatividad
Copy link
Contributor Author

I agree.The jsonschema produced by qsv is really robust. It enforces pretty much all the validation rules as per the jsonschema spec, and the generated schema can even have the enumerated constraints inferred by qsv frequency for a given threshold. It can even infer a regex pattern from the domain values of a column and add it to the jsonschema (https://github.com/jqnatividad/qsv/blob/master/src/cmd/schema.rs#L23).

Validating with qsv validate is also quite fast as its multi-threaded. When we were testing it, we were validating a CSV against a non-trivial jsonschema spec at 300k records/second (jqnatividad/qsv#164).

And as you pointed out, the jsonschema can even be used by other extensions, as the jsonschema validation engine we use has python bindings.

As for the Data Dictionary UI, our client actually wants to be able to specify some things manually beyond just changing the description. For the alpha launch this summer, we're just going to stay with the current Data Dictionary form, and add the jsonschema file as a "system resource."

But yeah, ultimately, having some form of jsonschema editor would be nice. At the very least, to change/sync the description. Maybe, there can be a "power user" view they can switch to where they can manually edit the jsonschema, and just do a json lint check before saving the edits...

As for the Data Dictionary API, was thinking we can keep current interface for backwards compatibility and just document that there is a jsonschema file at a standard URL endpoint for each resource...

WDYT?

@wardi
Copy link
Contributor

wardi commented May 1, 2022

ckanext-validation stores the jsonschema as a resource metadata field so it can be edited as part of the form or via the API. I'm not sure what you mean by "system resource", is it stored in the db or on the filesystem or using an installed IUploader extension? what APIs are available for updating the contents?

@jqnatividad
Copy link
Contributor Author

I'll have to look into ckanext-validation and understand how it works, and perhaps "borrow" some ideas. Do you have it running in production?

In my mind, a "system resource" is nothing more than a regular resource that's created by DP+ when pushing a "regular" resource with a special suffix (RESOURCEID_stats, RESOURCEID_freq, etc.), so it can be queried with the API just like any other resource.

@wardi
Copy link
Contributor

wardi commented May 2, 2022

So a resource that appears next to the others in the dataset? Or just some tables in the datastore db with corresponding table names?

We are using ckanext-validation in production on an internal version of our site.

@jqnatividad
Copy link
Contributor Author

I was thinking it will just be a resource that appears next to the "regular resources". In that way, they're still addressable with the Datastore API... but there might be some advantages in storing them as regular PosgtgreSQL tables...

@wardi
Copy link
Contributor

wardi commented May 2, 2022

If they're regular resources I can see issues with users:

  • wanting them displayed differently or hidden from the form/UI
  • reordering them so it's not clear which system resource corresponds to which resource
  • making changes to them that will get lost the next time the original resource changes

Or the extension itself might have problems creating system resources due to a customized resource metadata schema.

A specially named table in the datastore might be the way to go. Next add a custom endpoint that generates jsonschema based on data in a special table, and hook resource deletion so that these special tables are cleaned up. That should be most of the features you would want.

@jqnatividad
Copy link
Contributor Author

Hi @wardi, in the 5/5/2022 dev team call, you mentioned you were working on a Table Designer.

Is that in a publicly available repo I can check?

@wardi
Copy link
Contributor

wardi commented May 6, 2022

just the branch I started a while ago where you can see some of the data dictionary overrides: https://github.com/ckan/ckan/compare/6118-table-designer#diff-9de4be8179eb39da879b2e19646f874128c85c2d673d277735c709405d706eac

@jqnatividad
Copy link
Contributor Author

For the 0.4.0 release, DP+ sets the Data Dictionary Label field to the original column name as it was uploaded by the user.

This is because DP+ sanitizes the column names to be "db-safe" (i.e. valid postgres identifiers) with the qsv safenames command.

This pretty much guarantees a table will be created in the Datastore even if a CSV/Excel file has invalid column names.

By setting the Data Dictionary label to the original column name, they'll still see the original name with DataTables view as it uses the Data Dictionary label when its defined instead of the underlying postgresql column name.

@jqnatividad jqnatividad added the enhancement New feature or request label Dec 19, 2022
@jqnatividad
Copy link
Contributor Author

For the next release, DP+ will now have a setting SUMMARY_STATISTICS with a default value of false.

If set, a detailed summary statistics CSV file will be added as an additional resources with the "-stats" filename suffix.

@jqnatividad
Copy link
Contributor Author

Just to update, DP+ has to option to add summary statistics to a resource since 0.5.0.

Here's the relevant section in the .env file

# -------- SUMMARY STATS SETTINGS -----------
# Create a resource for calculated summary stats?
ADD_SUMMARY_STATS_RESOURCE = False
# Summary Stats don't make sense if PREVIEW_ROWS > 0
# because, you're only summarizing the preview, not the whole file
# Set to True if Summary Stats should also be done for previews
SUMMARY_STATS_WITH_PREVIEW = False
# additional command line options to pass to qsv stats when creating
# summary stats. Set to `--everything` if you want to include all the stats,
# particularly, when ADD_SUMMARY_STATS_RESOURCE is True
SUMMARY_STATS_OPTIONS = ''

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants