Advanced Data Dictionary #18

jqnatividad · 2022-04-28T22:11:42Z

Currently, CKAN's Data Dictionary is limited to data type, label and description.

With qsv stats we collect descriptive statistics when we infer each column's data type during the Analysis phase of a DP+ job.

Currently - sum, min/max, min/max length, mean, stddev, variance, quartiles, median, modes, cardinality & nullcount.

When cardinality = rowcount and nullcount = 0, we can infer that a column can be a primary key and be a unique index, and annotate its data dictionary accordingly (and going further, create a unique index on it after the Copy phase).

When nullcount = rowcount, we can infer that a column is empty, and note it in the data dictionary as well.

And with qsv frequency - we can also compile frequency tables for the top N values of a column, and if the cardinality of a column is below a given N threshold, we can even infer the domain of a column as enumerated values.

Since we paid for compiling the statistics when we inferred the column data types already, we can store these statistics in the data dictionary as well as "schemata" (a term I coined for schema metadata) for "free" ( or nearly free, as running qsv frequency is not currently done by DP+, but even against a large file like the 1m row, 500mb NYC 311 benchmark data, it only takes 2.8 seconds).

We have several options:

add additional properties to the existing Data Dictionary JSON which is stored as a table comment
keep the existing Data Dictionary JSON as is, and store RESOURCE_NAME-stats.csv and RESOURCE_NAME-freq.csv as "system resources", that can be downloaded and queried with the CKAN API.
extend the Data Dictionary with additional properties, and also store the -stats and -freq CSVs as system resources.
alternatively, instead of using the original Data Dictionary JSON, we can instead insert the jsonschema file produced by the qsv schema command. The added benefit of doing so is that we use the jsonschema file with qsv validate to check if an external file conforms to the schema. And since qsv validate accepts a jsonschema URL, you can even validate an external file against the CKAN hosted jsonschema.
Do 4, and add the "system resources" like 3.
Store all these schemata data in a "schemata catalog" in the datastore database or a dedicated schemata database as native PostgreSQL objects. Perhaps, by using a resource's ID and adding a special prefix and/or suffix to it (e.g. RESOURCEID_datadict, RESOURCEID_stats, RESOURCEID_freq). Doing so has the added benefit of being able to query all the data dictionaries - e.g. columns with the same name, infer related resources, suggest joins, etc. and other "Linked Data" like queries and inferences in a performant manner.

@wardi , since you originally implemented the Data Dictionary, would be keen to get your opinion.

The text was updated successfully, but these errors were encountered:

wardi · 2022-04-29T17:43:59Z

I like the idea of adding the jsonschema generated by qsv to the resource as a "schema" metadata field like the way ckanext-validation does. This way there's a clear method to download and reuse the schema, and it might even be interoperable with other extensions.

I don't like the data dictionary option because this data is all generated on upload so populating the data dictionary with these fields would be mix generated data together with user-supplied metadata. (would you make them read-only in the form? what about in the api?)

jqnatividad · 2022-04-30T01:46:47Z

I agree.The jsonschema produced by qsv is really robust. It enforces pretty much all the validation rules as per the jsonschema spec, and the generated schema can even have the enumerated constraints inferred by qsv frequency for a given threshold. It can even infer a regex pattern from the domain values of a column and add it to the jsonschema (https://github.com/jqnatividad/qsv/blob/master/src/cmd/schema.rs#L23).

Validating with qsv validate is also quite fast as its multi-threaded. When we were testing it, we were validating a CSV against a non-trivial jsonschema spec at 300k records/second (jqnatividad/qsv#164).

And as you pointed out, the jsonschema can even be used by other extensions, as the jsonschema validation engine we use has python bindings.

As for the Data Dictionary UI, our client actually wants to be able to specify some things manually beyond just changing the description. For the alpha launch this summer, we're just going to stay with the current Data Dictionary form, and add the jsonschema file as a "system resource."

But yeah, ultimately, having some form of jsonschema editor would be nice. At the very least, to change/sync the description. Maybe, there can be a "power user" view they can switch to where they can manually edit the jsonschema, and just do a json lint check before saving the edits...

As for the Data Dictionary API, was thinking we can keep current interface for backwards compatibility and just document that there is a jsonschema file at a standard URL endpoint for each resource...

WDYT?

wardi · 2022-05-01T13:21:59Z

ckanext-validation stores the jsonschema as a resource metadata field so it can be edited as part of the form or via the API. I'm not sure what you mean by "system resource", is it stored in the db or on the filesystem or using an installed IUploader extension? what APIs are available for updating the contents?

jqnatividad · 2022-05-02T16:45:38Z

I'll have to look into ckanext-validation and understand how it works, and perhaps "borrow" some ideas. Do you have it running in production?

In my mind, a "system resource" is nothing more than a regular resource that's created by DP+ when pushing a "regular" resource with a special suffix (RESOURCEID_stats, RESOURCEID_freq, etc.), so it can be queried with the API just like any other resource.

wardi · 2022-05-02T17:34:58Z

So a resource that appears next to the others in the dataset? Or just some tables in the datastore db with corresponding table names?

We are using ckanext-validation in production on an internal version of our site.

jqnatividad · 2022-05-02T20:16:52Z

I was thinking it will just be a resource that appears next to the "regular resources". In that way, they're still addressable with the Datastore API... but there might be some advantages in storing them as regular PosgtgreSQL tables...

wardi · 2022-05-02T20:50:52Z

If they're regular resources I can see issues with users:

wanting them displayed differently or hidden from the form/UI
reordering them so it's not clear which system resource corresponds to which resource
making changes to them that will get lost the next time the original resource changes

Or the extension itself might have problems creating system resources due to a customized resource metadata schema.

A specially named table in the datastore might be the way to go. Next add a custom endpoint that generates jsonschema based on data in a special table, and hook resource deletion so that these special tables are cleaned up. That should be most of the features you would want.

jqnatividad · 2022-05-06T17:52:40Z

Hi @wardi, in the 5/5/2022 dev team call, you mentioned you were working on a Table Designer.

Is that in a publicly available repo I can check?

wardi · 2022-05-06T19:09:17Z

just the branch I started a while ago where you can see some of the data dictionary overrides: https://github.com/ckan/ckan/compare/6118-table-designer#diff-9de4be8179eb39da879b2e19646f874128c85c2d673d277735c709405d706eac

jqnatividad · 2022-12-14T15:46:58Z

For the 0.4.0 release, DP+ sets the Data Dictionary Label field to the original column name as it was uploaded by the user.

This is because DP+ sanitizes the column names to be "db-safe" (i.e. valid postgres identifiers) with the qsv safenames command.

This pretty much guarantees a table will be created in the Datastore even if a CSV/Excel file has invalid column names.

By setting the Data Dictionary label to the original column name, they'll still see the original name with DataTables view as it uses the Data Dictionary label when its defined instead of the underlying postgresql column name.

jqnatividad · 2023-01-04T19:50:00Z

For the next release, DP+ will now have a setting SUMMARY_STATISTICS with a default value of false.

If set, a detailed summary statistics CSV file will be added as an additional resources with the "-stats" filename suffix.

jqnatividad · 2023-06-18T13:15:37Z

Just to update, DP+ has to option to add summary statistics to a resource since 0.5.0.

Here's the relevant section in the .env file

datapusher-plus/datapusher/dot-env.template

Lines 116 to 128 in 48ca3eb

    
           # -------- SUMMARY STATS SETTINGS ----------- 
        
           # Create a resource for calculated summary stats? 
        
           ADD_SUMMARY_STATS_RESOURCE = False 
        
           # Summary Stats don't make sense if PREVIEW_ROWS > 0 
        
           # because, you're only summarizing the preview, not the whole file 
        
           # Set to True if Summary Stats should also be done for previews 
        
           SUMMARY_STATS_WITH_PREVIEW = False 
        
           # additional command line options to pass to qsv stats when creating 
        
           # summary stats. Set to `--everything` if you want to include all the stats, 
        
           # particularly, when ADD_SUMMARY_STATS_RESOURCE is True 
        
           SUMMARY_STATS_OPTIONS = ''

jqnatividad mentioned this issue Apr 28, 2022

Roadmap Tracking Issue - EPIC #5

Open

27 tasks

jqnatividad added the enhancement New feature or request label Dec 19, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Advanced Data Dictionary #18

Advanced Data Dictionary #18

jqnatividad commented Apr 28, 2022 •

edited

Loading

wardi commented Apr 29, 2022

jqnatividad commented Apr 30, 2022

wardi commented May 1, 2022

jqnatividad commented May 2, 2022

wardi commented May 2, 2022

jqnatividad commented May 2, 2022

wardi commented May 2, 2022

jqnatividad commented May 6, 2022

wardi commented May 6, 2022

jqnatividad commented Dec 14, 2022

jqnatividad commented Jan 4, 2023

jqnatividad commented Jun 18, 2023

Advanced Data Dictionary #18

Advanced Data Dictionary #18

Comments

jqnatividad commented Apr 28, 2022 • edited Loading

wardi commented Apr 29, 2022

jqnatividad commented Apr 30, 2022

wardi commented May 1, 2022

jqnatividad commented May 2, 2022

wardi commented May 2, 2022

jqnatividad commented May 2, 2022

wardi commented May 2, 2022

jqnatividad commented May 6, 2022

wardi commented May 6, 2022

jqnatividad commented Dec 14, 2022

jqnatividad commented Jan 4, 2023

jqnatividad commented Jun 18, 2023

jqnatividad commented Apr 28, 2022 •

edited

Loading