-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Advanced Data Dictionary #18
Comments
I like the idea of adding the jsonschema generated by qsv to the resource as a "schema" metadata field like the way ckanext-validation does. This way there's a clear method to download and reuse the schema, and it might even be interoperable with other extensions. I don't like the data dictionary option because this data is all generated on upload so populating the data dictionary with these fields would be mix generated data together with user-supplied metadata. (would you make them read-only in the form? what about in the api?) |
I agree.The jsonschema produced by qsv is really robust. It enforces pretty much all the validation rules as per the jsonschema spec, and the generated schema can even have the enumerated constraints inferred by Validating with And as you pointed out, the jsonschema can even be used by other extensions, as the jsonschema validation engine we use has python bindings. As for the Data Dictionary UI, our client actually wants to be able to specify some things manually beyond just changing the description. For the alpha launch this summer, we're just going to stay with the current Data Dictionary form, and add the jsonschema file as a "system resource." But yeah, ultimately, having some form of jsonschema editor would be nice. At the very least, to change/sync the description. Maybe, there can be a "power user" view they can switch to where they can manually edit the jsonschema, and just do a json lint check before saving the edits... As for the Data Dictionary API, was thinking we can keep current interface for backwards compatibility and just document that there is a jsonschema file at a standard URL endpoint for each resource... WDYT? |
ckanext-validation stores the jsonschema as a resource metadata field so it can be edited as part of the form or via the API. I'm not sure what you mean by "system resource", is it stored in the db or on the filesystem or using an installed IUploader extension? what APIs are available for updating the contents? |
I'll have to look into ckanext-validation and understand how it works, and perhaps "borrow" some ideas. Do you have it running in production? In my mind, a "system resource" is nothing more than a regular resource that's created by DP+ when pushing a "regular" resource with a special suffix (RESOURCEID_stats, RESOURCEID_freq, etc.), so it can be queried with the API just like any other resource. |
So a resource that appears next to the others in the dataset? Or just some tables in the datastore db with corresponding table names? We are using ckanext-validation in production on an internal version of our site. |
I was thinking it will just be a resource that appears next to the "regular resources". In that way, they're still addressable with the Datastore API... but there might be some advantages in storing them as regular PosgtgreSQL tables... |
If they're regular resources I can see issues with users:
Or the extension itself might have problems creating system resources due to a customized resource metadata schema. A specially named table in the datastore might be the way to go. Next add a custom endpoint that generates jsonschema based on data in a special table, and hook resource deletion so that these special tables are cleaned up. That should be most of the features you would want. |
Hi @wardi, in the 5/5/2022 dev team call, you mentioned you were working on a Table Designer. Is that in a publicly available repo I can check? |
just the branch I started a while ago where you can see some of the data dictionary overrides: https://github.com/ckan/ckan/compare/6118-table-designer#diff-9de4be8179eb39da879b2e19646f874128c85c2d673d277735c709405d706eac |
For the 0.4.0 release, DP+ sets the Data Dictionary Label field to the original column name as it was uploaded by the user. This is because DP+ sanitizes the column names to be "db-safe" (i.e. valid postgres identifiers) with the qsv safenames command. This pretty much guarantees a table will be created in the Datastore even if a CSV/Excel file has invalid column names. By setting the Data Dictionary label to the original column name, they'll still see the original name with DataTables view as it uses the Data Dictionary label when its defined instead of the underlying postgresql column name. |
For the next release, DP+ will now have a setting SUMMARY_STATISTICS with a default value of false. If set, a detailed summary statistics CSV file will be added as an additional resources with the "-stats" filename suffix. |
Just to update, DP+ has to option to add summary statistics to a resource since 0.5.0. Here's the relevant section in the .env file datapusher-plus/datapusher/dot-env.template Lines 116 to 128 in 48ca3eb
|
Currently, CKAN's Data Dictionary is limited to data type, label and description.
With
qsv stats
we collect descriptive statistics when we infer each column's data type during the Analysis phase of a DP+ job.Currently - sum, min/max, min/max length, mean, stddev, variance, quartiles, median, modes, cardinality & nullcount.
When cardinality = rowcount and nullcount = 0, we can infer that a column can be a primary key and be a unique index, and annotate its data dictionary accordingly (and going further, create a unique index on it after the Copy phase).
When nullcount = rowcount, we can infer that a column is empty, and note it in the data dictionary as well.
And with
qsv frequency
- we can also compile frequency tables for the top N values of a column, and if the cardinality of a column is below a given N threshold, we can even infer the domain of a column as enumerated values.Since we paid for compiling the statistics when we inferred the column data types already, we can store these statistics in the data dictionary as well as "schemata" (a term I coined for schema metadata) for "free" ( or nearly free, as running
qsv frequency
is not currently done by DP+, but even against a large file like the 1m row, 500mb NYC 311 benchmark data, it only takes 2.8 seconds).We have several options:
RESOURCE_NAME-stats.csv
andRESOURCE_NAME-freq.csv
as "system resources", that can be downloaded and queried with the CKAN API.-stats
and-freq
CSVs as system resources.qsv schema
command. The added benefit of doing so is that we use the jsonschema file withqsv validate
to check if an external file conforms to the schema. And sinceqsv validate
accepts a jsonschema URL, you can even validate an external file against the CKAN hosted jsonschema.RESOURCEID_datadict
,RESOURCEID_stats
,RESOURCEID_freq
). Doing so has the added benefit of being able to query all the data dictionaries - e.g. columns with the same name, infer related resources, suggest joins, etc. and other "Linked Data" like queries and inferences in a performant manner.@wardi , since you originally implemented the Data Dictionary, would be keen to get your opinion.
The text was updated successfully, but these errors were encountered: