Metadata resources: A special resource type for storing metadata #7856
Replies: 3 comments 8 replies
-
Thanks again for chatting today @jqnatividad. I'm adding some questions below - let me know if you think they belong elsewhere Our (currently an MVP - not yet a deployable extension) datastore_profiler currently stores metadata in the CKAN database using Are there negatives to datapusher+ storing metadata resources in the same way? On the topic of datastore_profiler; while there's lots of overlap, there are things that (at least currently) dont:
IYO, would it be best to, short-term, to keep these extensions separate? Then we could combine (parts of) them later after we see what kinds of successes we have with them? |
Beta Was this translation helpful? Give feedback.
-
Would be great to settle on a standard for storing summary statistics/resource metadata/column profiles in ckan. There are a few ways we discussed doing this:
Each has advantages, but the data dictionary/fields info approach is the one that automatically exposes the information to If we decide to go this way and choose a common set of json keys for column metadata/profiles/statistics then they could be generated at csv load time by a tool like datapusher+, on demand by one like datastore_profiler or however a user likes with their own custom extension/tool. Can we also settle on a name for the thing we're collecting? Of the suggestions:
"statistics" or "stats" like the qsv command is a good fit. Quarantined data feels like a separate feature that could be implemented as a separate resource since the amount of data would be much larger. Frequency tables and detected controlled lists could become unwieldy as json data in the data dictionary/fields info. Not sure this is a strong enough argument to store them as auxiliary tables or resources because those options come with a management cost. |
Beta Was this translation helpful? Give feedback.
-
Hey @jqnatividad 👋 I had a half-baked thought that overlaps some with CKAN has vocabularies and tags. Im considering developing an extension that allows tags to be added to a datastore attribute (instead of / in addition to onto a resource/package). Nothing too complicated. From there (and here's the maybe-overlapping-with- part), I'm considering adding (likely as a supplementary resource in a package) unique values and counts for each column in a datastore resource (like from a I could see that being useful for a team like mine that wants to know the answer of "what kind of data is in our catalog?" at a slightly more granular level I wanted to hear your thoughts on this, and whether it has too much overlap with what Thanks in advance :D and let me know if this belongs somewhere outside of this discussion thread 😅 |
Beta Was this translation helpful? Give feedback.
-
Datapusher+ (DP+) does guaranteed type inferencing with qsv.
It can guarantee type inferences as it scans the entire file, unlike messytables in Datapusher, which only samples the beginning of a file. Messytables works for the most part but is not reliable as a column value not in the sample can invalidate the inferred type, aborting a Datapusher job when it is actually being inserted into the Datastore. This can result in a job failing at the end of a long Datapusher job. It is also slow as it inserts data via the Datastore API.
XLoader does not do type inferencing at all and just loads everything as text. Though it is exponentially faster as it loads the data directly via PostgreSQL copy instead of the API.
DP+ combines the the type inferencing of Datapusher, and the loading speed of XLoader.
Even for a 1m row, 41 column, 520mb file - well above the typical file uploaded to CKAN, it can scan the entire file in 2 seconds with the
qsv stats
command (see https://qsv.dathere.com/benchmarks).It's fast because qsv is written in Rust, is multi-threaded, and the file is partitioned by the number of logical CPU cores available, with each core opening a separate file handle to work on a partition.
Apart from doing guaranteed type inferences,
qsv stats
also compiles summary statistics as the name implies.Currently, DP+ can be configured to save these summary statistics as a supporting resources with the "-stats" suffix.
Other optional supporting "metadata resources" that can be created in the Datastore include:
qsv frequency
command (compiling a frequency table for the same file for the top 50 values of each column takes 1 second)This works fairly well but it has some issues:
I propose we create a new class of Datastore-backed resource types called "metadata resources".
They will still be stored as tables in the Datastore, but will not be displayed in the Resource list, but as appropriate, in a separate "metadata resources" tab and/or to complement the existing Data Dictionary.
With "metadata resources", we go beyond the simple key-value metadata that we currently have, allowing us to:
Beta Was this translation helpful? Give feedback.
All reactions