Metadata resources: A special resource type for storing metadata #7856

jqnatividad · 2023-10-10T12:45:20Z

jqnatividad
Oct 10, 2023

Datapusher+ (DP+) does guaranteed type inferencing with qsv.

It can guarantee type inferences as it scans the entire file, unlike messytables in Datapusher, which only samples the beginning of a file. Messytables works for the most part but is not reliable as a column value not in the sample can invalidate the inferred type, aborting a Datapusher job when it is actually being inserted into the Datastore. This can result in a job failing at the end of a long Datapusher job. It is also slow as it inserts data via the Datastore API.

XLoader does not do type inferencing at all and just loads everything as text. Though it is exponentially faster as it loads the data directly via PostgreSQL copy instead of the API.

DP+ combines the the type inferencing of Datapusher, and the loading speed of XLoader.

Even for a 1m row, 41 column, 520mb file - well above the typical file uploaded to CKAN, it can scan the entire file in 2 seconds with the qsv stats command (see https://qsv.dathere.com/benchmarks).

It's fast because qsv is written in Rust, is multi-threaded, and the file is partitioned by the number of logical CPU cores available, with each core opening a separate file handle to work on a partition.

Apart from doing guaranteed type inferences, qsv stats also compiles summary statistics as the name implies.

Currently, DP+ can be configured to save these summary statistics as a supporting resources with the "-stats" suffix.

Other optional supporting "metadata resources" that can be created in the Datastore include:

a "frequency" ("-freq" suffix) resource showing the top N values of a column compiled with the qsv frequency command (compiling a frequency table for the same file for the top 50 values of each column takes 1 second)
a "quarantined" ("-quarantined") resource with PII-candidates (takes about 2 seconds to quarantine rows with Bank Account Number, telephone numbers, email, credit card account numbers and social security number)

This works fairly well but it has some issues:

it blows up the resource list with each metadata resource
the data publisher has to manually manage deleting the supporting metadata resources manually

I propose we create a new class of Datastore-backed resource types called "metadata resources".

They will still be stored as tables in the Datastore, but will not be displayed in the Resource list, but as appropriate, in a separate "metadata resources" tab and/or to complement the existing Data Dictionary.

With "metadata resources", we go beyond the simple key-value metadata that we currently have, allowing us to:

precalculate metadata by leveraging summary statistics (e.g. precalc spatial extent with min/max of lat/long fields, precalc date range of available observations, etc.)
use them to power more intelligent scheming forms (e.g. as we know the domain/range of each column, we can have the different field controls constrained appropriately)
do proxied, global, two-stage record-level search. If we add the summary statistics and frequency table of each resource to the index, which are relatively small, we can get a search hit on a global catalog search, which is good enough to populate the initial search result. The user clicking on a proxied hit, can then trigger a datastore_search/search_sql on the resource and do a record-level search.

mnichol4 · 2023-10-12T19:45:46Z

mnichol4
Oct 12, 2023

Thanks again for chatting today @jqnatividad. I'm adding some questions below - let me know if you think they belong elsewhere

Our (currently an MVP - not yet a deployable extension) datastore_profiler currently stores metadata in the CKAN database using datastore_create's fields object.

Are there negatives to datapusher+ storing metadata resources in the same way?

On the topic of datastore_profiler; while there's lots of overlap, there are things that (at least currently) dont:

datastore_profiler makes a CKAN preview for a datastore resource using created summaries
datastore_profiler is aiming to allow users to associate CKAN tags to a datastore column
datapusher+ reads from an uploaded file - datastore_profiler reads from a datastore resource
datastore_profiler is a CKAN extension

IYO, would it be best to, short-term, to keep these extensions separate? Then we could combine (parts of) them later after we see what kinds of successes we have with them?

3 replies

jqnatividad Oct 13, 2023
Author

Hi @mnichol4 ,

And thanks for reaching out as I think we're going towards the same goal of creating "Living" Data Catalogs with "automagical metadata!" 😄

To your queries:

Are there negatives to DP+ storing metadata resources in the same way?

When I first started writing DP+, I was thinking of storing them as part of Data Dictionary JSON as the stats are small enough and makes sense as extended metadata for each column.

dathere/datapusher-plus#18

I ended up storing them as Datastore Tables though as:

it was easier
I thought data users would like to consume that information as a tabular resource
Advanced CKAN Datastore API users would be able to ask advanced "metadata resource" queries using the existing Datastore API datastore_search_sql to run what I've been calling "FUSION QUERIES" - universally applicable SQL queries that will allow you to fuse/join seemingly unrelated datasets based on spatio-temporal fields (e.g. show me all resources that have a timestamp columns with observations for the last two years, and WGS84 coordinate fields for this specific region) that is not possible if we store it in the Data Dictionary JSON
as the Data Dictionary is stored as a table comment which is text, so we can't leverage PostgreSQL's JSON capabilities to run queries against it.

On the topic of datastore_profiler; while there's lots of overlap, there are things that (at least currently) dont:

datastore_profiler makes a CKAN preview for a datastore resource using created summaries

We display the stats as a resource, so arguably, we also have a preview, but we also have on our roadmap some "metadata visualizations" that allow users to view the characteristics of the dataset resource at a glance (box plot showing distribution, skewness, error bars, outliers, etc.)

datastore_profiler is aiming to allow users to associate CKAN tags to a datastore column

We're doing initial experiments on this front as well - we have had some success with qsv's describegpt command to create an expanded data dictionary which we can mine for auto-tagging/classification.

Ultimately, we hope to train an LLM on the catalog's metadata and metadata resources so users can also query the Catalog using a Natural Language Interface (or what I've been calling the "Answering People Interface") 😄.

This is another motivation to store the stats/frequency tables as tabular data as it's easier to use for training.
An added bonus of training on the extended metadata is that the metadata volume is miniscule in comparison to the raw data; you eliminate/minimize the possibility of hallucinations as the LLM just generates normal SQL to answer "fusion queries" - you don't need to rely on a model's "inscrutable matrices" to figure out why it answered your query in a certain way - its fully reproducible, decipherable SQL that the model generated based on your Data Dictionary, stats and frequency tables; and the model does not need to train on the raw data which may include sensitive information, you just train it on the catalog's metadata.

Having the frequency tables stored as system resources will also allow the model to infer the additional metadata based on the domain/range/top N values of the columns (e.g. when we ran describegpt on for NYC's 311 data - the borough column only has these five values - Manhattan, Queens, Staten Island, Brooklyn, Bronx - it then created a narrative on the Data Dictionary Description on its own that these are the five boroughs of New York City and on some runs, even said these boroughs correspond to US counties, which is correct but is not in the stats/frequency data we fed the model)

datapusher+ reads from an uploaded file - datastore_profiler reads from a datastore resource

That is correct. But as I mentioned yesterday, we've been thinking of creating an affiliated background job for "grooming/gardening" the catalog - "DataGroomers" (or maybe "Data Gardeners" in line with the "Living" Data Catalog and to stay away from the baggage the term "grooming" has acquired of late) 😉

dathere/datapusher-plus#13

datastore_profiler is a CKAN extension

And so will DP+ v1.0 that @tino097 has been working on...

Apart from being an extension and getting rid of the ckanserviceprovider dependency, we hope to use DP+ v1.x as:

a better way to manage the Datastore that is also available to the editor and org-admin, not just the sys-admin
a way to configure per resource DP+ jobs (e.g. for some resources, you may not want the stats, frequency table or to dedup, sort and screen for PII)
a way to manage jobs on the DP+ queue
a way to manage db-level knobs/switches for the resource (like how DP+ uses a heuristic to auto-index datastore resources for faster datastore_search_sql queries, some folks may want better control of these PostgreSQL indices and see how much storage they use (perhaps, by creating a new resource_info API ala datastore_info). Sometimes, they may also want to drop the FTS index as well which tend to blow up the footprint of a dataset in the CKAN datastore).
a way to manage the preview sample that's displayed for the resource (as I shared, the implementation we're working on only wants to have a 1,000-row preview of each dataset they catalog. However, they still want to be the canonical source of truth/metadata about all the state's data resources WITHOUT having to store the raw data in the CKAN filestore, never mind the CKAN datastore. The user still needs to get the raw data from the source URL and we just link to it. Having the stats/frequencies to characterize the entire dataset allows them to do so without having to take on the. onus of storage/bandwidth of storing all the data as well)

cc @wardi @samibaig @twdbben

jqnatividad Oct 13, 2023
Author

And to your last question @mnichol4:

IYO, would it be best to, short-term, to keep these extensions separate? Then we could combine (parts of) them later after we see what kinds of successes we have with them?

Ever since we got into this space, the way we do product development is by working on a concrete, real world use case - typically from one of our clients; iterating quickly on it, and open-sourcing it as much as possible as I described here.

A lot of commercial folks often recoil 😱 when we say "open source" as they need to protect their "IP."

But what we've found is that by doing so, we get to stand on the shoulders of other users; we get real-world, valuable feedback from them, and we often get contributions back in the way of tests, sample data, additional edge cases, pull requests, and not too infrequently, projects. We also don't end up creating a fork that we now have to maintain all by ourselves, without the benefit of being able to share the load with the community.

At the same time, another thing I love about "open-source" first development is its "permissionless innovation" - I don't need to take the permission or a license from anybody to start experimenting/innovating.

You're "scratching your itch" in the best way that you see fit as "we scratch ours".

But I think the overlap is there for us to open a channel to start synching our efforts (this is actually in the spirit of what we're trying to achieve with the NSF-funded program to scale up the CKAN ecosystem - https://civicdataecosystem.org) and at a logical point in the not too distant future, do "synchronized back-scratching" when it makes sense.

generated with DALL-E - "a stylized image of male synchronized swimmers holding back scratchers" 😆

mnichol4 Oct 13, 2023

Wow thanks a ton @jqnatividad 😁 I appreciate this.

Well then, at least, we can talk here to agree on some of the points below (naming, storage location of summary stats, common keys, etc). From there, I'll noodle on whether my team creates something separate but similar or not.

wardi · 2023-10-13T03:38:32Z

wardi
Oct 13, 2023
Maintainer

Would be great to settle on a standard for storing summary statistics/resource metadata/column profiles in ckan. There are a few ways we discussed doing this:

separate resources (original discussion suggestion)
plugin extras (at dataset level or introduce new plugin extras at resource level)
data dictionary/fields info (like datastore_profiler)
auxiliary datastore tables

Each has advantages, but the data dictionary/fields info approach is the one that automatically exposes the information to datastore_search users and associates statistics directly with a column.

If we decide to go this way and choose a common set of json keys for column metadata/profiles/statistics then they could be generated at csv load time by a tool like datapusher+, on demand by one like datastore_profiler or however a user likes with their own custom extension/tool.

Can we also settle on a name for the thing we're collecting? Of the suggestions:

resource metadata
summary statistics
datastore profiles

"statistics" or "stats" like the qsv command is a good fit.

Quarantined data feels like a separate feature that could be implemented as a separate resource since the amount of data would be much larger.

Frequency tables and detected controlled lists could become unwieldy as json data in the data dictionary/fields info. Not sure this is a strong enough argument to store them as auxiliary tables or resources because those options come with a management cost.

4 replies

mnichol4 Oct 13, 2023

Re Naming:
"stats" to match qsv sounds smart 👍

Re Stats Storage Method:
If stats are stored in plugin extras or other datastore tables, how could we expose the stats to users?

Re Common JSON Keys:
I'm not sure if there's appetite for this here; would we want each key to be a different stat (min, max, frequency, etc)?

If so, do we want stats for different data types to be different?
Ex:

numeric stats has min, max, mean etc
text stats has frequency, patterns, etc
date stats has min_year, max_month, etc

wardi Oct 13, 2023
Maintainer

for dates and datetimes min and max with proper ISO8601 values would be easier to work with than separate year/month/day/.../timezone values

mnichol4 Oct 13, 2023

Agreed re dates and datetimes.

Related:
What are the benefits/costs of storing text frequencies in the data dictionary vs in auxiliary tables?

jqnatividad Oct 13, 2023
Author

re Naming:

"stats" is universally understood, accessible, and a good label. However, we should explain in detail what these stats are specifically - "summary statistics", and insofar as the main job that they hired CKAN to do is concerned - is an expanded, more detailed form of metadata - to address the "why" we need to compile these stats.

re Stats storage:

I favor storing them as "auxiliary" datastore resources that are automatically maintained/managed through the Datastore API. So we can run "fusion queries" on them using the existing datastore_search_sql API. Also, users should be able to download them as discrete resources themselves.

re Common keys:

Though I do see the value of storing stats in the Data Dictionary JSON as well, as its relatively small and nicely complements the existing Data Dictionary. Given the relative "smallness" of the stats, it may not be bad (though not as elegant) to store them as both.
To minimize them getting out-of-sync, perhaps the stats can be pushed as an auxiliary resource first, which then fires a trigger to update the Data Dictionary JSON?

I agree there should be common keys. To your point @mnichol4 that the keys should be context/type-sensitive, for the most part, we can still use the same vocabulary across data types and just don't add them or set them to null when they don't apply.

BTW, when date inferencing is enabled, qsv converts the dates/datetimes to unix timestamps, so we can compute stats on them you wouldn't normally be able to (e.g. stddev, range, mean, quartiles, inter-quartile range, etc.). For example, running

qsv stats boston311-100.csv --everything --infer-dates results in:

https://github.com/jqnatividad/qsv/blob/master/resources/test/boston311-100-everything-date-stats-variance-stddev.csv

For text stats, I quite like your idea of compiling patterns for them. Right now, qsv schema can be run to derive a regular expression and use it to create a JSON Schema regex validator that you can use in the qsv validate command. We should be able to reuse that code to add the derived regex to the data dictionary.

re date/datetimes:

DP+ and qsv normalizes dates to RFC3339 format from any recognized date format (qsv recognizes 19 date formats, with each format having several variants)
https://github.com/jqnatividad/belt/tree/main/dateparser#accepted-date-formats

This is done as DP+ uses PostgreSQL copy to bulk copy at native speeds direct into the database and it allows users to do date-arithmetic and date/datetime queries properly with SQL (again, because we want "fusion queries" to join seemingly unrelated datasets).

The downside of this is when you view the data in the datastore, dates will often be displayed in a different format than the original format.

There is some pending work right now to return the inferred date format as well with the qsv stats command when date inferencing is enabled.
jqnatividad/qsv#1197

mackeynichols · 2024-04-17T19:49:49Z

mackeynichols
Apr 17, 2024

Hey @jqnatividad 👋
(same user as mnichol4 above - just a different account)
Digging this up after your CKAN Monthly Live presentation a couple weeks back got my wheels turning again.

I had a half-baked thought that overlaps some with qsv and datapusher+'s implementation of the idea of ✨automagical metadata✨

CKAN has vocabularies and tags. Im considering developing an extension that allows tags to be added to a datastore attribute (instead of / in addition to onto a resource/package). Nothing too complicated.

From there (and here's the maybe-overlapping-with- part), I'm considering adding (likely as a supplementary resource in a package) unique values and counts for each column in a datastore resource (like from a SELECT DISTINCT, COUNT SQL query). I'd use those to enhance a team's ability to manually or automatically assign tags to a datastore attribute.

I could see that being useful for a team like mine that wants to know the answer of "what kind of data is in our catalog?" at a slightly more granular level

I wanted to hear your thoughts on this, and whether it has too much overlap with what qsv and datapusher+ are already doing

Thanks in advance :D and let me know if this belongs somewhere outside of this discussion thread 😅

1 reply

jqnatividad Apr 19, 2024
Author

Hi @mackeynichols 👋 ,

Automagically tagging a dataset is something we dabbled with in DP+. We've done some experiments on that front with qsv's describegpt command and the results have been promising.

describegpt leverages qsv's statsand frequency commands to compile summary statistics along with a frequency table detailing the frequency counts of each value in a column.

We then send these to OpenAI to infer an extended data dictionary and suggest tags.

The good thing is even for very large datasets, these summary statistics are relatively "tiny" and are well within the context window of OpenAI's various models, so the incremental cost of inferring these "extrametadata" is relatively inexpensive. And since the LLMs are constantly improving and are general-purpose - the results have been improving markedly over time.

The main reason why we haven't integrated it with DP+ is that as you know with current LLMs, you need to interact with the model to get your desired results, and DP+ background processing model does not lend itself to this.

That's why we're integrating it with qsv pro which I briefly mentioned during the CKAN Monthly Live, but just ran out of time to demo. Think of qsv-pro as a Desktop CKAN client that combines the familiar interface of Excel and the data-wrangling power of qsv to make for an interactive DP+.

The target audience for it are Business analysts and Data Stewards (who are the "data owners" and domain experts of the datasets they're managing, but may not necessarily be data-wrangling devs or comfortable with a CLI tool like qsv) who need to groom data interactively, compute viz-ready "aggregation resources", and automagically infer and enrich metadata based on a dataset's attributes (e.g. compute spatial extent; date range of time-series datasets; geocode additional columns, etc.) using customizable recipes (the recipes can be either written in Luau or Python).

I'd be happy to demo it to you if you want to see it in action and would be interested in your feedback.

In the meantime, you may want to experiment with qsv describegpt to see what kind of tags it returns. Maybe, we can even fine-tune describegpt to use a tag vocabulary to constrain the tag suggestions.

FYI, I've also been looking at IDataDictionaryForm contributed by @wardi with keen interest (#7971).

We're currently working on a project where we will take advantage of it to store the summary/frequency stats as part of the Data Dictionary rather than as a separate resource as what DP+ currently does.

Finally, we're also experimenting with a qsv-enhanced Datastore. The problem with both DP+ and XLoader, is that we're still pushing/loading data into PostgreSQL with a very large index overhead. In installations we're involved in, its the largest line item in our AWS bill.

So instead of pushing/loading the tabular data into Postgres, we're thinking of just intercepting the datastore_search and datastore_search_sql routes for now and use qsv's search \ searchset[^1] and Polars-powered sqlp commands respectively to search and query CSVs in-place in the Filestore. No need to push/load data into the Datastore, no big AWS bills, and best of all, its blazing fast! (cc @EricSoroos on Apr 18 2024 CKAN tech team discussion on the topic of alternate datastore backends)

cc @rzmk @samibaig

[^1] : Using the same regex engine used by ripgrep which in turn, powers Visual Studio Code's magical "Find in Files" function.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Metadata resources: A special resource type for storing metadata #7856

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 8 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Metadata resources: A special resource type for storing metadata #7856

jqnatividad Oct 10, 2023

Replies: 3 comments · 8 replies

mnichol4 Oct 12, 2023

jqnatividad Oct 13, 2023 Author

jqnatividad Oct 13, 2023 Author

mnichol4 Oct 13, 2023

wardi Oct 13, 2023 Maintainer

mnichol4 Oct 13, 2023

wardi Oct 13, 2023 Maintainer

mnichol4 Oct 13, 2023

jqnatividad Oct 13, 2023 Author

re Naming:

re Stats storage:

re Common keys:

re date/datetimes:

mackeynichols Apr 17, 2024

jqnatividad Apr 19, 2024 Author

jqnatividad
Oct 10, 2023

Replies: 3 comments 8 replies

mnichol4
Oct 12, 2023

jqnatividad Oct 13, 2023
Author

jqnatividad Oct 13, 2023
Author

wardi
Oct 13, 2023
Maintainer

wardi Oct 13, 2023
Maintainer

jqnatividad Oct 13, 2023
Author

mackeynichols
Apr 17, 2024

jqnatividad Apr 19, 2024
Author