-
Notifications
You must be signed in to change notification settings - Fork 107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Discussion of Catalogs re Data Packages #37
Comments
TODO: review again us gov data.json spec ... |
Laid out 3 options. Inclining to option 3 with hash form inside datapackages. Note this is similar to https://github.com/datasets/registry/blob/4e19a9d2fb58e885f0622bb0ea1b7f3eaa78d334/datapackage-index.json Questions:
|
Based on experience with the registry I'd like to suggest something even simpler. Basic thing people have is a "Catalog List" in a file named This is a list of URLs (file or http etc) to Data Packages with one data package per line. Questions:
|
Based on recent discussion with @bnvk it seems to me that making a registry a (tabular) data package is nice and kind of cute. As a straw man suggestion the registry is a Data Package with the following structure:
catalog.csv is a CSV file with the following structure:
|
+1 in favor of a (tabular) data package but be aware that GitHub repository can be not enough in some cases. see datasets/awesome-data#113 (see problem with |
Agreed, definitely the way to go. |
Seems like some other values might be desirable, such as:
So in the case where the dataset is NOT hosted on Github, perhaps the
|
Note: amended the description with current simple pattern we use. Also:
|
@rufuspollock let's WONTFIX ? |
INVALID / WONTFIX. This discussion has been fruitful but there is no explicit action here so closing. (Maybe one day we can have a pattern!) |
I think adding a spec for this would be useful beyond the registry/catalog use case. batch:
- metadata.json
- user_packages[]:
{ metadata, relation_a, relation_b } This could theoretically be modeled with a single data package, in a "normalized" way, but it isn't practical (and certainly isn't frictionless). If we had a spec for I think the right approach for this would be first to add a As an additional upside, it's hard for me to think of a data topology this approach wouldn't let us model. |
an example of an ETL tool leveraging these ideas : package(user_packages):
# profile specifies metadata json schema and user_package schema
profile: "http://foo.bar/batch_user_package_profile.json"
metadata: ./batch_metadata.json # includes `created` timestamp
resources:
user_packages:
- user: "./user_a.json" # includes `id` uuid
- data: "./tabular_user_a_data.csv" one could normalize it into csvs ready for ingestion into a relational DB: # reduce the package resources into a single normalized package ready for ingestion:
reduce_package_stream(
batch_package_stream,
# a spec with the normalized resources
target_profile="normalized_ingestion_profile.json",
target="./normalized_data/",
# just some pointfree pseudo code for transforming user data
# and merging some top level batch data
transform=across_packages(
"/user_packages",
reduce_to_resources={
"users": concat(merge("user", "/metadata#created,batch_id")),
"data": concat(merge("data", "user#id"))
}
)
) |
@micimize could you give a bit more info on the spec you propose for the catalog - i'm not quite sure i follow ... |
@rufuspollock in order of implementation, we make:
Does that makes sense? To use a The example is just meant to illustrate how this approach would enable really powerful dataflow-style tooling |
@micimize thanks - can you give a sample of what the catalog file would look like ... |
@rufuspollock sure: {
"profile": "data-package-catalog",
"name": "climate-change-packages",
"resources": [
{
// this would probably actually be a custom profile,
// like "aq-deployment-data-package"
"profile": "json-data-package",
"name": "beacon-network-description",
"path": "https://http://beacon.berkeley.edu/hypothetical_deployment_description.json"
},
{
"profile": "tabular-data-package",
"path": "https://pkgstore.datahub.io/core/co2-ppm/10/datapackage.json"
},
{
"profile": "tabular-data-package",
"name": "co2-fossil-global",
"path": "https://pkgstore.datahub.io/core/co2-fossil-global/11/datapackage.json"
}
]
} Or, for a ubiquitous package registry where profile constraints the valid {
"profile": "tabular-data-package-catalog",
"name": "datahub-climate-change-packages",
"resources": [
{
"path": "https://pkgstore.datahub.io/core/co2-ppm/10/datapackage.json"
},
{
"name": "co2-fossil-global",
"path": "https://pkgstore.datahub.io/core/co2-fossil-global/11/datapackage.json"
}
]
} I think all fields could be hypothetically optional except |
@micimize ok i get it - you are actually listing the data packages as resources. Interesting. I hadn't thought of that. I'd been thinking you'd have a single CSV or json resource that was the catalog. What do you think are the pros/cons of this approach vs having a json or CSV file listing the data packages. |
@rufuspollock so, a csv listing data packages is simple, and an alright solution for the "static registry website" use case. But, it is limited to that use case, because without a "document resource" spec, we can't model the domain properly, so we also can't make spec-driven frictionless tooling manipulating sets of data packages, or any other data topology beyond flat relational data (see my first comment) To put it a different way - I view the "document resource" spec as the necessary next step in a truly frictionless toolchain, because it will let us model pretty much all real-world data topologies. That we can then model datapackages as a subset/ |
@micimize sounds good - I suggest writing this up as a pattern and doing a PR on patterns.md. |
Cool stuff 😄 |
Need to think further about this. Removed the material below from the current spec since this is not finalized.
Current Primary Proposal
Making your registry into a (tabular) Data Package. A real-life example here:
https://github.com/datasets/registry
Here's the rough structure:
catalog.csv is a CSV file with the following structure:
url
: url to the dataset, usually the URL to the github repositoryname
: the name of the dataset as set in itsdatapackage.json
(willusually be the same as the name of the repository)
owner
: the username of the owner of the package. For datasets in githubthis will be the github username
name
andowner
are both optional.# OLD
Options
Option 1
Option 2
Option 3
Existing material
Catalogs and Discovery
In order to find Data Packages tools may make use of a "consolidated" catalog
either online or locally.
A general specification for (online) Data Catalogs can be found at
http://spec.datacatalogs.org/.
For local catalogs on disk we suggest locating at "HOME/.dpm/catalog.json" and
having the following structure::
When Package metadata is added to the catalog a field called bundle is added
pointing to a bundle for this item (see below for more on bundles).
The text was updated successfully, but these errors were encountered: