Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discussion of Catalogs re Data Packages #37

Closed
rufuspollock opened this issue Apr 14, 2013 · 20 comments
Closed

Discussion of Catalogs re Data Packages #37

rufuspollock opened this issue Apr 14, 2013 · 20 comments

Comments

@rufuspollock
Copy link
Contributor

rufuspollock commented Apr 14, 2013

Need to think further about this. Removed the material below from the current spec since this is not finalized.

Current Primary Proposal

Making your registry into a (tabular) Data Package. A real-life example here:

https://github.com/datasets/registry

Here's the rough structure:

datapackage.json
catalog.csv

catalog.csv is a CSV file with the following structure:

url,name,owner
...
  • url: url to the dataset, usually the URL to the github repository
  • name: the name of the dataset as set in its datapackage.json (will
    usually be the same as the name of the repository)
  • owner: the username of the owner of the package. For datasets in github
    this will be the github username

name and owner are both optional.


# OLD

Options

Option 1

[ 
   { data-package },
   { data-package }
]

Option 2

{ 
   dp-id: { data-package },
   dp-id: { data-package }
}

Option 3

 {
    dataPackageCatalogVersion: [an integer indicating version of the spec this corresponds to]
    dataPackages: 
      like option 1 or 2 ...
    ...
 }

Existing material

Catalogs and Discovery

In order to find Data Packages tools may make use of a "consolidated" catalog
either online or locally.

A general specification for (online) Data Catalogs can be found at
http://spec.datacatalogs.org/.

For local catalogs on disk we suggest locating at "HOME/.dpm/catalog.json" and
having the following structure::

 {
    version: ...
    datasets:
      {name}: {
        {version}:
          metadata: {metadata},
          bundles: [
            url: ...
            type: file, url, ckan, zip, tgz
          ]
 }

When Package metadata is added to the catalog a field called bundle is added
pointing to a bundle for this item (see below for more on bundles).

@rufuspollock
Copy link
Contributor Author

TODO: review again us gov data.json spec ...

@rufuspollock
Copy link
Contributor Author

Laid out 3 options. Inclining to option 3 with hash form inside datapackages.

Note this is similar to https://github.com/datasets/registry/blob/4e19a9d2fb58e885f0622bb0ea1b7f3eaa78d334/datapackage-index.json

Questions:

  • Would we every have a situation where we have multiple versions of a given datapackage.json inside the catalog? If so we would require to key by another level.
    • At the moment I think answer would be: no requirement for this complexity so let's not have it.
  • Performance considerations: storing all this as json does not seem a huge deal. Even at 1k per datapackage we are talking for say 5k data packagfe only 5mb ...

rufuspollock added a commit to rufuspollock/fd-specs that referenced this issue Feb 25, 2014
@rufuspollock
Copy link
Contributor Author

Based on experience with the registry I'd like to suggest something even simpler.

Basic thing people have is a "Catalog List" in a file named catalog-list.txt

This is a list of URLs (file or http etc) to Data Packages with one data package per line.

Questions:

  • Consolidation / caching: often I will want to get the full data package metadata together (perhaps in a searchable form) (that's what we have to do on data.okfn.org for example). Suggest we have catalog.json which is a consolidated set of datapackage.json and perhaps catalog-index.json or similar ...?
  • support for relative URLs (??)

@rufuspollock
Copy link
Contributor Author

Based on recent discussion with @bnvk it seems to me that making a registry a (tabular) data package is nice and kind of cute.

As a straw man suggestion the registry is a Data Package with the following structure:

datapackage.json
catalog.csv

catalog.csv is a CSV file with the following structure:

url,name,owner
...
  • url: url to the dataset, usually the URL to the github repository
  • name: the name of the dataset as set in its datapackage.json (will
    usually be the same as the name of the repository)
  • owner: the username of the owner of the package. For datasets in github
    this will be the github username

name and owner are both optional.

@femtotrader
Copy link

+1 in favor of a (tabular) data package but be aware that GitHub repository can be not enough in some cases. see datasets/awesome-data#113 (see problem with gh-pages)
Maybe an other column like url_optional is necessary

@rufuspollock
Copy link
Contributor Author

Agreed, definitely the way to go.

@bnvk
Copy link

bnvk commented Sep 21, 2015

Seems like some other values might be desirable, such as:

url, name, owner, keywords, last_updated
...

So in the case where the dataset is NOT hosted on Github, perhaps the owner value could be better scoped by something like

@github-username
name@domain.com
https://domain.com

@roll roll added the backlog label Aug 8, 2016
@roll roll removed the backlog label Aug 29, 2016
@pwalsh pwalsh modified the milestone: Backlog Feb 5, 2017
@rufuspollock
Copy link
Contributor Author

Note: amended the description with current simple pattern we use.

Also:

@pwalsh
Copy link
Member

pwalsh commented May 29, 2017

@rufuspollock let's WONTFIX ?

@rufuspollock
Copy link
Contributor Author

INVALID / WONTFIX. This discussion has been fruitful but there is no explicit action here so closing. (Maybe one day we can have a pattern!)

@micimize
Copy link
Contributor

I think adding a spec for this would be useful beyond the registry/catalog use case.
For instance, I'm currently working with a system where the topology is roughly:

batch:
  - metadata.json
  - user_packages[]:
     { metadata, relation_a, relation_b }

This could theoretically be modeled with a single data package, in a "normalized" way, but it isn't practical (and certainly isn't frictionless).

If we had a spec for package(packages[]), we could build tools to leverage it generically (this is actually roughly what I thought datapackage-pipelines/dataflows was for, initially)

I think the right approach for this would be first to add a document resource spec (json schema + json doc to start). Then a datapackage resource can be built from that.

As an additional upside, it's hard for me to think of a data topology this approach wouldn't let us model.

@micimize
Copy link
Contributor

an example of an ETL tool leveraging these ideas :
Given the package

package(user_packages):
  # profile specifies metadata json schema and user_package schema
  profile: "http://foo.bar/batch_user_package_profile.json" 
  metadata: ./batch_metadata.json # includes `created` timestamp
  resources:
    user_packages:
    - user: "./user_a.json" # includes `id` uuid
    - data: "./tabular_user_a_data.csv"

one could normalize it into csvs ready for ingestion into a relational DB:

# reduce the package resources into a single normalized package ready for ingestion:
reduce_package_stream(
    batch_package_stream,
    # a spec with the normalized resources 
    target_profile="normalized_ingestion_profile.json",
    target="./normalized_data/",

    # just some pointfree pseudo code for transforming user data
    # and merging some top level batch data
    transform=across_packages(
        "/user_packages",
        reduce_to_resources={
            "users": concat(merge("user", "/metadata#created,batch_id")),
            "data": concat(merge("data", "user#id"))
        }
    )
)

@rufuspollock
Copy link
Contributor Author

@micimize could you give a bit more info on the spec you propose for the catalog - i'm not quite sure i follow ...

@micimize
Copy link
Contributor

@rufuspollock in order of implementation, we make:

  • a "document resource" spec, like a tabular resource but with json schema
  • a "datapackage resource", which is just a refinement of that - a document that conforms to the datapackage json schema
  • then, a "datapackage datapackage" is just a datapackage where all the resources are packages. This can be refined to a "package of tabular datapackages", etc.

Does that makes sense? To use a datapackage-datapackage as a catalogue, you'd just have resources of { "path": "http://website.com/package", ... } without any inline data, and a profile.

The example is just meant to illustrate how this approach would enable really powerful dataflow-style tooling

@rufuspollock
Copy link
Contributor Author

@micimize thanks - can you give a sample of what the catalog file would look like ...

@micimize
Copy link
Contributor

@rufuspollock sure:
So, for a "mixed" package registry:

{
  "profile": "data-package-catalog",
  "name": "climate-change-packages",
  "resources": [
    {
      // this would probably actually be a custom profile,
      // like "aq-deployment-data-package"
      "profile": "json-data-package",
      "name": "beacon-network-description",
      "path": "https://http://beacon.berkeley.edu/hypothetical_deployment_description.json"
    },
    {
      "profile": "tabular-data-package",
      "path": "https://pkgstore.datahub.io/core/co2-ppm/10/datapackage.json"
    },
    {
      "profile": "tabular-data-package",
      "name": "co2-fossil-global",
      "path": "https://pkgstore.datahub.io/core/co2-fossil-global/11/datapackage.json"
    }
  ]
}

Or, for a ubiquitous package registry where profile constraints the valid resources:

{
  "profile": "tabular-data-package-catalog",
  "name": "datahub-climate-change-packages",
  "resources": [
    {
      "path": "https://pkgstore.datahub.io/core/co2-ppm/10/datapackage.json"
    },
    {
      "name": "co2-fossil-global",
      "path": "https://pkgstore.datahub.io/core/co2-fossil-global/11/datapackage.json"
    }
  ]
}

I think all fields could be hypothetically optional except "path", as it can be used to pull the rest.

@rufuspollock
Copy link
Contributor Author

@micimize ok i get it - you are actually listing the data packages as resources. Interesting. I hadn't thought of that. I'd been thinking you'd have a single CSV or json resource that was the catalog.

What do you think are the pros/cons of this approach vs having a json or CSV file listing the data packages.

@micimize
Copy link
Contributor

@rufuspollock so, a csv listing data packages is simple, and an alright solution for the "static registry website" use case.

But, it is limited to that use case, because without a "document resource" spec, we can't model the domain properly, so we also can't make spec-driven frictionless tooling manipulating sets of data packages, or any other data topology beyond flat relational data (see my first comment)

To put it a different way - I view the "document resource" spec as the necessary next step in a truly frictionless toolchain, because it will let us model pretty much all real-world data topologies. That we can then model datapackages as a subset/profile of document resources, and build more powerful tooling based on that model, just naturally falls out of that foundational feature.

@rufuspollock
Copy link
Contributor Author

@micimize sounds good - I suggest writing this up as a pattern and doing a PR on patterns.md.

rufuspollock added a commit that referenced this issue Jul 10, 2019
…Data Package Catalog as a Data Package - refs #37.

Document resources, Data package resources

Merge pull request #632 from micimize/master
@bnvk
Copy link

bnvk commented Nov 12, 2019

Cool stuff 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Archived in project
Development

No branches or pull requests

7 participants