Discussion of Catalogs re Data Packages #37

rufuspollock · 2013-04-14T16:24:46Z

Need to think further about this. Removed the material below from the current spec since this is not finalized.

Current Primary Proposal

Making your registry into a (tabular) Data Package. A real-life example here:

Here's the rough structure:

datapackage.json
catalog.csv

catalog.csv is a CSV file with the following structure:

url,name,owner
...

url: url to the dataset, usually the URL to the github repository
name: the name of the dataset as set in its datapackage.json (will
usually be the same as the name of the repository)
owner: the username of the owner of the package. For datasets in github
this will be the github username

name and owner are both optional.

# OLD

Options

Option 1

[ 
   { data-package },
   { data-package }
]

Option 2

{ 
   dp-id: { data-package },
   dp-id: { data-package }
}

Option 3

 {
    dataPackageCatalogVersion: [an integer indicating version of the spec this corresponds to]
    dataPackages: 
      like option 1 or 2 ...
    ...
 }

Existing material

Catalogs and Discovery

In order to find Data Packages tools may make use of a "consolidated" catalog
either online or locally.

A general specification for (online) Data Catalogs can be found at
http://spec.datacatalogs.org/.

For local catalogs on disk we suggest locating at "HOME/.dpm/catalog.json" and
having the following structure::

 {
    version: ...
    datasets:
      {name}: {
        {version}:
          metadata: {metadata},
          bundles: [
            url: ...
            type: file, url, ckan, zip, tgz
          ]
 }

When Package metadata is added to the catalog a field called bundle is added
pointing to a bundle for this item (see below for more on bundles).

The text was updated successfully, but these errors were encountered:

rufuspollock · 2013-06-23T14:31:05Z

TODO: review again us gov data.json spec ...

rufuspollock · 2013-06-23T14:36:56Z

Laid out 3 options. Inclining to option 3 with hash form inside datapackages.

Note this is similar to https://github.com/datasets/registry/blob/4e19a9d2fb58e885f0622bb0ea1b7f3eaa78d334/datapackage-index.json

Questions:

Would we every have a situation where we have multiple versions of a given datapackage.json inside the catalog? If so we would require to key by another level.
- At the moment I think answer would be: no requirement for this complexity so let's not have it.
Performance considerations: storing all this as json does not seem a huge deal. Even at 1k per datapackage we are talking for say 5k data packagfe only 5mb ...

…t yet mature frictionlessdata#37.

rufuspollock · 2014-08-16T10:46:42Z

Based on experience with the registry I'd like to suggest something even simpler.

Basic thing people have is a "Catalog List" in a file named catalog-list.txt

This is a list of URLs (file or http etc) to Data Packages with one data package per line.

Questions:

Consolidation / caching: often I will want to get the full data package metadata together (perhaps in a searchable form) (that's what we have to do on data.okfn.org for example). Suggest we have catalog.json which is a consolidated set of datapackage.json and perhaps catalog-index.json or similar ...?
support for relative URLs (??)

rufuspollock · 2015-08-09T20:58:55Z

Based on recent discussion with @bnvk it seems to me that making a registry a (tabular) data package is nice and kind of cute.

As a straw man suggestion the registry is a Data Package with the following structure:

datapackage.json
catalog.csv

catalog.csv is a CSV file with the following structure:

url,name,owner
...

url: url to the dataset, usually the URL to the github repository
name: the name of the dataset as set in its datapackage.json (will
usually be the same as the name of the repository)
owner: the username of the owner of the package. For datasets in github
this will be the github username

name and owner are both optional.

femtotrader · 2015-09-19T08:06:46Z

+1 in favor of a (tabular) data package but be aware that GitHub repository can be not enough in some cases. see datasets/awesome-data#113 (see problem with gh-pages)
Maybe an other column like url_optional is necessary

rufuspollock · 2015-09-21T11:42:36Z

Agreed, definitely the way to go.

bnvk · 2015-09-21T13:26:58Z

Seems like some other values might be desirable, such as:

url, name, owner, keywords, last_updated
...

So in the case where the dataset is NOT hosted on Github, perhaps the owner value could be better scoped by something like

@github-username
name@domain.com
https://domain.com

rufuspollock · 2017-03-07T06:25:14Z

Note: amended the description with current simple pattern we use.

Also:

Several simple catalog out there -- see http://frictionlessdata.io/tools. Plus we're hard at work on a hosted registry system (see http://github.com/frictionlessdata/dpr-api)

pwalsh · 2017-05-29T11:53:24Z

@rufuspollock let's WONTFIX ?

rufuspollock · 2017-05-29T15:29:54Z

INVALID / WONTFIX. This discussion has been fruitful but there is no explicit action here so closing. (Maybe one day we can have a pattern!)

micimize · 2019-03-14T16:33:25Z

I think adding a spec for this would be useful beyond the registry/catalog use case.
For instance, I'm currently working with a system where the topology is roughly:

batch:
  - metadata.json
  - user_packages[]:
     { metadata, relation_a, relation_b }

This could theoretically be modeled with a single data package, in a "normalized" way, but it isn't practical (and certainly isn't frictionless).

If we had a spec for package(packages[]), we could build tools to leverage it generically (this is actually roughly what I thought datapackage-pipelines/dataflows was for, initially)

I think the right approach for this would be first to add a document resource spec (json schema + json doc to start). Then a datapackage resource can be built from that.

As an additional upside, it's hard for me to think of a data topology this approach wouldn't let us model.

micimize · 2019-03-14T17:06:08Z

an example of an ETL tool leveraging these ideas :
Given the package

package(user_packages):
  # profile specifies metadata json schema and user_package schema
  profile: "http://foo.bar/batch_user_package_profile.json" 
  metadata: ./batch_metadata.json # includes `created` timestamp
  resources:
    user_packages:
    - user: "./user_a.json" # includes `id` uuid
    - data: "./tabular_user_a_data.csv"

one could normalize it into csvs ready for ingestion into a relational DB:

# reduce the package resources into a single normalized package ready for ingestion:
reduce_package_stream(
    batch_package_stream,
    # a spec with the normalized resources 
    target_profile="normalized_ingestion_profile.json",
    target="./normalized_data/",

    # just some pointfree pseudo code for transforming user data
    # and merging some top level batch data
    transform=across_packages(
        "/user_packages",
        reduce_to_resources={
            "users": concat(merge("user", "/metadata#created,batch_id")),
            "data": concat(merge("data", "user#id"))
        }
    )
)

rufuspollock · 2019-03-20T20:54:13Z

@micimize could you give a bit more info on the spec you propose for the catalog - i'm not quite sure i follow ...

micimize · 2019-03-21T01:53:04Z

@rufuspollock in order of implementation, we make:

a "document resource" spec, like a tabular resource but with json schema
a "datapackage resource", which is just a refinement of that - a document that conforms to the datapackage json schema
then, a "datapackage datapackage" is just a datapackage where all the resources are packages. This can be refined to a "package of tabular datapackages", etc.

Does that makes sense? To use a datapackage-datapackage as a catalogue, you'd just have resources of { "path": "http://website.com/package", ... } without any inline data, and a profile.

The example is just meant to illustrate how this approach would enable really powerful dataflow-style tooling

rufuspollock · 2019-04-25T14:41:10Z

@micimize thanks - can you give a sample of what the catalog file would look like ...

micimize · 2019-04-25T16:31:43Z

@rufuspollock sure:
So, for a "mixed" package registry:

{
  "profile": "data-package-catalog",
  "name": "climate-change-packages",
  "resources": [
    {
      // this would probably actually be a custom profile,
      // like "aq-deployment-data-package"
      "profile": "json-data-package",
      "name": "beacon-network-description",
      "path": "https://http://beacon.berkeley.edu/hypothetical_deployment_description.json"
    },
    {
      "profile": "tabular-data-package",
      "path": "https://pkgstore.datahub.io/core/co2-ppm/10/datapackage.json"
    },
    {
      "profile": "tabular-data-package",
      "name": "co2-fossil-global",
      "path": "https://pkgstore.datahub.io/core/co2-fossil-global/11/datapackage.json"
    }
  ]
}

Or, for a ubiquitous package registry where profile constraints the valid resources:

{
  "profile": "tabular-data-package-catalog",
  "name": "datahub-climate-change-packages",
  "resources": [
    {
      "path": "https://pkgstore.datahub.io/core/co2-ppm/10/datapackage.json"
    },
    {
      "name": "co2-fossil-global",
      "path": "https://pkgstore.datahub.io/core/co2-fossil-global/11/datapackage.json"
    }
  ]
}

I think all fields could be hypothetically optional except "path", as it can be used to pull the rest.

rufuspollock · 2019-05-11T20:16:01Z

@micimize ok i get it - you are actually listing the data packages as resources. Interesting. I hadn't thought of that. I'd been thinking you'd have a single CSV or json resource that was the catalog.

What do you think are the pros/cons of this approach vs having a json or CSV file listing the data packages.

micimize · 2019-06-15T17:18:43Z

@rufuspollock so, a csv listing data packages is simple, and an alright solution for the "static registry website" use case.

But, it is limited to that use case, because without a "document resource" spec, we can't model the domain properly, so we also can't make spec-driven frictionless tooling manipulating sets of data packages, or any other data topology beyond flat relational data (see my first comment)

To put it a different way - I view the "document resource" spec as the necessary next step in a truly frictionless toolchain, because it will let us model pretty much all real-world data topologies. That we can then model datapackages as a subset/profile of document resources, and build more powerful tooling based on that model, just naturally falls out of that foundational feature.

rufuspollock · 2019-06-17T09:09:45Z

@micimize sounds good - I suggest writing this up as a pattern and doing a PR on patterns.md.

…Data Package Catalog as a Data Package - refs #37. Document resources, Data package resources Merge pull request #632 from micimize/master

bnvk · 2019-11-12T15:14:15Z

Cool stuff 😄

rufuspollock added a commit to rufuspollock/fd-specs that referenced this issue Feb 25, 2014

[data-packages][s]: remove catalog material and put in an issue as no…

9d8681c

…t yet mature frictionlessdata#37.

jpmckinney added the Data Package label Feb 3, 2015

rufuspollock mentioned this issue Sep 15, 2015

datapackage.json is missing ? datasets/awesome-data#112

Closed

roll added the backlog label Aug 8, 2016

roll removed the backlog label Aug 29, 2016

pwalsh modified the milestone: Backlog Feb 5, 2017

rufuspollock closed this as completed May 29, 2017

rufuspollock added invalid wontfix labels May 29, 2017

bnvk mentioned this issue Mar 16, 2018

API: Add better support for the ToC bnvk/Conjuror#50

Open

4 tasks

micimize mentioned this issue Mar 23, 2019

[proposal] package-set level operations datahq/dataflows#62

Open

micimize mentioned this issue Jun 23, 2019

Document resources, Data package resources #632

Merged

rufuspollock added a commit that referenced this issue Jul 10, 2019

[patterns][l]: Add pattern for JSON resources with a JSON schema and …

6bdf4a9

…Data Package Catalog as a Data Package - refs #37. Document resources, Data package resources Merge pull request #632 from micimize/master

scls19fr mentioned this issue Nov 19, 2020

Updating data package lists datasets/awesome-data#96

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Discussion of Catalogs re Data Packages #37

Discussion of Catalogs re Data Packages #37

rufuspollock commented Apr 14, 2013 •

edited

Loading

rufuspollock commented Jun 23, 2013

rufuspollock commented Jun 23, 2013

rufuspollock commented Aug 16, 2014

rufuspollock commented Aug 9, 2015

femtotrader commented Sep 19, 2015

rufuspollock commented Sep 21, 2015

bnvk commented Sep 21, 2015

rufuspollock commented Mar 7, 2017

pwalsh commented May 29, 2017

rufuspollock commented May 29, 2017

micimize commented Mar 14, 2019

micimize commented Mar 14, 2019

rufuspollock commented Mar 20, 2019

micimize commented Mar 21, 2019

rufuspollock commented Apr 25, 2019

micimize commented Apr 25, 2019

rufuspollock commented May 11, 2019

micimize commented Jun 15, 2019

rufuspollock commented Jun 17, 2019

bnvk commented Nov 12, 2019

Discussion of Catalogs re Data Packages #37

Discussion of Catalogs re Data Packages #37

Comments

rufuspollock commented Apr 14, 2013 • edited Loading

Current Primary Proposal

Options

Option 1

Option 2

Option 3

Existing material

Catalogs and Discovery

rufuspollock commented Jun 23, 2013

rufuspollock commented Jun 23, 2013

rufuspollock commented Aug 16, 2014

rufuspollock commented Aug 9, 2015

femtotrader commented Sep 19, 2015

rufuspollock commented Sep 21, 2015

bnvk commented Sep 21, 2015

rufuspollock commented Mar 7, 2017

pwalsh commented May 29, 2017

rufuspollock commented May 29, 2017

micimize commented Mar 14, 2019

micimize commented Mar 14, 2019

rufuspollock commented Mar 20, 2019

micimize commented Mar 21, 2019

rufuspollock commented Apr 25, 2019

micimize commented Apr 25, 2019

rufuspollock commented May 11, 2019

micimize commented Jun 15, 2019

rufuspollock commented Jun 17, 2019

bnvk commented Nov 12, 2019

rufuspollock commented Apr 14, 2013 •

edited

Loading