[DP] Data Dependencies #75

Closed
rufuspollock opened this Issue Nov 12, 2013 · 8 comments

Projects

None yet

5 participants

@rufuspollock
Contributor

We want to list other data packages this data package depends on so they are installed as part of installing this data package.

Implementation

Could either:

  • reuse commonjs dependencies
  • create new dataDependencies attribute

Think latter is better as we avoid collisions - see oki-archive/datapackage.js-deprecated#3

@rufuspollock
Contributor

@paulfitz any thoughts here?

@paulfitz
Contributor

@rgrp I'd generally be +1 for adding a specific attribute for a specific kind of dependency (so favoring the dataDependencies option). Not quite sure how this is going to play out, it could well be a mistake long-term but that is the particular mistake I would vote for right now :-).

@sballesteros

I understand the need of dataDependencies outside of a data package (that is for instance if you consume data packages from a programming language). Taking Node.js as an example, in this case you list the data packages that you want to use as dependencies (or dataDependencies when npm will support it) of a package.json and you are good to go. However, within a data package, the usage is a bit less clear.

I see 2 main interests of having dataDependencies within a data package:

  1. for foreign keys and ensuring that the values of fields having a foreignkey property are a set of the values of the field of the resource that the foreign key points to (arg this sentence is ugly...).
  2. to explicitly refer to other resources coming from the data packages listed as dependencies, so that a user can distribute a data package containing a nice set of resources that he grouped together according to some criteria explained in the description of the data package.

I do not know if 2. is within the scope of the data package protocol? To give an example of application, within the reproducibility initiative we want to use data packages to package all the inputs used in a scientific study. Being able to refer to resources from other data packages is particularly useful in this case as it avoids hazardous (from the perspective of reproducibility) data duplication while allowing for an explicit listing of which subset of the dependencies were used.

Note: for the implementation (valid for 1. and 2.) I would suggest that we follow the folder structures and dependencies resolution used by npm.

@rufuspollock
Contributor

@sballesteros both 1+2 are good. I guess one can also think back to basic use cases for data packages and imagine a situation where data packages are part of a tool chain that, say, loads all of the data into an SQL db. You can then imagine a situation where one requires package A which requires package B + C. However, I think this example fits inside of the foreignkey case you describe in most circumstances.

@sballesteros

So for 2. above, how about:

"name": "dpkg1",
"version": "0.0.0",
"dataDependencies": {
  "dpkg2": "0.0.1"
},
"resources": [
  {
    "name": "SDF_resource",    
    "require": {"datapackage": "dpkg2", "resource": "data", "fields": ["date", "x", "y"]}
  },
  {
    "name": "classical_resource",
    "require": {"datapackage": "dpkg2", "resource": "foo"}    
  }
]

require comes in addition to data, path and url to specify the location of the data.

For SDF resources, the proposed solution (above) only allows us to import a subset (or the whole part) of one external resource.
If this is too limiting, we could support basic merge, i.e. doing a schema where each field potentially comes from different data packages (including the main one). So something like:

{
  "name": "merged_resource",
  "schema": {
    "fields":[
      {
        "name": "date",
        "require": {"datapackage": "dpkg2", "resource": "data", "field": "date"}
      },
      {
        "name": "seq",
        "require": {"datapackage": "dpkg2", "resource": "seq", "field": "seq"}
      },
      {
        "name": "x",
        "type": "integer"
      }
    ]
  },
  "data": [
    {"x": 10},
    {"x": 20},
    {"x": 30},
    {"x": 40}
  ]
}

The second option has the "advantage"(?) of letting us define new constraints (foreign keys...) for the created resource.

I think that I prefer the simplicity of the first solution (I would tend to say that merges should be done at the application level).

@JDureau
JDureau commented Dec 2, 2013

I think these are good solutions to deal with the dependencies on other data packages. The first solution seems better to me too.

@ldodds
Contributor
ldodds commented Dec 5, 2013

Small point, but the DP specification already has dependencies (although its under-defined). Is the proposal to rename that and document it, or is issue covering a different purpose?

@rufuspollock
Contributor

@ldodds should correct the title - this is a proposal to "rename" dependencies in the data protocols spec to dataDependencies.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment