Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reconciling with package.jsonld #110

Closed
jbenet opened this issue Apr 3, 2014 · 58 comments
Closed

Reconciling with package.jsonld #110

jbenet opened this issue Apr 3, 2014 · 58 comments

Comments

@jbenet
Copy link

jbenet commented Apr 3, 2014

Hey guys!

I'm the author of datadex, and now working with @maxogden on dat. As a package manager for datasets, datadex uses a package file to describe its datasets. Choosing between data-package.json and package.jsonld is hard:

It's confusing for adopters to have two different specs. I think we should reconcile these two standards and push forward with one. Thoughts? What work would it entail?

To ease transition costs, I'm happy to take on the convergence work if others are too busy. Also, I can write a tool to convert between current data-package.json and package.jsonld and whatever else.

Cheers!

cc @rgrp, @maxogden, @sballesteros

@sballesteros
Copy link

Hi!

What we love about JSON-LD is that it can be seen as one serialization of RDF and can therefore be converted in RDFa and therefore directly inserted into HTML documents. It opens some cool possibilities, like you are reading a New York Times article for instance and you can ldpm install it and start hacking on the data. Everything your data package manager needs to know is directly embedded into the HTML!
To me, being able to embed a package.json-like-thing into a webpage, respecting open web standards, is amazing.
Regarding schema.org, our "dream" is to be able to leverage the web as the registry using some markup already being indexed by the major search engines (google, yahoo!, yandex and bing). Check http://datasets.schema-labs.appspot.com/ for instance.

I would encourage anyone interested in that to go read the JSON-LD spec and the RDFa Lite spec. Both are super well written. The RDFa Lite spec in particular is remarkably short.

That being said, we are still experimenting a lot with that approach and 100% agree that soon enough we should work on merging all of that (and happy to contribute to the work)...

Another thing to follow closely is: CSV-LD.

@sballesteros
Copy link

Forgot to mention but for datatypes and co http://www.w3.org/TR/xmlschema-2/#built-in-datatypes is here to help (and can prevent re-inventing a spec for datatypes).

@rufuspollock
Copy link
Contributor

@jbenet great to hear from you and good questions. Obviously my recommendation here would be that we converge on datapackage.json - I should also say that @sballesteros has been a major contributor to the datapackage.json spec as it stands :-)

I note there are plans to introduce a few json-ld-isms (see #89) into datapackage.json but the basic aim is to keep this as simple as possible and pretty close to the commonjs package spec. Whilst I appreciate RDF's benefits (I've been a heavy RDF user in times past) I think we need to keep things super-simple if we are going to generate adoption - most data producers and users are closer to the Excel than the RDF end of the spectrum. (I note one can always choose to enhance a given datapackage.json to be json-ld like - I just think we don't want to require that).

That said the differences seem pretty minor here in the basic outline so with a small bit of tweaking we could have compatibility.

@sballesteros I note main differences seem, at the moment, to be:

  • rename of 'resources' to 'datasets'
  • rename of 'path' to 'distribution'
  • 'code' element (currently 'scripts' is being used informally in datapackage.json echoing the approach in node)

If we could resolve these and perhaps define a natural enhancement path from a datapackage.json to become "json-ld" compliant we could have a common base - those who wanted full json-ld could 'enhance' the base datapackage.json in their desired ways but we'd keep the simplicity (and commonjs compatability) for non-RDF folks.

wdyt?

@rufuspollock
Copy link
Contributor

@jbenet more broadly - great to see what you are up to. Have you seen https://github.com/okfn/dpm - the data package manager? That seems to have quite a bit in common with the data tool I see you are working on. Perhaps we could converge efforts there too?

There's also a specific issue for the registry at frictionlessdata/dpm-js#5 - the current suggestion had been piggy-backing on github but I know we also have options in terms of CKAN and @sballesteros has worked on a couchdb based registry.

@sballesteros
Copy link

I would said that given that using the npm registry is no longer really an option, alignment with schema.org is more interesting than commonjs compatibility but I am obviously biased ;)

A counter argument to that would be dat: @maxogden, do you know how dat is going to leverage transformation modules (do you plan to use the scripts property of commonJS ?)

To me alignment with schema.org => we can generate a package.jsonld from any webpage with RDFa markup (or microdata). You can treat JSON-LD as almost JSON (just an extra @context) and in this case there is no additional complexity involved and no need to mention / know RDF at all.

@JDureau
Copy link

JDureau commented Apr 5, 2014

Hey all,

Another argument in favour of a spec supporting JSON-LD and aligned with schema.org is explorability. Being able to communicate unambiguously that a given dataset/resource deals with http://en.wikipedia.org/wiki/Crime and http://en.wikipedia.org/wiki/Sentence_(law) for example, goes a longer way than keywords and a description. It makes it query-ready.

"dataset": [
  {
    "name":  "mycsv",
    "about": [
      { 
        "name": “crimes”,
        "sameAs": "http://en.wikipedia.org/wiki/Crime"
      },
      { 
        "name": “sentences”,
        "sameAs": "http://en.wikipedia.org/wiki/Sentence_(law)
      }
    ],
    ...
  }
]

@jbenet
Copy link
Author

jbenet commented Apr 5, 2014

@rgrp thanks for checking this out!

@rgrp said:

Whilst I appreciate RDF's benefits ... I think we need to keep things super-simple if we are going to generate adoption ... (I note one can always choose to enhance a given datapackage.json to be json-ld like - I just think we don't want to require that).

Strong +1 for simplicity and ease of use for end users. My target user is the average scientist. Friction (in terms of having to learn how to use tools, or ambiguity in the process) is deadly.

I don't think that making the format json-ld compliant will add complexity beyond that ease of use. JSON-LD was designed specifically the smallest overhead, that still provides data-linking power. I found these blog posts from Manu (the primary creator), quite informative:

If I were building any package manager today, I would aim for JSON-LD as a format, seeking the easiest (and readable-ish) integration to other tools. I think JSON is already "difficult to read" to non-developers (hence people's use of YAML and TOML, which are inadequate for other reasons), the JSON-LD @context additions don't seem to make matters significantly worse.

I think even learning the super simple NPM package.json is hard enough for most scientists -- as a prerequisite to publishing their data. I claim the solution is to build a very simple tool (both CLI and GUI) that guides users through populating whatever package format we end up converging in. My data tool already does this, though that could still be even simpler. GUIs will be important for lots of users.


@rgrp I found your dpm after I had already built mine. We should definitely discuss converging there. I'm working with @maxogden and we're building dat + datadex to be interoperable. Also, one of the use cases I care a lot about is large datasets (100GB+) in Machine Learning and Bioinformatics. I'm not sure how much you've digged into how data and datadex work, but it separates the data + the registry metadata, such that the registry can be very lightweight, and the data can be retreived directly from S3, Google drive, or peers (yes, p2p distribution). The way it works right now will change. Let's talk more off-band. I'll send you an email :)


@sballesteros what do you think of the differences @rgrp pointed out? Namely:

  • rename of 'resources' to 'datasets'
  • rename of 'path' to 'distribution'
  • 'code' element (currently 'scripts' is being used informally in datapackage.json echoing the approach in node)

IMO:

  • resources is more general.
  • distribution seems more general. @sballesteros what else (other than contentPath) could go here?
  • the code in package.jsonld seems to be more descriptive than scripts, however, scripts is simple and works well already. not sure.

And, @sballesteros, do you see other differences? What else do you remember being explicitly different?

Let's try to get convergence on these :)

@sballesteros
Copy link

@jbenet before diving into the small differences and trying to converge somewhere, I think we should really think of why we should move away from vocabularies promoted by the W3C (like DCAT). To me, schema.org has already done a huge amount of work to try to bring as much pragmatism as possible in that space see: http://www.w3.org/wiki/WebSchemas/Datasets for instance.

Why don't we join the W3C mailing lists and take action there so that new properties are added if we need them for our different data package managers?

The way I see it is that unlike npm and software package manager, for open data, one of the key challenge is to make data more accessible to search engines (there are so many different decentralized data publishers out there...). Schema.org is a great step in that direction so in my opinion it is worth the little amount of clunkiness in the property names that it imposes.
Like you said, a GUI is going to be needed for beginners anyway so the incentive to move away from W3C standard for easier to read prop. name is low.

Just wanted to make that clear but all that being said, super happy to go in convergence mode.

@rufuspollock
Copy link
Contributor

Let's separate several concerns here:

  • A. MUST datapackage.json be valid JSON-LD? Note, nothing prevents someone enhancing datapackage.json to be full JSON-LD - the spec specfically allows new attributes. As such question is: should we require datapackage.json always to be valid JSON-LD?
    • e.g. @JDureau nothing prevents you doing what you want already with datapackage.json :-) i.e. you can add those attributes if you want (the question is MUST everybody do it ...)
  • B. If so what is required for datapackage.json to be valid JSON-LD. (Could someone specific exaclty what would be needed here? Is it that there must be an @context and @id (?).)
  • C. DCAT "compatability". There isn't a specific serialization of DCAT to JSON so we have some flexibility here since we ultimately going to have to do a bit of translation to DCAT / schema.org whatever we do. I note here that I had significant input into the DCAT spec - including being the person originally responsible for the "distribution" terminology (which came from early CKAN where it came from python setup.py stuff). I think distribution is ultimately wrong for what we are trying to describe here and argued for that (sadly unsuccessfully ;-) ...) in later versions of DCAT. However, all that said I'm not too hung up here on the terminology - my biggest concern is breaking changes give existing adoption. My preference would be:
    • resources over datasets (this does not really occur in the DCAT spec and i think datasets is confusing since what you are describing by the datapackage.json is a dataset)
    • I don't mind if we converge path + url into distribution- though it means a bit of effort for parsers (e.g. to work out if something is a relative path)
    • code vs scripts - not in the spec properly yet so happy either way frankly (we could even support both in the interim)

@jbenet current spec allows you to store data anywhere and tradition from CKAN continued in datapackage.json is that you have flexibility as to whether data is stored "next" to the metadata or separately (e.g. in s3, a specific DB, a dat instance etc). dpm is intended to continue to respect that setup so it sounds like we are very close here :-)

@sballesteros (aside) I'm not sure accessibility to search engines is the major concern in this - the concern is integration with tooling. We already get reasonable discovery from search engines (sure its far from perfect, but its no worse than for code). Key here for me is that data is more like code than it is like "content". As such, what we most want is better toolchains and processing pipelines for data. As such the test of our spec is now how it integrates with html page markup but how it supports use in data toolchains. As a basic test: can we do dependencies and automated installation (into a DB!).

@jbenet
Copy link
Author

jbenet commented Apr 8, 2014

  • A. MUST datapackage.json be valid JSON-LD?

I think yes. I understand the implications of MUST semantics, and the unfortunate upgrading overhead costs it imposes. But without requiring this, applications cannot rely on a package definition being proper linked-data. They require data-package.json specific parsing. In a way, it constrains the reach of the format. (FWIW, JSON-LD is a very pragmatic format.)

To better understand the costs of converting exiting things, it would be useful to get a clear picture of the current usage of data-package.json. I see datahub.io and other CKAN instances. @rgrp Am I correct in assuming all of these use it? What's the completeness of the CKAN instances list (i.e. is that a tiny or large fraction)?

  • B. What is required for datapackage.json to be valid JSON-LD?

I believe @context and @id is enough for valid JSON-LD, though the spec defines more that would be useful. I'm new to it, so I'll dive in and post back here with a straight JSON-LD-fication of data-package.json. In the meantime, @sballesteros what's your answer to this? What else did you have to and get to use?

  • C. DCAT "compatability".

Relevant mappings between DCAT and Schema.org. I'm new to DCAT, so can't comment on its vocabulary, beyond echoing "let's try not to break compatibility unless we must." @sballesteros ?

@jbenet current spec allows you to store data anywhere and tradition from CKAN continued in datapackage.json is that you have flexibility as to whether data is stored "next" to the metadata or separately (e.g. in s3, a specific DB, a dat instance etc). dpm is intended to continue to respect that setup so it sounds like we are very close here :-)

Sounds great! I care strongly about backing up everything, in case individuals stop maintaining what they published. IMO, what npm does is exactly right: back up published versions, AND link to the github repo. Data is obviously much more complicated, given licensing, storage, and bandwidth concerns. I came up with a solution-- more on this later :).

accessibility to search engines

I don't particularly care much about this either. Search engines already do really well (and Links tend to be the problem, not format). IMO a JSON-LD format that uses either an existing vocabulary or one with good mappings will work well. @sballesteros what are your concerns here?

@rufuspollock
Copy link
Contributor

@jbenet thanks for the responses which are very useful.

On point A - what MUST be done to support JSON-LD, the fact that json-ld support would mean a MUST for @id and @context is a concern IMO.

This is significant conceptual complexity for most people (e.g. doesn't the id need to be to a valid RDF class?). This is where I'd like to allow but not require that additional complexity.

@jbenet
Copy link
Author

jbenet commented Apr 9, 2014

@rgrp

  • @context will probably be the same in all documents. It's important to have though, for non-data-package-specific JSON-LD enabled apps.
  • @id will be unique per dataset per version.

I think these could be filled in automatically by dat, data, dpm, and ldpm before submission. That way people don't have to worry about the conceptual complexity of understanding RDF.

For instance, say I have a dataset cifar that I want to publish to datadex. It's version is named 1.1-py. Based on how datadex works now, this dataset's unique id is jbenet/cifar@1.1-py (i call this a handle, this is datadex specific). dat, dpm, and ldpm also have namespace/versioning schemes that uniquely identify versions. So, can use that as an @id directly. On submission, the tool would automatically fill in:

{
  @context: "<url to our>/data-package-context.jsonld",
  @id: "http://datadex.io/jbenet/cifar-100@1.0-py",
  ... // everything else
}

@rufuspollock
Copy link
Contributor

@jbenet I must confess I still think this is unnecessarily burdensome addition as a requirement for all users. As I said there's no reason users or even a given group cannot add these to their datapackage.json but this adds quite a bit of "cognitive complexity" for those users who are unfamiliar with RDF and linked data.

There are very few required fields at the moment in datapackage.json and anything that goes in has to be seen as showing a very strong benefit over cost (remember each time we add stuff we make it more likely people either won't use it or won't actually produce valid datapackage.json).

Whilst I acknowledge that quite a lot (perhaps most) datapackage.json will be created by tools I think some people will want to edit by hand (and want to understand the files they look at). (I'm an example of a by-hand editor ;-) ...)

@jbenet
Copy link
Author

jbenet commented Apr 25, 2014

anything that goes in has to be seen as showing a very strong benefit over cost

Entirely agreed. Perhaps the benefits of ensuring every package is JSON-LD compliant aren't clear. Any program that understands JSON-LD would then be able to understand datapackage.jsonld automatically. Without the need for human intervention (writing parsers, running them, telling the program how to manipulate this format, etc). This is huge. On the same -- or greater -- of importance than having a version.

This video is aimed towards a very general audience, but still highlights the core principles: https://www.youtube.com/watch?v=vioCbTo3C-4

Many people have been harping on the benefits of linking data for over a decade, so I won't repeat all that here. The JSON-LD website and posts by @msporny highlight some of the more pragmatic (yay!) reasoning. Will note that it only works for the entire data web if the context is there (as the video explains). That's what enables programs that know nothing at all about this particular format to completely understand and be able to process the file. Think of it as a link to a machine-understandable RFC spec that teaches the program how to read the rest of the data. (without humans having to program that knowledge in manually).

I think some people will want to edit by hand (and want to understand the files they look at).

Absolutely, me too. But imagine it's your first time looking at datapackage.json. Does

{
  "name": "a-unique-human-readable-and-url-usable-identifier",
  "datapackage_version": "1.0-beta",
  "title": "A nice title",
  "description": "...",
  "version": "2.0",
  "keywords": ["name", "My new keyword"],
  "licenses": [{
    "url": "http://opendatacommons.org/licenses/pddl/",
    "name": "Open Data Commons Public Domain",
    "version": "1.0",
    "id": "odc-pddl"
  }]
  "sources": [{
    "name": "World Bank and OECD",
    "web": "http://data.worldbank.org/indicator/NY.GDP.MKTP.CD"
  }],
  "contributors":[ {
    "name": "Joe Bloggs",
    "email": "joe@bloggs.com",
    "web": "http://www.bloggs.com"
  }],
  "maintainers": [{
    # like contributors
  }],
  "publishers": [{
    # like contributors
  }],
  "dependencies": {
    "data-package-name": ">=1.0"
  }
  "resources": [
    {
    }
  ]
}

Look much better than

{
  "@context": "http://okfn.org/datapackage-context.jsonld",
  "@id": "a-unique-human-readable-and-url-usable-identifier",
  "datapackage_version": "1.0-beta",
  "title": "A nice title",
  "description": "...",
  "version": "2.0",
  "keywords": ["name", "My new keyword"],
  "licenses": [{
    "url": "http://opendatacommons.org/licenses/pddl/",
    "name": "Open Data Commons Public Domain",
    "version": "1.0",
    "id": "odc-pddl"
  }]
  "sources": [{
    "name": "World Bank and OECD",
    "web": "http://data.worldbank.org/indicator/NY.GDP.MKTP.CD"
  }],
  "contributors":[ {
    "name": "Joe Bloggs",
    "email": "joe@bloggs.com",
    "web": "http://www.bloggs.com"
  }],
  "maintainers": [{
    # like contributors
  }],
  "publishers": [{
    # like contributors
  }],
  "dependencies": {
    "data-package-name": ">=1.0"
  }
  "resources": [
    {
    }
  ]
}

?

I would imagine thinking things like:

  • Why are there two versions?
  • What is a datapackage_version ?
  • What is the -beta part?
  • Why do licenses have an object describing them? what other values could go there?
  • Same for contributors, maintainers, publishers.
  • What is the difference between contributors, maintainers, publishers anyway?
  • What bank of words count as keywords? anything?
  • What are the differences between name, title, and description? What will go where?

The latter adds:

  • What is that funky @context thing?

IMO, these would involve looking up the spec and understanding how the format works. I care a lot about readability (i originally had picked yaml for datadex) But i claim readability for new users is not affected significantly here. :)

@msporny
Copy link

msporny commented Apr 25, 2014

@rgrp wrote:

On point A - what MUST be done to support JSON-LD, the fact that json-ld support would mean a MUST for @id and @context is a concern IMO. This is significant conceptual complexity for most people (e.g. doesn't the id need to be to a valid RDF class?). This is where I'd like to allow but not require that additional complexity.

@id is not required for a valid JSON-LD document. Also note that you can alias "@id" to something less strange looking, like "id" or "url", for instance. The ID doesn't need to be a valid RDF class. The only thing that's truly required to transform a JSON document to a JSON-LD document is one line - @context. None of your users need to be burdened w/ RDF or Linked Data concepts unless they want to be. Just my $0.02. :)

@junosuarez
Copy link

Weighing in briefly after being directed to this thread by @maxogden. I am not currently developing any tools, but rather looking for forward-thinking best practices around metadata for datasets I'm working with a city to help publish, so in that sense I am your end user.

From a government point of view: many governmental entities at a similar level (e.g. cities, transit agencies, school districts) are dealing with similar but very differently structured data - often in contexts that care very little about schemas or data re-use, and in strange, vendor-specific formats. In order to do any sort of comparative analysis or re-use open source tools, it is very important to be able to create ad-hoc schemas. Metadata formats which make this easier by including linked data concepts are a huge step forward.

Pertinent to this thread: given what @msporny said about being able to alias @id, and given the cognitive overhead of datapackage_version which @jbenet mentioned, would it be possible to use @context to also indicate the version number of the datapackage spec? eg:

{
  "@context": "http://okfn.org/datapackage-context.jsonld#0.1.1",
  "id": "http://dathub.org/my-dataset",
  "title": "my dataset",
  "version": "1b76fa0893628af6c72d7fa7a6c10f8e7101c31c"
}

In my example, I'm also using a hash as a datapackage version rather than a semver, since for frequently changing, properly versioned data, it is unfeasible to have a manually incremented version number.

@jbenet
Copy link
Author

jbenet commented May 1, 2014

From a government point of view: many governmental entities at a similar level (e.g. cities, transit agencies, school districts) are dealing with similar but very differently structured data - often in contexts that care very little about schemas or data re-use, and in strange, vendor-specific formats. In order to do any sort of comparative analysis or re-use open source tools, it is very important to be able to create ad-hoc schemas. Metadata formats which make this easier by including linked data concepts are a huge step forward.

👍 Thank you. I will quote this in the future. :)

would it be possible to use @context to also indicate the version number of the datapackage spec?

Yeah, absolutely. It's any URL, so you can embed a version number in the url and thus identify a different @context. And, great point. We should establish a culture of doing that. I don't think we should require it, as i can imagine cases where it would be more problematic than ideal (not to say how hard and annoying it would be to impose a version scheme on others).

"http://okfn.org/datapackage-context.jsonld#0.1.1"

I believe JSON-LD can extract the right @context from a #<fragment>, though not 100% sure. @msporny will know better. If not, embed it in the path:

"http://okfn.org/datapackage-context@<version>.jsonld"
"http://okfn.org/datapackage-context.<version>.jsonld"
"http://okfn.org/datapackage-context/<version>.jsonld"
"http://okfn.org/<version>/datapackage-context.jsonld"

I'm also using a hash as a datapackage version rather than a semver, since for frequently changing, properly versioned data, it is unfeasible to have a manually incremented version number.

👍 hash versions ftw. What are you building? And i encourage you to allow tagging of versions. the rightest thing i've seen is to have hashes (content-addressing) identify versions, and allow human-readable tags/symlinks (yay git).

@jbenet
Copy link
Author

jbenet commented May 1, 2014

@rgrp thoughts on all this? Can we move fwd with @context or do you still think it is inflexible? Happy to discuss it more/make the case through a higher throughput medium (+hangout?).

@sballesteros if we have @context here, does that satisfy your constraints? (given that your own @context could alias the naming the names you're currently using in package.jsonld to those used in the standard @context).

@rufuspollock
Copy link
Contributor

@jden great input.

@jden @jbenet re datapackage_version I actually think we should deprecate this a bit further (its not strictly required but I think, frankyl, we remove completely). People rarely add it and I'm doubtful it would be reliably maintained in which case its value to consumers rapidly falls towards zero. (I was sort of doubtful when first added but there were strong arguments in favour by others at the time).

Re the general version field I note that semver allows using hashes a sort of extra e.g. 1.0.0-beta+5114f85...

However, I do wonder about using version field at all if you are using full version control for the data - I imagined the version field being more like version field for software packages where its increment means something substantial (but where you can individual revisions if you want from the version control system - cf the node.js package.json where dependencies can refer either to package versions or to specific revisions for git repos).

@jbenet
Copy link
Author

jbenet commented May 1, 2014

re datapackage_version I actually think we should deprecate this a bit further (its not strictly required but I think, frankyl, we remove completely).

I agree, let's remove datapackage_version and put a version number in the url of our @context. That's the LD way of doing it. FWIW, having a version is a good thing, and this way (in the url) we get seamless enforcement without the additional complexity of an optional field). And it's pretty good: the version isn't just a number to lookup somewhere; it points directly to a different file. :)

However, I do wonder about using version field at all

Having versions in pacakage-managers/registries is really useful. Let's not remove this. The package manager websites want to show meaningful descriptions of important version changes (semver). Users can understand the difference between 1.0.3 and 2.6.4 (one's newer) and conclude, usually correctly, which one is better. Git, which is full version control, makes extensive use of tags/branches (which are both just named version pointers). Hence I recommended to @jden to allow tagging :).

semver

By the way-- i'm not sure if you came up with something similar, but I put a tiny bit of thought into making a data-semver jbenet/data#13 which might be useful. Clearly expressing what constitutes a MAJOR.MINOR.PATCH version change in the data context will help avoid confusion for people working with data that don't understand the subtleties of code semver.

@rgrp can we go fwd with @context? (Lmk if you need more time for consideration-- just wanting to get closer to done with this as we'll be using data packages soon and would like to have things resolved before that happens :) ).

@rufuspollock
Copy link
Contributor

On the @context question: let me reiterate that I'm a strong +1 on allowing this in (even encouraging it) but I am still concerned about making it a MUST. Strictly we only have one required field at the moment (name).

Also do be aware that @context on its own buys you rather little. If you really want to benefit from the rich schema/ontology support of RDF you are going to want to integrate a lot more stuff (e.g. into the type fields of the resources). Again i think this is great if you can do it since you get much richer info - and data package has been designed so you can do this progressive enhancement really easily (just add the @type to your resource schema) but I don't think it should be required for everyone.

@rufuspollock
Copy link
Contributor

@jbenet to be clear i wasn't suggesting removing version - i was saying i wasn't sure about using it for sha hash of a changeset (since as @jden mentions that changes so much). I think version is super-useful and isn't going anywhere. As you say the primary use for version (IMO) is more like the tags one has in git.

Also note I wrote my previous comment before I'd read your response. My suggested approach at present is that we add @context to the spec with a clear description of use but that we don't make it a MUST. If you were able to draft a description for the @context field to include that would be great and we could then review and slot it in the appropriate place.

@msporny
Copy link

msporny commented May 1, 2014

I believe JSON-LD can extract the right @context from a #, though not 100% sure. @msporny will know better. If not, embed it in the path:

No, JSON-LD will not extract the "right" context from a #fragment :). We considered that option and felt that it adds unnecessary complexity (when a simpler alternative would solve the same problem). Just do this if you want to version the context:

"@context": "http://okfn.org/datapackage-context/v1.jsonld"

You are probably going to want to use a URL redirecting service so that your developers don't see breaking changes if okfn.org ever goes away. For example, use https://w3id.org/ and make this your context URL:

https://w3id.org/datapackage/v1

This does three things:

  1. It decouples the hosting location of the actual file from the identifier that people type out to use the context. So, if you decide to change the hosting location from okfn.org to some other hosting provider, none of the programs using the context are affected.
  2. It gives people an easy-to-remember URL for the JSON-LD Context.
  3. It provides a hint to clients as to the version of the vocabulary you're using.

I can add it to the site in less than a minute if you want (or you can submit a pull request). w3id.org is backed by multiple companies and is designed to be around for 50+ years. You can learn more about it by going here: https://w3id.org/

(edit: fixed UTF-8 BOM - no idea how that got in there)

@paulfitz
Copy link
Contributor

paulfitz commented May 1, 2014

Somehow the w3id.org homepage link at the end of #110 (comment) is broken for me due to a utf8 bom that's crept in? Source code shows it as https://w3id.org/%EF%BB%BF. Strange. https://w3id.org/ works.

@junosuarez
Copy link

@jbenet @rgrp Here's an example dataset I'm building: https://github.com/jden/data-bike-chattanooga-docks

Some thoughts from the experience (albeit tangential to this thread):

  1. all of this metadata was created by hand, without tooling. What I've filled out is about as far as I got before moving on. I am still interested in creating schemas to describe the csv and geojson representations of the data, but haven't looked into it yet.

  2. it seems silly having both package.json and package.jsonld at the root level of the directory. I'm torn between whether I'd prefer npm to be package.jsonld aware and just parse package.jsonld or to have some other place to put package.json. From a package user experience point of view, I really want someone to be able to rebuild my data from git cloning, npm installing, npm starting.

  3. How would I indicate that this package has two representations of the same resource (data.csv and data.geojson), as opposed to for example two separate-but-related tables? From a REST background, my inclination would be to do something like

"resources": [
  {
    "name": "data",
    "mediatype": "text/csv",
    "path": "data.csv"
  },
  {
    "name": "data",
    "mediatype": "application/json",
    "path": "data.geojson"
  }
]
  1. it wasn't clear to me how to specify an appropriate license for the code part in the scripts/ directory vs the package as a whole. I ended up describing it in plaintext in the readme, putting the code license (ISC) in the package.json, and the datapackage license (PDDL) in the package.jsonld.

@rufuspollock
Copy link
Contributor

@jden

  1. I agree - i've largely hand-created though I've started using dpm init to bootstrap the file as it auto extracts table info

  2. we debated having package.json (for node) and datapackage.json for datapackage stuff and ultimately decided you do want them different - see [DP] Rename datapackage.json to package.json in Data Package #73 for a bit more on this

  3. I think your suggestion seems good and how we've done datapackage.json resource entries so far

  4. I think that is correct and how i'd go with things :-)

@jbenet
Copy link
Author

jbenet commented May 12, 2014

@jden

Here's an example dataset I'm building

Cool!

I am still interested in creating schemas to describe the csv and geojson representations of the data, but haven't looked into it yet.

If you give me a couple of weeks, Transformer (repo) will help you do this really easily.

  1. it seems silly having both package.json and package.jsonld at the root level of the directory.

This will be the case as long as different registries do not agree on their structures. If you published this to rubygems too you'd also have a .gemspec.

I'm torn between whether I'd prefer npm to be package.jsonld aware and just parse package.jsonld

We could open up a discussion about getting to this. Frankly, now that JSON-LD exists, there's no reason we can't have the same package.jsonld spec for every package manager out there, and use different @contexts (thank you, @msporny!!). Developers are free to adhere to whatever @context they want (and thus whatever key/val structure), and yet would be compatible across the board, if mappings between the properties of both contexts exists. ❤️ data integration!

But, don't expect this to happen for years. :)

Actually... we might even be able to be package.json compatible... we'd need a special @context (which npm will just ignore) that gives us the mappings from the npm keys to our package.jsonld keys. Hm! (thank you @msporny !!!! in one @context swoop you fixed so much).

  1. How would I indicate that this package has two representations of the same resource

In my biased world view, I'd include only one and use transformer to generate the second representation with a makefile. (or have both, but still have the transform) Something like:

data.geojson: data.csv
    cat data.csv | transform csv my-data-schema geojson > data.geojson

Note: this doesn't work yet. It will soon :)

On indicating this, your same name with different mediatype (or something like it) sgtm. And I would definitely put it in the Readme too.

  1. it wasn't clear to me how to specify an appropriate license for the code part in the scripts/ directory vs the package as a whole. I ended up describing it in plaintext in the readme, putting the code license (ISC) in the package.json, and the datapackage license (PDDL) in the package.jsonld.

I think this (ISC in package.json#license and PDDL in package.jsonld#license) is precisely the right thing to do.

For code, it's common to add a LICENSE file in packages. We could establish a convention of putting the various licenses into the same file, or perhaps have two: LICENSE as usual for code, and DATA-LICENSE. (Personally, I think it's ugly to have these files and I never include them, because I think the license in package.json / Readme is enough for most modules I write. That said, if something gets really popular, and lots of people start using it to make products, it becomes more important to be legally safe than to have a clean directory. :) )


@rgrp

  1. we debated having package.json (for node) and datapackage.json for datapackage stuff and ultimately decided you do want them different

does this change in light of the comments I made above, re being directly package.json compatible? I haven't given it full thought. I agree with your comments here, and also think it fine to have both package.json and datapackage.jsonld in both.

Actually, the .jsonld extension-- though nice-- may add much more trouble than it's worth yet, given that many tools know to parse .json as JSON and don't understand .jsonld (node's require, for example). Thoughts, @sballesteros @msporny? Is there a strong reason (other than indicating to humans that this is JSON-LD) to use .jsonld over .json ? JSON-LD was designed as an upgrade to (and fully compatible with) json, so maybe we should just use json? @rgrp, you probably prefer this, no?


@rgrp

Also do be aware that @context on its own buys you rather little. If you really want to benefit from the rich schema/ontology support of RDF you are going to want to integrate a lot more stuff (e.g. into the type fields of the resources).

Not necessarily? As I understand, we can remap type -> @type in the context file, and try to use the types that are currently there for richer stuff. We get this for free without having to change anything, thanks to how JSON-LD works. Though, not sure whether people's use of type is well defined or that useful.

I want to get this finished soon, so let's settle our thought on @context, so we can draft changes and move fwd. :D

I'm a strong +1 on allowing this in (even encouraging it) but I am still concerned about making it a MUST. Strictly we only have one required field at the moment (name).

I think it's really important to make the move to JSON-LD. And this IMO makes it a MUST (i actually think it's more important than name itself (i can go into why name itself could become non-MUST, but prob not helpful here / not good idea :) ).

I will definitely require it in any registries I write for data packages. Again, tools like dpm init help. AND, you can have dpm publish (or the registries themselves) insert it automatically on publishing. That way, users never have to worry about it at all! It just happens for them.

I'm happy to help in upgrading all existing packages (scripts to upgrade, + crawl CKAN and add @context to everything (would we need to email people to ask for permission? not clear on the ToS of CKAN).

If you're set on not making it a must, then I propose we keep data-packages.json as is. I can setup a fork of the Data Packages site with a Linked Data Packages, that tracks the original spec to the T, except for the MUST on the @context. Apologies, I don't mean to be uncompromising. This simply is a very important step forward to take, not just for data packages, but for the entire web itself.

If you were able to draft a description for the @context field to include that would be great and we could then review and slot it in the appropriate place.

How's this as a first draft?

  • @context (required) - URL of the Data Packages JSON-LD context definition. This MUST be a valid URL to a Data Packages compatible JSON-LD context. The @context SHOULD just be the URL of the most recent Data Packages context file: https://w3id.org/datapackage/v1.

    The @context MAY be the URL to another context file, but it MUST be compatible with the Data Packages context: there MUST be relation between the properties outlined in Data Packages and the equivalent properties outlined in your own context.

The last line is super awkward. Does it even make sense?


@rgrp @msporny please correct any nonsense I might have spewed! :)

@pvgenuchten
Copy link
Contributor

hi @rgrp @jbenet what is the current status of this "MAY" item in next spec? I also wonder if you guys are lining up with the DCAT&PROV initiatives from W3C. DCAT and PROV are managing a similar usecase. Introduce a spec for metadata to describe datasets in RDF, which can be encoded as json-ld easily.

@jbenet
Copy link
Author

jbenet commented Apr 1, 2015

@pvgenuchten not sure-- @rgrp ?

@rufuspollock
Copy link
Contributor

@jbenet @pvgenuchten no-one commented on the proposal so nothing happened :-) Generally I want to see a fair number of comments on a proposal to put a change in.

@rufuspollock
Copy link
Contributor

@pvgenuchten @jbenet if someone could give me some sample language or submit a PR this can go in.

@jbenet
Copy link
Author

jbenet commented Apr 17, 2015

@rgrp sample language for the datapackage.json?

@rufuspollock
Copy link
Contributor

@jbenet yes plus language for the actual spec proposal

@jbenet
Copy link
Author

jbenet commented Apr 19, 2015

@rgrp can you give me a precise example of what you want, say for another field, like version?

@rufuspollock
Copy link
Contributor

@jbenet I'd be looking for relevant language to add to the spec to specify what property (or properties) to add e.g. @context and how they should be used.

@jbenet
Copy link
Author

jbenet commented Apr 19, 2015

@rgrp the directions are too vague. do you want a patch to http://dataprotocols.org/data-packages/ ?

how much of a connection to @context are you willing to have? proper JSON-LD makes this required. IIRC you disagree with forcing all data-packages to be linked data. So-- do you want it under the SHOULD part of required fields? (I still think the point is lost, even if a little better).

Also, as mentioned before, @context sort of obviates specs-- or rather, makes them machine-readable. So one way to go about this is to make the @context point to a JSON-LD context file with the data-package spec, allowing users to point somewhere else if they're using a different spec. But you probably dont wan't that-- you probably want them to still strictly follow the data-packages spec (otherwise non-LD parsers would break)-- so maybe make it so any other context url needs to be derived from yours (have every field covered)?

It's also easy to treat all data-packages without an @context as if they had one default @context url-- namely yours.

Note this also needs a proper JSON-LD context file representing the machine-readable version of this spec. Hmmm, i don't have enough time to take this whole thing on right now-- do you have anyone else on the team that cares about linked data to work with me on this?

@rufuspollock
Copy link
Contributor

@jbenet I think it would come under the "MAY" style fields. I'm not sure I understand enough here to get the complexity. No problem if you don't have time for this right now and we can wait to see if someone else volunteers to get this in.

@hubgit
Copy link

hubgit commented May 1, 2015

I note that the link to package.jsonld in the issue description now leads to a 404 page - is package.jsonld still a thing?

@jbenet Would you be able to write out the data contained in a datapackage.json file as Turtle, so we can see what the URLs for each property would be?

@sballesteros
Copy link

Yep still a thing. We haven't had time to give it a new home yet.
We have been merging it with the work done by the CSV on the web working group: see http://www.w3.org/standards/techs/csv#w3c_all.
More soon.

@hubgit
Copy link

hubgit commented May 1, 2015

We have been merging it with the work done by the CSV on the web working group

In that case, this issue should probably be closed in favour of a new "Use the W3C Metadata Vocabulary for Tabular Data" issue.

@rgrp - Is the plan to transition to the CSV-WG's JSON data package description, when it becomes a Recommendation?

@rufuspollock
Copy link
Contributor

@hubgit no, no intention to transition to that spec as it isn't data package. Whilst directly inspired by Data Package and Tabular Data Package and i'm an editor I think it has currently diverged a lot.

So, still useful to get JSON-LD compatibility in here and this issue should stay open.

@hubgit
Copy link

hubgit commented May 1, 2015

Ok, in that case we just need the mapping from Data Package property names to URLs, and a stable place to host a JSON-LD context file.

@rufuspollock
Copy link
Contributor

@hubgit great - would you be up for having a stab? Also is there anything we need to add to datapackage.json itself?

@hubgit
Copy link

hubgit commented May 1, 2015

I'll have a look, yes. I'm not sure what the best URL for each property would be, though: maybe something like http://dataprotocols.org/data-packages/#name?


All that needs to be added to datapackage.json is this:

"@context": "URL of the context.json file"

@msporny
Copy link

msporny commented May 1, 2015

Hey @hubgit, @jbenet just a quick note on best practices wrt. JSON-LD context files:

  • Use w3id.org for the URLs, because dataprotocols.org may not be around forever and you don't want to break your apps if it goes away. w3id.org has relationships setup and ensures that those URLs will resolve for 50+ years (or as long as the Web is around) - https://w3id.org/
  • Use HTTPS for the context file and data URLs - you can do nasty things to apps by man-in-the-middle-ing a JSON-LD context. For example, you can switch 'source' and 'destination' in a financial transaction by MiTM'ing a payments.jsonld context file and mapping 'source' to 'http://w3id.org/payments#destination' and vice-versa. Don't know if you guys have this sort of potential vulnerability, but it's good to keep this attack vector in mind and really, just using HTTPs mitigates as much of the risk as you can (without only using locally cached JSON-LD contexts or digitally signing JSON-LD context documents).
  • Use major versions, so the context file might be: https://w3id.org/data-packages/v1
  • Don't use minor versions and point releases - vocabulary and context files should provide stability between major versions. You can always add things to context files. You should never remove or change the mapping of a term to a URL (because it'll break existing apps).
  • Design your vocabulary documents for a 5-10 year lifecycle between major versions. That's the sort of stability you should be shooting for.
  • Put all your vocabulary terms in a single document (if you choose to publish a vocabulary document). For example: http://dataprotocols.org/data-packages#name (this is good)... http://dataprotocols.org/data-packages/#name (this is not as good, note the trailing slash)

That's just the stuff I can think of off of the top of my head. I'd be happy to look at the JSON-LD context, URL mappings, and other stuff as you make more progress.

@hubgit
Copy link

hubgit commented May 1, 2015

Thanks @msporny, that's helpful.

It looks like Chrome's not happy about w3id.org's HTTPS encryption?
screen shot 2015-05-01 at 15 51 01

@msporny
Copy link

msporny commented May 1, 2015

Thanks @hubgit, looks like the new versions of chrome mark certs that use RSA w/ SHA1 as invalid - we'll get a new cert that uses SHA256 from our CA... there may also be a problem w/ the fact that our CA doesn't publish their public audit records. Working to fix it now.

@msporny
Copy link

msporny commented May 1, 2015

@hubgit fixed - w3id.org now uses RSA w/ SHA256, which'll get rid of the warning in the newer versions of Chrome.

@rufuspollock
Copy link
Contributor

INVALID / DUPLICATE in favour of #218.

This issue has moved quite far from original discussion and is quite lengthy. I'm therefore closing in favour of a new specific issue on providing a JSON-LD context file for Data Package and Tabular Data Package.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Archived in project
Development

No branches or pull requests

10 participants