Restrict resource location fields #223

Closed
amercader opened this Issue Oct 27, 2015 · 19 comments

Projects

None yet

6 participants

@amercader
Member

(To be labelled spec-datapackage)

Apologies if this has been discussed before.

Can you elaborate on or give a use case as for why resources are allowed to have url, path and data at the same time? I can see how path (plus base) and url might point to the same thing, but all other combinations seem to suggest different versions of the actual data. Regardless of this, you want to make sure that different consumers are getting and working with the same data so allowing ambiguity around its location doesn't seem right to me.

Why not require one and only one of them?

@rufuspollock
Contributor

@amercader only one is required. However, users can specific more than one without making the Data Package "Invalid".

I think the point you raise is that only one should be specified. I think there was originally a logic to this in that, for example, you might download a data package that had a url specified and then cache the data locally and set the path. However, I think this is both unlikely and confusing - in such cases delete the url and specify it in some other way.

Thus, thinking about it now I think we should probably state that one and only one of these should be present ;-)

@amercader
Member

That's exactly what I was suggesting, that publishers need to pick one of the three. It makes life of datapackages consumers much easier and avoid ambiguity.

@rufuspollock
Contributor

@amercader do you want to have a stab at a pull request to make this change - should be pretty simple language change. Do remember to note it in the changelog at the top of the file as it is strictly a "breaking change" and please put [!] at the start of the git commit message as it allows me to track breaking changes ;-)

@amercader
Member

sure, will do

@amercader
Member

@davidread do you have any views on this? After checking your implementation of a Data Package on DGU I noticed that you use both url and path on the datapackage.json file, url pointing to the online file and path to the local zipped one. That actually makes a lot of sense when you are downloading a zip file with datapackage.json plus data files. I'm now less eager to enforce only one of them, although we can provide some more guidance on the spec.

@davidread

@morty, you coded the DGU one, do you want to comment?

@morty
Contributor
morty commented Oct 30, 2015

OK, I included url and path in our datapackage.json file because of the way in which these are used in tools like dpm which download the data from the url and put it into a file defined by the path. This allowed us to specify how files with duplicate names would be stored when downloaded. There is a link to just the datapackage.json for each dataset, but it's tucked away in menu. e.g:

https://test.data.gov.uk/dataset/driving-licence-data/datapackage.json

Likewise if a datapackage.json had inline data in the data field where would a tool processing that file put the data on the file system so that it could be accessed (by non datapackage aware tools) if there is no path specified? Maybe you could use the name, but that's optional and wouldn't result in the correct file ending so would cause problems on Windows for example.

The spec also seems to suggest you could define a base URL for the whole datapackage and then path would give the URL of the resource relative to that base URL.

The three seem to change use depending on the context.

There seems to be a bit of confusion about what a datapackage is. Rufus was surprised that we had zipped up resources along with the datapackage.json and called that a datapackage. Until there is more tooling around datapackage that will show up more concrete use cases it's hard to nail down.

@amercader
Member

thanks @morty that was really useful. I'm now convinced that given the flexibility (or ambiguity) of the spec it doesn't make sense to enforce only one of the fields. I'll close this now.

As for the "what exactly is a Data Package" question, I agree 100%. All devs I've met so far (myself included) initially thought about the Data Package as a zipped datapackage.json plus data files. I think the spec doesn't do a great job of clarifying what exactly the Data Package is, but that should be a separate issue.

@amercader amercader closed this Oct 30, 2015
@rufuspollock
Contributor

@amercader I'd actually like to reopen at least to clarify order of processing. Based on @morty comment and my own experience i think processing order needs to be corrected to data, url, then path (?). I'm also still concerned that your experience is quite common and that having multiple options will lead to confusion (even in my suggested change there are some issues: e.g. when download from DGU you may want to process url not path - path won't work in fact - but in other circumstances e.g. you now have the data package on your local machine you want to use path not url ...)

@rufuspollock rufuspollock reopened this Nov 3, 2015
@rufuspollock
Contributor

OK, I've come to a definite conclusion that one and only one of these should be present on a data package resource at a time -- this issue already demonstrates how the current multiple allowed values results in confusion.

@morty your points are really good and similar to why i originally allowed this (as you note re dpm). However, I now think it leads to too much confusion as it isn't really clear how things should work.

If we want to support things like specifying where a resource with a url should be stored locally I think we probably want a separate field rather than (ab)using path or just define a standard algorithm.

@rufuspollock rufuspollock added a commit that closed this issue Jan 31, 2016
@rufuspollock rufuspollock [!,resource][s]: only one of url, path, or data can be present on a r…
…esource - fixes #223.

Allowing more than one was confusing with unclear processing semantics.
a639b24
@rufuspollock
Contributor

FIXED.

@pwalsh
Member
pwalsh commented Feb 1, 2016

I'm not too excited by this. I think there is a clear use case for allowing url, path and data, as long as we get more explicit on the resolution order across those.

Use Case:

OpenSpending Next stores all data in Data Package format. When a user adds a Data Package, that user can of course include resources from publicly-accessible URLs. However, in order to ensure data is always available, we would like to also store a local copy of that resource. In the Data Package, that means there would be, for each resource, the url field for the original data source the user added, and path would be the local version of the same. If the URL ever went offline, we'd still use the local file accessible via path.

I think this use case is critical considering the distributed nature of data packages and the fact that URLs do die. The pattern I describe is so important to availability of data in data registries that support Data Package that I think we should not only revert this change, but make such a pattern much more explicit :).

Edit: I reopened the issue so we can address this concern.

@pwalsh pwalsh reopened this Feb 1, 2016
@danfowler
Contributor

@pwalsh do you know of any existing examples for this pattern?
On Feb 1, 2016 09:54, "Paul Walsh" notifications@github.com wrote:

Reopened #223 #223.


Reply to this email directly or view it on GitHub
#223 (comment)
.

@pwalsh
Member
pwalsh commented Feb 1, 2016

@danfowler I just gave a very clear use case, that we really should implement in OpenSpending in order to ensure we always have source data available to rebuild databases that are derived from the datastore. As OpenSpending is already the biggest potential deployment of Data Package to date, is that not a good example?

@pwalsh
Member
pwalsh commented Feb 1, 2016

@danfowler @rgrp I am definitely happy enough if you still decide this is irreversible, but in such a case, for reasons listed above, OpenSpending (and even other data stores like the coming reboot of the Data Package Registry) would likely still implement a local path for a URL to ensure availability of data. In such cases we could store this path on a field called cache or something to make the reasoning clearer and to ensure we are not going against the spec.

@rufuspollock
Contributor

@pwalsh very sensible questions. I do have thoughts:

  • "source" vs "path to the data": you can use the sources property primarily defined on data package on resources. This sources property can be used to point to other "source" version of the data whilst url / path / data point to the actual primary version stored with this data package
  • "archived / backup" copies: I think this is a definite use case. However, my sense is that we should make this explicit rather than have complex semantics for url / path / data. e.g. we could have a property called archives or similar. If we want this we can open a specific issue for this.

For OS specifically I am increasingly of the view that:

  • You should copy local the data the user gives you. If it is backup (so url is in the resource) then put it in the archive/ directory and we reference appropriately
  • Often you will end up with local data as you will need to tweak or transform. In which case just use sources for the original data (though even here i would copy a version locally into archives/)
@pwalsh
Member
pwalsh commented Feb 4, 2016
  • TBH I never thought of sources like that, and while it def. works, it feels like too much information for this use case (sources has the same footprint as resources, essentially).
  • For the use case I outline, I agree, let's not overload path, url, data with this (so: let's close this ticket again), but also let's not use archives either, as we already have a convention for archive in DP.
  • Suggestion: A resource can have a cache property which is always a local path to a file representation of path, url, data. If we want to consider that as part of spec, let's open an issue for it. If not, we can first just implement this pattern in OS and see how it develops.
@rufuspollock
Contributor
  • sources is more limited than a resource in that it only 3/4 properties I think
  • we have no convention for archive yet in DP afaik. We have a convention for the archives/ directory but no property in the datapackage.json.
  • I like cache idea and we should open an issue for that i think
@rufuspollock
Contributor

FIXED. Re-closing as i think this resolved for now.

@rufuspollock rufuspollock added a commit to rufuspollock/fd-specs that referenced this issue Nov 28, 2016
@rufuspollock rufuspollock [dp,!][m]: merge resource property url into path - fixes #250.
Major, breaking change. Major justification is simplication.

In addition to basic change have also addressed a security concern by
introducing limitations on path (no / or ../).

Simplicity.

Logic for this spelled out in detail in the github issue thread. Summary:

At the moment we have path and url. I originally had this to make it super easy
for tool implementors (no lists of web protocols to match against `http://,
https://, ftp://, etc).

At the same time it adds cognitive complexity to the spec and for publishers
and confusion about whether one could use both e.g. #223 #232.

Whilst change increases demand on consumers to parse out urls from simple paths
this is relatively straightforward and consumer could not rely on url vs path
being used correctly anyway
2aab215
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment