Merging path and url for resources #250

Closed
rufuspollock opened this Issue Mar 7, 2016 · 37 comments

Projects

None yet

9 participants

@rufuspollock
Contributor
rufuspollock commented Mar 7, 2016 edited

At the moment we have path and url. I originally had this to make it super easy for tool implementors (no lists of web protocols to match against `http://, https://, ftp://, etc).

At the same time it adds cognitive complexity to the spec and for publishers and confusion about whether one could use both #223 #232

The approach would be probably going forward just to have path and define like:

the location of the resource specified either as a fully qualified url or a unix-style path.
Data Packages may be found online or locally on disk and path should be interpreted
relative to that context (on disk it is a file path, online it is a url path). 

If a relative path is used it is interpreted relative to the base directory for the data
package (the parent directory of the datapackage.json).

Examples:

http://ex.datapackages.org/big-csv/my-big.csv
/examples/my-csv.csv  # absolute path
examples/my-csv.csv   # relative path
@pwalsh
Member
pwalsh commented Mar 7, 2016

+1.

Easy enough to implement code and check if we have a protocol scheme or just a plain old path to file.

@rufuspollock
Contributor

Questions:

  • Do we change to a new name e.g. location - this is breaking!
  • How do implementors distinguish fully qualified urls from (relative) paths?

My sense if we do do this, approach would be:

  • Merge to path and allow it to be a full url (makes semantic sense: "path to the resource file")
  • Deprecate url
  • Require implementors to do some regex'ing to distinguish urls from relative paths ...

Aside:

  • Must paths always be relative to the data package base directory to avoid security risks
@ivotron
ivotron commented Jul 9, 2016

For use cases where the data can't be fetched (e.g. too big), one would still like to retrieve a datapackage.json that has a URL pointing to where the data resides. Would this issue (merging path and URL) cover this use case?

@pwalsh
Member
pwalsh commented Jul 10, 2016

@ivotron yes, that issue is covered right now, and would not be removed by the proposal in this issue.

@pwalsh
Member
pwalsh commented Jul 12, 2016

So @rgrp I'm in agreement with your points:

  • Merge to path and allow it to be a full url (makes semantic sense: "path to the resource file")
  • Deprecate url
  • Require implementors to do some regex'ing to distinguish urls from relative paths

Additionally, cache should be implemented, and follow the same rules. (worth mentioning here for the possibility that users were relying on presence of both url and path to keep local copies of remote resources.)

@roll @amercader @akariv @vitorbaptista you are all writing code around this in various places. What do you think?

@roll
Member
roll commented Jul 12, 2016 edited

tabulator has just path for anything and OPTIONAL scheme and format to explicitly describe the path kind.

@vitorbaptista
Contributor

Sorry if this was discussed before. We've agreed on removing the base path on #232, however there are some relative paths being described here. How does the path resolution would work with this proposed change? Consider both the cases when a datapackage is remote (e.g. http://somewhere.com/datapackage.json) and local (e.g. /home/vitor/data/datapackage.json)

@pwalsh
Member
pwalsh commented Jul 18, 2016

@vitorbaptista I think, always relative to the directory that contains datapackage.json for local filepaths, and, remote resources should always be fully qualified URLs.

@vitorbaptista
Contributor

So, imagine we have this datapackage.json:

{
    "name": "foo",
    "resources": [
        { "path": "data/foo.csv" }
    ],
}

If this datapackage is at http://example.org/datapackage.json, its resource would be at http://example.org/data/foo.csv. If it's in /home/vitor/foo/datapackage.json, the resource would be at /home/vitor/foo/data/foo.csv. All good. Now, consider this other datapackage:

{
    "name": "bar",
    "resources": [
        { "path": "http://example.org/bar/data/bar.csv" }
    ],
}

This resource then doesn't depend on the path for the datapackage itself. Even if the datapackage.json lives in /home/vitor/bar/datapackage.json, its resource would still live at http://example.org/bar/data/bar.csv.

If those examples are correct, we will always have to download the resources from their URLs, even if the datapackage is inside a ZIP file.

Am I missing something?

@pwalsh
Member
pwalsh commented Jul 19, 2016

@vitorbaptista those examples are correct as I understand it, but still:

If those examples are correct, we will always have to download the resources from their URLs, even if the datapackage is inside a ZIP file.

Not quite sure what you are getting it here.

@rufuspollock
Contributor

@pwalsh relative paths work for remote urls too - that's really important. In both cases you'd compute relative to the datapackage.json base directory.

@vitorbaptista your examples are correct and 👍 to @pwalsh comment.

Note: I'm really thinking explaining this stuff will go in the patterns or FAQs section on the fd.io site (we could add to the spec but it gets kind of bulky).

@vitorbaptista
Contributor

@pwalsh Consider the second example:

{
    "name": "bar",
    "resources": [
        { "path": "http://example.org/bar/data/bar.csv" }
    ],
}

Also consider that we have the files in the local filesystem:

/home/vitor/bar/datapackage.json
/home/vitor/bar/data/datapackage.json  

Then I open python and try to load this, as such:

import datapackage
dp = datapackage.DataPackage('/home/vitor/bar/datapackage.json')

Where should datapackage.DataPackage look for the datapackage's resources?

Going through its resources array, it'll only find { "path": "http://example.org/bar/data/bar.csv" }. As far as I can see, there's no (easy) way for the datapackage library to know that this resource also lives in ./data/bar.csv relative to the datapackage.json, so it'll end up downloading it from the remote URL even though we have a local copy.

@rufuspollock
Contributor

@vitorbaptista it should load the value of path which is the url.

@vitorbaptista
Contributor

@rgrp So if a datapackage.json has a resource with a full URL in its path, the resource should always be downloaded from the Internet, even if we have the data locally, right?

Sounds fine to me, just making sure this was the intended behaviour.

@rufuspollock
Contributor
@roll
Member
roll commented Aug 5, 2016

format field is already available so what about scheme field? Sometimes there could be a need to set a scheme explicitly. And implementations could be able to pass this path,format,scheme as it is to tabulator.

@roll roll referenced this issue in frictionlessdata/project Aug 6, 2016
Closed

Finalize a few critical to implementations changes to specs #267

2 of 5 tasks complete
@roll roll added backlog priority and removed backlog labels Aug 8, 2016
@roll roll modified the milestone: specs-v1 Aug 8, 2016
@rufuspollock
Contributor

OK, I think this is ready to go in. It is a breaking change so I think I will need to add an actual warning to the spec mentioning that to support data packages created under earlier version of the spec implementors should convert url to path.

Welcome final +1 / -1 as this is a significant change (please put these as explicit comments rather than emoticons on this comment!)

/cc @pwalsh @paulfitz @danfowler @vitorbaptista @floppy @jpmckinney ...

@akariv
akariv commented Aug 9, 2016

Sorry for joining this discussion so late, but I seem to be missing something. I realise that this PR is the result of very long discussions in various threads - but when trying to compare the proposed change vs. the current state, I get this:

In the current state, we have:

  • url - absolute path to resource
  • path - relative path to resource
    (only one of these attributes can appear)

In the proposed state, we will have

  • path - relative or absolute path to resource
  • cache - relative path to resource
    (both of these attributes can appear together)

So it looks like we've introduced some ambiguity to the spec, complicated the implementation (which needs to detect network schemas), and it's also a breaking change. Plus it doesn't feel to me that the original goal of reducing cognitive complexity is being achieved.

Just an idea - If we allowed path and url to co-exist, and also added a recommended resolving order (path before url) and a few examples and common use cases to the spec, we might be able to achieve the same functionality without breaking the spec or burdening implementors.

@roll
Member
roll commented Aug 9, 2016

Do we handle cache in this proposal?

@akariv
akariv commented Aug 9, 2016

Not in this specific change, but based on @pwalsh 's comment I understand
that this is the final intention.

On Tue, 9 Aug 2016 at 12:58 roll notifications@github.com wrote:

Do we handle cache in this proposal?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#250 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAQMde9s0_mS_FVYVv9eiv0Qfm-CaiH9ks5qeE8WgaJpZM4HqtgY
.

@roll roll added progress and removed priority labels Aug 9, 2016
@roll roll assigned roll and pwalsh and unassigned roll Aug 9, 2016
@pwalsh pwalsh removed their assignment Aug 9, 2016
@roll roll self-assigned this Aug 9, 2016
@roll
Member
roll commented Aug 9, 2016 edited

So may be we should to handle this both (path/url and cache related) as one change to provide a complete solution? As @akariv has suggested.

@rufuspollock
Contributor

@akariv whilst it is conceptually simpler for the "experts", for many users there is confusion about which of the two to use and when to use them (or whether both can be used at once - see the issues referenced in the description). Also people get confused whether to use path or url for stuff online, or whether they can use url offline. Having one value I hope will reduce this confusion a bit.

It is true that consumers of data packages (esp tools) may need to do a bit more work: they now need to work out when they have a full qualified url vs a path. However, for publishers it is simpler and the overall spec is simpler.

@roll roll added backlog and removed progress labels Aug 11, 2016
@roll roll removed their assignment Aug 11, 2016
@roll roll removed the backlog label Aug 29, 2016
@rufuspollock rufuspollock modified the milestone: Current Sep 27, 2016
@rufuspollock
Contributor

OK. I think we want to move this forward.

  • @akariv i see your point about path being guaranteed to be relative. The challenge is people end up using both url and path sometimes and there is some confusion. As someone actively using this in the wild your views are very important so please reiterate if this would be a real issue in your opinion.
  • cache is unrelated to this change IMO. It may happen but it is not a dependency.

AGREED: We are going to implement this.

Question is what we do for backwards compatibility. I suggest we explicitly state that clients should remap url to path if path not present. Otherwise behaviour is undefined and should raise an error.

If people are happy i will start a pull request so people can comment on the changes.

@roll
Member
roll commented Sep 28, 2016 edited

@rgrp
awesome

No plans to introduce scheme keyword in terms of path=[scheme]://path-core.[format] to have an explicit ability to force exact schemes?

@rufuspollock
Contributor

@roll what is use case for implementors? How hard is it to work out scheme.

@roll
Member
roll commented Sep 28, 2016 edited

@rgrp
Scheme is inferrable from path with no problem so I mean optional explicit schema options as we have for format.

tabulator from born support explicit schema (loading) and format (parsing) options. But it's mostly been done because of some nice symmetry in schema-format parts of path (an url analogy). But also on implementation level it's important to have ability to ensure what loader for file will be used. For example goodtables using tabulator.Stream('path', scheme='http') is sure this call is safe. Explicitly saying the source is remote. This approach something like other flavor of path-url separation approach. Solving to same problem.

But you're right for now I don't know a real use case for it on a spec level (may be @danfowler knows when publishers need to ensure path is ONLY local or remote) -> without a use case no reason to consider it.

@roll
Member
roll commented Sep 28, 2016 edited

Just to clarify what I mean - consider there is a filesystem supporting many letters drive names - path: drive://path.csv. In this case implantation just fail with UnknownScheme error but if publisher will provide path: drive://path.csv, scheme: file it will be correctly opened as a local file. I suppose thing to investigate in a future.

@jpmckinney jpmckinney added the blocker label Sep 28, 2016
@waylonflinn
waylonflinn commented Oct 9, 2016 edited

👍

Since the intent is to modify the spec with this change (and I'm working on the relevant code today), I'm going to go ahead and do this in Dataship.

@waylonflinn
waylonflinn commented Oct 9, 2016 edited

I have a question about an edge case. Does the following seem correct?

When a path doesn't have a protocol descriptor but starts with /, it MUST be treated as an absolute path referring to the root of the resource from which the datapackage is accessed.

Example

datapackage at http://www.example.com/user/foo/datapackage.json
contains:

{
  "name" : "foo"
  "resource" : [
    {
      "name" : "bar"
      "path" : "/data/bar.csv"
    }
  ]

The resource bar MUST be found at http://www.example.com/data/bar.csv

@rufuspollock
Contributor

@waylonflinn great question - and i've actually worked on a draft here and came up with exactly this case. I'm coming to the view that you do not allow absolute paths without a scheme - only relative paths (and also no ../). So basically any path is:

  • EITHER a url
  • OR: a relative path (which is interpreted relative to the directory in which datapackage.json is found

This also addresses security concerns.

@roll
Member
roll commented Oct 10, 2016 edited

HTTP, zip etc uses / as a pointer to a root of container. Is it an option also for us? If / means datapackage.json dir - no security problems at all (just ban .. or getting out of dp dir).

@waylonflinn

@rgrp Thanks for giving this some thought and coming up with a reasonable answer.

Having the absolute path (/) available definitely helps my use case. Is it possible to change this?

@rufuspollock
Contributor

@waylonflinn really useful feedback. BTW why do you want "root paths" i.e. /...?

Lesson for me: people definitely want to retain allowing / as a starter in paths. I'm ok with that.

Does anyone want to allow ../ paths? I think not though it could be a bit complex to start having these rules in the spec. Maybe more of a thing for "notes for implementors".

For security conscious implementors: they can add a switch to disallow all absolute and ../ paths.

@roll
Member
roll commented Oct 19, 2016

In my opinion a secure recommendation for implementors will be option to enable absolute/dotted pathes disabled by default

@pwalsh
Member
pwalsh commented Nov 10, 2016

@rgrp do you have a PR in the wings for this? I think the wording for the spec should be quite straight forward. I think we should also have a set of clear examples alongside that that addresses some of the above points.

Again here, I think in the various places this has been discussed, there is general agreement that it simplifies the prep of a data package for a publisher, and reduces ambiguity in multiple properties that do essentially the same thing, at the relatively small cost of implementing code needing to handle the file pointer disambiguation (fs vs http).

Any comments from the @frictionlessdata/specs-working-group before we start the PR process?

@roll roll removed the blocker label Nov 16, 2016
@rufuspollock
Contributor
rufuspollock commented Nov 17, 2016 edited

OK, here's a specific proposal for the text change:

NOTE: we will disallow absolute paths like /... and parent relative paths ../ for security reasons and simplicity of implementation.

Required Fields - Data Location

Resource information MUST contain a property describing the location of the
data associated to this resource.

Data associated to a resource can be located online or locally on disk or
inline in the datapackage.json itself.

The location of resource data MUST be specified by the presence of one
(and only one) of these two properties:

  • path: for data in files online or locally.
  • data: for data inline in the datapackage.json descriptor itself.

Data in Files - path

path the location of the data specified either as a fully qualified url or a relative
unix-style path ('/' as separator). A relative path should be interpreted
relative relative to the directory or URL in which the descriptordatapackage.json listing this resource resides.

Examples:

# fully qualified url
"path": "http://ex.datapackages.org/big-csv/my-big.csv"

# relative path
"path": "examples/my-csv.csv"

`data`: (inline) a field containing the data directly inline in the
  `datapackage.json` file. Further details below.

Inline Data - data

...

@rufuspollock rufuspollock self-assigned this Nov 17, 2016
@rufuspollock
Contributor

@waylonflinn note we are planning to disallow / at the start of paths as paths must be relative or fully qualified url. Do you have any comments on this - i note you had required / being permitted for URLs. If you could provide details on your use case that would be valuable.

@rufuspollock rufuspollock added a commit that closed this issue Nov 28, 2016
@rufuspollock rufuspollock [dp,!][m]: merge resource property url into path - fixes #250.
Major, breaking change. Major justification is simplication.

In addition to basic change have also addressed a security concern by
introducing limitations on path (no / or ../).

Simplicity.

Logic for this spelled out in detail in the github issue thread. Summary:

At the moment we have path and url. I originally had this to make it super easy
for tool implementors (no lists of web protocols to match against `http://,
https://, ftp://, etc).

At the same time it adds cognitive complexity to the spec and for publishers
and confusion about whether one could use both e.g. #223 #232.

Whilst change increases demand on consumers to parse out urls from simple paths
this is relatively straightforward and consumer could not rely on url vs path
being used correctly anyway
2aab215
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment