New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merging path and url for resources #250

Closed
rufuspollock opened this Issue Mar 7, 2016 · 37 comments

Comments

Projects
None yet
9 participants
@rufuspollock
Contributor

rufuspollock commented Mar 7, 2016

At the moment we have path and url. I originally had this to make it super easy for tool implementors (no lists of web protocols to match against `http://, https://, ftp://, etc).

At the same time it adds cognitive complexity to the spec and for publishers and confusion about whether one could use both #223 #232

The approach would be probably going forward just to have path and define like:

the location of the resource specified either as a fully qualified url or a unix-style path.
Data Packages may be found online or locally on disk and path should be interpreted
relative to that context (on disk it is a file path, online it is a url path). 

If a relative path is used it is interpreted relative to the base directory for the data
package (the parent directory of the datapackage.json).

Examples:

http://ex.datapackages.org/big-csv/my-big.csv
/examples/my-csv.csv  # absolute path
examples/my-csv.csv   # relative path
@pwalsh

This comment has been minimized.

Show comment
Hide comment
@pwalsh

pwalsh Mar 7, 2016

Member

+1.

Easy enough to implement code and check if we have a protocol scheme or just a plain old path to file.

Member

pwalsh commented Mar 7, 2016

+1.

Easy enough to implement code and check if we have a protocol scheme or just a plain old path to file.

@rufuspollock

This comment has been minimized.

Show comment
Hide comment
@rufuspollock

rufuspollock Mar 23, 2016

Contributor

Questions:

  • Do we change to a new name e.g. location - this is breaking!
  • How do implementors distinguish fully qualified urls from (relative) paths?

My sense if we do do this, approach would be:

  • Merge to path and allow it to be a full url (makes semantic sense: "path to the resource file")
  • Deprecate url
  • Require implementors to do some regex'ing to distinguish urls from relative paths ...

Aside:

  • Must paths always be relative to the data package base directory to avoid security risks
Contributor

rufuspollock commented Mar 23, 2016

Questions:

  • Do we change to a new name e.g. location - this is breaking!
  • How do implementors distinguish fully qualified urls from (relative) paths?

My sense if we do do this, approach would be:

  • Merge to path and allow it to be a full url (makes semantic sense: "path to the resource file")
  • Deprecate url
  • Require implementors to do some regex'ing to distinguish urls from relative paths ...

Aside:

  • Must paths always be relative to the data package base directory to avoid security risks
@ivotron

This comment has been minimized.

Show comment
Hide comment
@ivotron

ivotron Jul 9, 2016

For use cases where the data can't be fetched (e.g. too big), one would still like to retrieve a datapackage.json that has a URL pointing to where the data resides. Would this issue (merging path and URL) cover this use case?

ivotron commented Jul 9, 2016

For use cases where the data can't be fetched (e.g. too big), one would still like to retrieve a datapackage.json that has a URL pointing to where the data resides. Would this issue (merging path and URL) cover this use case?

@pwalsh

This comment has been minimized.

Show comment
Hide comment
@pwalsh

pwalsh Jul 10, 2016

Member

@ivotron yes, that issue is covered right now, and would not be removed by the proposal in this issue.

Member

pwalsh commented Jul 10, 2016

@ivotron yes, that issue is covered right now, and would not be removed by the proposal in this issue.

@pwalsh

This comment has been minimized.

Show comment
Hide comment
@pwalsh

pwalsh Jul 12, 2016

Member

So @rgrp I'm in agreement with your points:

  • Merge to path and allow it to be a full url (makes semantic sense: "path to the resource file")
  • Deprecate url
  • Require implementors to do some regex'ing to distinguish urls from relative paths

Additionally, cache should be implemented, and follow the same rules. (worth mentioning here for the possibility that users were relying on presence of both url and path to keep local copies of remote resources.)

@roll @amercader @akariv @vitorbaptista you are all writing code around this in various places. What do you think?

Member

pwalsh commented Jul 12, 2016

So @rgrp I'm in agreement with your points:

  • Merge to path and allow it to be a full url (makes semantic sense: "path to the resource file")
  • Deprecate url
  • Require implementors to do some regex'ing to distinguish urls from relative paths

Additionally, cache should be implemented, and follow the same rules. (worth mentioning here for the possibility that users were relying on presence of both url and path to keep local copies of remote resources.)

@roll @amercader @akariv @vitorbaptista you are all writing code around this in various places. What do you think?

@roll

This comment has been minimized.

Show comment
Hide comment
@roll

roll Jul 12, 2016

Member

tabulator has just path for anything and OPTIONAL scheme and format to explicitly describe the path kind.

Member

roll commented Jul 12, 2016

tabulator has just path for anything and OPTIONAL scheme and format to explicitly describe the path kind.

@vitorbaptista

This comment has been minimized.

Show comment
Hide comment
@vitorbaptista

vitorbaptista Jul 15, 2016

Contributor

Sorry if this was discussed before. We've agreed on removing the base path on #232, however there are some relative paths being described here. How does the path resolution would work with this proposed change? Consider both the cases when a datapackage is remote (e.g. http://somewhere.com/datapackage.json) and local (e.g. /home/vitor/data/datapackage.json)

Contributor

vitorbaptista commented Jul 15, 2016

Sorry if this was discussed before. We've agreed on removing the base path on #232, however there are some relative paths being described here. How does the path resolution would work with this proposed change? Consider both the cases when a datapackage is remote (e.g. http://somewhere.com/datapackage.json) and local (e.g. /home/vitor/data/datapackage.json)

@pwalsh

This comment has been minimized.

Show comment
Hide comment
@pwalsh

pwalsh Jul 18, 2016

Member

@vitorbaptista I think, always relative to the directory that contains datapackage.json for local filepaths, and, remote resources should always be fully qualified URLs.

Member

pwalsh commented Jul 18, 2016

@vitorbaptista I think, always relative to the directory that contains datapackage.json for local filepaths, and, remote resources should always be fully qualified URLs.

@vitorbaptista

This comment has been minimized.

Show comment
Hide comment
@vitorbaptista

vitorbaptista Jul 19, 2016

Contributor

So, imagine we have this datapackage.json:

{
    "name": "foo",
    "resources": [
        { "path": "data/foo.csv" }
    ],
}

If this datapackage is at http://example.org/datapackage.json, its resource would be at http://example.org/data/foo.csv. If it's in /home/vitor/foo/datapackage.json, the resource would be at /home/vitor/foo/data/foo.csv. All good. Now, consider this other datapackage:

{
    "name": "bar",
    "resources": [
        { "path": "http://example.org/bar/data/bar.csv" }
    ],
}

This resource then doesn't depend on the path for the datapackage itself. Even if the datapackage.json lives in /home/vitor/bar/datapackage.json, its resource would still live at http://example.org/bar/data/bar.csv.

If those examples are correct, we will always have to download the resources from their URLs, even if the datapackage is inside a ZIP file.

Am I missing something?

Contributor

vitorbaptista commented Jul 19, 2016

So, imagine we have this datapackage.json:

{
    "name": "foo",
    "resources": [
        { "path": "data/foo.csv" }
    ],
}

If this datapackage is at http://example.org/datapackage.json, its resource would be at http://example.org/data/foo.csv. If it's in /home/vitor/foo/datapackage.json, the resource would be at /home/vitor/foo/data/foo.csv. All good. Now, consider this other datapackage:

{
    "name": "bar",
    "resources": [
        { "path": "http://example.org/bar/data/bar.csv" }
    ],
}

This resource then doesn't depend on the path for the datapackage itself. Even if the datapackage.json lives in /home/vitor/bar/datapackage.json, its resource would still live at http://example.org/bar/data/bar.csv.

If those examples are correct, we will always have to download the resources from their URLs, even if the datapackage is inside a ZIP file.

Am I missing something?

@pwalsh

This comment has been minimized.

Show comment
Hide comment
@pwalsh

pwalsh Jul 19, 2016

Member

@vitorbaptista those examples are correct as I understand it, but still:

If those examples are correct, we will always have to download the resources from their URLs, even if the datapackage is inside a ZIP file.

Not quite sure what you are getting it here.

Member

pwalsh commented Jul 19, 2016

@vitorbaptista those examples are correct as I understand it, but still:

If those examples are correct, we will always have to download the resources from their URLs, even if the datapackage is inside a ZIP file.

Not quite sure what you are getting it here.

@rufuspollock

This comment has been minimized.

Show comment
Hide comment
@rufuspollock

rufuspollock Jul 19, 2016

Contributor

@pwalsh relative paths work for remote urls too - that's really important. In both cases you'd compute relative to the datapackage.json base directory.

@vitorbaptista your examples are correct and 👍 to @pwalsh comment.

Note: I'm really thinking explaining this stuff will go in the patterns or FAQs section on the fd.io site (we could add to the spec but it gets kind of bulky).

Contributor

rufuspollock commented Jul 19, 2016

@pwalsh relative paths work for remote urls too - that's really important. In both cases you'd compute relative to the datapackage.json base directory.

@vitorbaptista your examples are correct and 👍 to @pwalsh comment.

Note: I'm really thinking explaining this stuff will go in the patterns or FAQs section on the fd.io site (we could add to the spec but it gets kind of bulky).

@vitorbaptista

This comment has been minimized.

Show comment
Hide comment
@vitorbaptista

vitorbaptista Jul 19, 2016

Contributor

@pwalsh Consider the second example:

{
    "name": "bar",
    "resources": [
        { "path": "http://example.org/bar/data/bar.csv" }
    ],
}

Also consider that we have the files in the local filesystem:

/home/vitor/bar/datapackage.json
/home/vitor/bar/data/datapackage.json  

Then I open python and try to load this, as such:

import datapackage
dp = datapackage.DataPackage('/home/vitor/bar/datapackage.json')

Where should datapackage.DataPackage look for the datapackage's resources?

Going through its resources array, it'll only find { "path": "http://example.org/bar/data/bar.csv" }. As far as I can see, there's no (easy) way for the datapackage library to know that this resource also lives in ./data/bar.csv relative to the datapackage.json, so it'll end up downloading it from the remote URL even though we have a local copy.

Contributor

vitorbaptista commented Jul 19, 2016

@pwalsh Consider the second example:

{
    "name": "bar",
    "resources": [
        { "path": "http://example.org/bar/data/bar.csv" }
    ],
}

Also consider that we have the files in the local filesystem:

/home/vitor/bar/datapackage.json
/home/vitor/bar/data/datapackage.json  

Then I open python and try to load this, as such:

import datapackage
dp = datapackage.DataPackage('/home/vitor/bar/datapackage.json')

Where should datapackage.DataPackage look for the datapackage's resources?

Going through its resources array, it'll only find { "path": "http://example.org/bar/data/bar.csv" }. As far as I can see, there's no (easy) way for the datapackage library to know that this resource also lives in ./data/bar.csv relative to the datapackage.json, so it'll end up downloading it from the remote URL even though we have a local copy.

@rufuspollock

This comment has been minimized.

Show comment
Hide comment
@rufuspollock

rufuspollock Jul 19, 2016

Contributor

@vitorbaptista it should load the value of path which is the url.

Contributor

rufuspollock commented Jul 19, 2016

@vitorbaptista it should load the value of path which is the url.

@vitorbaptista

This comment has been minimized.

Show comment
Hide comment
@vitorbaptista

vitorbaptista Jul 19, 2016

Contributor

@rgrp So if a datapackage.json has a resource with a full URL in its path, the resource should always be downloaded from the Internet, even if we have the data locally, right?

Sounds fine to me, just making sure this was the intended behaviour.

Contributor

vitorbaptista commented Jul 19, 2016

@rgrp So if a datapackage.json has a resource with a full URL in its path, the resource should always be downloaded from the Internet, even if we have the data locally, right?

Sounds fine to me, just making sure this was the intended behaviour.

@rufuspollock

This comment has been minimized.

Show comment
Hide comment
@rufuspollock
Contributor

rufuspollock commented Jul 19, 2016

@roll

This comment has been minimized.

Show comment
Hide comment
@roll

roll Aug 5, 2016

Member

format field is already available so what about scheme field? Sometimes there could be a need to set a scheme explicitly. And implementations could be able to pass this path,format,scheme as it is to tabulator.

Member

roll commented Aug 5, 2016

format field is already available so what about scheme field? Sometimes there could be a need to set a scheme explicitly. And implementations could be able to pass this path,format,scheme as it is to tabulator.

@roll roll added backlog and removed backlog labels Aug 8, 2016

@roll roll modified the milestone: specs-v1 Aug 8, 2016

@rufuspollock

This comment has been minimized.

Show comment
Hide comment
@rufuspollock

rufuspollock Aug 9, 2016

Contributor

OK, I think this is ready to go in. It is a breaking change so I think I will need to add an actual warning to the spec mentioning that to support data packages created under earlier version of the spec implementors should convert url to path.

Welcome final +1 / -1 as this is a significant change (please put these as explicit comments rather than emoticons on this comment!)

/cc @pwalsh @paulfitz @danfowler @vitorbaptista @Floppy @jpmckinney ...

Contributor

rufuspollock commented Aug 9, 2016

OK, I think this is ready to go in. It is a breaking change so I think I will need to add an actual warning to the spec mentioning that to support data packages created under earlier version of the spec implementors should convert url to path.

Welcome final +1 / -1 as this is a significant change (please put these as explicit comments rather than emoticons on this comment!)

/cc @pwalsh @paulfitz @danfowler @vitorbaptista @Floppy @jpmckinney ...

@akariv

This comment has been minimized.

Show comment
Hide comment
@akariv

akariv Aug 9, 2016

Contributor

Sorry for joining this discussion so late, but I seem to be missing something. I realise that this PR is the result of very long discussions in various threads - but when trying to compare the proposed change vs. the current state, I get this:

In the current state, we have:

  • url - absolute path to resource
  • path - relative path to resource
    (only one of these attributes can appear)

In the proposed state, we will have

  • path - relative or absolute path to resource
  • cache - relative path to resource
    (both of these attributes can appear together)

So it looks like we've introduced some ambiguity to the spec, complicated the implementation (which needs to detect network schemas), and it's also a breaking change. Plus it doesn't feel to me that the original goal of reducing cognitive complexity is being achieved.

Just an idea - If we allowed path and url to co-exist, and also added a recommended resolving order (path before url) and a few examples and common use cases to the spec, we might be able to achieve the same functionality without breaking the spec or burdening implementors.

Contributor

akariv commented Aug 9, 2016

Sorry for joining this discussion so late, but I seem to be missing something. I realise that this PR is the result of very long discussions in various threads - but when trying to compare the proposed change vs. the current state, I get this:

In the current state, we have:

  • url - absolute path to resource
  • path - relative path to resource
    (only one of these attributes can appear)

In the proposed state, we will have

  • path - relative or absolute path to resource
  • cache - relative path to resource
    (both of these attributes can appear together)

So it looks like we've introduced some ambiguity to the spec, complicated the implementation (which needs to detect network schemas), and it's also a breaking change. Plus it doesn't feel to me that the original goal of reducing cognitive complexity is being achieved.

Just an idea - If we allowed path and url to co-exist, and also added a recommended resolving order (path before url) and a few examples and common use cases to the spec, we might be able to achieve the same functionality without breaking the spec or burdening implementors.

@roll

This comment has been minimized.

Show comment
Hide comment
@roll

roll Aug 9, 2016

Member

Do we handle cache in this proposal?

Member

roll commented Aug 9, 2016

Do we handle cache in this proposal?

@akariv

This comment has been minimized.

Show comment
Hide comment
@akariv

akariv Aug 9, 2016

Contributor

Not in this specific change, but based on @pwalsh 's comment I understand
that this is the final intention.

On Tue, 9 Aug 2016 at 12:58 roll notifications@github.com wrote:

Do we handle cache in this proposal?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#250 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAQMde9s0_mS_FVYVv9eiv0Qfm-CaiH9ks5qeE8WgaJpZM4HqtgY
.

Contributor

akariv commented Aug 9, 2016

Not in this specific change, but based on @pwalsh 's comment I understand
that this is the final intention.

On Tue, 9 Aug 2016 at 12:58 roll notifications@github.com wrote:

Do we handle cache in this proposal?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#250 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAQMde9s0_mS_FVYVv9eiv0Qfm-CaiH9ks5qeE8WgaJpZM4HqtgY
.

@roll roll added progress and removed priority labels Aug 9, 2016

@roll

This comment has been minimized.

Show comment
Hide comment
@roll

roll Aug 9, 2016

Member

So may be we should to handle this both (path/url and cache related) as one change to provide a complete solution? As @akariv has suggested.

Member

roll commented Aug 9, 2016

So may be we should to handle this both (path/url and cache related) as one change to provide a complete solution? As @akariv has suggested.

@rufuspollock

This comment has been minimized.

Show comment
Hide comment
@rufuspollock

rufuspollock Aug 9, 2016

Contributor

@akariv whilst it is conceptually simpler for the "experts", for many users there is confusion about which of the two to use and when to use them (or whether both can be used at once - see the issues referenced in the description). Also people get confused whether to use path or url for stuff online, or whether they can use url offline. Having one value I hope will reduce this confusion a bit.

It is true that consumers of data packages (esp tools) may need to do a bit more work: they now need to work out when they have a full qualified url vs a path. However, for publishers it is simpler and the overall spec is simpler.

Contributor

rufuspollock commented Aug 9, 2016

@akariv whilst it is conceptually simpler for the "experts", for many users there is confusion about which of the two to use and when to use them (or whether both can be used at once - see the issues referenced in the description). Also people get confused whether to use path or url for stuff online, or whether they can use url offline. Having one value I hope will reduce this confusion a bit.

It is true that consumers of data packages (esp tools) may need to do a bit more work: they now need to work out when they have a full qualified url vs a path. However, for publishers it is simpler and the overall spec is simpler.

@roll roll added backlog and removed progress labels Aug 11, 2016

@roll roll removed their assignment Aug 11, 2016

@roll roll removed the backlog label Aug 29, 2016

@rufuspollock rufuspollock modified the milestone: Current Sep 27, 2016

@rufuspollock

This comment has been minimized.

Show comment
Hide comment
@rufuspollock

rufuspollock Sep 28, 2016

Contributor

OK. I think we want to move this forward.

  • @akariv i see your point about path being guaranteed to be relative. The challenge is people end up using both url and path sometimes and there is some confusion. As someone actively using this in the wild your views are very important so please reiterate if this would be a real issue in your opinion.
  • cache is unrelated to this change IMO. It may happen but it is not a dependency.

AGREED: We are going to implement this.

Question is what we do for backwards compatibility. I suggest we explicitly state that clients should remap url to path if path not present. Otherwise behaviour is undefined and should raise an error.

If people are happy i will start a pull request so people can comment on the changes.

Contributor

rufuspollock commented Sep 28, 2016

OK. I think we want to move this forward.

  • @akariv i see your point about path being guaranteed to be relative. The challenge is people end up using both url and path sometimes and there is some confusion. As someone actively using this in the wild your views are very important so please reiterate if this would be a real issue in your opinion.
  • cache is unrelated to this change IMO. It may happen but it is not a dependency.

AGREED: We are going to implement this.

Question is what we do for backwards compatibility. I suggest we explicitly state that clients should remap url to path if path not present. Otherwise behaviour is undefined and should raise an error.

If people are happy i will start a pull request so people can comment on the changes.

@roll

This comment has been minimized.

Show comment
Hide comment
@roll

roll Sep 28, 2016

Member

@rgrp
awesome

No plans to introduce scheme keyword in terms of path=[scheme]://path-core.[format] to have an explicit ability to force exact schemes?

Member

roll commented Sep 28, 2016

@rgrp
awesome

No plans to introduce scheme keyword in terms of path=[scheme]://path-core.[format] to have an explicit ability to force exact schemes?

@rufuspollock

This comment has been minimized.

Show comment
Hide comment
@rufuspollock

rufuspollock Sep 28, 2016

Contributor

@roll what is use case for implementors? How hard is it to work out scheme.

Contributor

rufuspollock commented Sep 28, 2016

@roll what is use case for implementors? How hard is it to work out scheme.

@roll

This comment has been minimized.

Show comment
Hide comment
@roll

roll Sep 28, 2016

Member

@rgrp
Scheme is inferrable from path with no problem so I mean optional explicit schema options as we have for format.

tabulator from born support explicit schema (loading) and format (parsing) options. But it's mostly been done because of some nice symmetry in schema-format parts of path (an url analogy). But also on implementation level it's important to have ability to ensure what loader for file will be used. For example goodtables using tabulator.Stream('path', scheme='http') is sure this call is safe. Explicitly saying the source is remote. This approach something like other flavor of path-url separation approach. Solving to same problem.

But you're right for now I don't know a real use case for it on a spec level (may be @danfowler knows when publishers need to ensure path is ONLY local or remote) -> without a use case no reason to consider it.

Member

roll commented Sep 28, 2016

@rgrp
Scheme is inferrable from path with no problem so I mean optional explicit schema options as we have for format.

tabulator from born support explicit schema (loading) and format (parsing) options. But it's mostly been done because of some nice symmetry in schema-format parts of path (an url analogy). But also on implementation level it's important to have ability to ensure what loader for file will be used. For example goodtables using tabulator.Stream('path', scheme='http') is sure this call is safe. Explicitly saying the source is remote. This approach something like other flavor of path-url separation approach. Solving to same problem.

But you're right for now I don't know a real use case for it on a spec level (may be @danfowler knows when publishers need to ensure path is ONLY local or remote) -> without a use case no reason to consider it.

@roll

This comment has been minimized.

Show comment
Hide comment
@roll

roll Sep 28, 2016

Member

Just to clarify what I mean - consider there is a filesystem supporting many letters drive names - path: drive://path.csv. In this case implantation just fail with UnknownScheme error but if publisher will provide path: drive://path.csv, scheme: file it will be correctly opened as a local file. I suppose thing to investigate in a future.

Member

roll commented Sep 28, 2016

Just to clarify what I mean - consider there is a filesystem supporting many letters drive names - path: drive://path.csv. In this case implantation just fail with UnknownScheme error but if publisher will provide path: drive://path.csv, scheme: file it will be correctly opened as a local file. I suppose thing to investigate in a future.

@jpmckinney jpmckinney added the blocker label Sep 28, 2016

@waylonflinn

This comment has been minimized.

Show comment
Hide comment
@waylonflinn

waylonflinn Oct 9, 2016

👍

Since the intent is to modify the spec with this change (and I'm working on the relevant code today), I'm going to go ahead and do this in Dataship.

waylonflinn commented Oct 9, 2016

👍

Since the intent is to modify the spec with this change (and I'm working on the relevant code today), I'm going to go ahead and do this in Dataship.

@waylonflinn

This comment has been minimized.

Show comment
Hide comment
@waylonflinn

waylonflinn Oct 9, 2016

I have a question about an edge case. Does the following seem correct?

When a path doesn't have a protocol descriptor but starts with /, it MUST be treated as an absolute path referring to the root of the resource from which the datapackage is accessed.

Example

datapackage at http://www.example.com/user/foo/datapackage.json
contains:

{
  "name" : "foo"
  "resource" : [
    {
      "name" : "bar"
      "path" : "/data/bar.csv"
    }
  ]

The resource bar MUST be found at http://www.example.com/data/bar.csv

waylonflinn commented Oct 9, 2016

I have a question about an edge case. Does the following seem correct?

When a path doesn't have a protocol descriptor but starts with /, it MUST be treated as an absolute path referring to the root of the resource from which the datapackage is accessed.

Example

datapackage at http://www.example.com/user/foo/datapackage.json
contains:

{
  "name" : "foo"
  "resource" : [
    {
      "name" : "bar"
      "path" : "/data/bar.csv"
    }
  ]

The resource bar MUST be found at http://www.example.com/data/bar.csv

@rufuspollock

This comment has been minimized.

Show comment
Hide comment
@rufuspollock

rufuspollock Oct 10, 2016

Contributor

@waylonflinn great question - and i've actually worked on a draft here and came up with exactly this case. I'm coming to the view that you do not allow absolute paths without a scheme - only relative paths (and also no ../). So basically any path is:

  • EITHER a url
  • OR: a relative path (which is interpreted relative to the directory in which datapackage.json is found

This also addresses security concerns.

Contributor

rufuspollock commented Oct 10, 2016

@waylonflinn great question - and i've actually worked on a draft here and came up with exactly this case. I'm coming to the view that you do not allow absolute paths without a scheme - only relative paths (and also no ../). So basically any path is:

  • EITHER a url
  • OR: a relative path (which is interpreted relative to the directory in which datapackage.json is found

This also addresses security concerns.

@roll

This comment has been minimized.

Show comment
Hide comment
@roll

roll Oct 10, 2016

Member

HTTP, zip etc uses / as a pointer to a root of container. Is it an option also for us? If / means datapackage.json dir - no security problems at all (just ban .. or getting out of dp dir).

Member

roll commented Oct 10, 2016

HTTP, zip etc uses / as a pointer to a root of container. Is it an option also for us? If / means datapackage.json dir - no security problems at all (just ban .. or getting out of dp dir).

@waylonflinn

This comment has been minimized.

Show comment
Hide comment
@waylonflinn

waylonflinn Oct 10, 2016

@rgrp Thanks for giving this some thought and coming up with a reasonable answer.

Having the absolute path (/) available definitely helps my use case. Is it possible to change this?

waylonflinn commented Oct 10, 2016

@rgrp Thanks for giving this some thought and coming up with a reasonable answer.

Having the absolute path (/) available definitely helps my use case. Is it possible to change this?

@rufuspollock

This comment has been minimized.

Show comment
Hide comment
@rufuspollock

rufuspollock Oct 16, 2016

Contributor

@waylonflinn really useful feedback. BTW why do you want "root paths" i.e. /...?

Lesson for me: people definitely want to retain allowing / as a starter in paths. I'm ok with that.

Does anyone want to allow ../ paths? I think not though it could be a bit complex to start having these rules in the spec. Maybe more of a thing for "notes for implementors".

For security conscious implementors: they can add a switch to disallow all absolute and ../ paths.

Contributor

rufuspollock commented Oct 16, 2016

@waylonflinn really useful feedback. BTW why do you want "root paths" i.e. /...?

Lesson for me: people definitely want to retain allowing / as a starter in paths. I'm ok with that.

Does anyone want to allow ../ paths? I think not though it could be a bit complex to start having these rules in the spec. Maybe more of a thing for "notes for implementors".

For security conscious implementors: they can add a switch to disallow all absolute and ../ paths.

@roll

This comment has been minimized.

Show comment
Hide comment
@roll

roll Oct 19, 2016

Member

In my opinion a secure recommendation for implementors will be option to enable absolute/dotted pathes disabled by default

Member

roll commented Oct 19, 2016

In my opinion a secure recommendation for implementors will be option to enable absolute/dotted pathes disabled by default

@pwalsh

This comment has been minimized.

Show comment
Hide comment
@pwalsh

pwalsh Nov 10, 2016

Member

@rgrp do you have a PR in the wings for this? I think the wording for the spec should be quite straight forward. I think we should also have a set of clear examples alongside that that addresses some of the above points.

Again here, I think in the various places this has been discussed, there is general agreement that it simplifies the prep of a data package for a publisher, and reduces ambiguity in multiple properties that do essentially the same thing, at the relatively small cost of implementing code needing to handle the file pointer disambiguation (fs vs http).

Any comments from the @frictionlessdata/specs-working-group before we start the PR process?

Member

pwalsh commented Nov 10, 2016

@rgrp do you have a PR in the wings for this? I think the wording for the spec should be quite straight forward. I think we should also have a set of clear examples alongside that that addresses some of the above points.

Again here, I think in the various places this has been discussed, there is general agreement that it simplifies the prep of a data package for a publisher, and reduces ambiguity in multiple properties that do essentially the same thing, at the relatively small cost of implementing code needing to handle the file pointer disambiguation (fs vs http).

Any comments from the @frictionlessdata/specs-working-group before we start the PR process?

@roll roll removed the blocker label Nov 16, 2016

@rufuspollock

This comment has been minimized.

Show comment
Hide comment
@rufuspollock

rufuspollock Nov 17, 2016

Contributor

OK, here's a specific proposal for the text change:

NOTE: we will disallow absolute paths like /... and parent relative paths ../ for security reasons and simplicity of implementation.

Required Fields - Data Location

Resource information MUST contain a property describing the location of the
data associated to this resource.

Data associated to a resource can be located online or locally on disk or
inline in the datapackage.json itself.

The location of resource data MUST be specified by the presence of one
(and only one) of these two properties:

  • path: for data in files online or locally.
  • data: for data inline in the datapackage.json descriptor itself.

Data in Files - path

path the location of the data specified either as a fully qualified url or a relative
unix-style path ('/' as separator). A relative path should be interpreted
relative relative to the directory or URL in which the descriptordatapackage.json listing this resource resides.

Examples:

# fully qualified url
"path": "http://ex.datapackages.org/big-csv/my-big.csv"

# relative path
"path": "examples/my-csv.csv"

`data`: (inline) a field containing the data directly inline in the
  `datapackage.json` file. Further details below.

Inline Data - data

...

Contributor

rufuspollock commented Nov 17, 2016

OK, here's a specific proposal for the text change:

NOTE: we will disallow absolute paths like /... and parent relative paths ../ for security reasons and simplicity of implementation.

Required Fields - Data Location

Resource information MUST contain a property describing the location of the
data associated to this resource.

Data associated to a resource can be located online or locally on disk or
inline in the datapackage.json itself.

The location of resource data MUST be specified by the presence of one
(and only one) of these two properties:

  • path: for data in files online or locally.
  • data: for data inline in the datapackage.json descriptor itself.

Data in Files - path

path the location of the data specified either as a fully qualified url or a relative
unix-style path ('/' as separator). A relative path should be interpreted
relative relative to the directory or URL in which the descriptordatapackage.json listing this resource resides.

Examples:

# fully qualified url
"path": "http://ex.datapackages.org/big-csv/my-big.csv"

# relative path
"path": "examples/my-csv.csv"

`data`: (inline) a field containing the data directly inline in the
  `datapackage.json` file. Further details below.

Inline Data - data

...

@rufuspollock rufuspollock self-assigned this Nov 17, 2016

@rufuspollock

This comment has been minimized.

Show comment
Hide comment
@rufuspollock

rufuspollock Nov 18, 2016

Contributor

@waylonflinn note we are planning to disallow / at the start of paths as paths must be relative or fully qualified url. Do you have any comments on this - i note you had required / being permitted for URLs. If you could provide details on your use case that would be valuable.

Contributor

rufuspollock commented Nov 18, 2016

@waylonflinn note we are planning to disallow / at the start of paths as paths must be relative or fully qualified url. Do you have any comments on this - i note you had required / being permitted for URLs. If you could provide details on your use case that would be valuable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment