New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Chunks" and resources split over multiple physical files #228

Closed
danfowler opened this Issue Nov 10, 2015 · 24 comments

Comments

Projects
None yet
6 participants
@danfowler
Contributor

danfowler commented Nov 10, 2015

(Original title: Support concatenations of CSV files in a Tabular Data Package)

Dealing with large datasets split across multiple CSVs is a common enough problem that we should support a way of specifying concatenation directly in the datapackage.json.

Originally referenced here and here as part of the Fiscal Data Package work, @pwalsh had the suggestion to allow path, data, and url in a resource object to be arrays that a given implementation would transparently concatenate and treat as a single entity. Given that the data property of a resource specifies inline data (i.e. concatenation is irrelevant), I will amend this to a recommendation allowing path and url to be arrays.

@pwalsh's original post below:

I thought about this more. I think we should definitely support the basic use case of multiple files of the "same" data, and we should do it as part of the core Tabular Data Package spec, on a resource object.

eg: given:

budget1.csv
amount, date

budget2.csv
amount,date

We would not have 2 distinct resources: rather, we'd allow path, data and url to be arrays:

{
  "name": "My Stuff",
  "resources": [
    {
      "path": ["budget1.csv", "budget2.csv"]
    }
  ]
}

I don't like having a distinct "concatenations" property: this pattern is a much better representation of the thing we are describing IMHO, and simply extends what already exists.

I agree that this is blurring a line between ETL and metadata.

I think the pros outweigh the cons considering how common this really is out in the wild.

In the various libs that exist and will be further developed for dealing with DataPackages and resources therein, it is just trivial to extend iteration patterns over files to work on an array of them, or, perform some concat first and internally treat the files as one thing (shutil.copyfileobj, or whatever is behind csvstack, for example).

Alternate suggestion from @rgrp for context:

Rather than addressing this in the mapping by having to reference multiple files from each mapped attribute I propose that instead we have some way to indicate that files should be concatenated e.g. in mapping attribute or similar we have:

concatentations: [
  ["budget1.csv", "budget2.csv", ...],
  ["payee1.csv", "payee2.csv"]
]

Thoughts? @pudo @rgrp @jindrichmynarz @pwalsh

@pwalsh

This comment has been minimized.

Show comment
Hide comment
@pwalsh

pwalsh Nov 10, 2015

Member

It might be good to copy over the concatenations example to, as it gives context for the suggestion

Member

pwalsh commented Nov 10, 2015

It might be good to copy over the concatenations example to, as it gives context for the suggestion

@rufuspollock

This comment has been minimized.

Show comment
Hide comment
@rufuspollock

rufuspollock Nov 17, 2015

Contributor

I have been thinking about this a lot after a conversation with @jbenet re chunking and hashing in IPFS.

I think we could introduce "chunking" into Data Package resources quite easily and it would provide quite a bit of power:

  • We introduce "chunk" attribute representing a single resource broken up into multiple physical "files" or "urls".
  • Guess logic is we concatenate - but need to allow for case for repeated headers?
  • How much metadata do we push down onto them? hash and path/url is all i guess, maybe size as well?
  • mediatype and schema are common
  • stuff like title, name are not needed / not allowed
{
  "resources": [
     "name": ...
     "title": ... 
     "chunks": [
        {
          "path" or "url" ... 
          "hash": ... 
        }
     ],
     "mediatype": ...
}
Contributor

rufuspollock commented Nov 17, 2015

I have been thinking about this a lot after a conversation with @jbenet re chunking and hashing in IPFS.

I think we could introduce "chunking" into Data Package resources quite easily and it would provide quite a bit of power:

  • We introduce "chunk" attribute representing a single resource broken up into multiple physical "files" or "urls".
  • Guess logic is we concatenate - but need to allow for case for repeated headers?
  • How much metadata do we push down onto them? hash and path/url is all i guess, maybe size as well?
  • mediatype and schema are common
  • stuff like title, name are not needed / not allowed
{
  "resources": [
     "name": ...
     "title": ... 
     "chunks": [
        {
          "path" or "url" ... 
          "hash": ... 
        }
     ],
     "mediatype": ...
}
@pwalsh

This comment has been minimized.

Show comment
Hide comment
@pwalsh

pwalsh Nov 17, 2015

Member

@rgrp great to see the chunking idea ;).

It will essentially do what I was looking for, and has the benefit of not overloading current concepts.

Would chunks be an option adjacent to data, path and url? e.g.: if chunks is present we'd treat that as the data source.

About meta data, schema, etc: I'm a big -1 on pushing metadata into the chunks (as per my original argument).

Member

pwalsh commented Nov 17, 2015

@rgrp great to see the chunking idea ;).

It will essentially do what I was looking for, and has the benefit of not overloading current concepts.

Would chunks be an option adjacent to data, path and url? e.g.: if chunks is present we'd treat that as the data source.

About meta data, schema, etc: I'm a big -1 on pushing metadata into the chunks (as per my original argument).

@rufuspollock

This comment has been minimized.

Show comment
Hide comment
@rufuspollock

rufuspollock Nov 17, 2015

Contributor

@pwalsh yes chunks would mean that the other options could not be present. Re metadata in the chunks I generally agree and only stuff in there would be location identifier path or url plus hash - hash is important as it allows for rapid change)

What's really nice about chunks btw generally is they allow for much improved efficiency in terms of storage and management of data that changes. For example, compare a single 100Mb CSV vs a chunked version with spending broken into file per month and the impact of a correction to a row. In the former case the whole file has to be reuploaded whereas in the chunk case only that one file changes. This was a point I really got from the talk @jbenet as his IFPS system really does this nicely and it would relevant to e.g. OpenSpending in terms of storage in S3 (if we turned autoversioning on - or alternatively you could even name chunks via their hash - which is what IPFS does - and then you only upload a chunk if you don't already have it).

As an aside I also think we can get stricter about only having one of data, path or url as per #223

Contributor

rufuspollock commented Nov 17, 2015

@pwalsh yes chunks would mean that the other options could not be present. Re metadata in the chunks I generally agree and only stuff in there would be location identifier path or url plus hash - hash is important as it allows for rapid change)

What's really nice about chunks btw generally is they allow for much improved efficiency in terms of storage and management of data that changes. For example, compare a single 100Mb CSV vs a chunked version with spending broken into file per month and the impact of a correction to a row. In the former case the whole file has to be reuploaded whereas in the chunk case only that one file changes. This was a point I really got from the talk @jbenet as his IFPS system really does this nicely and it would relevant to e.g. OpenSpending in terms of storage in S3 (if we turned autoversioning on - or alternatively you could even name chunks via their hash - which is what IPFS does - and then you only upload a chunk if you don't already have it).

As an aside I also think we can get stricter about only having one of data, path or url as per #223

@pwalsh

This comment has been minimized.

Show comment
Hide comment
@pwalsh

pwalsh Nov 17, 2015

Member

@rgrp great, I'm a big +1 on this as a solution to the original problem, and as a great addition to the spec.

Member

pwalsh commented Nov 17, 2015

@rgrp great, I'm a big +1 on this as a solution to the original problem, and as a great addition to the spec.

@jbenet

This comment has been minimized.

Show comment
Hide comment
@jbenet

jbenet Nov 19, 2015

(Will write a better response later-- mobile is hard)

Would it help if we made a small spec for chunking? We're going for a super
pluggable thing where people can create a plethora of chunkers that receive
a stream (or file if need to analyze the whole thing) and just emit chunks
over time.

We (will soon) default to Rabin fingerprinting based chunking, but the
plugging is for when users know the data much better than general systems.

I'll write up the chunking spec and move our go implementations out of
go-ipfs. We'll also write them in js soon, but not there yet.
On Tue, Nov 17, 2015 at 07:16 Paul Walsh notifications@github.com wrote:

@rgrp https://github.com/rgrp great, I'm a big +1 on this as a solution
to the original problem, and as a great addition to the spec.


Reply to this email directly or view it on GitHub
#228 (comment)
.

jbenet commented Nov 19, 2015

(Will write a better response later-- mobile is hard)

Would it help if we made a small spec for chunking? We're going for a super
pluggable thing where people can create a plethora of chunkers that receive
a stream (or file if need to analyze the whole thing) and just emit chunks
over time.

We (will soon) default to Rabin fingerprinting based chunking, but the
plugging is for when users know the data much better than general systems.

I'll write up the chunking spec and move our go implementations out of
go-ipfs. We'll also write them in js soon, but not there yet.
On Tue, Nov 17, 2015 at 07:16 Paul Walsh notifications@github.com wrote:

@rgrp https://github.com/rgrp great, I'm a big +1 on this as a solution
to the original problem, and as a great addition to the spec.


Reply to this email directly or view it on GitHub
#228 (comment)
.

@jbenet

This comment has been minimized.

Show comment
Hide comment
@jbenet

jbenet Nov 19, 2015

(Happy to cross post that spec into dataprotocols specs and take review)
On Wed, Nov 18, 2015 at 21:36 Juan Benet juan@benet.ai wrote:

(Will write a better response later-- mobile is hard)

Would it help if we made a small spec for chunking? We're going for a
super pluggable thing where people can create a plethora of chunkers that
receive a stream (or file if need to analyze the whole thing) and just emit
chunks over time.

We (will soon) default to Rabin fingerprinting based chunking, but the
plugging is for when users know the data much better than general systems.

I'll write up the chunking spec and move our go implementations out of
go-ipfs. We'll also write them in js soon, but not there yet.
On Tue, Nov 17, 2015 at 07:16 Paul Walsh notifications@github.com wrote:

@rgrp https://github.com/rgrp great, I'm a big +1 on this as a
solution to the original problem, and as a great addition to the spec.


Reply to this email directly or view it on GitHub
#228 (comment)
.

jbenet commented Nov 19, 2015

(Happy to cross post that spec into dataprotocols specs and take review)
On Wed, Nov 18, 2015 at 21:36 Juan Benet juan@benet.ai wrote:

(Will write a better response later-- mobile is hard)

Would it help if we made a small spec for chunking? We're going for a
super pluggable thing where people can create a plethora of chunkers that
receive a stream (or file if need to analyze the whole thing) and just emit
chunks over time.

We (will soon) default to Rabin fingerprinting based chunking, but the
plugging is for when users know the data much better than general systems.

I'll write up the chunking spec and move our go implementations out of
go-ipfs. We'll also write them in js soon, but not there yet.
On Tue, Nov 17, 2015 at 07:16 Paul Walsh notifications@github.com wrote:

@rgrp https://github.com/rgrp great, I'm a big +1 on this as a
solution to the original problem, and as a great addition to the spec.


Reply to this email directly or view it on GitHub
#228 (comment)
.

@rufuspollock

This comment has been minimized.

Show comment
Hide comment
@rufuspollock

rufuspollock Nov 19, 2015

Contributor

@jbenet a small, general spec for chunkings would be good to see. Please post link or cross post and we can take a look.

Here we are going to have a very simple use e.g. i'm not sure we need to know how someone performed the chunking (at least i think so) - the only issue i can think about is if you had split CSV files had you kept a header in each file or not (if no you need to strip it from each file except the first when you concatenate).

Please post and let's have a look :-)

Contributor

rufuspollock commented Nov 19, 2015

@jbenet a small, general spec for chunkings would be good to see. Please post link or cross post and we can take a look.

Here we are going to have a very simple use e.g. i'm not sure we need to know how someone performed the chunking (at least i think so) - the only issue i can think about is if you had split CSV files had you kept a header in each file or not (if no you need to strip it from each file except the first when you concatenate).

Please post and let's have a look :-)

@danfowler

This comment has been minimized.

Show comment
Hide comment
@danfowler

danfowler Dec 4, 2015

Contributor

👍

Contributor

danfowler commented Dec 4, 2015

👍

@rufuspollock rufuspollock changed the title from Support concatenations of CSV files in a Tabular Data Package to "Chunks" and resources split over multiple physical files Jan 29, 2016

@rufuspollock

This comment has been minimized.

Show comment
Hide comment
@rufuspollock

rufuspollock Jan 31, 2016

Contributor

Right now, think we're going to go with simplest possible option: just have an array of paths with no extra metadata:

"chunks": ["file1.csv", "file2.csv"]

Questions:

  • What name should we use? chunks or blocks? Love input on this one
  • Why not merge chunks with path and make path an array?
    • Ans: generally do not like overloading as easy to have confusion. Semantics of path and chunks are different. Not a biggie to have two separate things.
  • Could chunks/blocks array contains urls as well as paths? Can you have a mixture? Not so sure on this one but my inclination is to say path only.
  • What do we about repeated headers e.g. in CSV files? Not so sure. Feel this is properly a separate option e.g. stripChunkHeaders which mean you should take first line of all but first chunk and throw it away (first line being first thing up to \n or \r\n. Thoughts welcome

@pwalsh @danfowler @jpmckinney @rossjones @morty any thoughts here?

@jbenet still great to get an update on how you are doing with the minispec (plus any thoughts on the proposed approach here)

Contributor

rufuspollock commented Jan 31, 2016

Right now, think we're going to go with simplest possible option: just have an array of paths with no extra metadata:

"chunks": ["file1.csv", "file2.csv"]

Questions:

  • What name should we use? chunks or blocks? Love input on this one
  • Why not merge chunks with path and make path an array?
    • Ans: generally do not like overloading as easy to have confusion. Semantics of path and chunks are different. Not a biggie to have two separate things.
  • Could chunks/blocks array contains urls as well as paths? Can you have a mixture? Not so sure on this one but my inclination is to say path only.
  • What do we about repeated headers e.g. in CSV files? Not so sure. Feel this is properly a separate option e.g. stripChunkHeaders which mean you should take first line of all but first chunk and throw it away (first line being first thing up to \n or \r\n. Thoughts welcome

@pwalsh @danfowler @jpmckinney @rossjones @morty any thoughts here?

@jbenet still great to get an update on how you are doing with the minispec (plus any thoughts on the proposed approach here)

@jpmckinney

This comment has been minimized.

Show comment
Hide comment
@jpmckinney

jpmckinney Feb 1, 2016

Member

Why not parts?

It's easy to determine whether a header is repeated or not. I don't think we need an explicit property. It can be done automatically.

In HTTP, a chunk is just some length of bytes. With a single table split across multiple CSVs, the parts are always sensibly split so that you don't have the start of one row in one file and the end of the row in another. So, chunk sounds wrong to me. I don't know who uses "blocks" but it suggests the same sort of data-agnosticism as "chunk".

Member

jpmckinney commented Feb 1, 2016

Why not parts?

It's easy to determine whether a header is repeated or not. I don't think we need an explicit property. It can be done automatically.

In HTTP, a chunk is just some length of bytes. With a single table split across multiple CSVs, the parts are always sensibly split so that you don't have the start of one row in one file and the end of the row in another. So, chunk sounds wrong to me. I don't know who uses "blocks" but it suggests the same sort of data-agnosticism as "chunk".

@pwalsh

This comment has been minimized.

Show comment
Hide comment
@pwalsh

pwalsh Feb 1, 2016

Member
  1. I like the thought behind the @jpmckinney suggestion for parts and not chunks
  2. About headers in subsequent files: I think I prefer the explicit boolean property
  3. I still do prefer overloading path / url / data to introducing a new concept, but, if we are introducing one, we need to explicitly define:
    • What happens if one (or more) of path / url / data are present along with chunks
    • Can each item in a chunk array be one of path / url / data?
Member

pwalsh commented Feb 1, 2016

  1. I like the thought behind the @jpmckinney suggestion for parts and not chunks
  2. About headers in subsequent files: I think I prefer the explicit boolean property
  3. I still do prefer overloading path / url / data to introducing a new concept, but, if we are introducing one, we need to explicitly define:
    • What happens if one (or more) of path / url / data are present along with chunks
    • Can each item in a chunk array be one of path / url / data?
@danfowler

This comment has been minimized.

Show comment
Hide comment
@danfowler

danfowler Feb 3, 2016

Contributor

What name should we use? chunks or blocks? Love input on this one

If this were a separate key, then +1 parts for the reasons @jpmckinney suggests

Why not merge chunks with path and make path an array?

I am +1 on overloading path / url (not data*). We already have a pattern that allows primaryKey to be either a string representing a single value or an array of multiple values. This could be extended to path / url (with a side benefit of retaining backwards compatibility for data packages with single-valued path / url).

Could chunks/blocks array contains urls as well as paths? Can you have a mixture? Not so sure on this one but my inclination is to say path only.

My inclination is to include url. Remote resources would benefit most from the "chunking" mechanism we're discussing (that is, if we include something like the hash key mentioned above). But, if we're specifying this as a separate key, then we need a way to specify that type. So, we either need pathParts and urlParts (not a fan) or the elements of the array need to go back to objects (also not really a fan):

"parts" : [{"path": "a.csv"},{"path": "b.csv"}]

What do we about repeated headers e.g. in CSV files? Not so sure. Feel this is properly a separate option e.g. stripChunkHeaders which mean you should take first line of all but first chunk and throw it away (first line being first thing up to \n' or\r\n`. Thoughts welcome

My feeling is to default to stripping the first line in subsequent files the way csvstack does. After all, what is a CSV file with no header :)? In that case, of course, we would need a csvsplit utility.

What happens if one (or more) of path / url / data are present along with chunks

Reiterating @rgrp: "yes chunks would mean that the other options could not be present"


* IMO, inline data does not make sense to "chunk". Plus, if we are overloading the key, data can already be a string, object, or array of values.

Contributor

danfowler commented Feb 3, 2016

What name should we use? chunks or blocks? Love input on this one

If this were a separate key, then +1 parts for the reasons @jpmckinney suggests

Why not merge chunks with path and make path an array?

I am +1 on overloading path / url (not data*). We already have a pattern that allows primaryKey to be either a string representing a single value or an array of multiple values. This could be extended to path / url (with a side benefit of retaining backwards compatibility for data packages with single-valued path / url).

Could chunks/blocks array contains urls as well as paths? Can you have a mixture? Not so sure on this one but my inclination is to say path only.

My inclination is to include url. Remote resources would benefit most from the "chunking" mechanism we're discussing (that is, if we include something like the hash key mentioned above). But, if we're specifying this as a separate key, then we need a way to specify that type. So, we either need pathParts and urlParts (not a fan) or the elements of the array need to go back to objects (also not really a fan):

"parts" : [{"path": "a.csv"},{"path": "b.csv"}]

What do we about repeated headers e.g. in CSV files? Not so sure. Feel this is properly a separate option e.g. stripChunkHeaders which mean you should take first line of all but first chunk and throw it away (first line being first thing up to \n' or\r\n`. Thoughts welcome

My feeling is to default to stripping the first line in subsequent files the way csvstack does. After all, what is a CSV file with no header :)? In that case, of course, we would need a csvsplit utility.

What happens if one (or more) of path / url / data are present along with chunks

Reiterating @rgrp: "yes chunks would mean that the other options could not be present"


* IMO, inline data does not make sense to "chunk". Plus, if we are overloading the key, data can already be a string, object, or array of values.

@rufuspollock

This comment has been minimized.

Show comment
Hide comment
@rufuspollock

rufuspollock Feb 3, 2016

Contributor
  • Naming: @jpmckinney great suggestion re parts.
  • Overloading vs distinct name: @danfowler + @pwalsh - i am starting to think we could go for the array approach on path and url.
  • Stripping headers: I think we want an explicit boolean property so that is not something left to guessing by external tools (though such tools are welcome to check / guess)

@ALL: really welcome your vote / view on overloading url and path vs creating parts.

Contributor

rufuspollock commented Feb 3, 2016

  • Naming: @jpmckinney great suggestion re parts.
  • Overloading vs distinct name: @danfowler + @pwalsh - i am starting to think we could go for the array approach on path and url.
  • Stripping headers: I think we want an explicit boolean property so that is not something left to guessing by external tools (though such tools are welcome to check / guess)

@ALL: really welcome your vote / view on overloading url and path vs creating parts.

@pwalsh

This comment has been minimized.

Show comment
Hide comment
@pwalsh

pwalsh Feb 3, 2016

Member
  • + 1 overloading for url, path and data

I add data because it should have the same API as url and path IMHO.

Member

pwalsh commented Feb 3, 2016

  • + 1 overloading for url, path and data

I add data because it should have the same API as url and path IMHO.

@danfowler

This comment has been minimized.

Show comment
Hide comment
@danfowler

danfowler Feb 4, 2016

Contributor

@pwalsh for consistency's sake, I agree, but then I think we might have to revisit data because it acts so differently when compared to path and url (e.g. data can already be an array).

Contributor

danfowler commented Feb 4, 2016

@pwalsh for consistency's sake, I agree, but then I think we might have to revisit data because it acts so differently when compared to path and url (e.g. data can already be an array).

@pwalsh

This comment has been minimized.

Show comment
Hide comment
@pwalsh

pwalsh Feb 4, 2016

Member

@danfowler

In Tabular Data Package: the payload from a file (via url or path) is an array, so conceptually it is the "same". For DataPackage in general, pretty much the same principle would apply (items in an array, but not necessary that each item is also an array).

In any event, it doesn't seem to me that data really behaves differently on this level (path and url are pointers to "data", data is the resolved "data"). To actually interact with data in a DataPackage you need to de-reference the url and path pointers and thereby they end up being the same as data.

Member

pwalsh commented Feb 4, 2016

@danfowler

In Tabular Data Package: the payload from a file (via url or path) is an array, so conceptually it is the "same". For DataPackage in general, pretty much the same principle would apply (items in an array, but not necessary that each item is also an array).

In any event, it doesn't seem to me that data really behaves differently on this level (path and url are pointers to "data", data is the resolved "data"). To actually interact with data in a DataPackage you need to de-reference the url and path pointers and thereby they end up being the same as data.

@pwalsh

This comment has been minimized.

Show comment
Hide comment
@pwalsh

pwalsh Mar 7, 2016

Member

@rgrp @danfowler any moves towards resolution on this issue?

Member

pwalsh commented Mar 7, 2016

@rgrp @danfowler any moves towards resolution on this issue?

@pwalsh

This comment has been minimized.

Show comment
Hide comment
@pwalsh

pwalsh Jul 12, 2016

Member

@rgrp

where should we take this now? With path and url becoming just path, overloading path to be an array of parts is still the simplest option to me.

Member

pwalsh commented Jul 12, 2016

@rgrp

where should we take this now? With path and url becoming just path, overloading path to be an array of parts is still the simplest option to me.

@roll roll added the backlog label Aug 8, 2016

@rufuspollock

This comment has been minimized.

Show comment
Hide comment
@rufuspollock

rufuspollock Aug 11, 2016

Contributor

OK I will start a WIP on a draft text - thoughts welcome

### A Resource in Multiple Files ("Chunked Resources")

Usually a resource has a single file associated to it containing the data for that resource.

Sometimes, however, it may be convenient to have a single resource whose data is split across
multiple files -- perhaps the data is large and having it in one file would be inconvenient.

To support this use case we allow for the `path` property to be an array of file path rathe
than a single file path.
Contributor

rufuspollock commented Aug 11, 2016

OK I will start a WIP on a draft text - thoughts welcome

### A Resource in Multiple Files ("Chunked Resources")

Usually a resource has a single file associated to it containing the data for that resource.

Sometimes, however, it may be convenient to have a single resource whose data is split across
multiple files -- perhaps the data is large and having it in one file would be inconvenient.

To support this use case we allow for the `path` property to be an array of file path rathe
than a single file path.

@roll roll removed the backlog label Aug 29, 2016

@danfowler

This comment has been minimized.

Show comment
Hide comment
@danfowler

danfowler Sep 23, 2016

Contributor

Here's an example in the wild from @GregoryRhysEvans of the path-as-array approach. Knowing nothing else about this, this seems to prove out the intuitiveness of the approach.

https://gist.github.com/GregoryRhysEvans/7c07337abbefd64ac06562d4e5a54243

...
  {
      "name": "opensnmp",
      "format": "csv",
      "compression": "gz",
      "path": [
        "snmp-data/snmp-data.20141014.csv.gz",
        "snmp-data/snmp-data.20141021.csv.gz"
      ],
      "schema": {
        "fields": [
          {
            "type": "datetime",
            "name": "timestamp",
            "description": "Date and time of observation"
          },
          {
            "type": "integer",
            "name": "risk_id",
            "description": "ID from Postgres DB representing risk type"
          },
          {
            "type": "string",
            "name": "ip",
            "description": "IP address related to observation"
          },
          {
            "type": "integer",
            "name": "asn",
            "description": "Autonomous Network Number"
          },
          {
            "type": "string",
            "name": "cc",
            "description": "ISO ALPHA-2 country code"
          }
        ]
      }
  }
...
Contributor

danfowler commented Sep 23, 2016

Here's an example in the wild from @GregoryRhysEvans of the path-as-array approach. Knowing nothing else about this, this seems to prove out the intuitiveness of the approach.

https://gist.github.com/GregoryRhysEvans/7c07337abbefd64ac06562d4e5a54243

...
  {
      "name": "opensnmp",
      "format": "csv",
      "compression": "gz",
      "path": [
        "snmp-data/snmp-data.20141014.csv.gz",
        "snmp-data/snmp-data.20141021.csv.gz"
      ],
      "schema": {
        "fields": [
          {
            "type": "datetime",
            "name": "timestamp",
            "description": "Date and time of observation"
          },
          {
            "type": "integer",
            "name": "risk_id",
            "description": "ID from Postgres DB representing risk type"
          },
          {
            "type": "string",
            "name": "ip",
            "description": "IP address related to observation"
          },
          {
            "type": "integer",
            "name": "asn",
            "description": "Autonomous Network Number"
          },
          {
            "type": "string",
            "name": "cc",
            "description": "ISO ALPHA-2 country code"
          }
        ]
      }
  }
...

@rufuspollock rufuspollock added this to the Current milestone Sep 27, 2016

@jpmckinney

This comment has been minimized.

Show comment
Hide comment
@jpmckinney

jpmckinney Sep 27, 2016

Member

@rgrp's suggestion looks fine to me.

Member

jpmckinney commented Sep 27, 2016

@rgrp's suggestion looks fine to me.

@jpmckinney jpmckinney added the blocker label Sep 28, 2016

@pwalsh

This comment has been minimized.

Show comment
Hide comment
@pwalsh

pwalsh Nov 10, 2016

Member

@rgrp I think this is solid and extremely useful. Can you do a PR and let's take it from there?

Member

pwalsh commented Nov 10, 2016

@rgrp I think this is solid and extremely useful. Can you do a PR and let's take it from there?

@rufuspollock

This comment has been minimized.

Show comment
Hide comment
@rufuspollock

rufuspollock Dec 1, 2016

Contributor

The one remaining question is what to do with headers:

My sense here is that in Data Package behaviour is naive: simple concatenation. In Tabular Data Package where there could be a concept of skipInitialRows or headerRow etc - see #326

Contributor

rufuspollock commented Dec 1, 2016

The one remaining question is what to do with headers:

My sense here is that in Data Package behaviour is naive: simple concatenation. In Tabular Data Package where there could be a concept of skipInitialRows or headerRow etc - see #326

@roll roll added the duplicate label Dec 6, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment