"Chunks" and resources split over multiple physical files #228

Closed
danfowler opened this Issue Nov 10, 2015 · 24 comments

Projects

None yet

6 participants

@danfowler
Contributor

(Original title: Support concatenations of CSV files in a Tabular Data Package)

Dealing with large datasets split across multiple CSVs is a common enough problem that we should support a way of specifying concatenation directly in the datapackage.json.

Originally referenced here and here as part of the Fiscal Data Package work, @pwalsh had the suggestion to allow path, data, and url in a resource object to be arrays that a given implementation would transparently concatenate and treat as a single entity. Given that the data property of a resource specifies inline data (i.e. concatenation is irrelevant), I will amend this to a recommendation allowing path and url to be arrays.

@pwalsh's original post below:

I thought about this more. I think we should definitely support the basic use case of multiple files of the "same" data, and we should do it as part of the core Tabular Data Package spec, on a resource object.

eg: given:

budget1.csv
amount, date

budget2.csv
amount,date

We would not have 2 distinct resources: rather, we'd allow path, data and url to be arrays:

{
  "name": "My Stuff",
  "resources": [
    {
      "path": ["budget1.csv", "budget2.csv"]
    }
  ]
}

I don't like having a distinct "concatenations" property: this pattern is a much better representation of the thing we are describing IMHO, and simply extends what already exists.

I agree that this is blurring a line between ETL and metadata.

I think the pros outweigh the cons considering how common this really is out in the wild.

In the various libs that exist and will be further developed for dealing with DataPackages and resources therein, it is just trivial to extend iteration patterns over files to work on an array of them, or, perform some concat first and internally treat the files as one thing (shutil.copyfileobj, or whatever is behind csvstack, for example).

Alternate suggestion from @rgrp for context:

Rather than addressing this in the mapping by having to reference multiple files from each mapped attribute I propose that instead we have some way to indicate that files should be concatenated e.g. in mapping attribute or similar we have:

concatentations: [
  ["budget1.csv", "budget2.csv", ...],
  ["payee1.csv", "payee2.csv"]
]

Thoughts? @pudo @rgrp @jindrichmynarz @pwalsh

@danfowler danfowler referenced this issue in openspending/fiscal-data-package Nov 10, 2015
Closed

Concatenation and support or data split over multiples files #75

@pwalsh
Member
pwalsh commented Nov 10, 2015

It might be good to copy over the concatenations example to, as it gives context for the suggestion

@danfowler danfowler referenced this issue in openspending/fiscal-data-package Nov 12, 2015
Closed

mapping XML dataset to FDP datasets #67

@rufuspollock
Contributor

I have been thinking about this a lot after a conversation with @jbenet re chunking and hashing in IPFS.

I think we could introduce "chunking" into Data Package resources quite easily and it would provide quite a bit of power:

  • We introduce "chunk" attribute representing a single resource broken up into multiple physical "files" or "urls".
  • Guess logic is we concatenate - but need to allow for case for repeated headers?
  • How much metadata do we push down onto them? hash and path/url is all i guess, maybe size as well?
  • mediatype and schema are common
  • stuff like title, name are not needed / not allowed
{
  "resources": [
     "name": ...
     "title": ... 
     "chunks": [
        {
          "path" or "url" ... 
          "hash": ... 
        }
     ],
     "mediatype": ...
}
@pwalsh
Member
pwalsh commented Nov 17, 2015

@rgrp great to see the chunking idea ;).

It will essentially do what I was looking for, and has the benefit of not overloading current concepts.

Would chunks be an option adjacent to data, path and url? e.g.: if chunks is present we'd treat that as the data source.

About meta data, schema, etc: I'm a big -1 on pushing metadata into the chunks (as per my original argument).

@rufuspollock
Contributor

@pwalsh yes chunks would mean that the other options could not be present. Re metadata in the chunks I generally agree and only stuff in there would be location identifier path or url plus hash - hash is important as it allows for rapid change)

What's really nice about chunks btw generally is they allow for much improved efficiency in terms of storage and management of data that changes. For example, compare a single 100Mb CSV vs a chunked version with spending broken into file per month and the impact of a correction to a row. In the former case the whole file has to be reuploaded whereas in the chunk case only that one file changes. This was a point I really got from the talk @jbenet as his IFPS system really does this nicely and it would relevant to e.g. OpenSpending in terms of storage in S3 (if we turned autoversioning on - or alternatively you could even name chunks via their hash - which is what IPFS does - and then you only upload a chunk if you don't already have it).

As an aside I also think we can get stricter about only having one of data, path or url as per #223

@pwalsh
Member
pwalsh commented Nov 17, 2015

@rgrp great, I'm a big +1 on this as a solution to the original problem, and as a great addition to the spec.

@jbenet
jbenet commented Nov 19, 2015

(Will write a better response later-- mobile is hard)

Would it help if we made a small spec for chunking? We're going for a super
pluggable thing where people can create a plethora of chunkers that receive
a stream (or file if need to analyze the whole thing) and just emit chunks
over time.

We (will soon) default to Rabin fingerprinting based chunking, but the
plugging is for when users know the data much better than general systems.

I'll write up the chunking spec and move our go implementations out of
go-ipfs. We'll also write them in js soon, but not there yet.
On Tue, Nov 17, 2015 at 07:16 Paul Walsh notifications@github.com wrote:

@rgrp https://github.com/rgrp great, I'm a big +1 on this as a solution
to the original problem, and as a great addition to the spec.


Reply to this email directly or view it on GitHub
#228 (comment)
.

@jbenet
jbenet commented Nov 19, 2015

(Happy to cross post that spec into dataprotocols specs and take review)
On Wed, Nov 18, 2015 at 21:36 Juan Benet juan@benet.ai wrote:

(Will write a better response later-- mobile is hard)

Would it help if we made a small spec for chunking? We're going for a
super pluggable thing where people can create a plethora of chunkers that
receive a stream (or file if need to analyze the whole thing) and just emit
chunks over time.

We (will soon) default to Rabin fingerprinting based chunking, but the
plugging is for when users know the data much better than general systems.

I'll write up the chunking spec and move our go implementations out of
go-ipfs. We'll also write them in js soon, but not there yet.
On Tue, Nov 17, 2015 at 07:16 Paul Walsh notifications@github.com wrote:

@rgrp https://github.com/rgrp great, I'm a big +1 on this as a
solution to the original problem, and as a great addition to the spec.


Reply to this email directly or view it on GitHub
#228 (comment)
.

@rufuspollock
Contributor

@jbenet a small, general spec for chunkings would be good to see. Please post link or cross post and we can take a look.

Here we are going to have a very simple use e.g. i'm not sure we need to know how someone performed the chunking (at least i think so) - the only issue i can think about is if you had split CSV files had you kept a header in each file or not (if no you need to strip it from each file except the first when you concatenate).

Please post and let's have a look :-)

@danfowler
Contributor

👍

@rufuspollock rufuspollock changed the title from Support concatenations of CSV files in a Tabular Data Package to "Chunks" and resources split over multiple physical files Jan 29, 2016
@rufuspollock
Contributor
rufuspollock commented Jan 31, 2016 edited

Right now, think we're going to go with simplest possible option: just have an array of paths with no extra metadata:

"chunks": ["file1.csv", "file2.csv"]

Questions:

  • What name should we use? chunks or blocks? Love input on this one
  • Why not merge chunks with path and make path an array?
    • Ans: generally do not like overloading as easy to have confusion. Semantics of path and chunks are different. Not a biggie to have two separate things.
  • Could chunks/blocks array contains urls as well as paths? Can you have a mixture? Not so sure on this one but my inclination is to say path only.
  • What do we about repeated headers e.g. in CSV files? Not so sure. Feel this is properly a separate option e.g. stripChunkHeaders which mean you should take first line of all but first chunk and throw it away (first line being first thing up to \n or \r\n. Thoughts welcome

@pwalsh @danfowler @jpmckinney @rossjones @morty any thoughts here?

@jbenet still great to get an update on how you are doing with the minispec (plus any thoughts on the proposed approach here)

@jpmckinney
Member

Why not parts?

It's easy to determine whether a header is repeated or not. I don't think we need an explicit property. It can be done automatically.

In HTTP, a chunk is just some length of bytes. With a single table split across multiple CSVs, the parts are always sensibly split so that you don't have the start of one row in one file and the end of the row in another. So, chunk sounds wrong to me. I don't know who uses "blocks" but it suggests the same sort of data-agnosticism as "chunk".

@pwalsh
Member
pwalsh commented Feb 1, 2016
  1. I like the thought behind the @jpmckinney suggestion for parts and not chunks
  2. About headers in subsequent files: I think I prefer the explicit boolean property
  3. I still do prefer overloading path / url / data to introducing a new concept, but, if we are introducing one, we need to explicitly define:
    • What happens if one (or more) of path / url / data are present along with chunks
    • Can each item in a chunk array be one of path / url / data?
@danfowler
Contributor

What name should we use? chunks or blocks? Love input on this one

If this were a separate key, then +1 parts for the reasons @jpmckinney suggests

Why not merge chunks with path and make path an array?

I am +1 on overloading path / url (not data*). We already have a pattern that allows primaryKey to be either a string representing a single value or an array of multiple values. This could be extended to path / url (with a side benefit of retaining backwards compatibility for data packages with single-valued path / url).

Could chunks/blocks array contains urls as well as paths? Can you have a mixture? Not so sure on this one but my inclination is to say path only.

My inclination is to include url. Remote resources would benefit most from the "chunking" mechanism we're discussing (that is, if we include something like the hash key mentioned above). But, if we're specifying this as a separate key, then we need a way to specify that type. So, we either need pathParts and urlParts (not a fan) or the elements of the array need to go back to objects (also not really a fan):

"parts" : [{"path": "a.csv"},{"path": "b.csv"}]

What do we about repeated headers e.g. in CSV files? Not so sure. Feel this is properly a separate option e.g. stripChunkHeaders which mean you should take first line of all but first chunk and throw it away (first line being first thing up to \n' or\r\n`. Thoughts welcome

My feeling is to default to stripping the first line in subsequent files the way csvstack does. After all, what is a CSV file with no header :)? In that case, of course, we would need a csvsplit utility.

What happens if one (or more) of path / url / data are present along with chunks

Reiterating @rgrp: "yes chunks would mean that the other options could not be present"


* IMO, inline data does not make sense to "chunk". Plus, if we are overloading the key, data can already be a string, object, or array of values.

@rufuspollock
Contributor
  • Naming: @jpmckinney great suggestion re parts.
  • Overloading vs distinct name: @danfowler + @pwalsh - i am starting to think we could go for the array approach on path and url.
  • Stripping headers: I think we want an explicit boolean property so that is not something left to guessing by external tools (though such tools are welcome to check / guess)

@all: really welcome your vote / view on overloading url and path vs creating parts.

@pwalsh
Member
pwalsh commented Feb 3, 2016
  • + 1 overloading for url, path and data

I add data because it should have the same API as url and path IMHO.

@danfowler
Contributor

@pwalsh for consistency's sake, I agree, but then I think we might have to revisit data because it acts so differently when compared to path and url (e.g. data can already be an array).

@pwalsh
Member
pwalsh commented Feb 4, 2016

@danfowler

In Tabular Data Package: the payload from a file (via url or path) is an array, so conceptually it is the "same". For DataPackage in general, pretty much the same principle would apply (items in an array, but not necessary that each item is also an array).

In any event, it doesn't seem to me that data really behaves differently on this level (path and url are pointers to "data", data is the resolved "data"). To actually interact with data in a DataPackage you need to de-reference the url and path pointers and thereby they end up being the same as data.

@pwalsh
Member
pwalsh commented Mar 7, 2016

@rgrp @danfowler any moves towards resolution on this issue?

@pwalsh
Member
pwalsh commented Jul 12, 2016

@rgrp

where should we take this now? With path and url becoming just path, overloading path to be an array of parts is still the simplest option to me.

@roll roll added the backlog label Aug 8, 2016
@rufuspollock
Contributor

OK I will start a WIP on a draft text - thoughts welcome

### A Resource in Multiple Files ("Chunked Resources")

Usually a resource has a single file associated to it containing the data for that resource.

Sometimes, however, it may be convenient to have a single resource whose data is split across
multiple files -- perhaps the data is large and having it in one file would be inconvenient.

To support this use case we allow for the `path` property to be an array of file path rathe
than a single file path.
@roll roll removed the backlog label Aug 29, 2016
@danfowler
Contributor

Here's an example in the wild from @GregoryRhysEvans of the path-as-array approach. Knowing nothing else about this, this seems to prove out the intuitiveness of the approach.

https://gist.github.com/GregoryRhysEvans/7c07337abbefd64ac06562d4e5a54243

...
  {
      "name": "opensnmp",
      "format": "csv",
      "compression": "gz",
      "path": [
        "snmp-data/snmp-data.20141014.csv.gz",
        "snmp-data/snmp-data.20141021.csv.gz"
      ],
      "schema": {
        "fields": [
          {
            "type": "datetime",
            "name": "timestamp",
            "description": "Date and time of observation"
          },
          {
            "type": "integer",
            "name": "risk_id",
            "description": "ID from Postgres DB representing risk type"
          },
          {
            "type": "string",
            "name": "ip",
            "description": "IP address related to observation"
          },
          {
            "type": "integer",
            "name": "asn",
            "description": "Autonomous Network Number"
          },
          {
            "type": "string",
            "name": "cc",
            "description": "ISO ALPHA-2 country code"
          }
        ]
      }
  }
...
@rufuspollock rufuspollock added this to the Current milestone Sep 27, 2016
@jpmckinney
Member

@rgrp's suggestion looks fine to me.

@jpmckinney jpmckinney added the blocker label Sep 28, 2016
@pwalsh
Member
pwalsh commented Nov 10, 2016

@rgrp I think this is solid and extremely useful. Can you do a PR and let's take it from there?

@roll roll removed the blocker label Nov 16, 2016
@rufuspollock rufuspollock self-assigned this Nov 17, 2016
@rufuspollock
Contributor

The one remaining question is what to do with headers:

My sense here is that in Data Package behaviour is naive: simple concatenation. In Tabular Data Package where there could be a concept of skipInitialRows or headerRow etc - see #326

@roll roll added duplicate and removed duplicate labels Dec 6, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment