New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Foreign Key attribute in JSON schema #23

Closed
rufuspollock opened this Issue Feb 15, 2013 · 49 comments

Comments

Projects
None yet
10 participants
@rufuspollock
Contributor

rufuspollock commented Feb 15, 2013

Suggest in a type field:

foreignkey: {
  // points to the datapackage.json of the relevant data package
  url: ...
  // id / name of the dataset
  file: ...
  # id of the field in the referenced table
  field: ...
}
@rufuspollock

This comment has been minimized.

Contributor

rufuspollock commented Feb 15, 2013

@mk270 comments please :-)

@mk270

This comment has been minimized.

Contributor

mk270 commented Mar 7, 2013

As discussed, we should handle this outside the table schema.

@trestletech

This comment has been minimized.

Contributor

trestletech commented Apr 3, 2013

Can either of you point me to a doc explaining the proper way to do this, if not directly embed it in the Table Schema? It seems like a useful feature and I'm not sure of the best way to accomplish something like this.

Our use case is a bit different since we're using REST web services, so we're considering either embedding small factors directly into the Schema (a la #29) or providing a reference to the REST call which would provide such a mapping ({"id":"color", "rest":"/colors/"}). I realize that particular attribute would be outside of any spec, but if there were an encouraged way to reference another Schema, perhaps we could use that more heavily in the R package then customize from there.

@rufuspollock

This comment has been minimized.

Contributor

rufuspollock commented Apr 3, 2013

I think this and primary keys #21 should probably go in.

@trestletech

This comment has been minimized.

Contributor

trestletech commented Apr 3, 2013

I like it. The more I think about it, the REST-based approach I mentioned would actually fit within your proposed solution. If I were to host a datapackage.json file at the root of my webapp, all of my data/file references could just be relative URLs defining the GET/list functions for each type of object I'm interested in. Assuming my REST service produces strictly JSON Table Schema, we could end up with a pretty clean solution.

What's the best next step for this? We're trying to move pretty quickly on our client application, so I'd be happy to write up the docs and submit them in a pull request as we continue our development if we're in agreement that this should happen within the scope of this spec? If not, I'll just document it internally and save myself some time. Let me know...

@trestletech

This comment has been minimized.

Contributor

trestletech commented Apr 17, 2013

FYI: The RODProt project now supports foreign keys in the following format:

foreignkey: {
  // points to the datapackage.json of the relevant data package
  url: ...
  // id / name of the dataset
  resource: ...
  # id of the field in the referenced table
  field: ...
}

Note that we're now using resource instead of file. Since IDs are not guaranteed to be provided for a given resource, we use the following logic to map a resource attribute to another resource in the specified DataPackage:

  1. If the resources in this DataPackage are a named hash, check in those names for the resource id.
  2. Check the id field of each resource for a match.
  3. Check the name field of each resource for a match.
  4. Check the path and url field of each resource for a match.

As soon as a single match is found at any of these steps, no more searching is done. (As you can see, I think it would be easier for clients to implement this feature once this is pinned down in the spec).

In R, there's not an obvious way to preserve the underlying ID of an object and still map it to another abstract object ("hash" in JS or "list" in R) within the context of a data.frame, so we support this fairly naively at this point by only supporting mapping integer IDs to a single field (which get cast to a string, if its not already). The details (excerpt from https://github.com/QBRC/RODProt/blob/master/R/incorporate-foreign-keys.R) are copied below:

The behavior is as follows: if there is only one column in the referenced resource and that column
matches the \code{ID} given for the foreignkey, that column will be used as both the IDs and
the levels of the factor -- meaning that any values defined in this external table will map
to themselves, and any other values not in the table will map to \code{NA}. If the table has two
columns, one of which matches the \code{ID} value of the foreignkey, then the \code{ID} column
will be used for the IDs, and the second column, regardless of name, will be used as the levels
for the factor. If there are more than two columns, then the column to be used as the name must
be specified using the \code{name.column} parameter.

I created an issue to support more robust mappings of IDs to complex objects (QBRC/RODProt#12), but it will require some engineering to get that functionality in place.

@rufuspollock

This comment has been minimized.

Contributor

rufuspollock commented Apr 18, 2013

@trestletech I really like this. Where does this go inside the JTS? Do you put it parallel to fields?

@trestletech

This comment has been minimized.

Contributor

trestletech commented Apr 18, 2013

We went within fields to more clearly identify which source field was associated with the foreign key. For instance (https://github.com/QBRC/RODProt/blob/master/inst/extdata/datapackage.json#L58)

[...
  "path": "data3.json",
  "id": "data3",
  "schema": {
    "fields":[
      {
        "id":"A",
         "label":"Column A",
         "type":"integer",
         "foreignkey": {
            "pkg": "../extdata/datapackage.json",
            "resource": "data",
            "id": "A"
         }
      },{
        "id":"B",
        "label":"Column B",
        "type":"string"
      }
    ]
  }
...
]

(in this case package is a relative path, but I imagine it would typically be a URL -- just harder to unit test a package on a remote URL).

@rufuspollock

This comment has been minimized.

Contributor

rufuspollock commented Apr 19, 2013

Interesting. How would you handle multiple foreign keys (is that possible)?

I'm in 2 minds whether we inline or separate into a separate foreignkeys attribute (more like SQL).

  • SQL approach: nicely partitionable from the fields list.
  • Inline: natural simplicity

Best thing here is you have running code which is what matters!

@trestletech

This comment has been minimized.

Contributor

trestletech commented Apr 19, 2013

I'm not sure which way you intend "multiple." To me, multiple foreign keys (single-columned FKs on multiple columns) could be handled well in this approach (indeed, we're using that pretty heavily in our application). To extend the previous example:

[...
  "path": "data3.json",
  "id": "data3",
  "schema": {
    "fields":[
      {
        "id":"A",
         "label":"Column A",
         "type":"integer",
         "foreignkey": {
            "pkg": "../extdata/datapackage.json",
            "resource": "data",
            "id": "A"
         }
      },{
        "id":"B",
        "label":"Column B",
        "type":"string",
         "foreignkey": {
            "pkg": "../extdata/datapackage.json",
            "resource": "anotherData",
            "id": "B"
         }
      }
    ]
  }
...
]

If you mean a single column referencing multiple other resources simultaneously, I suppose you could support a foreign key array?

[...
  "path": "data3.json",
  "id": "data3",
  "schema": {
    "fields":[
      {
        "id":"A",
         "label":"Column A",
         "type":"integer",
         "foreignkey": [{
            "pkg": "../extdata/datapackage.json",
            "resource": "data",
            "id": "A"
           },{
            "pkg": "../extdata/datapackage.json",
            "resource": "data",
            "id": "A"
           }
         ]
      }
  ]
]

(pardon any syntax errors). To be honest, though, I don't think I've ever seen that use case -- maybe it's more common in other fields.

Finally, if you meant composite foreign keys, then I agree that would be a limitation of this style. In my experience, it seems like the trend (for better or worse) has been to move away from composite keys where possible, though. I think at least part of this is motivated by the rise of web applications and the need to more easily de/serialize data. Most of my ORM experience has been with Hibernate, so I can only speak to that library, but I know support for Composite keys is pretty sloppy. Essentially, you typically need to embed your composite key as a single object so that HIbernate can treat it as a single reference to another table.

So instead of:

public class Person(){
  String firstName; //foreign key part1
  String lastName; //foreign key part2
  int height;
}

you'd have to use

public class NamedPerson(){
  String firstName; //foreign key part1
  String lastName; //foreign key part2
}

public class Person(){
  @EmbeddedId
  NamedPerson names; //now your foreign key is a single object
  int height;
}

I suppose you could take a similar approach and specify that the type for this field is actually another embedded JSON Table Schema (again, not thinking thoroughly through the precise syntax):

    "fields":[
      {
        "id":"A",
         "label":"Column A",
         "type":"embedded", // or "any" or some other type 
         "foreignkey": {
            "pkg": "../extdata/datapackage.json",
            "resource": "anotherData",
            "id": "nameObjCol"
         },
         "fields": [{
             "id":"first",
             "label": "FirstName",
             "type": "string"
           },{
             "id":"last",
             "label": "LastName",
             "type": "string"                 
           }
         }
       }           
     ]

I'm not sure how well that would work or if it would be worth the complexity, but it would be one approach.

I may just be biased by our application, but if I'm already serializing the data to JSON, I would have long given up the luxury of "fancy" relational mappings involving composite keys. Perhaps I'm in the minority there, though. I'm not sure about other common client languages like Python, but I know in R the support for a concept of a foreign key is so simplistic that it would require a pretty foundational re-write of some of the base data structures in the language to even get something like a composite foreign key into the language.

@rufuspollock

This comment has been minimized.

Contributor

rufuspollock commented Apr 21, 2013

I think we are making great progress here. I think there is agreement on:

  • At most one foreignkey per field
  • Inlining foreignkey on the field attribute
  • I think we should probably introduce a name field on resources - i will comment more in #32

In terms of the foreignkey hash I would suggest a minor tweak to rename id to field:

   {
      "pkg": "../extdata/datapackage.json",
      "resource": "anotherData",
      "field": "id-of-field"
   }

Questions:

  • pkg: should we rename to more explicit datapackage? (Probably no)
  • pkg: what can it be? Suggest either:
    • a package name (resolvable by means we have not yet described!)
    • a relative path in unix style
    • a url
  • resource resolution is done by algorithm you describe above (though I suggest we simplify that via #32!)
@trestletech

This comment has been minimized.

Contributor

trestletech commented Apr 21, 2013

All sounds good to me. pkg seems reasonable to me, but I don't think it would be too inconvenient were it expanded.

@davidmiller

This comment has been minimized.

davidmiller commented Apr 30, 2013

My immediate use case for data packages was discoverability of other datasets, at which point multiple foreign key relationships becomes v.useful...

e.g. for drug relationships, the BNF code is a foreign key to

  • Prescribing data
  • Research papers on this drug
  • Recent health warnings/new side effects issued (Yellow card)
  • Patent expiry and grants
  • OpenBNF formulary dosage information
  • Individual hospital / commissioning group formulary notes

Being able to hit each of these in an explorer would allow me to get all the related metadata from different publishers...

In principle.

@gthb

This comment has been minimized.

Contributor

gthb commented Jun 17, 2013

A foreign key into another resource in the same package could simply omit the pkg attribute, right? So it would look like:

{
   "id": "otherdata",
   "foreignkey": {
      "resource": "anotherData",
      "field": "id-of-field"
   }
}
@sballesteros

This comment has been minimized.

sballesteros commented Oct 16, 2013

Any news on this issue ?
Is what follows the current consensus ?


{
  "name": "resource_A",
  "schema":{
    "fields":[
      {
        "name": "X"
        "foreignkey": {
          "pkg": "datapackage_dep",
          "resource": "resource_X",
          "field": "field_X"
        }
      },
      {
        "name": "Y"
        "foreignkey": {
          "pkg": "datapackage_dep",
          "resource": "resource_X",
          "field": "field_Y"
        }
      }
    ]
  }
}

Thanks!

@rufuspollock

This comment has been minimized.

Contributor

rufuspollock commented Oct 16, 2013

OK, final recommendation is:

    "foreignkey": {
      "datapackage": "datapackage_dep",
      "resource": "resource_X",
      "field": "field_X"
    }

With:

  • datapackage: [optional] a url or a single name reference. If a single name it will be up to client to resolve that to a datapackage (we will not allow relative path references). If not supplied then we resolve for a resource within this datapackage.json
  • resource: [required] name of the resource within the datapackage
  • field: [required] the name of the field within the resource

Note rename of pkg attribute to datapackage

Example:

{
  "name": "resource_A",
  "schema":{
    "fields":[
      {
        "name": "X"
        "foreignkey": {
          "datapackage": "datapackage_dep",
          "resource": "resource_X",
          "field": "field_X"
        }
      }, ...
    ]
  }
}
@rufuspollock

This comment has been minimized.

Contributor

rufuspollock commented Nov 29, 2013

@paulfitz interested in your thoughts here especially given your comments on primary key and experience with coopy ...

I'd like to get this and the primary key stuff closed and into JTS asap ...

@jpmckinney

This comment has been minimized.

Member

jpmckinney commented Nov 29, 2013

Why not make datapackage always be a URL? If an implementation doesn't want to link to something accessible over HTTP, they can just use a custom scheme like custom://domain/path which is still a URL.

Also, why foreignkey not foreign_key?

As in #21, I think a separate section for constraints would be best, in case other constraints are implemented later.

@rufuspollock

This comment has been minimized.

Contributor

rufuspollock commented Nov 30, 2013

@jpmckinney all good points. For foreign keys which are already quite complex I guess this makes sense. I suppose for other items e.g. even unique or primary key it makes thinks more hasslesome to write.

Overall: I get the feeling that a good set of people want constraints outside of the fields in a separate section a la sql.

@paulfitz your thoughts either way would be very useful.

@sballesteros ditto ...

@jpmckinney

This comment has been minimized.

Member

jpmckinney commented Nov 30, 2013

I think a composite (multi-column) primary key is much easier to read if all the columns appear side-by-side, e.g.:

{
  "fields": [
    ...
  ],
  "indexes": [
    {
      "type": "primary",
      "fields": ["column1", "column10"]
    }
  ]
}

If you see a "primary_key" property on the first field in a datapackage.json file, you might assume that field is the primary key, when in fact it's composite with a field that appears later in the definition.

@sballesteros

This comment has been minimized.

sballesteros commented Nov 30, 2013

EDIT: This was based on a wrong understanding of foreign keys. What follows is largely irrelevant for this issue and is closer to materialized view.

Having foreignkeys in the fields array is convenient to do "joints" / merge.
For instance if there is a resource myresource in a data package called dpkg1 listing dpkg2 as dependencies:

"name": "dpkg1",
"version": "0.0.0",
"dependencies": {
    "dpkg2": "0.0.1"
},
"resources": [
    {
      "name": "myresource",
      "schema": {
        "fields":[
          {
            "name": "date",
            "foreignkey": {"datapackage": "dpkg2", "resource": "data", "field": "date"}
          },
          {
            "name": "seq",
            "foreignkey": {"datapackage": "dpkg2", "resource": "seq", "field": "seq"}
          },
          {
            "name": "x",
            "type": "integer"
          }
        ]
      },
      "data": [
        {"x": 10},
        {"x": 20},
        {"x": 30},
        {"x": 40}
      ]
    }

data referred by the foreignkeys are not available in dpkg1 (the resource data only contains x and not date and seq). To get date and seq, one need to go into dpkg2 and extract them from here (note that this is recursive as dpkg2 can also have foreignkeys...).
In this scenario it is convenient to know the order of the fields and have the foreignkeys right into schema.fields.

So to me foreignkeys allow to recreate composite JSON Table Schema from fields originating from different data packages.

Generalizing outside JSON Table Schema it would be good to have a discussion on how to use the top-level property dependencies or dataDependencies of a datapackage.
Maybe we should think of a more generic solution of "importing" resources (from x import y as z used in Python or var y = require('x') in node.js.).

If we stick with the current foreignkey hash and generalize it for non SDF resources we just need to omit the field property.
For instance we could have a resource y in dpkg1 coming from dpkg2:

"name": "dpkg1",
"version": "0.0.0",
"dependencies": {
    "dpkg2": "0.0.1"
},
"resources": [
  {
    "name": "y",
    "foreignkey": {"datapackage": "dpkg2", "resource": "x"}
  }

Here "foreignkey" is an extra possible property in addition to "data", "path" or "url".

If it is the intended use maybe "foreignkey" is not the best possible word.
What I really like about this usage of "foreignkey" is that it allows to repackage resources coming from different datapackages into one datapackage (tailored to someone specific data need) with full versioning support and without any data duplication.

@jpmckinney

This comment has been minimized.

Member

jpmckinney commented Nov 30, 2013

@sballesteros That's not how foreign keys work... the data for myresource would still have values for date and seq. A foreign key is just a constraint that can be enforced to ensure that the values in date and seq exist in the referenced columns of a resource in dpkg2. It's a way to check data consistency.

What you're describing is an entirely different feature, which may be worthwhile, but is not what a foreign key is. What you're describing is closer to a materialized view, that joins multiple tables together.

@sballesteros

This comment has been minimized.

sballesteros commented Nov 30, 2013

@jpmckinney thanks for clarifying. So what I suggest is more related to the way dependencies will be handled. Got it now... Foreign key is a constraint to ensure that the values taken by the foreignkey are elements of the resource field it points to.

@jpmckinney

This comment has been minimized.

Member

jpmckinney commented Dec 1, 2013

I think constraints is more appropriate. A primary key constrains that field to be unique, as does a unique key. Indexes are for avoiding table scans, and such optimizations are usually use case-specific. For example, you might index on a person's name, but not have any uniqueness or other constraints on that name. It just so happens that most DBs automatically create indexes for most constraints, hence the potential confusion.

I prefer constraints to be a simple array, where each object has a type key equal to "foreign key", "primary key" or "unique key". Unless there is a good conceptual reason for adding structure like nesting, in general I prefer a more flat structure (a conservative, "built-it-when-you-need-it" approach). In either case, though, implementations would just loop through the constraints once, and read in whatever constraints they support.

Re: using the string "foreign key" as a value, etc.: If people prefer using underscores instead of spaces, I'm fine with that. I don't think we should remove the space - it's easier to read words when they are properly separated, and no one is crying over the extra 1 byte used for each space/underscore.

If using foreign key as a key instead of a value, as in the parallel example, then I think we should simply conform to whatever the rest of datapackage.json does. datapackage_version and last_modified use underscore style. If we want to switch to camel-case to be more like commonjs, then we should open a new ticket and do it for all properties at once.

@sballesteros

This comment has been minimized.

sballesteros commented Dec 1, 2013

+1 for constraints.
I like the single array idea but I am not a huge fan of spaces for type values.
In an implementation, if you create a hash to handle all the types, for some languages (e.g JS) having a space in the keys is annoying (you cannot use hash.key and have to use hash[key]).

More generally for the data package keys, I think switching to camelCase to be more like CommonJS would be nice, especially if datapackage.json is renamed package.json.

@jpmckinney

This comment has been minimized.

Member

jpmckinney commented Dec 1, 2013

@sballesteros If we put the types as keys, then yes, using underscores or camelcase is better. But if we're putting the types as values, then it makes no difference. Edit: Nevermind, I misunderstood what you meant by creating a hash.

@sballesteros

This comment has been minimized.

sballesteros commented Dec 1, 2013

@jpmckinney I was talking about an implementation. Using a hash with types values as keys to avoid some if statements.

@paulfitz

This comment has been minimized.

Contributor

paulfitz commented Dec 2, 2013

+1 for parallel option. @jpmckinney's array suggestion seems reasonable. Agree with @sballesteros about avoiding spaces. Could also rename constraints to keys and give them type foreign.

@jpmckinney

This comment has been minimized.

Member

jpmckinney commented Dec 5, 2013

Another +1 from @besquared in #21 for keys.

@besquared

This comment has been minimized.

Contributor

besquared commented Dec 5, 2013

Going to repost this here since apparently #21 isn't the right place anymore.

Howdy guys I just wanted to add my recommendation here based on a few things.

  1. Keys are intrinsic descriptions of a dataset and not a field

I don't think adding any kind of key description at the field level is correct. I don't think It's a matter of shorthand either as structurally all keys are properties of a datasets and not a field.

  1. Keys aren't indices and aren't necessarily constraints either

Indices are a database specific feature used for optimizing common operations. The terms are often used interchangeably in those systems because database vendors often default to (or require) building an index onto the primary key columns for their own internal operations or for the convenience of the user. Keys don't necessarily imply a constraint either although generally implementations constrain primary keys to be unique. One vendor specific example is that the InnoDB engine allows both non-unique indices and foreign keys to non-unique indices.

IMO If the specification's goal is to be producer and consumer agnostic then the it wouldn't include references to the concept of an index and leave that up to the system consuming the data package. I think there's probably some room here to include a unique flag as it might be seen as a description of the key's relationship to the dataset and not a constraint that the package is defining.

As far as how to implement a specification for keys I think the most generic form would try to only include things that are descriptive of the data itself. A key should signify that fields within or between datasets are linked to one another in some way or have some kind of special meaning in the dataset itself (such as uniqueness). Something like the following (which even I consider somewhat verbose) might be a good first shot:

{
  "schema: {
    "fields": [...]
  },

  "keys": [
    {
      "type": "primary",
      "unique": true, // for the convenience of the consumer
      "fields": ['field1', 'field2', ...]
    },
    {
      "type": "foreign",
      "fields": ['field3', ...],
      "references": {
         "url": <datapackage.json>, // url of datapackage.json
         "resource": <resource>, // name of the resource in the referenced package
         "fields": ['field1', ...] // ordered respective to the ancestor's fields
       }
    }
  ]
@rufuspollock

This comment has been minimized.

Contributor

rufuspollock commented Dec 5, 2013

My one objection to the array approach for keys/constraints is that it makes it more painful for consumers - rather than just doing:

resources[0].schema.constraints.primarykey.fields

i have to iterate through the list of constraints.

@besquared re constraints / keys in naming I feel a little ambivalent. I think it would be nice to include unique requirements in their at some point and they definitely aren't keys ... (however this isn't a biggie)

I do think generally that keys/constraints should live inside the schema attribute and be part of json table schema (@besquared your example suggested otherwise but i was not sure if that was intentional ...)

@jpmckinney

This comment has been minimized.

Member

jpmckinney commented Dec 5, 2013

@rgrp For all except the primary key, you'll still need to iterate, as only the primary key will have a single key object as its value. The others will have arrays. If your app loads all keys, you'll have a double-nested iteration if the constraints are an object, instead of a simple iteration if the constraints are an array. It depends on what use cases you want to prioritize.

Also, can we have separation between words, e.g. primary_key?

Also, unique keys are keys (it's in the name). How are they "definitely not keys"? @besquared is correct in saying that constraints in the language of DBs are something different, e.g. you can put a constraint at the field-level on a price field to say that it must have a value greater than zero. NOT NULL is maybe the most common constraint. "Keys" is the more universal DB word for what we're talking about.

@besquared

This comment has been minimized.

Contributor

besquared commented Dec 5, 2013

+1 for everything @jpmckinney is saying. Also @rgrp I originally did think keys belongs to the resource and not the schema but I take it back now they obviously belong in the schema. To draw some inspiration from Codd's relational model something like the format I proposed above allows us to correctly express the three types of relational constraints.

@rgrp other options include a keys section keyed by type

{
  "schema: {

    "fields": [...]

    "keys": {

      "primary": {
        "fields": ['field1', 'field2', ...]
      },

      "foreign": [
        {
          "fields": ['field3', ...],
          "references": {
            "url": <datapackage.json>, // url of datapackage.json
            "resource": <resource>, // name of the resource in the referenced package
            "fields": ['field1', ...] // ordered respective to the ancestor's fields
        }
      ]

    }

  }
}

This seems fine too and allows the kind of object access one would expect (resource.schema.keys.primary) while still allowing multiple foreign keys and allows the keys section to remain pretty robust to extension later.

@besquared

This comment has been minimized.

Contributor

besquared commented Dec 5, 2013

This might also be a good time to talk about foreign keys vs. references. In the relational model foreign keys imply a constraint in which the foreign relation must have a corresponding value for the keyed fields in order for a new relation to be added. This allows referential integrity between relations to be maintained.

If what the specification is trying to achieve is a way to link data together (but not necessarily imply a constraint) then we might want to have a section called "references" instead of a set of foreign keys. Consumers who would like to resolve reference packages and apply foreign key constraints in their data management system may choose to do so but wouldn't be required to.

This leads to another possible schema layout:

{
  "schema: {

    "fields": [...]

    "primary_key": {
      "fields": ['field1', 'field2', ...]
    }

    "references": [
      {
        "url": <datapackage.json>, // url of datapackage.json
        "resource": <resource>, // name of the resource in the referenced package
        "required": true, // require the referenced row to exist in the foreign resource?
        "fields": {
          'field1': 'field4',
          'field3': 'field5'
        } // map between this resource and the foreign resource
      }
    ]
  }
}

I like this as well for various reasons. It leaves the concept of constraints for a later time and still allows consumers to optionally infer foreign key constraints.

Thoughts?

@rufuspollock

This comment has been minimized.

Contributor

rufuspollock commented Dec 7, 2013

So along with @jpmckinney's proposal we're going with out of line items. Here's the updated proposal:

{
  "name": "resource-a",
  "schema":{
    "fields":[
      {
        "name": "X"
      }, ...
    ],
    "foreignKeys": [
        {
          "fields": string or array of fields on local resource
          "reference": {
            "datapackage": "datapackage_dep",
            "resource": "resource_X",
            "fields": ["field_X"]
          }
        }
      ]
    ]
  }
}

Questions:

  • foreignKeys vs references vs relations vs foreign as name for attribute - votes please (+1,+0,0, -0,-1)
  • camelcase vs underscores (foreignKey vs foreign_key)

rfc @paulfitz @jpmckinney @besquared @davidmiller @sballesteros

@jpmckinney

This comment has been minimized.

Member

jpmckinney commented Dec 8, 2013

  • I prefer foreign keys to "relations", etc. because it is a label that precisely refers to the existing concept.
  • datapackage_version and last_modified use underscores. Do any current properties use camelcase? The case should be consistent across all properties, which for now would mean sticking to underscores. I don't mind switching everything to camelcase, but then we'd have to change at least datapackage_version and last_modified.
@besquared

This comment has been minimized.

Contributor

besquared commented Dec 8, 2013

There's some mixing here between #21 and #23.

Everything should just standardize on camelCase.

Given that it's likely that primaryKey will become a thing it seems reasonable that foreignKeys should become the other thing. I think it's important to tell people that this might not have the same meaning that it does in their rdbms. This seems ok.

@paulfitz

This comment has been minimized.

Contributor

paulfitz commented Dec 9, 2013

+1 for foreignKeys and lowerCamelCase.

@sballesteros

This comment has been minimized.

sballesteros commented Dec 9, 2013

+1 for foreignKeys and lowerCamelCase too.

@trickvi

This comment has been minimized.

Contributor

trickvi commented Jan 22, 2014

This has become a pretty big feature for me to get into the specification. Any ideas about ETA @rgrp?

@trickvi

This comment has been minimized.

Contributor

trickvi commented Jan 23, 2014

One idea I have which would be a nice to have is to create some sort of a fallback sequence. It might make everything more complicated but what I'm thinking is that resource could be an array of resources (most significant resource on the left) and when looking up the key it would just go through the resources in that order until it finds the key being searched for.

This means I can have multiple resources for datasets published in different phases. I'm thinking about a case for budget data where countries have different budget phases. The problematic one is when we have an approved budget which can be adjusted at a later point in time. Adjustments are not necessarily a completely new dataset but changes to different rows of the approved budget.

Let's say we represent this with two resources (marking them with dates for clarity): approved-2013-12-25.csv and adjusted-2014-01-22.csv. This problem could be solved by just making adjusted-2014-01-22.csv contain the whole budget with changes (so it also contains rows which haven't changed). For this the foreign key resource would just be the most recent budget resource. This means that if there are a few adjustments made we'll be copying a lot of the same rows between resources. To compare changes you would have to do a diff.

Another option would be to make adjusted-2014-01-22.csv only contain lines that have changed, with the same schema (fields, primary keys, and foreign keys) and define the resources for the foreign key with an array. Then when trying to reference an item in the budget it would first try to find it in adjusted-2014-01-22.csv and if not found go through approved-2013-12-25.csv. This reduces amount of redundancy and makes it clearer what the changes are but this makes the implementation of foreign keys more difficult.

Just an idea. I'm glad to drop it and use up more disk space if that means we can get foreign keys in as soon as possible.

@rufuspollock

This comment has been minimized.

Contributor

rufuspollock commented Jan 23, 2014

@tryggvib at this point I doubt we're going to expand for this use-case. Right now we have a finalized spec and I think it will go in.

@trickvi

This comment has been minimized.

Contributor

trickvi commented Jan 23, 2014

Great. Happy to see this go in as long as I have foreign keys :)

@trickvi

This comment has been minimized.

Contributor

trickvi commented Jan 23, 2014

Another use case I came up with though just now was that this would enable you to break up resources to more manageable sizes. For example one resource per year and some foreign key in another resource could include the year which could fallback through the resources broken down by years.

@rufuspollock

This comment has been minimized.

Contributor

rufuspollock commented Feb 25, 2014

FIXED. Foreign Key support is now in - see http://dataprotocols.org/json-table-schema/#foreign-keys

Huge thank-you to everyone who contributed here and please double check for the final implemented version for any errors that need to be corrected or emendations that should be made.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment