Foreign Key attribute in JSON schema #23

Closed
rufuspollock opened this Issue Feb 15, 2013 · 49 comments

Projects

None yet

10 participants

@rufuspollock
Contributor

Suggest in a type field:

foreignkey: {
  // points to the datapackage.json of the relevant data package
  url: ...
  // id / name of the dataset
  file: ...
  # id of the field in the referenced table
  field: ...
}
@rufuspollock
Contributor

@mk270 comments please :-)

@mk270
Contributor
mk270 commented Mar 7, 2013

As discussed, we should handle this outside the table schema.

@trestletech
Contributor

Can either of you point me to a doc explaining the proper way to do this, if not directly embed it in the Table Schema? It seems like a useful feature and I'm not sure of the best way to accomplish something like this.

Our use case is a bit different since we're using REST web services, so we're considering either embedding small factors directly into the Schema (a la #29) or providing a reference to the REST call which would provide such a mapping ({"id":"color", "rest":"/colors/"}). I realize that particular attribute would be outside of any spec, but if there were an encouraged way to reference another Schema, perhaps we could use that more heavily in the R package then customize from there.

@rufuspollock
Contributor

I think this and primary keys #21 should probably go in.

@trestletech
Contributor

I like it. The more I think about it, the REST-based approach I mentioned would actually fit within your proposed solution. If I were to host a datapackage.json file at the root of my webapp, all of my data/file references could just be relative URLs defining the GET/list functions for each type of object I'm interested in. Assuming my REST service produces strictly JSON Table Schema, we could end up with a pretty clean solution.

What's the best next step for this? We're trying to move pretty quickly on our client application, so I'd be happy to write up the docs and submit them in a pull request as we continue our development if we're in agreement that this should happen within the scope of this spec? If not, I'll just document it internally and save myself some time. Let me know...

@trestletech
Contributor

FYI: The RODProt project now supports foreign keys in the following format:

foreignkey: {
  // points to the datapackage.json of the relevant data package
  url: ...
  // id / name of the dataset
  resource: ...
  # id of the field in the referenced table
  field: ...
}

Note that we're now using resource instead of file. Since IDs are not guaranteed to be provided for a given resource, we use the following logic to map a resource attribute to another resource in the specified DataPackage:

  1. If the resources in this DataPackage are a named hash, check in those names for the resource id.
  2. Check the id field of each resource for a match.
  3. Check the name field of each resource for a match.
  4. Check the path and url field of each resource for a match.

As soon as a single match is found at any of these steps, no more searching is done. (As you can see, I think it would be easier for clients to implement this feature once this is pinned down in the spec).

In R, there's not an obvious way to preserve the underlying ID of an object and still map it to another abstract object ("hash" in JS or "list" in R) within the context of a data.frame, so we support this fairly naively at this point by only supporting mapping integer IDs to a single field (which get cast to a string, if its not already). The details (excerpt from https://github.com/QBRC/RODProt/blob/master/R/incorporate-foreign-keys.R) are copied below:

The behavior is as follows: if there is only one column in the referenced resource and that column
matches the \code{ID} given for the foreignkey, that column will be used as both the IDs and
the levels of the factor -- meaning that any values defined in this external table will map
to themselves, and any other values not in the table will map to \code{NA}. If the table has two
columns, one of which matches the \code{ID} value of the foreignkey, then the \code{ID} column
will be used for the IDs, and the second column, regardless of name, will be used as the levels
for the factor. If there are more than two columns, then the column to be used as the name must
be specified using the \code{name.column} parameter.

I created an issue to support more robust mappings of IDs to complex objects (QBRC/RODProt#12), but it will require some engineering to get that functionality in place.

@rufuspollock
Contributor

@trestletech I really like this. Where does this go inside the JTS? Do you put it parallel to fields?

@trestletech
Contributor

We went within fields to more clearly identify which source field was associated with the foreign key. For instance (https://github.com/QBRC/RODProt/blob/master/inst/extdata/datapackage.json#L58)

[...
  "path": "data3.json",
  "id": "data3",
  "schema": {
    "fields":[
      {
        "id":"A",
         "label":"Column A",
         "type":"integer",
         "foreignkey": {
            "pkg": "../extdata/datapackage.json",
            "resource": "data",
            "id": "A"
         }
      },{
        "id":"B",
        "label":"Column B",
        "type":"string"
      }
    ]
  }
...
]

(in this case package is a relative path, but I imagine it would typically be a URL -- just harder to unit test a package on a remote URL).

@rufuspollock
Contributor

Interesting. How would you handle multiple foreign keys (is that possible)?

I'm in 2 minds whether we inline or separate into a separate foreignkeys attribute (more like SQL).

  • SQL approach: nicely partitionable from the fields list.
  • Inline: natural simplicity

Best thing here is you have running code which is what matters!

@trestletech
Contributor

I'm not sure which way you intend "multiple." To me, multiple foreign keys (single-columned FKs on multiple columns) could be handled well in this approach (indeed, we're using that pretty heavily in our application). To extend the previous example:

[...
  "path": "data3.json",
  "id": "data3",
  "schema": {
    "fields":[
      {
        "id":"A",
         "label":"Column A",
         "type":"integer",
         "foreignkey": {
            "pkg": "../extdata/datapackage.json",
            "resource": "data",
            "id": "A"
         }
      },{
        "id":"B",
        "label":"Column B",
        "type":"string",
         "foreignkey": {
            "pkg": "../extdata/datapackage.json",
            "resource": "anotherData",
            "id": "B"
         }
      }
    ]
  }
...
]

If you mean a single column referencing multiple other resources simultaneously, I suppose you could support a foreign key array?

[...
  "path": "data3.json",
  "id": "data3",
  "schema": {
    "fields":[
      {
        "id":"A",
         "label":"Column A",
         "type":"integer",
         "foreignkey": [{
            "pkg": "../extdata/datapackage.json",
            "resource": "data",
            "id": "A"
           },{
            "pkg": "../extdata/datapackage.json",
            "resource": "data",
            "id": "A"
           }
         ]
      }
  ]
]

(pardon any syntax errors). To be honest, though, I don't think I've ever seen that use case -- maybe it's more common in other fields.

Finally, if you meant composite foreign keys, then I agree that would be a limitation of this style. In my experience, it seems like the trend (for better or worse) has been to move away from composite keys where possible, though. I think at least part of this is motivated by the rise of web applications and the need to more easily de/serialize data. Most of my ORM experience has been with Hibernate, so I can only speak to that library, but I know support for Composite keys is pretty sloppy. Essentially, you typically need to embed your composite key as a single object so that HIbernate can treat it as a single reference to another table.

So instead of:

public class Person(){
  String firstName; //foreign key part1
  String lastName; //foreign key part2
  int height;
}

you'd have to use

public class NamedPerson(){
  String firstName; //foreign key part1
  String lastName; //foreign key part2
}

public class Person(){
  @EmbeddedId
  NamedPerson names; //now your foreign key is a single object
  int height;
}

I suppose you could take a similar approach and specify that the type for this field is actually another embedded JSON Table Schema (again, not thinking thoroughly through the precise syntax):

    "fields":[
      {
        "id":"A",
         "label":"Column A",
         "type":"embedded", // or "any" or some other type 
         "foreignkey": {
            "pkg": "../extdata/datapackage.json",
            "resource": "anotherData",
            "id": "nameObjCol"
         },
         "fields": [{
             "id":"first",
             "label": "FirstName",
             "type": "string"
           },{
             "id":"last",
             "label": "LastName",
             "type": "string"                 
           }
         }
       }           
     ]

I'm not sure how well that would work or if it would be worth the complexity, but it would be one approach.

I may just be biased by our application, but if I'm already serializing the data to JSON, I would have long given up the luxury of "fancy" relational mappings involving composite keys. Perhaps I'm in the minority there, though. I'm not sure about other common client languages like Python, but I know in R the support for a concept of a foreign key is so simplistic that it would require a pretty foundational re-write of some of the base data structures in the language to even get something like a composite foreign key into the language.

@rufuspollock
Contributor

I think we are making great progress here. I think there is agreement on:

  • At most one foreignkey per field
  • Inlining foreignkey on the field attribute
  • I think we should probably introduce a name field on resources - i will comment more in #32

In terms of the foreignkey hash I would suggest a minor tweak to rename id to field:

   {
      "pkg": "../extdata/datapackage.json",
      "resource": "anotherData",
      "field": "id-of-field"
   }

Questions:

  • pkg: should we rename to more explicit datapackage? (Probably no)
  • pkg: what can it be? Suggest either:
    • a package name (resolvable by means we have not yet described!)
    • a relative path in unix style
    • a url
  • resource resolution is done by algorithm you describe above (though I suggest we simplify that via #32!)
@trestletech
Contributor

All sounds good to me. pkg seems reasonable to me, but I don't think it would be too inconvenient were it expanded.

@davidmiller

My immediate use case for data packages was discoverability of other datasets, at which point multiple foreign key relationships becomes v.useful...

e.g. for drug relationships, the BNF code is a foreign key to

  • Prescribing data
  • Research papers on this drug
  • Recent health warnings/new side effects issued (Yellow card)
  • Patent expiry and grants
  • OpenBNF formulary dosage information
  • Individual hospital / commissioning group formulary notes

Being able to hit each of these in an explorer would allow me to get all the related metadata from different publishers...

In principle.

@gthb
Contributor
gthb commented Jun 17, 2013

A foreign key into another resource in the same package could simply omit the pkg attribute, right? So it would look like:

{
   "id": "otherdata",
   "foreignkey": {
      "resource": "anotherData",
      "field": "id-of-field"
   }
}
@sballesteros

Any news on this issue ?
Is what follows the current consensus ?


{
  "name": "resource_A",
  "schema":{
    "fields":[
      {
        "name": "X"
        "foreignkey": {
          "pkg": "datapackage_dep",
          "resource": "resource_X",
          "field": "field_X"
        }
      },
      {
        "name": "Y"
        "foreignkey": {
          "pkg": "datapackage_dep",
          "resource": "resource_X",
          "field": "field_Y"
        }
      }
    ]
  }
}

Thanks!

@rufuspollock
Contributor

OK, final recommendation is:

    "foreignkey": {
      "datapackage": "datapackage_dep",
      "resource": "resource_X",
      "field": "field_X"
    }

With:

  • datapackage: [optional] a url or a single name reference. If a single name it will be up to client to resolve that to a datapackage (we will not allow relative path references). If not supplied then we resolve for a resource within this datapackage.json
  • resource: [required] name of the resource within the datapackage
  • field: [required] the name of the field within the resource

Note rename of pkg attribute to datapackage

Example:

{
  "name": "resource_A",
  "schema":{
    "fields":[
      {
        "name": "X"
        "foreignkey": {
          "datapackage": "datapackage_dep",
          "resource": "resource_X",
          "field": "field_X"
        }
      }, ...
    ]
  }
}
@rufuspollock
Contributor

@paulfitz interested in your thoughts here especially given your comments on primary key and experience with coopy ...

I'd like to get this and the primary key stuff closed and into JTS asap ...

@jpmckinney
Member

Why not make datapackage always be a URL? If an implementation doesn't want to link to something accessible over HTTP, they can just use a custom scheme like custom://domain/path which is still a URL.

Also, why foreignkey not foreign_key?

As in #21, I think a separate section for constraints would be best, in case other constraints are implemented later.

@rufuspollock
Contributor

@jpmckinney all good points. For foreign keys which are already quite complex I guess this makes sense. I suppose for other items e.g. even unique or primary key it makes thinks more hasslesome to write.

Overall: I get the feeling that a good set of people want constraints outside of the fields in a separate section a la sql.

@paulfitz your thoughts either way would be very useful.

@sballesteros ditto ...

@jpmckinney
Member

I think a composite (multi-column) primary key is much easier to read if all the columns appear side-by-side, e.g.:

{
  "fields": [
    ...
  ],
  "indexes": [
    {
      "type": "primary",
      "fields": ["column1", "column10"]
    }
  ]
}

If you see a "primary_key" property on the first field in a datapackage.json file, you might assume that field is the primary key, when in fact it's composite with a field that appears later in the definition.

@sballesteros

EDIT: This was based on a wrong understanding of foreign keys. What follows is largely irrelevant for this issue and is closer to materialized view.

Having foreignkeys in the fields array is convenient to do "joints" / merge.
For instance if there is a resource myresource in a data package called dpkg1 listing dpkg2 as dependencies:

"name": "dpkg1",
"version": "0.0.0",
"dependencies": {
    "dpkg2": "0.0.1"
},
"resources": [
    {
      "name": "myresource",
      "schema": {
        "fields":[
          {
            "name": "date",
            "foreignkey": {"datapackage": "dpkg2", "resource": "data", "field": "date"}
          },
          {
            "name": "seq",
            "foreignkey": {"datapackage": "dpkg2", "resource": "seq", "field": "seq"}
          },
          {
            "name": "x",
            "type": "integer"
          }
        ]
      },
      "data": [
        {"x": 10},
        {"x": 20},
        {"x": 30},
        {"x": 40}
      ]
    }

data referred by the foreignkeys are not available in dpkg1 (the resource data only contains x and not date and seq). To get date and seq, one need to go into dpkg2 and extract them from here (note that this is recursive as dpkg2 can also have foreignkeys...).
In this scenario it is convenient to know the order of the fields and have the foreignkeys right into schema.fields.

So to me foreignkeys allow to recreate composite JSON Table Schema from fields originating from different data packages.

Generalizing outside JSON Table Schema it would be good to have a discussion on how to use the top-level property dependencies or dataDependencies of a datapackage.
Maybe we should think of a more generic solution of "importing" resources (from x import y as z used in Python or var y = require('x') in node.js.).

If we stick with the current foreignkey hash and generalize it for non SDF resources we just need to omit the field property.
For instance we could have a resource y in dpkg1 coming from dpkg2:

"name": "dpkg1",
"version": "0.0.0",
"dependencies": {
    "dpkg2": "0.0.1"
},
"resources": [
  {
    "name": "y",
    "foreignkey": {"datapackage": "dpkg2", "resource": "x"}
  }

Here "foreignkey" is an extra possible property in addition to "data", "path" or "url".

If it is the intended use maybe "foreignkey" is not the best possible word.
What I really like about this usage of "foreignkey" is that it allows to repackage resources coming from different datapackages into one datapackage (tailored to someone specific data need) with full versioning support and without any data duplication.

@jpmckinney
Member

@sballesteros That's not how foreign keys work... the data for myresource would still have values for date and seq. A foreign key is just a constraint that can be enforced to ensure that the values in date and seq exist in the referenced columns of a resource in dpkg2. It's a way to check data consistency.

What you're describing is an entirely different feature, which may be worthwhile, but is not what a foreign key is. What you're describing is closer to a materialized view, that joins multiple tables together.

@sballesteros

@jpmckinney thanks for clarifying. So what I suggest is more related to the way dependencies will be handled. Got it now... Foreign key is a constraint to ensure that the values taken by the foreignkey are elements of the resource field it points to.

@jpmckinney
Member

@sballesteros Duplication is useful if the third-party resource disappears. There are many other reasons. Update: My previous message was about a comment that was later edited. Yes, that's what foreign keys do.

@rufuspollock
Contributor

@paulfitz @jpmckinney @sballesteros @maxogden Here's a summary of 2 major alternatives. I think we are very close to a good decision here :-)

Agreed

Generally agreed on structure of a foreign key reference as:

{
  "datapackage": "datapackage_dep",
  "resource": "resource_X",
  "fields": ["field_X"]
}

With:

  • datapackage: [optional] if not provided we assume a reference to resource within this datapackage. Otherwise should have a structure of that used by npm to reference dependent packages - see https://npmjs.org/doc/json.html#dependencies (only difference being that instead of tarballs you can point to a directory online)
  • resource: [required] name of the resource within the referenced datapackage
  • fields: [required] string or array indicating referenced field(s) on referenced resource.

Implementation inside JSON Table Schema Structure

Option 1 - Parallel (not inline)

{
  "name": "resource-a",
  "schema":{
    "fields":[
      {
        "name": "X"
      }, ...
    ],
    constraints: {
      "foreign_keys": [
        {
          "name": # optional name for this foreign key
          "fields": string or array of fields on local resource
          "reference": {
            foreign-key reference as above
          }
        }
      ]
    ]
  }
}

Option 2 - Inline on fields

{
  "name": "resource-a",
  "schema":{
    "fields":[
      {
        "name": "X"
        "foreign_key": {
          # foreign key reference as above
        }
      }, ...
    ]
  }
}

Questions

Main question: which do we prefer? Do we allow both? At the moment I'm inclining towards out of band for foreign keys

Other questions:

  • indexes versus constraints for name of out of band item
  • constraints as an array with type attribute or keyed by type (so if we had primary keys it would have primary_key: ... as an attribute)
  • foreign_key vs foreignKey vs foreignkey
    • camel case seems more json-y (see commonjs devDependencies etc)
@jpmckinney
Member

It's impossible to inline an n-n foreign key (which you seem to want to support, otherwise, foreign key reference wouldn't have fields plural as an array). You can only describe such a key in the parallel option. If this is to be supported, then I think we should implement the feature as only the parallel option, and if we get real demand from real users for an inline option, it can be implemented later.

@rufuspollock
Contributor

@jpmckinney that's correct - inline you wouldn't be able to have n-n foreign keys.

@jpmckinney
Member

I think constraints is more appropriate. A primary key constrains that field to be unique, as does a unique key. Indexes are for avoiding table scans, and such optimizations are usually use case-specific. For example, you might index on a person's name, but not have any uniqueness or other constraints on that name. It just so happens that most DBs automatically create indexes for most constraints, hence the potential confusion.

I prefer constraints to be a simple array, where each object has a type key equal to "foreign key", "primary key" or "unique key". Unless there is a good conceptual reason for adding structure like nesting, in general I prefer a more flat structure (a conservative, "built-it-when-you-need-it" approach). In either case, though, implementations would just loop through the constraints once, and read in whatever constraints they support.

Re: using the string "foreign key" as a value, etc.: If people prefer using underscores instead of spaces, I'm fine with that. I don't think we should remove the space - it's easier to read words when they are properly separated, and no one is crying over the extra 1 byte used for each space/underscore.

If using foreign key as a key instead of a value, as in the parallel example, then I think we should simply conform to whatever the rest of datapackage.json does. datapackage_version and last_modified use underscore style. If we want to switch to camel-case to be more like commonjs, then we should open a new ticket and do it for all properties at once.

@sballesteros

+1 for constraints.
I like the single array idea but I am not a huge fan of spaces for type values.
In an implementation, if you create a hash to handle all the types, for some languages (e.g JS) having a space in the keys is annoying (you cannot use hash.key and have to use hash[key]).

More generally for the data package keys, I think switching to camelCase to be more like CommonJS would be nice, especially if datapackage.json is renamed package.json.

@jpmckinney
Member

@sballesteros If we put the types as keys, then yes, using underscores or camelcase is better. But if we're putting the types as values, then it makes no difference. Edit: Nevermind, I misunderstood what you meant by creating a hash.

@sballesteros

@jpmckinney I was talking about an implementation. Using a hash with types values as keys to avoid some if statements.

@paulfitz
Contributor
paulfitz commented Dec 2, 2013

+1 for parallel option. @jpmckinney's array suggestion seems reasonable. Agree with @sballesteros about avoiding spaces. Could also rename constraints to keys and give them type foreign.

@jpmckinney
Member

Another +1 from @besquared in #21 for keys.

@besquared
Contributor

Going to repost this here since apparently #21 isn't the right place anymore.

Howdy guys I just wanted to add my recommendation here based on a few things.

  1. Keys are intrinsic descriptions of a dataset and not a field

I don't think adding any kind of key description at the field level is correct. I don't think It's a matter of shorthand either as structurally all keys are properties of a datasets and not a field.

  1. Keys aren't indices and aren't necessarily constraints either

Indices are a database specific feature used for optimizing common operations. The terms are often used interchangeably in those systems because database vendors often default to (or require) building an index onto the primary key columns for their own internal operations or for the convenience of the user. Keys don't necessarily imply a constraint either although generally implementations constrain primary keys to be unique. One vendor specific example is that the InnoDB engine allows both non-unique indices and foreign keys to non-unique indices.

IMO If the specification's goal is to be producer and consumer agnostic then the it wouldn't include references to the concept of an index and leave that up to the system consuming the data package. I think there's probably some room here to include a unique flag as it might be seen as a description of the key's relationship to the dataset and not a constraint that the package is defining.

As far as how to implement a specification for keys I think the most generic form would try to only include things that are descriptive of the data itself. A key should signify that fields within or between datasets are linked to one another in some way or have some kind of special meaning in the dataset itself (such as uniqueness). Something like the following (which even I consider somewhat verbose) might be a good first shot:

{
  "schema: {
    "fields": [...]
  },

  "keys": [
    {
      "type": "primary",
      "unique": true, // for the convenience of the consumer
      "fields": ['field1', 'field2', ...]
    },
    {
      "type": "foreign",
      "fields": ['field3', ...],
      "references": {
         "url": <datapackage.json>, // url of datapackage.json
         "resource": <resource>, // name of the resource in the referenced package
         "fields": ['field1', ...] // ordered respective to the ancestor's fields
       }
    }
  ]
@rufuspollock
Contributor

My one objection to the array approach for keys/constraints is that it makes it more painful for consumers - rather than just doing:

resources[0].schema.constraints.primarykey.fields

i have to iterate through the list of constraints.

@besquared re constraints / keys in naming I feel a little ambivalent. I think it would be nice to include unique requirements in their at some point and they definitely aren't keys ... (however this isn't a biggie)

I do think generally that keys/constraints should live inside the schema attribute and be part of json table schema (@besquared your example suggested otherwise but i was not sure if that was intentional ...)

@jpmckinney
Member

@rgrp For all except the primary key, you'll still need to iterate, as only the primary key will have a single key object as its value. The others will have arrays. If your app loads all keys, you'll have a double-nested iteration if the constraints are an object, instead of a simple iteration if the constraints are an array. It depends on what use cases you want to prioritize.

Also, can we have separation between words, e.g. primary_key?

Also, unique keys are keys (it's in the name). How are they "definitely not keys"? @besquared is correct in saying that constraints in the language of DBs are something different, e.g. you can put a constraint at the field-level on a price field to say that it must have a value greater than zero. NOT NULL is maybe the most common constraint. "Keys" is the more universal DB word for what we're talking about.

@besquared
Contributor

+1 for everything @jpmckinney is saying. Also @rgrp I originally did think keys belongs to the resource and not the schema but I take it back now they obviously belong in the schema. To draw some inspiration from Codd's relational model something like the format I proposed above allows us to correctly express the three types of relational constraints.

@rgrp other options include a keys section keyed by type

{
  "schema: {

    "fields": [...]

    "keys": {

      "primary": {
        "fields": ['field1', 'field2', ...]
      },

      "foreign": [
        {
          "fields": ['field3', ...],
          "references": {
            "url": <datapackage.json>, // url of datapackage.json
            "resource": <resource>, // name of the resource in the referenced package
            "fields": ['field1', ...] // ordered respective to the ancestor's fields
        }
      ]

    }

  }
}

This seems fine too and allows the kind of object access one would expect (resource.schema.keys.primary) while still allowing multiple foreign keys and allows the keys section to remain pretty robust to extension later.

@besquared
Contributor

This might also be a good time to talk about foreign keys vs. references. In the relational model foreign keys imply a constraint in which the foreign relation must have a corresponding value for the keyed fields in order for a new relation to be added. This allows referential integrity between relations to be maintained.

If what the specification is trying to achieve is a way to link data together (but not necessarily imply a constraint) then we might want to have a section called "references" instead of a set of foreign keys. Consumers who would like to resolve reference packages and apply foreign key constraints in their data management system may choose to do so but wouldn't be required to.

This leads to another possible schema layout:

{
  "schema: {

    "fields": [...]

    "primary_key": {
      "fields": ['field1', 'field2', ...]
    }

    "references": [
      {
        "url": <datapackage.json>, // url of datapackage.json
        "resource": <resource>, // name of the resource in the referenced package
        "required": true, // require the referenced row to exist in the foreign resource?
        "fields": {
          'field1': 'field4',
          'field3': 'field5'
        } // map between this resource and the foreign resource
      }
    ]
  }
}

I like this as well for various reasons. It leaves the concept of constraints for a later time and still allows consumers to optionally infer foreign key constraints.

Thoughts?

@rufuspollock
Contributor

So along with @jpmckinney's proposal we're going with out of line items. Here's the updated proposal:

{
  "name": "resource-a",
  "schema":{
    "fields":[
      {
        "name": "X"
      }, ...
    ],
    "foreignKeys": [
        {
          "fields": string or array of fields on local resource
          "reference": {
            "datapackage": "datapackage_dep",
            "resource": "resource_X",
            "fields": ["field_X"]
          }
        }
      ]
    ]
  }
}

Questions:

  • foreignKeys vs references vs relations vs foreign as name for attribute - votes please (+1,+0,0, -0,-1)
  • camelcase vs underscores (foreignKey vs foreign_key)

rfc @paulfitz @jpmckinney @besquared @davidmiller @sballesteros

@jpmckinney
Member
  • I prefer foreign keys to "relations", etc. because it is a label that precisely refers to the existing concept.
  • datapackage_version and last_modified use underscores. Do any current properties use camelcase? The case should be consistent across all properties, which for now would mean sticking to underscores. I don't mind switching everything to camelcase, but then we'd have to change at least datapackage_version and last_modified.
@besquared
Contributor

There's some mixing here between #21 and #23.

Everything should just standardize on camelCase.

Given that it's likely that primaryKey will become a thing it seems reasonable that foreignKeys should become the other thing. I think it's important to tell people that this might not have the same meaning that it does in their rdbms. This seems ok.

@paulfitz
Contributor
paulfitz commented Dec 9, 2013

+1 for foreignKeys and lowerCamelCase.

@sballesteros

+1 for foreignKeys and lowerCamelCase too.

@trickvi
Contributor
trickvi commented Jan 22, 2014

This has become a pretty big feature for me to get into the specification. Any ideas about ETA @rgrp?

@trickvi
Contributor
trickvi commented Jan 23, 2014

One idea I have which would be a nice to have is to create some sort of a fallback sequence. It might make everything more complicated but what I'm thinking is that resource could be an array of resources (most significant resource on the left) and when looking up the key it would just go through the resources in that order until it finds the key being searched for.

This means I can have multiple resources for datasets published in different phases. I'm thinking about a case for budget data where countries have different budget phases. The problematic one is when we have an approved budget which can be adjusted at a later point in time. Adjustments are not necessarily a completely new dataset but changes to different rows of the approved budget.

Let's say we represent this with two resources (marking them with dates for clarity): approved-2013-12-25.csv and adjusted-2014-01-22.csv. This problem could be solved by just making adjusted-2014-01-22.csv contain the whole budget with changes (so it also contains rows which haven't changed). For this the foreign key resource would just be the most recent budget resource. This means that if there are a few adjustments made we'll be copying a lot of the same rows between resources. To compare changes you would have to do a diff.

Another option would be to make adjusted-2014-01-22.csv only contain lines that have changed, with the same schema (fields, primary keys, and foreign keys) and define the resources for the foreign key with an array. Then when trying to reference an item in the budget it would first try to find it in adjusted-2014-01-22.csv and if not found go through approved-2013-12-25.csv. This reduces amount of redundancy and makes it clearer what the changes are but this makes the implementation of foreign keys more difficult.

Just an idea. I'm glad to drop it and use up more disk space if that means we can get foreign keys in as soon as possible.

@rufuspollock
Contributor

@tryggvib at this point I doubt we're going to expand for this use-case. Right now we have a finalized spec and I think it will go in.

@trickvi
Contributor
trickvi commented Jan 23, 2014

Great. Happy to see this go in as long as I have foreign keys :)

@trickvi
Contributor
trickvi commented Jan 23, 2014

Another use case I came up with though just now was that this would enable you to break up resources to more manageable sizes. For example one resource per year and some foreign key in another resource could include the year which could fallback through the resources broken down by years.

@rufuspollock
Contributor

FIXED. Foreign Key support is now in - see http://dataprotocols.org/json-table-schema/#foreign-keys

Huge thank-you to everyone who contributed here and please double check for the final implemented version for any errors that need to be corrected or emendations that should be made.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment