[JTS] Primary Key / ID attribute support in JSON table schema #21

Closed
rufuspollock opened this Issue Dec 25, 2012 · 32 comments

Projects

None yet

7 participants

@rufuspollock
Contributor

Ability to specify a field(s) as primary key / id field.

Proposal

{
   fields: [
     {
        id: ...
        type: ...
        primarykey: true
   ]
}

Questions

  • Do we allow multiple fields or insist on a single field
  • Naming of attribute: "primarykey"

Relationship to a possible distinct attribute "unique"

TODO: research / compare with other specs e.g. SQL, bigquery etc

@rufuspollock
Contributor

@mk270 thoughts here?

@mk270
Contributor
mk270 commented Mar 7, 2013

If we do it at all, let's do it in a later version

@rufuspollock
Contributor

AFAICT from my research with json schema there is to provision for this in json schema (there's this discussion but nothing in the spec ...)

Key question:

  • Can you have multiple fields in a primary key or must it be a single field?
@rufuspollock
Contributor

Propose:

{
   fields: [
     {
        id: ...
        type: ...
        primarykey: true
   ]
}

Decision: we allow multiple fields to make up the primary key but RECOMMEND having only one.

Allowed to have no fields set as primarykey.

@rufuspollock
Contributor

@davidmiller - any thoughts?

@rufuspollock
Contributor

I note though json schema has no primary key it has required support and does not put it on each element but rather outside in a list, see e.g. http://json-schema.org/example1.html

{
    "$schema": "http://json-schema.org/draft-04/schema#",
    "title": "Product",
    "description": "A product from Acme's catalog",
    "type": "object",
    "properties": {
        "id": {
            "description": "The unique identifier for a product",
            "type": "integer"
        }
    },
    "required": ["id"]
}
@rufuspollock
Contributor

@maxogden any thoughts here? (This is definitely needed for dat data loading IIRC).

@maxogden
Contributor
maxogden commented Nov 9, 2013

it's not required but I support it (e.g. if you don't specify one then uuids will be generated for your rows for you)

also, how would it work exactly to specify multiple primary keys? would they get concatenated together to form one meta-primary-key or that they can be used individually/optionally?

e.g. if I had a csv with columns city,state then my primary keys would have to be city, state e.g. salem-oregon vs salem-massachusetts

i'm not sure if tools should allow to interpret multiple columns as one composite key or if that is outside the scope

@rufuspollock
Contributor

@maxogden I'd planned that you could have multiple primary keys (set primarykey=true on multiple fields) but that you should not (i.e. heavily discouraged). Interpretation of multiple primary keys would be up to the client but natural one would be to have a composite key (aside: I'd thought about this in terms of an API for data.okfn.org where you have data.okfn.org/{dataset}/{file}/id/{primary-key} and think you'd do for composite stuff {col-1}/{col-2} etc)

@rufuspollock
Contributor

@paulfitz interested in your thoughts here. Hope to close this and push in next day or so.

@paulfitz
Contributor

@rgrp I'd vote for allowing (at most) one primary key, and allowing it to be a composite of multiple columns. I've found this to be common and useful practice. With Coopy I started off not covering this case, but then needed to when dealing with real-life data. For reference, Sqlite's treatment of specifying a primary key is as follows:

  • Each table may have at most one primary key.
  • If "PRIMARY KEY" is added to a column definition (like your "primarykey: true"), then the primary key for that table is exactly that column.
  • If the primary key is expressed as a column constraint (this would be like a list added outside of the column definitions, like json-schema does with required) then the primary key for that table is the list of columns given.

If multi-column primary keys end up allowed by the spec, then I'd suggest doing it wholeheartedly, and not put in language discouraging them or leaving interpretation to the client. Just so everyone knows what they can rely on for real.

@rufuspollock
Contributor

@paulfitz so agree on multi-column primary keys (but only one primary key). What do you think of doing inline i.e. primarkey=true (this is sqlalchemy style) on one or more fields or outside of the fields list (more sql constraint style ...?)

@paulfitz
Contributor

@rgrp inline is ok. Outside-of-fields style could eventually be simpler when dealing also with foreign keys (including composites), indexes, uniqueness, etc. I went with modeling just single-column foreign keys in Coopy and ended up regretting it. But for the primary key, inline is ok, since there's just one primary key and the order of columns in it doesn't matter.

@jpmckinney
Member

A primary key is an index. Most database systems I know of specify indexes separately from schema. So, I would prefer to have another top-level field (alongside fields) called indexes, which you can add primary keys, unique keys, etc. to. If you ever add additional indexes like unique keys, the implementation will be much easier if it's done separately like this. I figure DB systems have other good reasons for defining indexes separately, besides the conceptual difference from fields (e.g. row-level versus table-level).

Also, you can only ever have one primary key (though it may be a composite key). I know of no database system that allows for multiple "primary" keys (what would that even mean?).

@paulfitz
Contributor

@jpmckinney what do you think of Sqlite's compromise, where a simple single-column primary key can be expressed directly in the column definition, while still having more flexible methods of describing constraints and indexes in the general case (like your indexes property)? http://www.sqlite.org/lang_createtable.html The single-column case is common, is it worth streamlining?

@jpmckinney
Member

The goal of SQLite's table definition format is to optimize table creation (the most common interaction people have with the table syntax), so it offers developers many ways to write short-hand. An important goal of datapackage.json, as I understand, is to communicate the data's schema to readers, and reading is much more common than is the case for SQLite. Given that reading is prioritized over writing, it makes sense to choose a format that maximizes clarity and minimizes miscommunication or confusion. In #23 I give an example where putting primary_key on fields can lead to misreading and confusion.

Also, when introducing a new feature, I'm generally in favor of supporting one way of using that feature, and waiting for real demand to grow for alternatives, short-hands, etc.

@rufuspollock
Contributor

@jpmckinney i'm not sure that reading is prioritized over writing exactly - though I get your point.

I'm still at least +0 on the shorthand primarykey: true or similar but appreciate your point. I'm going to comment in #23 now and I hope with that one closed we can decide whether primarykey option makes sense ...

@besquared
Contributor

Howdy guys I just wanted to add my recommendation here based on a few things.

  1. Keys are intrinsic descriptions of a dataset and not a field

I don't think adding any kind of key description at the field level is correct. It's not a matter of shorthand either as structurally all keys are properties of a datasets and not a fields.

  1. Keys aren't indices and aren't necessarily constraints either

Indices are a database specific feature used for optimizing common operations. The terms are often used interchangeably in those systems because database vendors often default to (or require) building an index onto the primary key columns for their own internal operations or for the convenience of the user. Keys don't necessarily imply a constraint either although generally implementations constrain primary keys to be unique.

IMO If the specification's goal is to be producer and consumer agnostic then the it wouldn't include references to the concept of an index and leave that up to the system consuming the data package. I think there's probably some room here to include a unique flag as it might be seen as a description of the key's relationship to the dataset and not a constraint that we're defining.

As far as how to implement a specification for keys I think the most generic form would try to only include things that are descriptive of the data itself. A key should signify that fields within or between datasets are linked to one another in some way or have some kind of special meaning in the dataset itself (such as uniqueness). Something like the following (which even I consider somewhat verbose) might be a good first shot:

{
  "schema: {
    "fields": [...]
  },

  "keys": [
    {
      "type": "primary",
      "unique": true, // for the convenience of the consumer
      "fields": ['field1', 'field2', ...]
    },
    {
      "type": "foreign",
      "fields": ['field3', ...],
      "references": {
         "url": <datapackage.json>, // url of datapackage.json
         "resource": <resource>, // name of the resource in the referenced package
         "fields": ['field1', ...] // ordered respective to the ancestor's fields
       }
    }
  ]
@jpmckinney
Member

@besquared +1 to everything. Happy to call them keys instead of constraints or indexes. The discussion is mostly happening in #23, btw.

@maxogden
Contributor
maxogden commented Dec 5, 2013

just a note, in dat i've been experimenting with this API:

{
    primary: ['CaseNumber', 'Description', 'PoliceBeat', 'CrimeType'],
    hash: true
}

meaning for this CSV:

"CrimeType","DateTime","CaseNumber","Description","PoliceBeat","Address","City","State"
"GRAND THEFT","2013-10-08 17:10:00","13-051656","GRAND THEFT:MONEY/LABOR/PROPERTY  OVER $400","01X","300 WEBSTER ST","Oakland","CA"

the key for that row would be:

13-051656GRAND THEFT:MONEY/LABOR/PROPERTY  OVER $40001XGRAND THEFT

but since hash is true it gets md5'd to something like this:

1b05f83d08267b674cee5a79786b1a07

upside is that I get unique ID's, downside is that I have to know all 4 columns in order to retrieve the data for that row

@jpmckinney
Member

Hmm, simple concatenation is naive. Imagine another row where description is GRAND THEFT:MONEY/LABOR/PROPERTY OVER $4000 and police beat is 1X. The keys will be identical in your implementation, but they shouldn't be. If you do something like COLUMN_NAME_A=VALUE COLUMN_NAME_B=VALUE ... for the key, it will be unique. I'm sure database designers have much more clever solutions.

@besquared
Contributor

In the stuff I'm writing I generate an internal row id using a purely random UUID (version 4) or the SHA1 hash of the concatenated primary key values for that row. I can't think of a case where someone wouldn't have either the internal record id (via some kind of CLI or UI/URL) or all 4 of the values at hand (via a foreign key association for instance).

It also makes it much more convenient for things like URLs. If everyone agrees on the row id generation scheme I could theoretically pass that hashed row id to another service and it could look up the matching records. Of course I could also pass huge lists of strings around too but that seems worse.

@jpmckinney
Member

@besquared You can hash the concatenated string - that's fine - all I'm saying in my comment is that simply concatenating column values is a naive approach, as it doesn't delimit the values in any way, and I give an example where the lack of a delimiter causes a key collision where there should be no collision.

@besquared
Contributor

Sorry, I meant that I was agreeing with you. I'm glad you pointed out that concatenation alone isn't enough. :)

@rufuspollock
Contributor

OK, so how about this for a final proposal:

  • Schema MAY have an attribute primaryKey
    • I think we go with camel-cased finally if we want to keep with CommonJS approach). Happy to hear complaints here.
    • we could drop the "key" but "primaryKey" seems better than "primary" which is a bit ambiguous (though @maxogden i note this is your usage)
  • primaryKey may be either a single string (interpreted as a field name) or an array of strings (denoting that primary key is made up of those fields in order)
    • appreciate that allowing option here makes parsing a little bit harder but makes it that much nicer for creators and users

Here's what it looks like:

{
  "schema": {
     "fields": [ ... ]
     "primaryKey": "a"
     # or if multiple fields
     "primaryKey": ["a", "b"]
   }
}

rfc @maxogden @jpmckinney @besquared @paulfitz @sballesteros

@jpmckinney
Member
  • As mentioned in #23, if we want to start using camelcase, we need to camelcase all the currently underscored properties.
  • I prefer calling it a primary key to any other label, because that is a label that precisely refers to the existing concept.
  • The order of the fields in a primary key should not matter. The order only matters to the indexes that are automatically created for primary keys in SQL systems. Indexes are context- and use case-dependent. I thought we would handle those separately.
  • It should be conformant to create a primary key whose value is a single-value array, e.g. "primaryKey": ["a"], i.e. if using the array format, you are not obliged to have multiple values.
@turicas
turicas commented Dec 8, 2013

+1 for camel case since it is the default on JavaScript (and JSON comes from JS...).

@rufuspollock
Contributor

@jpmckinney

  • agree we'll need to camelcase current underscored properties for consistency. There are not that many i think but it is a definite consideration.
  • Good to hear the +1 for primaryKey over e.g. "primary"
  • We can make clear order of fields does not matter
  • It would be fine to have an array with a single value (in fact given that arrays are allowed that might be the default)

@turicas good to hear the +1 on camel case

@besquared
Contributor

+1 camelCase (not my favorite but consistency is important)
+1 primaryKey

@rufuspollock
Contributor

w00t!

@besquared @jpmckinney @maxogden @paulfitz believe I've captured the agreed consensus here but welcome review of the change: 128726b

@jpmckinney
Member

Looks good!

@besquared
Contributor

Looks great to me! Already implemented the Mode Ruby data packages library. I'll get around the string version soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment