[JTS] Primary Key / ID attribute support in JSON table schema #21

rufuspollock · 2012-12-25T22:32:48Z

Ability to specify a field(s) as primary key / id field.

Proposal

{
   fields: [
     {
        id: ...
        type: ...
        primarykey: true
   ]
}

Questions

Do we allow multiple fields or insist on a single field
Naming of attribute: "primarykey"

Relationship to a possible distinct attribute "unique"

TODO: research / compare with other specs e.g. SQL, bigquery etc

The text was updated successfully, but these errors were encountered:

rufuspollock · 2013-02-15T18:30:40Z

@mk270 thoughts here?

mk270 · 2013-03-07T19:26:59Z

If we do it at all, let's do it in a later version

rufuspollock · 2013-06-23T14:29:52Z

AFAICT from my research with json schema there is to provision for this in json schema (there's this discussion but nothing in the spec ...)

Key question:

Can you have multiple fields in a primary key or must it be a single field?

rufuspollock · 2013-08-06T16:03:46Z

Propose:

{
   fields: [
     {
        id: ...
        type: ...
        primarykey: true
   ]
}

Decision: we allow multiple fields to make up the primary key but RECOMMEND having only one.

Allowed to have no fields set as primarykey.

rufuspollock · 2013-10-06T11:57:54Z

@davidmiller - any thoughts?

rufuspollock · 2013-10-06T11:59:44Z

I note though json schema has no primary key it has required support and does not put it on each element but rather outside in a list, see e.g. http://json-schema.org/example1.html

{
    "$schema": "http://json-schema.org/draft-04/schema#",
    "title": "Product",
    "description": "A product from Acme's catalog",
    "type": "object",
    "properties": {
        "id": {
            "description": "The unique identifier for a product",
            "type": "integer"
        }
    },
    "required": ["id"]
}

rufuspollock · 2013-11-09T10:16:20Z

@maxogden any thoughts here? (This is definitely needed for dat data loading IIRC).

max-mapper · 2013-11-09T11:09:14Z

it's not required but I support it (e.g. if you don't specify one then uuids will be generated for your rows for you)

also, how would it work exactly to specify multiple primary keys? would they get concatenated together to form one meta-primary-key or that they can be used individually/optionally?

e.g. if I had a csv with columns city,state then my primary keys would have to be city, state e.g. salem-oregon vs salem-massachusetts

i'm not sure if tools should allow to interpret multiple columns as one composite key or if that is outside the scope

rufuspollock · 2013-11-09T17:59:15Z

@maxogden I'd planned that you could have multiple primary keys (set primarykey=true on multiple fields) but that you should not (i.e. heavily discouraged). Interpretation of multiple primary keys would be up to the client but natural one would be to have a composite key (aside: I'd thought about this in terms of an API for data.okfn.org where you have data.okfn.org/{dataset}/{file}/id/{primary-key} and think you'd do for composite stuff {col-1}/{col-2} etc)

rufuspollock · 2013-11-11T10:29:08Z

@paulfitz interested in your thoughts here. Hope to close this and push in next day or so.

paulfitz · 2013-11-11T15:03:04Z

@rgrp I'd vote for allowing (at most) one primary key, and allowing it to be a composite of multiple columns. I've found this to be common and useful practice. With Coopy I started off not covering this case, but then needed to when dealing with real-life data. For reference, Sqlite's treatment of specifying a primary key is as follows:

Each table may have at most one primary key.
If "PRIMARY KEY" is added to a column definition (like your "primarykey: true"), then the primary key for that table is exactly that column.
If the primary key is expressed as a column constraint (this would be like a list added outside of the column definitions, like json-schema does with required) then the primary key for that table is the list of columns given.

If multi-column primary keys end up allowed by the spec, then I'd suggest doing it wholeheartedly, and not put in language discouraging them or leaving interpretation to the client. Just so everyone knows what they can rely on for real.

rufuspollock · 2013-11-11T15:07:35Z

@paulfitz so agree on multi-column primary keys (but only one primary key). What do you think of doing inline i.e. primarkey=true (this is sqlalchemy style) on one or more fields or outside of the fields list (more sql constraint style ...?)

paulfitz · 2013-11-11T15:43:18Z

@rgrp inline is ok. Outside-of-fields style could eventually be simpler when dealing also with foreign keys (including composites), indexes, uniqueness, etc. I went with modeling just single-column foreign keys in Coopy and ended up regretting it. But for the primary key, inline is ok, since there's just one primary key and the order of columns in it doesn't matter.

jpmckinney · 2013-11-29T23:03:37Z

A primary key is an index. Most database systems I know of specify indexes separately from schema. So, I would prefer to have another top-level field (alongside fields) called indexes, which you can add primary keys, unique keys, etc. to. If you ever add additional indexes like unique keys, the implementation will be much easier if it's done separately like this. I figure DB systems have other good reasons for defining indexes separately, besides the conceptual difference from fields (e.g. row-level versus table-level).

Also, you can only ever have one primary key (though it may be a composite key). I know of no database system that allows for multiple "primary" keys (what would that even mean?).

paulfitz · 2013-11-30T04:35:54Z

@jpmckinney what do you think of Sqlite's compromise, where a simple single-column primary key can be expressed directly in the column definition, while still having more flexible methods of describing constraints and indexes in the general case (like your indexes property)? http://www.sqlite.org/lang_createtable.html The single-column case is common, is it worth streamlining?

jpmckinney · 2013-11-30T22:11:35Z

The goal of SQLite's table definition format is to optimize table creation (the most common interaction people have with the table syntax), so it offers developers many ways to write short-hand. An important goal of datapackage.json, as I understand, is to communicate the data's schema to readers, and reading is much more common than is the case for SQLite. Given that reading is prioritized over writing, it makes sense to choose a format that maximizes clarity and minimizes miscommunication or confusion. In #23 I give an example where putting primary_key on fields can lead to misreading and confusion.

Also, when introducing a new feature, I'm generally in favor of supporting one way of using that feature, and waiting for real demand to grow for alternatives, short-hands, etc.

rufuspollock · 2013-12-01T13:41:44Z

@jpmckinney i'm not sure that reading is prioritized over writing exactly - though I get your point.

I'm still at least +0 on the shorthand primarykey: true or similar but appreciate your point. I'm going to comment in #23 now and I hope with that one closed we can decide whether primarykey option makes sense ...

besquared · 2013-12-04T22:46:58Z

Howdy guys I just wanted to add my recommendation here based on a few things.

Keys are intrinsic descriptions of a dataset and not a field

I don't think adding any kind of key description at the field level is correct. It's not a matter of shorthand either as structurally all keys are properties of a datasets and not a fields.

Keys aren't indices and aren't necessarily constraints either

Indices are a database specific feature used for optimizing common operations. The terms are often used interchangeably in those systems because database vendors often default to (or require) building an index onto the primary key columns for their own internal operations or for the convenience of the user. Keys don't necessarily imply a constraint either although generally implementations constrain primary keys to be unique.

IMO If the specification's goal is to be producer and consumer agnostic then the it wouldn't include references to the concept of an index and leave that up to the system consuming the data package. I think there's probably some room here to include a unique flag as it might be seen as a description of the key's relationship to the dataset and not a constraint that we're defining.

As far as how to implement a specification for keys I think the most generic form would try to only include things that are descriptive of the data itself. A key should signify that fields within or between datasets are linked to one another in some way or have some kind of special meaning in the dataset itself (such as uniqueness). Something like the following (which even I consider somewhat verbose) might be a good first shot:

{
  "schema: {
    "fields": [...]
  },

  "keys": [
    {
      "type": "primary",
      "unique": true, // for the convenience of the consumer
      "fields": ['field1', 'field2', ...]
    },
    {
      "type": "foreign",
      "fields": ['field3', ...],
      "references": {
         "url": <datapackage.json>, // url of datapackage.json
         "resource": <resource>, // name of the resource in the referenced package
         "fields": ['field1', ...] // ordered respective to the ancestor's fields
       }
    }
  ]

jpmckinney · 2013-12-05T00:38:36Z

@besquared +1 to everything. Happy to call them keys instead of constraints or indexes. The discussion is mostly happening in #23, btw.

max-mapper · 2013-12-05T01:29:35Z

just a note, in dat i've been experimenting with this API:

{
    primary: ['CaseNumber', 'Description', 'PoliceBeat', 'CrimeType'],
    hash: true
}

meaning for this CSV:

"CrimeType","DateTime","CaseNumber","Description","PoliceBeat","Address","City","State"
"GRAND THEFT","2013-10-08 17:10:00","13-051656","GRAND THEFT:MONEY/LABOR/PROPERTY  OVER $400","01X","300 WEBSTER ST","Oakland","CA"

the key for that row would be:

13-051656GRAND THEFT:MONEY/LABOR/PROPERTY  OVER $40001XGRAND THEFT

but since hash is true it gets md5'd to something like this:

1b05f83d08267b674cee5a79786b1a07

upside is that I get unique ID's, downside is that I have to know all 4 columns in order to retrieve the data for that row

jpmckinney · 2013-12-05T01:47:41Z

Hmm, simple concatenation is naive. Imagine another row where description is GRAND THEFT:MONEY/LABOR/PROPERTY OVER $4000 and police beat is 1X. The keys will be identical in your implementation, but they shouldn't be. If you do something like COLUMN_NAME_A=VALUE COLUMN_NAME_B=VALUE ... for the key, it will be unique. I'm sure database designers have much more clever solutions.

besquared · 2013-12-05T02:11:33Z

In the stuff I'm writing I generate an internal row id using a purely random UUID (version 4) or the SHA1 hash of the concatenated primary key values for that row. I can't think of a case where someone wouldn't have either the internal record id (via some kind of CLI or UI/URL) or all 4 of the values at hand (via a foreign key association for instance).

It also makes it much more convenient for things like URLs. If everyone agrees on the row id generation scheme I could theoretically pass that hashed row id to another service and it could look up the matching records. Of course I could also pass huge lists of strings around too but that seems worse.

jpmckinney · 2013-12-05T02:20:34Z

@besquared You can hash the concatenated string - that's fine - all I'm saying in my comment is that simply concatenating column values is a naive approach, as it doesn't delimit the values in any way, and I give an example where the lack of a delimiter causes a key collision where there should be no collision.

besquared · 2013-12-05T06:36:23Z

Sorry, I meant that I was agreeing with you. I'm glad you pointed out that concatenation alone isn't enough. :)

rufuspollock · 2013-12-07T20:42:01Z

OK, so how about this for a final proposal:

Schema MAY have an attribute primaryKey
- I think we go with camel-cased finally if we want to keep with CommonJS approach). Happy to hear complaints here.
- we could drop the "key" but "primaryKey" seems better than "primary" which is a bit ambiguous (though @maxogden i note this is your usage)
primaryKey may be either a single string (interpreted as a field name) or an array of strings (denoting that primary key is made up of those fields in order)
- appreciate that allowing option here makes parsing a little bit harder but makes it that much nicer for creators and users

Here's what it looks like:

{
  "schema": {
     "fields": [ ... ]
     "primaryKey": "a"
     # or if multiple fields
     "primaryKey": ["a", "b"]
   }
}

rfc @maxogden @jpmckinney @besquared @paulfitz @sballesteros

jpmckinney · 2013-12-08T16:14:06Z

As mentioned in Foreign Key attribute in JSON schema #23, if we want to start using camelcase, we need to camelcase all the currently underscored properties.
I prefer calling it a primary key to any other label, because that is a label that precisely refers to the existing concept.
The order of the fields in a primary key should not matter. The order only matters to the indexes that are automatically created for primary keys in SQL systems. Indexes are context- and use case-dependent. I thought we would handle those separately.
It should be conformant to create a primary key whose value is a single-value array, e.g. "primaryKey": ["a"], i.e. if using the array format, you are not obliged to have multiple values.

turicas · 2013-12-08T16:30:40Z

+1 for camel case since it is the default on JavaScript (and JSON comes from JS...).

rufuspollock · 2013-12-08T17:05:25Z

@jpmckinney

agree we'll need to camelcase current underscored properties for consistency. There are not that many i think but it is a definite consideration.
Good to hear the +1 for primaryKey over e.g. "primary"
We can make clear order of fields does not matter
It would be fine to have an array with a single value (in fact given that arrays are allowed that might be the default)

@turicas good to hear the +1 on camel case

besquared · 2013-12-08T19:20:08Z

+1 camelCase (not my favorite but consistency is important)
+1 primaryKey

rufuspollock · 2014-01-12T12:57:01Z

w00t!

@besquared @jpmckinney @maxogden @paulfitz believe I've captured the agreed consensus here but welcome review of the change: 128726b

jpmckinney · 2014-01-12T18:26:24Z

Looks good!

besquared · 2014-01-12T18:29:31Z

Looks great to me! Already implemented the Mode Ruby data packages library. I'll get around the string version soon.

rufuspollock mentioned this issue Apr 3, 2013

Foreign Key attribute in JSON schema #23

Closed

rufuspollock mentioned this issue Apr 18, 2013

Added data transmission info #28

Merged

rufuspollock closed this as completed in 128726b Jan 12, 2014

rufuspollock mentioned this issue Feb 14, 2014

Fields in the resource dict #102

Closed

danfowler mentioned this issue Jan 26, 2016

Generating single value primary keys for dimensions openspending/fiscal-data-package#109

Closed

WillAyd mentioned this issue Oct 22, 2019

Incorrect json round-trip with orient='table' when dataframe contains duplicate index values pandas-dev/pandas#29025

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[JTS] Primary Key / ID attribute support in JSON table schema #21

[JTS] Primary Key / ID attribute support in JSON table schema #21

rufuspollock commented Dec 25, 2012

rufuspollock commented Feb 15, 2013

mk270 commented Mar 7, 2013

rufuspollock commented Jun 23, 2013

rufuspollock commented Aug 6, 2013

rufuspollock commented Oct 6, 2013

rufuspollock commented Oct 6, 2013

rufuspollock commented Nov 9, 2013

max-mapper commented Nov 9, 2013

rufuspollock commented Nov 9, 2013

rufuspollock commented Nov 11, 2013

paulfitz commented Nov 11, 2013

rufuspollock commented Nov 11, 2013

paulfitz commented Nov 11, 2013

jpmckinney commented Nov 29, 2013

paulfitz commented Nov 30, 2013

jpmckinney commented Nov 30, 2013

rufuspollock commented Dec 1, 2013

besquared commented Dec 4, 2013

jpmckinney commented Dec 5, 2013

max-mapper commented Dec 5, 2013

jpmckinney commented Dec 5, 2013

besquared commented Dec 5, 2013

jpmckinney commented Dec 5, 2013

besquared commented Dec 5, 2013

rufuspollock commented Dec 7, 2013

jpmckinney commented Dec 8, 2013

turicas commented Dec 8, 2013

rufuspollock commented Dec 8, 2013

besquared commented Dec 8, 2013

rufuspollock commented Jan 12, 2014

jpmckinney commented Jan 12, 2014

besquared commented Jan 12, 2014

[JTS] Primary Key / ID attribute support in JSON table schema #21

[JTS] Primary Key / ID attribute support in JSON table schema #21

Comments

rufuspollock commented Dec 25, 2012

Proposal

Questions

rufuspollock commented Feb 15, 2013

mk270 commented Mar 7, 2013

rufuspollock commented Jun 23, 2013

rufuspollock commented Aug 6, 2013

rufuspollock commented Oct 6, 2013

rufuspollock commented Oct 6, 2013

rufuspollock commented Nov 9, 2013

max-mapper commented Nov 9, 2013

rufuspollock commented Nov 9, 2013

rufuspollock commented Nov 11, 2013

paulfitz commented Nov 11, 2013

rufuspollock commented Nov 11, 2013

paulfitz commented Nov 11, 2013

jpmckinney commented Nov 29, 2013

paulfitz commented Nov 30, 2013

jpmckinney commented Nov 30, 2013

rufuspollock commented Dec 1, 2013

besquared commented Dec 4, 2013

jpmckinney commented Dec 5, 2013

max-mapper commented Dec 5, 2013

jpmckinney commented Dec 5, 2013

besquared commented Dec 5, 2013

jpmckinney commented Dec 5, 2013

besquared commented Dec 5, 2013

rufuspollock commented Dec 7, 2013

jpmckinney commented Dec 8, 2013

turicas commented Dec 8, 2013

rufuspollock commented Dec 8, 2013

besquared commented Dec 8, 2013

rufuspollock commented Jan 12, 2014

jpmckinney commented Jan 12, 2014

besquared commented Jan 12, 2014