Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Composite and partial unique constraints #593

Closed
vitorbaptista opened this issue Mar 3, 2018 · 26 comments
Labels
Projects

Comments

@vitorbaptista
Copy link
Contributor

@vitorbaptista vitorbaptista commented Mar 3, 2018

See previous discussion on frictionlessdata/goodtables-py#252 (comment)

The use case is where you want to validate that multiple fields are unique as a whole. Given the following table:

id_1 id_2
1 1
1
1

Both id_1 and id_2 are nullable, but we want (id_1, id_2) to be unique. All the following rows couldn't be added to this table:

id_1 id_2
1 1
1
1

But these would be valid:

id_1 id_2
1 2
2 1
2
2

If this were SQL, I would do this creating the indexes:

CREATE UNIQUE INDEX my_table_unique_keys ON my_table (id_1, id_2);
CREATE UNIQUE INDEX my_table_unique_keys ON my_table (id_1) WHERE id_2 IS NULL;
CREATE UNIQUE INDEX my_table_unique_keys ON my_table (id_2) WHERE id_1 IS NULL;

Currently there's no way of doing this with Table Schema. For the simple case where we just want (id_1, id_2) to be unique, we could follow a pattern similar to primaryKeys, like:

"schema": {
  "unique": ["id_1", "id_2"]
}

But it wouldn't work for partial indexes. In that case, we also need a WHERE clause. Maybe something like:

"schema": {
  "unique": [
    { "fields": ["id_1", "id_2"] },
    { "fields": ["id_1"], "where": "id_2 IS NULL" },
    { "fields": ["id_2"], "where": "id_1 IS NULL" },
  ]
}

But this can grow the implementation complexity pretty quickly, as now we have to handle the where clause somehow.

Maybe a good middle ground is just configuring what happens when some of the unique fields are null? The default behaviour on a composite index in SQL is that if any field is null, the entire index isn't used, so there could these two rows would be valid:

id_1 id_2
2
2

Maybe this behaviour could be controlled by a boolean flag. Something like:

"schema": {
  "unique": [
    { "fields": ["id_1", "id_2"], "ignoreNullValues": true },
  ]
}

The naming can be improved.

cc @roll @Stephen-Gates @hydrosquall

@robyoung

This comment has been minimized.

Copy link

@robyoung robyoung commented Apr 16, 2019

What needs to be done for this to be "Status:Ready-For-PR"?

@rufuspollock

This comment has been minimized.

Copy link
Contributor

@rufuspollock rufuspollock commented May 11, 2019

@robyoung we'd need a concrete proposal of how to do this. Then it would become a pattern and if sufficiently desired by the community it would get added to the spec.

@ezwelty

This comment has been minimized.

Copy link
Contributor

@ezwelty ezwelty commented Sep 5, 2019

@rufuspollock @vitorbaptista

I just ran into this myself. My use case are spatial surveys of spatial points, some of which are grouped into linear profiles (and hence have a profile id), others which are not (and hence have a missing profile id).

survey profile point
1 1 1
1 1 2
2 1
2 2

SQL aside, a straight reading of https://frictionlessdata.io/specs/table-schema/#primary-key suggested to me that the following should work, despite the missing values:

a field or set of fields that uniquely identifies each row in the table [...] concatenating values together

"primaryKey": ["survey", "profile", "point"]

Yet that is not how goodtables-py has chosen to interpret and enforce the schema (frictionlessdata/goodtables-py#252).

I don't agree that this requires adding new knobs and switches to the specs. Instead, simply clarify the specs for primaryKey that any field listed in primaryKey is interpreted as required: true unless explicitly marked as required: false. Why does it need to be more complicated than that?

@rufuspollock

This comment has been minimized.

Copy link
Contributor

@rufuspollock rufuspollock commented Sep 9, 2019

@ezwelty isn't what you want what was done in frictionlessdata/goodtables-py#252 - the issue is just that it wasn't merged?

@ezwelty

This comment has been minimized.

Copy link
Contributor

@ezwelty ezwelty commented Sep 9, 2019

@rufuspollock Sorry, you're right, I wasn't very clear. I support the code in frictionlessdata/goodtables-py#252 proposed by @vitorbaptista and suggest that https://frictionlessdata.io/specs/table-schema/#primary-key be amended to state that any field listed in primaryKey is assumed required: true unless explicitly marked as required: false.

@rufuspollock

This comment has been minimized.

Copy link
Contributor

@rufuspollock rufuspollock commented Sep 10, 2019

OK, do you want to submit a PR to this effect and resubmit vitor's PR on goodtables?

@ezwelty

This comment has been minimized.

Copy link
Contributor

@ezwelty ezwelty commented Sep 10, 2019

@rufuspollock I will do the first. By resubmit, do you mean create a new PR from the same branch, compel vitor to reopen the original PR, ... ?

@rufuspollock

This comment has been minimized.

Copy link
Contributor

@rufuspollock rufuspollock commented Sep 10, 2019

I mean a new PR ...

@roll

This comment has been minimized.

Copy link
Member

@roll roll commented Sep 11, 2019

@ezwelty
@rufuspollock
@pwalsh
I'm a little bit late to the discussion. Sorry in advance if I have missed something I had a really limited time to dive in.

Regarding the current proposal. I have a few thoughts regarding it:

  1. correct me if I'm wrong but this sentence A primary key is a field or set of fields that uniquely identifies each row in the table. clearly states that primary key's values can't be null. The current amendment from the PR is not enough. To solve the problem it should be something like this (I'm exaggerating but that how the new semantics look like):

if any of the fields have explicit required: false definitions than primary keys don't uniquely identify each row and ..., otherwise, they DO uniquely identify each row and ...

  1. The new concept of explicitly set required: false fields creates a 3 states flag (again sorry if I got it wrong):
  • required: none (implicit false by default)
  • required: false (explicit false to alter primary keys)
  • required: true
  1. Our implementations are really keen on interoperability with SQL and new primaryKey semantics will not play really well with it.

TBH I don't see it as a perfect solution. Thinking of alternative options:

  1. Just add the schema.uniqueKeys field:
  • exactly what this issue is about (using primaryKeys for this is more a hack)
  • asked for many times by other people
  • trivial to add to goodtables
  1. If schema.uniqueKeys is not an option at least altering primaryKey semantics explicitly feels as a cleaner option for me e.g. schema.primaryKey: "id", schema.primaryKeyAllowNull: true (or something like this)
  • cleaner semantics than the required: false option
@ezwelty

This comment has been minimized.

Copy link
Contributor

@ezwelty ezwelty commented Sep 11, 2019

@roll

  1. I read nothing about missing values in the statement A primary key is a field or set of fields that uniquely identifies each row in the table. Fields can uniquely identify each row despite the presence of null values, as in my original example:
survey profile point
1 1 1
1 1 2
2   1
2   2

Hence a desire to enforce this uniqueness constraint with "primaryKey": ["survey", "profile", "point"]. Why be forced to fill missing profile with a dummy value, especially since profile: null is physically meaningful?

  1. Currently, goodtables is assuming required: true for primary keys. The suggestion is to allow required: false in the fields' constraints to take precedence and overload this assumption. I don't see this as 3 different states.

  2. I understand the hesitation to break from SQL convention. However, I'd warn that the SQL standard for the UNIQUE constraint (and its various implementations) is a cluster$%#@ when it comes to missing values (see, for example, http://troels.arvin.dk/db/rdbms/#constraints-unique and https://dba.stackexchange.com/a/185370). So the specs would have to break from convention regardless to ensure that uniqueKeys actually meant what I and others in this discussion are asking for. I'm supportive of a schema.primaryKeyAllowNull or similar if you think that would be cleaner than required: false taking precedence over membership in primaryKey.

@roll

This comment has been minimized.

Copy link
Member

@roll roll commented Sep 11, 2019

@ezwelty

I read nothing about missing values in the statement A primary key is a field or set of fields that uniquely identifies each row in the table. Fields can uniquely identify each row despite the presence of null values, as in my original example

What I meant that the statement A primary key is a field or set of fields that uniquely identifies each row in the table states that a primary key MUST uniquely identify each row. As I see it. The same as in SQL. So, for example, in the data below:

id,value
1,value1
null,value2
null,value3

the id field cannot be a primary key because it doesn't uniquely identify the rows with value2 and value3.

However, I'd warn that the SQL standard for the UNIQUE constraint (and its various implementations) is a cluster$%#@ when it comes to missing values (see, for example, http://troels.arvin.dk/db/rdbms/#constraints-unique and https://dba.stackexchange.com/a/185370). So the specs would have to break from convention regardless to ensure that uniqueKeys actually meant what I and others in this discussion are asking for.

I think these links are related to some non-standard implementations. As far as I know, the ANSI SQL Standard allows multiple null-values for a unique filed. And it seems exactly what you're trying to achieve in your data example.

@ezwelty

This comment has been minimized.

Copy link
Contributor

@ezwelty ezwelty commented Sep 11, 2019

@roll

a primary key MUST uniquely identify each row.

Exactly. In your example, primaryKey cannot be ["id"], but it can be ["id", "value"], since that combination of fields is unique for each row.

As far as I know, the ANSI SQL Standard allows multiple null-values for a unique filed. And it seems exactly what you're trying to achieve in your data example.

I'm not so sure. The unique constraint is satisfied if and only if no two rows in a table have the same non-null values in the unique columns (http://standards.iso.org/ittf/PubliclyAvailableStandards/c053681_ISO_IEC_9075-1_2011.zip).

That's pretty ambiguous with respect to null values (see https://stackoverflow.com/a/22699498, https://www.sqlite.org/nulls.html, etc). The most common interpretation, in the words of the PostgreSQL documentation, is that two null values are never considered equal in this comparison. That means even in the presence of a unique constraint, it is possible to store duplicate rows that contain a null value in at least one of the constrained columns.

Thus, according to this common interpretation, a unique constraint does not require unique rows when null values are present, and thus we're back where we started.

@roll

This comment has been minimized.

Copy link
Member

@roll roll commented Sep 12, 2019

Sorry but I think that you are a little bit exaggerating this SQL inconsistency. I would say that there are 2 standard and well-established concepts in SQL:

  • the primary key is a unique column not accepting nulls; purpose - identify rows in the table
  • the unique key is a unique column accepting nulls; purpose - guarantee uniqueness of the existent values

Actually it's two really different concepts and as I see your use case is about the uniqueness not about identifying rows in the table.

I agree in your example having such a primary key makes more sense because the primary key is composite. But does it mean that we have to explicitly state that required:false or primaryKeyNullable: true is only applicable to composite primary keys? If yes it opens another can of worms for implementors etc

I just feel like:

  • on one hand, we have a straight-forward SQL-compatible primary key semantics (current)
  • on another hand, we have more complicated non-SQL-compatible primary key semantics
@vitorbaptista

This comment has been minimized.

Copy link
Contributor Author

@vitorbaptista vitorbaptista commented Sep 12, 2019

Hello all! o/

I agree with @roll that primary keys shouldn't be nullable. It seems to me that something like:

"schema": {
  "unique": ["col_1", "col_2"]
}

that treats null as any other value could solve this. This is different than SQL, where unique constraints aren't used when some of its components are null, but I don't see a problem in deviating from that. WDYT?

@ezwelty

This comment has been minimized.

Copy link
Contributor

@ezwelty ezwelty commented Sep 12, 2019

What follows is a bit long, but hopefully clarifies the conversation

What is being asked here and in frictionlessdata/goodtables-py#252 is a row uniqueness constraint that treats null as a regular value. In that sense, it is a unique key that can also serve as a primary key, despite the presence of null, and can apply to both single and multiple fields.

For a single field, the requested constraint is met by the following:

x
1
2
null

but not met by:

x
1
null
null

Yet this x with repeating null is considered unique in PostgreSQL(, MySQL, Oracle, Firebird, SQLite):
https://dbfiddle.uk/?rdbms=postgres_11&fiddle=21d04f1b1d151f5d0180a7f753544f95
But not in Microsoft SQL Server:
https://dbfiddle.uk/?rdbms=sqlserver_2019l&fiddle=21d04f1b1d151f5d0180a7f753544f95

For multiple fields, the requested constraint is met by the following:

x y
1 1
2 1
null 2

but not met by:

x y
1 1
2 null
2 null

Yet this x, y with repeating 2, null is considered unique by PostgreSQL(, MySQL, Oracle, Firebird, SQLite):
https://dbfiddle.uk/?rdbms=postgres_11&fiddle=5845f4945ba2fcd20ab710530b2348de
But not by Microsoft SQL Server:
https://dbfiddle.uk/?rdbms=sqlserver_2019l&fiddle=5845f4945ba2fcd20ab710530b2348de

So I believe Table Schema needs to be amended in two ways:

  • Clarify that the fields in primaryKey cannot contain null (i.e. an implicit required: true).
  • Then:
    • Allow null in primaryKey fields (with either explicit required: false in field constraints or a boolean switch on primaryKey) and clarify that null should be treated as a regular value (per Microsoft SQL Server).
    • and/or Add a new unique key constraint which treats null as a regular value either by default or with a boolean switch.
@ezwelty

This comment has been minimized.

Copy link
Contributor

@ezwelty ezwelty commented Sep 26, 2019

Having given this some more thought, I would like to make a specific proposal for amending the Table Schema:

  • Clarify that fields in primaryKey cannot contain null, per SQL specification.
  • Introduce the uniqueKeys property, an array of uniqueKey objects each of which must contain a fields string or array (like foreignKeys) and a boolean property, say nullUnique, that dictates whether null are considered unique (per PostgreSQL, MySQL, Oracle, Firebird, SQLite) or like any other value (per Microsoft SQL Server). The examples from my previous post would be helpful to include in the documentation. missingValuesUnique would better match the existing missingValues, but could be misconstrued to refer to how string representations of null should be handled.

This follows SQL specifications while also making up for their shortcomings by making the handling of null explicit and adjustable. Thoughts?

{
  "primaryKey": ["a"],
  "uniqueKeys": [
    {
      "fields": ["b", "c"],
      "nullUnique": true
    },
    {
      "fields": ["c", "d"],
      "nullUnique": false
    }
  ]
}
a b c d
1 1 1 1
2 2 null 2
3 2 null null
@rufuspollock

This comment has been minimized.

Copy link
Contributor

@rufuspollock rufuspollock commented Sep 28, 2019

@ezwelty this sound very sensible. Could you submit a pull request to patterns with the uniqueKeys approach.

@ezwelty

This comment has been minimized.

Copy link
Contributor

@ezwelty ezwelty commented Oct 1, 2019

@rufuspollock I will do that, and will include a clarification to the specs that primaryKey cannot contain null.

@ezwelty

This comment has been minimized.

Copy link
Contributor

@ezwelty ezwelty commented Oct 14, 2019

I made a PR for what we have discussed up to now, but there are a couple of additional ambiguities regarding null values that we should address. Perhaps by moving my proposed uniqueKey.nullUnique to schema.nullUnique.

  • field.constraints.unique: Should null be treated as unique or equal to each other? For example, is [(1), (null), (null)] unique? Yes if schema.nullUnique: true, no if schema.nullUnique: false.
  • foreignKeys: Are null permitted in the local or reference keys? SQL says yes. If so, how should they be treated? Per SQL, the local key can contain keys with column values not present in the reference key if at least one of the columns is null. For example (dbfiddle): (1, null) is allowed locally even if reference is [(2, 1), (3, 1)]. The reference key must be unique, the meaning of which would be determined by its schema.nullUnique.

p.s. Do you think nullUnique sounds better renamed to nullsUnique or inverted to nullsEqual?

@rufuspollock

This comment has been minimized.

Copy link
Contributor

@rufuspollock rufuspollock commented Oct 15, 2019

@ezwelty my feeling would be to go with default of postgresql and sqlite which is nulls are not considered in uniqueness and that we default to this rather than having yet another config option. Can you see a downside to this? (My concern here is for implementors who have to support this stuff ...)

General automation moved this from Specifications to Done Oct 15, 2019
@ezwelty

This comment has been minimized.

Copy link
Contributor

@ezwelty ezwelty commented Oct 16, 2019

@rufuspollock I guess I can think of two downsides:

First, there are datasets in the wild where rows containing null should not be ignored in uniqueness constraints. See the example I gave in #593 (comment) (1).

Second, pandas considers null not unique, in the same way that Microsoft SQL Server does...

pandas.DataFrame(dict(x=[1, 1, 1], y=[1, np.nan, np.nan])).duplicated()

0 False
1 False
2 True
dtype: bool

There isn't close to a consensus on whether null == null or null != null, so I think it would be best for the specs to provide a standard way to indicate the expected behavior. I would rather have an implementation tell me that it does not support nullUnique: false than silently treating null in a manner differently than intended.

@rufuspollock

This comment has been minimized.

Copy link
Contributor

@rufuspollock rufuspollock commented Oct 21, 2019

@ezwelty ok - and let's have a sane default and implementors can go with that if they don't support the switch explicitly.

@ezwelty

This comment has been minimized.

Copy link
Contributor

@ezwelty ezwelty commented Oct 21, 2019

Great. In that case, uniqueKeys should just be an array of arrays/strings.

{
  "uniqueKeys": [
      ["b", "c"],
      ["c", "d"],
      "z"
  ]
}

So we need to decide:

  • Where to put the property. I suggest it should be in the Table Schema "root", alongside global controls like missingValues.
  • What to call the property. Although I originally suggested nullUnique, uniqueNulls might fit best with the current naming convention (missingValues, foreignKeys, unique).
  • What the default should be. This I'd rather leave to the core team to decide. What have you been assuming was the behavior up to now? Choose unique nulls to align with most SQL implementations, and non-unique nulls to align more with Python's pandas, R's data.frame (and derivatives), and spreadsheet software.
@rufuspollock

This comment has been minimized.

Copy link
Contributor

@rufuspollock rufuspollock commented Oct 22, 2019

@ezwelty

This comment has been minimized.

Copy link
Contributor

@ezwelty ezwelty commented Oct 28, 2019

@rufuspollock I asked at https://gitter.im/frictionlessdata/chat. @roll weighed in with a preference for null being unique (i.e. ignored) by default in uniqueness constraints (although currently goodtables-py does the opposite). Seems reasonable to me, especially since null are ignored by all the other field constraints. How does that sound? Should I amend my uniqueKeys pattern and add another for uniqueNulls?

@rufuspollock

This comment has been minimized.

Copy link
Contributor

@rufuspollock rufuspollock commented Oct 28, 2019

Agreed, let's go with @roll suggestion.

@rufuspollock rufuspollock reopened this Nov 7, 2019
rufuspollock added a commit that referenced this issue Nov 7, 2019
Add uniqueNulls property to unique constraints pattern - refs #593.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
General
  
Done
5 participants
You can’t perform that action at this time.