-
Notifications
You must be signed in to change notification settings - Fork 113
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Foreign Key attribute in JSON schema #23
Comments
@mk270 comments please :-) |
As discussed, we should handle this outside the table schema. |
Can either of you point me to a doc explaining the proper way to do this, if not directly embed it in the Table Schema? It seems like a useful feature and I'm not sure of the best way to accomplish something like this. Our use case is a bit different since we're using REST web services, so we're considering either embedding small factors directly into the Schema (a la #29) or providing a reference to the REST call which would provide such a mapping ( |
I think this and primary keys #21 should probably go in. |
I like it. The more I think about it, the REST-based approach I mentioned would actually fit within your proposed solution. If I were to host a datapackage.json file at the root of my webapp, all of my data/file references could just be relative URLs defining the GET/list functions for each type of object I'm interested in. Assuming my REST service produces strictly JSON Table Schema, we could end up with a pretty clean solution. What's the best next step for this? We're trying to move pretty quickly on our client application, so I'd be happy to write up the docs and submit them in a pull request as we continue our development if we're in agreement that this should happen within the scope of this spec? If not, I'll just document it internally and save myself some time. Let me know... |
FYI: The project now supports foreign keys in the following format:
Note that we're now using
As soon as a single match is found at any of these steps, no more searching is done. (As you can see, I think it would be easier for clients to implement this feature once this is pinned down in the spec). In R, there's not an obvious way to preserve the underlying ID of an object and still map it to another abstract object ("hash" in JS or "list" in R) within the context of a
I created an issue to support more robust mappings of IDs to complex objects (https://github.com/QBRC/RODProt/issues/12), but it will require some engineering to get that functionality in place. |
@trestletech I really like this. Where does this go inside the JTS? Do you put it parallel to fields? |
We went within fields to more clearly identify which source field was associated with the foreign key. For instance (https://github.com/QBRC/RODProt/blob/master/inst/extdata/datapackage.json#L58)
(in this case package is a relative path, but I imagine it would typically be a URL -- just harder to unit test a package on a remote URL). |
Interesting. How would you handle multiple foreign keys (is that possible)? I'm in 2 minds whether we inline or separate into a separate foreignkeys attribute (more like SQL).
Best thing here is you have running code which is what matters! |
I'm not sure which way you intend "multiple." To me, multiple foreign keys (single-columned FKs on multiple columns) could be handled well in this approach (indeed, we're using that pretty heavily in our application). To extend the previous example:
If you mean a single column referencing multiple other resources simultaneously, I suppose you could support a foreign key array?
(pardon any syntax errors). To be honest, though, I don't think I've ever seen that use case -- maybe it's more common in other fields. Finally, if you meant composite foreign keys, then I agree that would be a limitation of this style. In my experience, it seems like the trend (for better or worse) has been to move away from composite keys where possible, though. I think at least part of this is motivated by the rise of web applications and the need to more easily de/serialize data. Most of my ORM experience has been with Hibernate, so I can only speak to that library, but I know support for Composite keys is pretty sloppy. Essentially, you typically need to embed your composite key as a single object so that HIbernate can treat it as a single reference to another table. So instead of:
you'd have to use
I suppose you could take a similar approach and specify that the
I'm not sure how well that would work or if it would be worth the complexity, but it would be one approach. I may just be biased by our application, but if I'm already serializing the data to JSON, I would have long given up the luxury of "fancy" relational mappings involving composite keys. Perhaps I'm in the minority there, though. I'm not sure about other common client languages like Python, but I know in R the support for a concept of a foreign key is so simplistic that it would require a pretty foundational re-write of some of the base data structures in the language to even get something like a composite foreign key into the language. |
I think we are making great progress here. I think there is agreement on:
In terms of the foreignkey hash I would suggest a minor tweak to rename id to field: { "pkg": "../extdata/datapackage.json", "resource": "anotherData", "field": "id-of-field" } Questions:
|
All sounds good to me. |
My immediate use case for data packages was discoverability of other datasets, at which point multiple foreign key relationships becomes v.useful... e.g. for drug relationships, the BNF code is a foreign key to
Being able to hit each of these in an explorer would allow me to get all the related metadata from different publishers... In principle. |
A foreign key into another resource in the same package could simply omit the {
"id": "otherdata",
"foreignkey": {
"resource": "anotherData",
"field": "id-of-field"
}
} |
Any news on this issue ?
Thanks! |
OK, final recommendation is:
With:
Note rename of pkg attribute to datapackage Example:
|
@paulfitz interested in your thoughts here especially given your comments on primary key and experience with coopy ... I'd like to get this and the primary key stuff closed and into JTS asap ... |
Why not make Also, why As in #21, I think a separate section for |
@jpmckinney all good points. For foreign keys which are already quite complex I guess this makes sense. I suppose for other items e.g. even unique or primary key it makes thinks more hasslesome to write. Overall: I get the feeling that a good set of people want constraints outside of the fields in a separate section a la sql. @paulfitz your thoughts either way would be very useful. @sballesteros ditto ... |
I think a composite (multi-column) primary key is much easier to read if all the columns appear side-by-side, e.g.: {
"fields": [
...
],
"indexes": [
{
"type": "primary",
"fields": ["column1", "column10"]
}
]
} If you see a "primary_key" property on the first field in a |
EDIT: This was based on a wrong understanding of foreign keys. What follows is largely irrelevant for this issue and is closer to materialized view. Having foreignkeys in the
data referred by the foreignkeys are not available in dpkg1 (the resource So to me foreignkeys allow to recreate composite JSON Table Schema from fields originating from different data packages. Generalizing outside JSON Table Schema it would be good to have a discussion on how to use the top-level property If we stick with the current foreignkey hash and generalize it for non SDF resources we just need to omit the
Here "foreignkey" is an extra possible property in addition to "data", "path" or "url". If it is the intended use maybe "foreignkey" is not the best possible word. |
@sballesteros That's not how foreign keys work... the What you're describing is an entirely different feature, which may be worthwhile, but is not what a foreign key is. What you're describing is closer to a materialized view, that joins multiple tables together. |
@jpmckinney thanks for clarifying. So what I suggest is more related to the way dependencies will be handled. Got it now... Foreign key is a constraint to ensure that the values taken by the foreignkey are elements of the resource field it points to. |
@sballesteros Duplication is useful if the third-party resource disappears. There are many other reasons. Update: My previous message was about a comment that was later edited. Yes, that's what foreign keys do. |
@sballesteros If we put the types as keys, then yes, using underscores or camelcase is better. But if we're putting the types as values, then it makes no difference. Edit: Nevermind, I misunderstood what you meant by creating a hash. |
@jpmckinney I was talking about an implementation. Using a hash with types values as keys to avoid some if statements. |
+1 for parallel option. @jpmckinney's array suggestion seems reasonable. Agree with @sballesteros about avoiding spaces. Could also rename |
Another +1 from @besquared in #21 for |
Going to repost this here since apparently #21 isn't the right place anymore. Howdy guys I just wanted to add my recommendation here based on a few things.
I don't think adding any kind of key description at the field level is correct. I don't think It's a matter of shorthand either as structurally all keys are properties of a datasets and not a field.
Indices are a database specific feature used for optimizing common operations. The terms are often used interchangeably in those systems because database vendors often default to (or require) building an index onto the primary key columns for their own internal operations or for the convenience of the user. Keys don't necessarily imply a constraint either although generally implementations constrain primary keys to be unique. One vendor specific example is that the InnoDB engine allows both non-unique indices and foreign keys to non-unique indices. IMO If the specification's goal is to be producer and consumer agnostic then the it wouldn't include references to the concept of an index and leave that up to the system consuming the data package. I think there's probably some room here to include a unique flag as it might be seen as a description of the key's relationship to the dataset and not a constraint that the package is defining. As far as how to implement a specification for keys I think the most generic form would try to only include things that are descriptive of the data itself. A key should signify that fields within or between datasets are linked to one another in some way or have some kind of special meaning in the dataset itself (such as uniqueness). Something like the following (which even I consider somewhat verbose) might be a good first shot:
|
My one objection to the array approach for keys/constraints is that it makes it more painful for consumers - rather than just doing:
i have to iterate through the list of constraints. @besquared re constraints / keys in naming I feel a little ambivalent. I think it would be nice to include unique requirements in their at some point and they definitely aren't keys ... (however this isn't a biggie) I do think generally that keys/constraints should live inside the schema attribute and be part of json table schema (@besquared your example suggested otherwise but i was not sure if that was intentional ...) |
@rgrp For all except the primary key, you'll still need to iterate, as only the primary key will have a single key object as its value. The others will have arrays. If your app loads all keys, you'll have a double-nested iteration if the Also, can we have separation between words, e.g. Also, unique keys are keys (it's in the name). How are they "definitely not keys"? @besquared is correct in saying that constraints in the language of DBs are something different, e.g. you can put a constraint at the field-level on a price field to say that it must have a value greater than zero. |
+1 for everything @jpmckinney is saying. Also @rgrp I originally did think keys belongs to the resource and not the schema but I take it back now they obviously belong in the schema. To draw some inspiration from Codd's relational model something like the format I proposed above allows us to correctly express the three types of relational constraints. @rgrp other options include a keys section keyed by type
This seems fine too and allows the kind of object access one would expect (resource.schema.keys.primary) while still allowing multiple foreign keys and allows the keys section to remain pretty robust to extension later. |
This might also be a good time to talk about foreign keys vs. references. In the relational model foreign keys imply a constraint in which the foreign relation must have a corresponding value for the keyed fields in order for a new relation to be added. This allows referential integrity between relations to be maintained. If what the specification is trying to achieve is a way to link data together (but not necessarily imply a constraint) then we might want to have a section called "references" instead of a set of foreign keys. Consumers who would like to resolve reference packages and apply foreign key constraints in their data management system may choose to do so but wouldn't be required to. This leads to another possible schema layout:
I like this as well for various reasons. It leaves the concept of constraints for a later time and still allows consumers to optionally infer foreign key constraints. Thoughts? |
So along with @jpmckinney's proposal we're going with out of line items. Here's the updated proposal:
Questions:
rfc @paulfitz @jpmckinney @besquared @davidmiller @sballesteros |
|
There's some mixing here between #21 and #23. Everything should just standardize on camelCase. Given that it's likely that primaryKey will become a thing it seems reasonable that foreignKeys should become the other thing. I think it's important to tell people that this might not have the same meaning that it does in their rdbms. This seems ok. |
+1 for foreignKeys and lowerCamelCase. |
+1 for foreignKeys and lowerCamelCase too. |
This has become a pretty big feature for me to get into the specification. Any ideas about ETA @rgrp? |
One idea I have which would be a nice to have is to create some sort of a fallback sequence. It might make everything more complicated but what I'm thinking is that This means I can have multiple resources for datasets published in different phases. I'm thinking about a case for budget data where countries have different budget phases. The problematic one is when we have an approved budget which can be adjusted at a later point in time. Adjustments are not necessarily a completely new dataset but changes to different rows of the approved budget. Let's say we represent this with two resources (marking them with dates for clarity): Another option would be to make Just an idea. I'm glad to drop it and use up more disk space if that means we can get foreign keys in as soon as possible. |
@tryggvib at this point I doubt we're going to expand for this use-case. Right now we have a finalized spec and I think it will go in. |
Great. Happy to see this go in as long as I have foreign keys :) |
Another use case I came up with though just now was that this would enable you to break up resources to more manageable sizes. For example one resource per year and some foreign key in another resource could include the year which could fallback through the resources broken down by years. |
FIXED. Foreign Key support is now in - see http://dataprotocols.org/json-table-schema/#foreign-keys Huge thank-you to everyone who contributed here and please double check for the final implemented version for any errors that need to be corrected or emendations that should be made. |
* Updated date/time definitions * Added examples * Improved formatting * Update content/docs/specifications/table-schema.md Co-authored-by: Peter Desmet <peter.desmet.work@gmail.com> * Updated datetime pattern example * Added `any` format note --------- Co-authored-by: Peter Desmet <peter.desmet.work@gmail.com>
Suggest in a type field:
The text was updated successfully, but these errors were encountered: