Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Map collections of primitive types to JSON column in relational database #29427

Closed
Tracked by #30731
ajcvickers opened this issue Oct 26, 2022 · 20 comments · Fixed by #30738
Closed
Tracked by #30731

Map collections of primitive types to JSON column in relational database #29427

ajcvickers opened this issue Oct 26, 2022 · 20 comments · Fixed by #30738
Assignees
Labels
area-json area-primitive-collections closed-fixed The issue has been fixed and is/will be included in the release indicated by the issue milestone. type-enhancement
Milestone

Comments

@ajcvickers
Copy link
Member

ajcvickers commented Oct 26, 2022

Contrast to #25163 (map primitive collections to another table).
Contrast also to #28688, which is about allowing primitive collections inside JSON documents.
Contrast as well to #28871, which is about allowing any unmapped value in a JSON document, not just primitives

This type of mapping doesn't involve a new table in the database but could still allow queries using the JSON query functionality of the relational database.

Notes:

  • The JSON support here would be for JSON stored in a simple property, not aggregates mapped via owned types.
  • Also consider other issues related to primitive collections--see label area-primitive-collections.
@roji
Copy link
Member

roji commented Dec 11, 2022

@ajcvickers @maumar is this a dup of #28688?

@roji roji removed this from the Backlog milestone Dec 11, 2022
@ajcvickers
Copy link
Member Author

@roji No. #28688 is about collections of primitive types inside another type where the other type is mapped to a JSON document. This is about mapping a property that is itself a collection of primitive types. In my opinion, this should end up being the by-convention mapping for collections of primitive types, although I think some others favor making #25163 the default mapping.

@roji
Copy link
Member

roji commented Dec 12, 2022

Ah OK, in that case #29825 (which I opened yesterday) is a dup of this. Pasting some stuff I wrote there:

  • This intersects with #28688 (primitive collections in JSON).
  • Both keys and values should support any valid JSON data type (string, int...).
  • However we could also support Dictionary<string, object>, to map an entire hierarchy. See #26903 on this (currently seemed blocked by property bag detection).

@marchy
Copy link

marchy commented Dec 15, 2022

@ajcvickers mapping to a separate table feels overly heavy for something you are modelling using only a primitive type. The use cases here likely don't scale beyond a few values stored in.

I would argue for potentially a CSV format being the default: '1,2,3' for int[], '"a","b","c"' for string[], '"One","Two","Three"' for a MyEnum[] enum mapped with string values.

JSON format ought to be opt-in.

Npgsql has support for these type through a native Postgres text[] type, though it seems like MSSQL could maybe support it using the STRING_SPLIT function on simple string columns? (example)

@roji
Copy link
Member

roji commented Dec 16, 2022

I would argue for potentially a CSV format being the default: '1,2,3' for int[], '"a","b","c"' for string[], '"One","Two","Three"' for a MyEnum[] enum mapped with string values.
JSON format ought to be opt-in.

What would be your argument in favor of CSV over JSON as the default mapping?

@marchy
Copy link

marchy commented Dec 16, 2022

@roji It’s the simplest possible format for the problem at hand.

The usual use case I’ve personally seen these used for are tags or categories. Why add extra characters to denote that the data holds an array and have escaping quotes around values when the commas are all that is needed to delineate them?

Correction: I made a mistake in my comment above, the CSV values should be 1,2,3, a,b,c and One,Two,Three - no quotes needed around any of the strings.

Simplicity and storage performance. Opt in to extra complexity (ie: JSON) only if and as you need it.

@roji
Copy link
Member

roji commented Dec 16, 2022

Why add extra characters to denote that the data holds an array and have escaping quotes around values when the commas are all that is needed to delineate them?

For one thing, at least for some values (e.g. strings), quoting is just as much a problem for CSV encoding as for JSON. For example, try to use STRING_SPLIT to do querying over a CSV field which does require quoting (because e.g. a comma is present in one of the values) - that's not likely to work well. The point is that JSON is a format that's already recognized and somewhat well-supported by the database, so in that sense it's a superior option to CSV.

The extra characters for quoting are unlikely to make a meaningful difference in real-world scenarios.

@marchy
Copy link

marchy commented Dec 16, 2022

In what scenarios does quoting actually come into play though?

I would imagine this would only be for user-inputted values, in which case you would likely configure it to be its own separate related table (#25163) as your value-set is going to grow with your userbase (ie: multi-tenancy scenarios) and likely include more complex data structures than a simple primitive value (ie: you'll at least have a Name and a Rank/Order column).

Example:
Screenshot 2022-12-16 at 8 54 49 AM

If you have this scenario you would NOT be leveraging a collection of simple primitive types to model it.

On the flip-side, here's an example of where we do use non-user-inputted tagging/categorization in practice:
Column tagging

Each of those columns are arrays of pre-defined, semantic, well-defined enums.

When it comes to DB, I would argue perf and storage optimization matter MOST – that's why DBs are DBs and we use ORMs to abstract and make it human-centric for developers to work with and hide away the serialization optimization concerns. Why would we transport any amount of extra bytes/characters between the DB server and our API server? We already have areas where we need to include custom serialization formats for performance reasons to reduce the size of payload transferred.

Instead of designing for the 1% scenario where someone might be persisting non-semantic data – we should optimize for the scenario that we encounter most often in practice. I believe this has been a recent design philosophy for C# 10/11 as well – to simplify and make the common real-world scenarios most optimized, then have a fallback to opt in to the extra once-in-a-blue-moon scenarios if and only when you need it.

@roji
Copy link
Member

roji commented Dec 16, 2022

I would imagine this would only be for user-inputted values, in which case you would likely configure it to be its own separate related table [...]

That's very speculative - I'm sure many developers would be happy representing lists of user inputs in a simple column rather than a separate table - especially given how easy it is to do with JSON. Not to mention performance: the cost of doing an extra JOIN because you've decided to put your value list in another table is likely to be significantly more than the overhead of the quoting characters in a JSON document.

I certainly don't think storing lists of arbitrary strings (which may contain commas) is a once-in-a-blue-moon scenario. I also think that before prematurely optimizing for the scenario where there are no commas, we'd want to see some hard proof that JSON really is significantly slower than CSV here - which I doubt.

Bottom line, I don't think we'd want the default mapping strategy for primitive collections to be something that breaks down as soon as you stick a comma in one of the values.

@marchy
Copy link

marchy commented Dec 16, 2022

Would it be that big a deal to add an .ToJson() spec on the column to opt into the JSON serialization? In fact, would it not actually be preferrable to keep it in line with the (fantastic btw) JSON column feature – which requires opt-in if you want to use the JSON format?

JSON is a format that's already recognized and somewhat well-supported by the database, so in that sense it's a superior option to CSV.

I don't think this is true for MS SQL: "JSON text is stored in VARCHAR or NVARCHAR columns and is indexed as plain text." (SQL Server 2022 docs). Seems like it's treated as just a raw string equivalent in the DB.

Here's another consideration: standardization across EF providers.

Given that Postgres DOES support array types natively and would by default map collections to its native text[] DB type – if you wanted it to do something different, such as leverage a higher-level format on top of a raw string column (ie: JSON) – then this ought to be opt-in.

I think the "bottom line" you stated exactly falls to theoretical / edge case optimization rather than doing the best thing the large majority of the time (ie: +90% scenario).

DBs should be optimized for performance, simplicity and highest storage optimization. OO is for humans and thus the reason for Entity Framework to exist and hide away the optimization complexities around the impedance mismatch – it's core value prop. If there's any bottom line or guiding principle here it ought to be that.

@roji
Copy link
Member

roji commented Dec 16, 2022

Would it be that big a deal to add an .ToJson() spec on the column to opt into the JSON serialization? In fact, would it not actually be preferrable to keep it in line with the (fantastic btw) JSON column feature – which requires opt-in if you want to use the JSON format?

The EF 7 JSON columns feature is an opt-in because there already is a default behavior when mapping owned entity types for relational database: table splitting. In other words, by default the properties of the owned entity are simply mapped to columns in the same table as the owner, which seems like a very reasonable default that people have been happy with.

However, with primitive collections there's no current default: just sticking a string[] property on your entity type will fail on SQL Server. So there's the question of what the default should be - or possibly whether there should be one at all. We could continue failing by default - requiring you to do ToJson() - or we can map to JSON by default, or CSV or something else. I do believe it's better to pick a reasonable default (just like we did with table splitting for owned entity types) and just do that, rather than forcing users to explicitly opt into it.

BTW one possible reason to require an explicit ToJson would be to allow System.Text.Json to possibly get trimmed when ToJson isn't specified. We'd likely need to do considerable other work to make that happen.

I don't think this is true for MS SQL: "JSON text is stored in VARCHAR or NVARCHAR columns and is indexed as plain text." (SQL Server 2022 docs). Seems like it's treated as just a raw string equivalent in the DB.

I never said SQL Server stored JSON in anything other than text (e.g. nvarchar). What I said was that the format is recognized and well-supported via various functions. For example, you can easily index into a JSON array via a built-in SQL Server function (JSON_VALUE), whereas you most cannot do that with CSV, especially if escaping comes into play.

Given that Postgres DOES support array types natively and would by default map collections to its native text[] DB type
– if you wanted it to do something different, such as leverage a higher-level format on top of a raw string column (ie: JSON)
– then this ought to be opt-in.

Different providers can certainly do different things - there's no reason to force an opt-in to JSON on the SQL Server provider just because the PG provider by default maps to its natively-supported array types. Just like on the PG side we don't require an explicit opt-in to map to arrays, even though other providers don't support arrays.

I think the "bottom line" you stated exactly falls to theoretical / edge case optimization rather than doing the best thing the large majority of the time (ie: +90% scenario).

You seem to be saying that CSV is superior for over 90% of the scenarios out there; in other words, that over 90% don't need escaping/quoting, and that the perf difference between CSV and JSON is important enough to choose CSV over JSON (assuming it's better).

That really is purely a subjective, unbacked intuition of yours which I don't share... If you'd like to substantiate your argument, I'd start by actually comparing performance here, posting an actual benchmark showing that CSV is really better.

@marchy
Copy link

marchy commented Dec 16, 2022

Thanks @roji, it's great we're picking this apart.

The EF 7 JSON columns feature is an opt-in because there already is a default behavior when mapping owned entity types for relational database: table splitting. In other words, by default the properties of the owned entity are simply mapped to columns in the same table as the owner, which seems like a very reasonable default that people have been happy with.

However, with primitive collections there's no current default: just sticking a string[] property on your entity type will fail on SQL Server. So there's the question of what the default should be - or possibly whether there should be one at all. We could continue failing by default - requiring you to do ToJson() - or we can map to JSON by default, or CSV or something else. I do believe it's better to pick a reasonable default (just like we did with table splitting for owned entity types) and just do that, rather than forcing users to explicitly opt into it.

Great points on the opt-in side. I fully agree, the gap is two-fold: should there be a default mapping and what should that mapping be.

In terms of the ability for EF to map a primitive collection to a JSON column, I fully agree that there should be. We use both JSON columns and CSV arrays heavily in our DB schemas as a more modern and practical alternative to having tables for everything.

Different providers can certainly do different things - there's no reason to force an opt-in to JSON on the SQL Server provider just because the PG provider by default maps to its natively-supported array types. Just like on the PG side we don't require an explicit opt-in to map to arrays, even though other providers don't support arrays.

I think minimizing fragmentation across DB provided should be a significant design goal for EF, so taking into account the functionality of other databases (in particular Postgres) and not just applying an MS SQL mindset ought to be of high consideration.

NOTE: Oracle also supports arrays with its VARRAY data type.

You seem to be saying that CSV is superior for over 90% of the scenarios out there; in other words, that over 90% don't need escaping/quoting, and that the perf difference between CSV and JSON is important enough to choose CSV over JSON (assuming it's better).

That really is purely a subjective, unbacked intuition of yours which I don't share... If you'd like to substantiate your argument, I'd start by actually comparing performance here, posting an actual benchmark showing that CSV is really better.

Yup, 100% intuitive and subjective and I am not a core member of the EF team to be able to take it on as a deep-dive research initiative to turn into pure data-driven science – you guys are. I draw my sense from a slew of architecture experiences across both large enterprise, FAANG and early-stage startups for what comes up there in the wild. This is of course an informed/educated opinion, one which leans into favouring simplicity and pragmatism with incremental complexity (YAGNI principle).

Based on this, CSV feels like a perfectly reasonable default for this with JSON as an opt-in – though I would say even having NO serialization as a default and having JSON as an opt-in would be better than having JSON as an arbitrary default on MS SQL. This would unify the functionality across providers: if they have a DB-backed array type they can map to it by default, if not you can opt in to JSON explicitly and they would all model it as the best thing for that.

NOTE: The best JSON representation in Postgres is not a raw string, it's the json or json types (docs). Similarly Oracle has its JSON type which uses a binary JSON representation named OSON (docs).

Option 1: CSV default (NEW), JSON opt-in (NEW)
Option 2: No default on MS SQL (existing functionality), JSON opt-in (NEW)
Option 3: JSON default on MS SQL (NEW), no unified JSON opt-in (GAP)

Option 1 scores higher on provider fragmentation (or lack thereof), provides a unified experience across them (don't surprise the user principle), has higher storage and payload efficiency and higher serialization simplicity on MS SQL. This is why it is my favoured and recommended approach – but Option 2 would still be better than Option 3 as it will at least allow JSON serialization to be leveraged by those few people that need it and don't want to add a custom CSV converter for all their semantic array value scenarios (existing status quo).

@roji
Copy link
Member

roji commented Dec 17, 2022

The thing I'm still missing here is some sort of proof that JSON is significantly less performant, which is what you were claiming above.

Based on this, CSV feels like a perfectly reasonable default for this with JSON as an opt-in – though I would say even having NO serialization as a default and having JSON as an opt-in would be better than having JSON as an arbitrary default on MS SQL. This would unify the functionality across providers: if they have a DB-backed array type they can map to it by default, if not you can opt in to JSON explicitly and they would all model it as the best thing for that.

Well, that wouldn't really be a consistent cross-database experience - on PG (and Oracle) you'd get implicit mapping to the DB native array type, whereas on SQL Server you wouldn't (that option doesn't exist). I'm not sure how forcing people to explicitly opt into JSON (or CSV) on SQL Server helps unify anything; the databases really are different in what they support, and I think the "unified" experience users expect is for the mapping to just work, regardless of the specific mechanism used to represent the array (native array type on PG, JSON or CSV).

Re your options, no option here provides a unified cross-database experience in terms of how the data is stored; that's going to be the case no matter what we do. The only unified experience we can possibly provide is making the mapping "just work" without any explicit opt-in, but with differing storage mechanisms.

Regardless, in your comparison you've left out the fact that CSV breaks down (badly!) when someone inserts a comma. You may speculate that this is rare, but forcing people to discover this and go through the pain of possibly converting their data just doesn't seem justified - at least for string data. We may consider a different mapping based on the data type (e.g. CSV for ints, JSON for strings), though going into even more complexity here also has a price.

Finally, note that again, there's a chance we may want to avoid JSON as an implicit default anywhere in order to enable trimming of System.Text.Json. That really is the strongest argument I'm seeing here for avoiding this as a default.

@marchy
Copy link

marchy commented Dec 17, 2022

Ahh, there's a mis-understanding here.

What I mean by consistency across providers is that if you were to use an explicit opt-in to JSON (.ToJson()), each database would map to JSON using whatever native JSON representation it has, so there is at least consistency across providers in the EF API if you want to map collections of primitives to JSON. Every DB provider supports JSON at this point, so it would be a win to have the same API on all of them map collections of primitives to JSON.

If you were to NOT specify it, then each DB would do whatever it can to support the collections natively (again, consistent in behaviour here), so that if a particular DB does have a native column format out of the box for it can be used – or if there is a reasonable alternative (ie: delimiter-separated option above), then it can use that (either Option 1 or Option 2 on SQL Server).

One other thing to consider with the whole CSV approach is that if the comma delimiter is an issue, the delimiter could be configured explicitly (either by convention or at the property level), that way if you want to use |, :: etc. to delimit between your values your can. I brought up the comma as a standard delimiter as a reasonable default since it's human-friendly to read.

@roji
Copy link
Member

roji commented Dec 18, 2022

One other thing to consider with the whole CSV approach is that if the comma delimiter is an issue, the delimiter could be configured explicitly (either by convention or at the property level), that way if you want to use |, :: etc. to delimit between your values your can. I brought up the comma as a standard delimiter as a reasonable default since it's human-friendly to read.

Of course, but that doesn't really help with the fundamental problems in any way. First, it doesn't help where escaping is necessary (i.e. where there's no possible safe delimiter choice because the strings are arbitrary).

More importantly, the user must be aware of this detail/complexity (discoverability). I'm thinking of the naïve user simply sticking a string array on their class and expecting EF to just "do the right thing" when mapping it, without even being aware of it being serialized as CSV. One of the values contains the delimiter character, and so when the column is read back, the user get silent data corruption. That's a classic "pit of failure" with potentially really bad consequences; I'd never want that to be a default behavior.

To summarize my view so far... Although JSON seems like a very reasonable default to me, I'd probably want us to avoid it as an implicit default for trimming reasons. Doing a CSV mapping by default may be acceptable for non-strings, but for strings I'd require users to explicitly specify what mapping strategy is desired (and possibly allow CSV, but require explicit specification of the delimiter).

@roji roji modified the milestone: Backlog Dec 18, 2022
@roji
Copy link
Member

roji commented Dec 18, 2022

One last note (thanks @ajcvickers): the actual character overhead of a JSON int array ([1,2,3]) is just 2 characters over a CSV int array (1,2,3). For strings, once again, I don't think we'd do CSV by default because of the delimiter problem. That maybe leaves other types which require quoting (e.g. DateTime?), but the char saving there is also very unlikely to be compelling. Basically there simply doesn't seem to be a great reason to prefer CSV.

@marchy
Copy link

marchy commented Dec 18, 2022

One last note (thanks @ajcvickers): the actual character overhead of a JSON int array ([1,2,3]) is just 2 characters over a CSV int array (1,2,3). For strings, once again, I don't think we'd do CSV by default because of the delimiter problem. That maybe leaves other types which require quoting (e.g. DateTime?), but the char saving there is also very unlikely to be compelling. Basically there simply doesn't seem to be a great reason to prefer CSV.

Heh I don't think if it was only 2 characters this push-back would have had much of an argument.

The general case here to support – as I included with screenshots above – is sets of structured tags, modelled as enums in OO, and encoded as strings in the DB. These require quotes in JSON, thus cow,dog,cat becomes ["cow","dog","cat"].

This means it's 2 chars + 2 chars * size of the collection.
This ISN'T a trivial amount of bloat in a database from either a performance perspective (both storage and transfer over the precious wire), nor a developer experience for somebody viewing values in any DB visualizer tool such as Azure Data Studio.

cow,dog,cat
VS
["cow","dog","cat"]

Not remotely the same. I would never want my DB/ORM to arbitrarily use such an inefficient encoding mechanism unless the complexity of the data structure warranted it. Given these are primitive types we are discussing by definition – JSON is overkill, and I would opt out of it each and every time if it was the default.

Maybe a good argument for setting this up as a model building convention that can be configured!

roji added a commit to roji/efcore that referenced this issue Apr 14, 2023
roji added a commit to roji/efcore that referenced this issue Apr 18, 2023
roji added a commit to roji/efcore that referenced this issue Apr 18, 2023
roji added a commit to roji/efcore that referenced this issue Apr 19, 2023
@roji roji mentioned this issue Apr 19, 2023
34 tasks
roji added a commit to roji/efcore that referenced this issue Apr 20, 2023
roji added a commit to roji/efcore that referenced this issue Apr 20, 2023
roji added a commit to roji/efcore that referenced this issue Apr 20, 2023
roji added a commit to roji/efcore that referenced this issue Apr 20, 2023
roji added a commit to roji/efcore that referenced this issue Apr 20, 2023
roji added a commit to roji/efcore that referenced this issue Apr 21, 2023
roji added a commit to roji/efcore that referenced this issue Apr 26, 2023
roji added a commit that referenced this issue Apr 26, 2023
@roji roji modified the milestones: Backlog, 8.0.0 Apr 26, 2023
roji added a commit to roji/efcore that referenced this issue Apr 27, 2023
@ajcvickers ajcvickers modified the milestones: 8.0.0, 8.0.0-preview4 Apr 27, 2023
@ajcvickers ajcvickers added the closed-fixed The issue has been fixed and is/will be included in the release indicated by the issue milestone. label Apr 27, 2023
roji added a commit that referenced this issue Apr 27, 2023
@ajcvickers ajcvickers modified the milestones: 8.0.0-preview4, 8.0.0 Nov 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-json area-primitive-collections closed-fixed The issue has been fixed and is/will be included in the release indicated by the issue milestone. type-enhancement
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants