SQL Server UTF8 collations #25798

egbertn · 2021-08-28T21:28:56Z

This does not behave as documented and expected.

If I have a entity for which I use Fluent API to define properties...
The SQL field (in my example) is varchar(255) using collation Latin1_General_100_BIN2_UTF8
in EF defined as
p.Property(prop => prop.Param).IsUnicode(false).UseCollation("Latin1_General_100_BIN2_UTF8").HasMaxLength(255);

However, unicode chars get's corrupted anyway on SQL both on Azure as on 2019 express.

Document Details

⚠ Do not edit this section. It is required for docs.microsoft.com ➟ GitHub issue linking.

ID: b27be469-3758-941c-0c5d-3a54b1545493
Version Independent ID: d89924aa-668e-229c-8170-5731c3fff4c1
Content: Collations and case sensitivity - EF Core
Content Source: entity-framework/core/miscellaneous/collations-and-case-sensitivity.md
Product: entity-framework
Technology: entity-framework-core
GitHub Login: @roji
Microsoft Alias: avickers

The text was updated successfully, but these errors were encountered:

roji · 2021-08-28T23:28:36Z

Can you post a runnable code sample which shows the corruption occurring?

egbertn · 2021-08-29T14:59:51Z

Can you post a runnable code sample which shows the corruption occurring?

Sure,
there you go. Thanks for your commitment.

https://github.com/egbertn/EFCore-5-UTF-8-support

ajcvickers · 2021-08-31T12:14:07Z

Note for triage: this is because the DbType is set to AnsiString when IsUnicode is false. Below is a repro without EF Core. Changing the DbType to String makes it work.

Not using IsUnicode(false) makes the test pass, but creates an nvarchar(255) column, which apparently isn't desirable for SQL Server's squirrely UTF8 support.

static class Program
{
	static void Main()
	{
		using (var connection = new SqlConnection(@"Server=localhost;Database=master;Trusted_Connection=True"))
		{
			connection.Open();

			using (var command = connection.CreateCommand())
			{
				command.CommandText = @"DROP DATABASE [UTF8];";
				command.ExecuteNonQuery();
				
				command.CommandText = @"CREATE DATABASE [UTF8];";
				command.ExecuteNonQuery();
			}
			
			connection.Close();
		}

		using (var connection = new SqlConnection(@"Server=localhost;Database=UTF8;Trusted_Connection=True"))
		{
			connection.Open();

			using (var command = connection.CreateCommand())
			{
				command.CommandText = @"
      CREATE TABLE [MyTable] (
      [Id] bigint NOT NULL IDENTITY,
      [Param] varchar(255) COLLATE Latin1_General_100_BIN2_UTF8 NULL,
      CONSTRAINT [PK_MyTable] PRIMARY KEY ([Id]) );";

				command.ExecuteNonQuery();

				command.CommandText = @"INSERT INTO [MyTable] ([Param]) VALUES (@p0);";
				
				var parameter = command.CreateParameter();
				parameter.DbType = DbType.AnsiString;
				parameter.Size = 255;
				parameter.ParameterName = "p0";
				parameter.Value = "Как дела?";
				command.Parameters.Add(parameter);

				command.ExecuteNonQuery();

				command.CommandText = @"SELECT TOP(1) [m].[Id], [m].[Param] FROM [MyTable] AS [m] ORDER BY [m].[Id] DESC";
				using (var reader = command.ExecuteReader())
				{
					reader.Read();
					var value = reader.GetFieldValue<string>(reader.GetOrdinal("Param"));

					if (value != "Как дела?")
					{
						Console.WriteLine($"Wrote {parameter.Value}, read {value}.");
					}
				}
			}
			
			connection.Close();
		}
	}
}

ajcvickers · 2021-08-31T18:53:43Z

Notes from triage:

It's not clear how to fix this; it will likely require type mappings to understand collations, which is not trivial.
For now, the workaround is to use IsUnicode(true) (which is the default) and to edit the migration to generate a varchar column.

egbertn · 2021-09-01T07:27:10Z

Thanks @ajcvickers

Notes from triage:

* It's not clear how to fix this; it will likely require type mappings to understand collations, which is not trivial.

* For now, the workaround is to use `IsUnicode(true)` (which is the default) and to edit the migration to generate a `varchar` column.

Thanks, but isn't it a solution, to see if the collation name contains or ends with _utf8, if so, all string operations must be prepended with N like in N'как лела?'

roji · 2021-09-01T07:34:41Z

@egbertn that could be a fix, but it would "require type mappings to understand collations, which is not trivial" as @ajcvickers wrote above. It's unfortunately too late to go into something like this for 6.0.

roji · 2021-09-02T07:41:45Z

Another direction for a proper fix would be to recognize the UTF8 collation in SQL Server's migration generator, and based on that create a varchar column instead of nvarchar, even if Unicode is true. This would allow keeping collations out of type mapping, and keep them in migrations which is where they currently live. This would also mean users continue running with Unicode=true, which feels more correct (after all, UTF8 is unicode).

egbertn · 2021-09-02T18:27:16Z

Another direction for a proper fix would be to recognize the UTF8 collation in SQL Server's migration generator, and based on that create a varchar column instead of nvarchar, even if Unicode is true. This would allow keeping collations out of type mapping, and keep them in migrations which is where they currently live. This would also mean users continue running with Unicode=true, which feels more correct (after all, UTF8 is unicode).

👍👍

mburbea · 2021-09-15T15:39:00Z

Also there are definite performance disadvantages when using utf8 collation and nvarchar parameters, as the query is then non-sargable.

egbertn · 2021-09-15T15:46:12Z

Also there are definite performance disadvantages when using utf8 collation and nvarchar parameters, as the query is then non-sargable.

THat is off topic. We are dealing here with EF supporting just another collation type.

ajcvickers · 2021-10-21T09:56:26Z

See also #7172 when considering collations in the type mapping.

clement911 · 2022-03-06T22:24:58Z

We are hitting this issue as well and didn't realize our non ASCII characters were getting corrupted.

Is there a work around that doesn't involve manual editing migrations?

I tried the following but EF still sends varchar parameters.

pty.SetIsUnicode(true);
pty.SetColumnType("varchar");

clement911 · 2022-03-07T00:09:08Z

Looking into it further, UTF8 collations seem quite problematic.

My understanding is that UTF8 columns should be declared as varchar but it looks like unicode characters only remain intact if the parameter is of type nvarchar. This will incur a conversion by sql server which can impact performance and SARGability as @mburbea mentioned. IMO this is quite relevant and important to consider this issue.

So I believe ideally we would want to pass a varchar parameter and also indicates that the collation of the parameter is Latin1_General_100_BIN2_UTF8 (or whatever other actually UTF8 collation was used). The problem I see is that neither Microsoft.Data.SqlClient.SqlParameter nor sp_executesql allows passing the collation name of given parameters.

We were exciting to use UTF8 collations to lower IO and storage size but I think we'll stick to UTF16 for now...

egbertn · 2022-03-07T08:27:03Z

Looking into it further, UTF8 collations seem quite problematic.

My understanding is that UTF8 columns should be declared as varchar but it looks like unicode characters only remain intact if the parameter is of type nvarchar. This will incur a conversion by sql server which can impact performance and SARGability as @mburbea mentioned. IMO this is quite relevant and important to consider this issue.

So I believe ideally we would want to pass a varchar parameter and also indicates that the collation of the parameter is Latin1_General_100_BIN2_UTF8 (or whatever other actually UTF8 collation was used). The problem I see is that neither Microsoft.Data.SqlClient.SqlParameter nor sp_executesql allows passing the collation name of given parameters.

We were exciting to use UTF8 collations to lower IO and storage size but I think we'll stick to UTF16 for now...

As suggested in this issue, the current workaround, is to do as EF is supposed to work, declare(keep) it nvarchar create a migration, and modify the migration script to use varchar, but the collation, of course, must be set to utf8. I did it and saw no problems.

clement911 · 2022-03-07T10:09:57Z

I was hoping for a work that doesn't involve manually editing migration.

Also I don't think that #25798 (comment) is off topic. Using a nvarchar will cause a conversion to a varchar which will impact performance and SARGability.

roji · 2022-03-07T10:50:35Z

@clement911

So I believe ideally we would want to pass a varchar parameter and also indicates that the collation of the parameter is Latin1_General_100_BIN2_UTF8 (or whatever other actually UTF8 collation was used). The problem I see is that neither Microsoft.Data.SqlClient.SqlParameter nor sp_executesql allows passing the collation name of given parameters.

The collation isn't something that gets specified on a parameter, but rather on the column (or at the database level for all columns); see our docs for more info on this.

Aside from that, as @egbertn wrote above, a workaround exists but requiring editing the migration to change the type to varchar (doing something better is what this issue tracks). There's no reason to avoid editing the migration file - it's perfectly fine (and frequently recommended) to customize migration code after generating it, see our docs. I definitely wouldn't avoid UTF8 just because it requires a one-time edit to migration code.

clement911 · 2022-03-07T11:17:15Z

Right, I understand that collation are applied to columns but one can also specify a collation on a varchar literal with the COLLATE keyword.

Regarding editing migrations, the reason we avoid it is that we have to squash migrations every now and then because too many migrations causes very slow builds. Unfortunately since squashing migrations is a manual process that involves deleting existing migrations and creating a new one, any manual edits on migrations are lost. So, as a general practice, we attempt to apply all configurations via configuration code instead of custom migration code. I hope that makes sense.

roji · 2022-03-07T12:19:56Z

Right, I understand that collation are applied to columns but one can also specify a collation on a varchar literal with the COLLATE keyword.

You can do that with EF.Collate if that's what you want, but that typically causes indexes to not be used, since they use the column's collation. So I'd really recommend against this.

Regarding editing migrations [...]

Yeah, that does make sense, though I'd check exactly why it is that you need to squash frequently. The number one reason we see for this is users putting lots of data (or big data) in seeding, which is something we explicitly recommend against (seeding was not designed for that). Assuming you're not doing that, squashing should only be necessary quite rarely.

More generally, editing migrations really is absolutely necessary in various cases. The typical one is when you rename a property: EF typically cannot know whether you want a rename (conserving data) or a drop and a creation of a new column, and so assumes the latter. If a rename is intended, the migration needs to be edited to reflect that (see this in our docs).

So I'd advise not trying to avoid migration customization, since it's a normal, required part of the migration process. I'd rather concentrate on how to make sure customizations are recorded somewhere, so that they can be reapplied after squashing etc.

clement911 · 2022-03-07T22:11:31Z

Thanks @roji for your comments.
FYI we don't use seeding at all.

Beside having a lot of migrations, there are other reasons for us to avoid manually editing migrations.
Sometimes we split a context into multiple contexts, or merge multiple context into ones, or move tables from one context to another (our project is pretty big). When that happens, EF might generate new migrations to drop / create tables that were moved to a different context. But we don't actually run these migrations because we don't really want to drop or create tables (they already exist from another context). After we do such changes, we find it's cleaner to regenerate an initial migration (squash) for each context so that each developer can create a fresh DB on their environment.

I understand your comment regarding renamed properties and EF assuming it's a new property. We're ok editing the code in that case, because we know that if we squash the migration, a new table with the latest column name would be created anyway. So, I guess there are some type of manuals edits which don't really matter once migrations are squashed. On the other hand, modifying the migration script to use varchar instead of nvarchar is something that would be lost/overridden when squashing migrations. Having the DB schema be a simple reflection of the code is a nice property which gives us confidence to change our db contexts and not be worried about breaking things.

Sorry, we got off topic a bit with the original question.

roji · 2022-03-07T22:17:28Z

Sorry, we got off topic a bit with the original question.

No worries, it's an interesting discussion. I do have it on my plate to propose something better for 7.0 (which wouldn't require editing migrations), so this will hopefully go away soon.

clement911 · 2022-03-08T00:02:45Z

I'm looking forward to it 😀

roji · 2022-03-08T15:16:43Z

Design proposal:

tl;dr allow users to configure UTF8 by explicitly setting both the column type to char/varchar and IsUnicode to true:

protected override void OnModelCreating(ModelBuilder modelBuilder)
{
    modelBuilder.Entity<Blog>()
        .Property(b => b.Name)
        .HasColumnType("varchar(max)")
        .IsUnicode(true)
        .UseCollation("LATIN1_GENERAL_100_CI_AS_SC_UTF8");
}

We can add a sugar method which does the above:

modelBuilder.Entity<Blog>()
    .Property(b => b.Name)
    .UseUTF8("LATIN1_GENERAL_100_CI_AS_SC_UTF8");

Notes:

Today, explicitly setting the column to char/varchar also sets DbType=AnsiString (note that Unicode still remains true in the type mapping - not ideal).
Explicitly setting to Unicode to true currently has no effect if the store type is set to char/varchar, i.e. DbType is still AnsiString.
We can allow the user to explicitly set varchar(max) and IsUnicode=true - this would opt into UTF8. The column type is exactly what it should be in migrations (and also in the query pipline etc.), and IsUnicode tells us to send DbType.String instead of DbType.AnsiString.
Note that this isn't enough: a collation is needed as well (and it has to be explicit). We can add model validation that for all UTF8 properties (char/varchar with Unicode=true), to check for a UTF8-compatible collation (ends with _UTF8).
Scaffolding: look into doing this reliably. The combination of a char/varchar property with a collation ending with UTF8 (including at the database level) should lead to the correct UTF8 property being scaffolded (i.e. with either UseUTF8 or IsUnicode(true)`)

Global model configuration

The default database collation can already be set via modelBuilder.UseCollation():

protected override void OnModelCreating(ModelBuilder modelBuilder)
{
    modelBuilder.UseCollation("LATIN1_GENERAL_100_CI_AS_SC_UTF8");
}

All string properties can be configured to be UTF8 by default via pre-convention model configuration:

protected override void ConfigureConventions(ModelConfigurationBuilder configurationBuilder)
{
    configurationBuilder.DefaultTypeMapping<string>(b => b.HasColumnType("varchar(max)").IsUnicode(true));
}

We could also add a ConfigureUTF8() extension method to do the above.

ErikEJ · 2022-03-08T16:14:18Z

@roji is there support for this in db first scaffolding?

roji · 2022-03-08T17:25:13Z

I hope so - at least the "decomposed version" (i.e. with HasColumnType+IsUnicode+UseCollation), if not the nicer UseUTF8 version... I'll see when I get to implementing it...

roji · 2022-03-11T18:41:20Z

Design decisions:

We'll just do the basic feature (support HasColumnType("varchar(max)").IsUnicode(true))
No UseUTF8() sugar for now, scoped out to SQL Server: introduce UseUtf8() sugar to configure UTF8 properties #27624 based on user feedback.

Closes dotnet#25798

Closes #25798

roji self-assigned this Aug 29, 2021

ajcvickers transferred this issue from dotnet/EntityFramework.Docs Aug 31, 2021

ajcvickers added the customer-reported label Aug 31, 2021

ajcvickers added area-migrations area-type-mapping needs-design type-enhancement labels Aug 31, 2021

ajcvickers added this to the Backlog milestone Aug 31, 2021

roji added the area-sqlserver label Sep 2, 2021

roji changed the title ~~Collation and Entity Framework Core does not behave~~ Using SQL Server UTF8 collations Sep 2, 2021

roji mentioned this issue Sep 2, 2021

Document workaround for using UTF8 on SQL Server dotnet/EntityFramework.Docs#3413

Closed

roji added the consider-for-next-release label Oct 12, 2021

ajcvickers modified the milestones: Backlog, 7.0.0 Oct 27, 2021

ajcvickers removed the consider-for-next-release label Oct 27, 2021

roji mentioned this issue Mar 11, 2022

SQL Server: introduce UseUtf8() sugar to configure UTF8 properties #27624

Open

roji added a commit to roji/efcore that referenced this issue Mar 13, 2022

Support SQL Server UTF8 strings

fe312b1

Closes dotnet#25798

roji mentioned this issue Mar 13, 2022

Support SQL Server UTF8 strings #27634

Merged

roji added closed-fixed The issue has been fixed and is/will be included in the release indicated by the issue milestone. and removed needs-design area-migrations labels Mar 13, 2022

roji closed this as completed in #27634 Mar 17, 2022

roji added a commit that referenced this issue Mar 17, 2022

Support SQL Server UTF8 strings (#27634)

587de30

Closes #25798

ajcvickers modified the milestones: 7.0.0, 7.0.0-preview3 Mar 31, 2022

valeriob mentioned this issue Apr 11, 2022

How to use Sql server UTF-8 collation with varchar data type dotnet/ef6#1983

Closed

ajcvickers changed the title ~~Using SQL Server UTF8 collations~~ SQL Server UTF8 collations Apr 18, 2022

ajcvickers mentioned this issue Apr 18, 2022

.NET Data Biweekly Updates (2022) #27185

Closed

roji mentioned this issue Jun 7, 2022

Document configuration of SQL Server UTF8 collations dotnet/EntityFramework.Docs#3885

Closed

ajcvickers modified the milestones: 7.0.0-preview3, 7.0.0 Nov 5, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SQL Server UTF8 collations #25798

SQL Server UTF8 collations #25798

egbertn commented Aug 28, 2021

roji commented Aug 28, 2021

egbertn commented Aug 29, 2021

ajcvickers commented Aug 31, 2021

ajcvickers commented Aug 31, 2021

egbertn commented Sep 1, 2021

roji commented Sep 1, 2021

roji commented Sep 2, 2021 •

edited

Loading

egbertn commented Sep 2, 2021

mburbea commented Sep 15, 2021

egbertn commented Sep 15, 2021

ajcvickers commented Oct 21, 2021

clement911 commented Mar 6, 2022 •

edited

Loading

clement911 commented Mar 7, 2022

egbertn commented Mar 7, 2022

clement911 commented Mar 7, 2022

roji commented Mar 7, 2022

clement911 commented Mar 7, 2022

roji commented Mar 7, 2022

clement911 commented Mar 7, 2022

roji commented Mar 7, 2022

clement911 commented Mar 8, 2022

roji commented Mar 8, 2022 •

edited

Loading

ErikEJ commented Mar 8, 2022

roji commented Mar 8, 2022

roji commented Mar 11, 2022

SQL Server UTF8 collations #25798

SQL Server UTF8 collations #25798

Comments

egbertn commented Aug 28, 2021

Document Details

roji commented Aug 28, 2021

egbertn commented Aug 29, 2021

ajcvickers commented Aug 31, 2021

ajcvickers commented Aug 31, 2021

egbertn commented Sep 1, 2021

roji commented Sep 1, 2021

roji commented Sep 2, 2021 • edited Loading

egbertn commented Sep 2, 2021

mburbea commented Sep 15, 2021

egbertn commented Sep 15, 2021

ajcvickers commented Oct 21, 2021

clement911 commented Mar 6, 2022 • edited Loading

clement911 commented Mar 7, 2022

egbertn commented Mar 7, 2022

clement911 commented Mar 7, 2022

roji commented Mar 7, 2022

clement911 commented Mar 7, 2022

roji commented Mar 7, 2022

clement911 commented Mar 7, 2022

roji commented Mar 7, 2022

clement911 commented Mar 8, 2022

roji commented Mar 8, 2022 • edited Loading

Global model configuration

ErikEJ commented Mar 8, 2022

roji commented Mar 8, 2022

roji commented Mar 11, 2022

roji commented Sep 2, 2021 •

edited

Loading

clement911 commented Mar 6, 2022 •

edited

Loading

roji commented Mar 8, 2022 •

edited

Loading