Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SQLServer - fix unicode support for text type - nvarchar(max) instead of varchar(max) #6421

Open
wants to merge 4 commits into
base: 4.1.x
Choose a base branch
from

Conversation

michalbundyra
Copy link
Member

@michalbundyra michalbundyra commented Jun 4, 2024

Q A
Type bug
Fixed issues #2323

Summary

Very old bug for SQLServer platform preventing usage unicode characters with "text" type.

Background

Here are previously reported issues and attempts to fix it:

Before it was unfortunately not classified as a bug - see #5237 (comment)
but let me provide here a bit more light, why actually IT IS a bug.

First of all in SQLServer we have types like:

  • varchar (variable length)
  • char (fixed length)
  • text (deprecated in favour of varchar(max))

But these types does not accept support UTF-8, so therefore there are the following types:

  • nvarchar (variable length)
  • nchar (fixed length)
  • ntext (deprecated in favour of nvarchar(max))

At some point, when doctrine evolved SQLServer platform from MSSQL the change has been made, but not in all places - see 8b29ffe:

image

So it was recognised that we suppose to use nvarchar in some places, but not ntext in the other.
And since then it is just there.

Later it was just updated to use varchar(max) instead of text (as text become deprecated): #451

Current situation

Now, on version 4.0.x, we still have no ability to use text field with SQLServer for UTF-8 characters.

The other fields:

and there is no easy option to use it, but overriding the type to provide support of length=-1 on the type level (so migrations is not trying to create nvarchar(-1) but uses nvarchar(max); -1 is reported back from the schema when we use max length, and this is also the same behaviour for varchar field).

Additional considerations

As it would solve the current issue, it's not a perfect solution.
It might be even considered as a BC break for some, especially when taking into account storage differences between varchar and nvarchar types.

Not everyone has a need to store unicode characters, but I haven't seen any issues reported that "(var)char should be used instead of n(var)char".

If we want to provide full flexibility we should be able to choose the exact type we want - either unicode or non-unicode, BUT the default solution should be at least consistent and not mix of these two.

@michalbundyra michalbundyra changed the title [WIP] SQLServer - nvarchar(max) instead of varchar(max) [WIP] SQLServer - support unicode for text type - nvarchar(max) instead of varchar(max) Jun 5, 2024
@michalbundyra michalbundyra changed the title [WIP] SQLServer - support unicode for text type - nvarchar(max) instead of varchar(max) SQLServer - fix unicode support for text type - nvarchar(max) instead of varchar(max) Jun 5, 2024
@michalbundyra michalbundyra marked this pull request as ready for review June 5, 2024 08:06
@fabiang
Copy link

fabiang commented Jun 5, 2024

I initially thought that VARCHR/CHAR were deprecated in SQLServer, but reading the documentation again this is far from true:

Starting with SQL Server 2019 (15.x), when a UTF-8 enabled collation is used, these data types store the full range of Unicode character data and use the UTF-8 character encoding.

Only TEXT, NTEXT and IMAGE are deprecated and will be removed.

I've created a small test script:

CREATE DATABASE test2 COLLATE Latin1_General_100_BIN2
GO
USE test2
CREATE TABLE collation_test (
	id INT NOT NULL IDENTITY,
	name1 VARCHAR(100) NOT NULL,
	name2 NVARCHAR(100) NOT NULL
)

INSERT INTO collation_test (name1, name2) VALUES (
	'UmlautsÄÜÖß',
	'UmlautsÄÜÖß'
)

SELECT LEN(ct.name1), LEN(ct.name2), DATALENGTH(ct.name1), DATALENGTH(ct.name2) FROM collation_test ct
GO

CREATE DATABASE test COLLATE Latin1_General_100_BIN2_UTF8
GO
USE test
CREATE TABLE collation_test (
	id INT NOT NULL IDENTITY,
	name1 VARCHAR(100) NOT NULL,
	name2 NVARCHAR(100) NOT NULL
)

INSERT INTO collation_test (name1, name2) VALUES (
	'UmlautsÄÜÖß',
	'UmlautsÄÜÖß'
)

SELECT LEN(ct.name1), LEN(ct.name2), DATALENGTH(ct.name1), DATALENGTH(ct.name2) FROM collation_test ct
GO

Result:

|  |  |  |  |
|--+--+--+--+
|11|11|11|22|

|  |  |  |  |
|--+--+--+--+
|11|11|15|22|

But these types does not accept support UTF-8, so therefore there are the following types:

So even I follow the initial issue #2323 for years now and use NVARCHAR() everywhere, I think this shouldn't be merged, as it is based in outdated information.

@michalbundyra
Copy link
Member Author

@fabiang thanks for your comment.

Just to summarise:

  • you think it should not be merged,
  • at the same point you use nvarchar fields yourself
  • the current inconsistency is ok? (doctrine text type is varchar(max), but string is n(var)char(length))

Is it right?

I am not sure if I can see your point - if varchar can store now unicode (15.x+ with the correct db collation), why you still use nvarchar?
Wouldn't be better to update doctrine/dbal to use consistently just varchar for SQLServer and point users to use correct collation on the db if they want to store unicode?

What then about users who are stuck (for whatever reason) on version prior to 15.x?

@fabiang
Copy link

fabiang commented Jun 6, 2024

I still use NVARCHAR because I had outdated or wrong informations. As of SQLServer 2019 I'll use VARCHAR now with the correct collation set. Therefore I think doctrine/dbal should only use VARCHAR(length)/VARCHAR(MAX) now everywhere. And imho SQLServer < 15.x support should be dropped, instead of having a BC-break.

@derrabus derrabus changed the base branch from 4.0.x to 4.1.x August 14, 2024 10:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants