Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recommend a better charset by default #5277

Merged
merged 1 commit into from
Feb 16, 2022

Conversation

greg0ire
Copy link
Member

@greg0ire greg0ire commented Feb 15, 2022

In MySQL, utf8 is an alias for utf8mb3, which is not really utf8 since
it only uses 3 bytes.
As a result utf8 has been deprecated and should not be used.
See https://dev.mysql.com/doc/refman/8.0/en/charset-unicode-utf8.html

Disclaimer: I am by no means a MySQL specialist, and find it a bit surprising that this hasn't been changed for such a long time (it was introduced in 59e17b5 , 6 years ago).

Note that from my understanding, this will not affect the charset of new tables.

EDIT: it looks like I understood wrong:

if (! isset($params['defaultTableOptions']['charset']) && isset($params['charset'])) {
$params['defaultTableOptions']['charset'] = $params['charset'];
}

Introduced in #3003

@morozov
Copy link
Member

morozov commented Feb 15, 2022

Regarding the documentation, why does the DBAL even have to recommend one specific charset? Since it's just an example, instead of using something that may look like a recommendation, could it use something exotic to make sure it's not used as is? Alternatively, we could just remove the "charset" parameter from the URL. The documented value is specific to MySQL and specific MySQL versions while there are other supported drivers and connection parameters.

Regarding the code, what does this code do? Why does it use the connection parameter in the schema definition?

@greg0ire
Copy link
Member Author

Regarding the code, what does this code do? Why does it use the connection parameter in the schema definition?

I'm wondering that too. #3003 does not come with an explanation besides

If the database is utf8mb4, tables should be utf8mb4 by default.

or with documentation, but I'll concede it would make sense to default to consistent charsets.

@greg0ire
Copy link
Member Author

I switched to an invalid charset, that way users will be forced to read the docs to pick the charset they think is correct. Picking an exotic charset might result in people just using it because they copy/paste without understanding.

@morozov
Copy link
Member

morozov commented Feb 15, 2022

I switched to an invalid charset [...]

That's odd because up until now, the URL described a realistic setup while now it doesn't. The bar doesn't look better than utf8 to me.

Picking an exotic charset might result in people just using it because they copy/paste without understanding.

Is it the problem we're trying to address there? The bar is prone to this error as well.

Personally, I wouldn't make this change just for the sake of a change. The documentation illustrates the use of connection parameters in the URL from the syntax perspective. It doesn't describe all supported parameters, their support by the drivers, and the recommended values. It's already good enough.

@derrabus
Copy link
Member

I switched to an invalid charset [...]

That's odd because up until now, the URL described a realistic setup while now it doesn't. The bar doesn't look better than utf8 to me.

In a realistic setup, please let's not use a deprecated charset setting with surprising behavior. Let's use utf8mb4 in this example. Even if someone C&Ps the URL from the documentation, they at least copied a sane configuration.

@morozov
Copy link
Member

morozov commented Feb 15, 2022

Then I would remove the "with the charset set to UTF-8". Otherwise, it makes the DBAL provide a solution to the problem that should be solved using the MySQL documentation.

@derrabus
Copy link
Member

derrabus commented Feb 15, 2022

Then I would remove the "with the charset set to UTF-8".

Let's rephrase, to make it clear we're not talking about the concept of character sets but about passing an option as-is to the driver:

For example, to connect to a "foo" MySQL DB using the ``pdo_mysql``
driver on localhost port 4486 with the "charset" option set to "utf8mb4",
you would use the following URL::

In MySQL, utf8 is an alias for utf8mb3, which is not really utf8 since
it only uses 3 bytes.
As a result utf8 has been deprecated and should not be used.
See https://dev.mysql.com/doc/refman/8.0/en/charset-unicode-utf8.html
@greg0ire greg0ire merged commit 68995c9 into doctrine:3.3.x Feb 16, 2022
@derrabus derrabus added this to the 3.3.3 milestone Feb 16, 2022
@greg0ire greg0ire deleted the recommend-better-charset branch February 16, 2022 09:19
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Feb 17, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants