Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change MySQL defaults from broken utf8 to fixed utf8mb4 #851

Open
wants to merge 3 commits into
base: master
from

Conversation

@DHager
Copy link
Contributor

commented May 4, 2015

This is a conservative echo of #317 .

Essentially MySQL's uft8 character set is broken and does not support full UTF-8, and a better alternative, utf8mb4 has existed for about five years now. When a 4-byte UTF-8 character comes in for a utf8 table, by default MySQL will truncate the string and log a warning.

Insofar as Doctrine has any default configuration values for MySQL, I think utf8mb4 is a better, safer choice. If anybody is running MySQL <5.5.3 and has problems, they should have a fairly easy time figuring out what's going wrong due to how distinctive the string is. (The other way around, searching for utf8, is not.)

@doctrinebot

This comment has been minimized.

Copy link

commented May 4, 2015

Hello,

thank you for creating this pull request. I have automatically opened an issue
on our Jira Bug Tracker for you. See the issue link:

http://www.doctrine-project.org/jira/browse/DBAL-1224

We use Jira to track the state of pull requests and the versions they got
included in.

@Ocramius

This comment has been minimized.

Copy link
Member

commented May 4, 2015

This seems reasonable to me, but only if an example string in a functional test is also being provided (in order to prove the validity of the change), especially around the data truncation you reported in the description

@fprochazka

This comment has been minimized.

Copy link
Contributor

commented May 5, 2015

Doctrine is already parsing the database version in order to create Platform objects. I would preffer a condition for this, like Nette did nette/http#28 + nette/database@7988663

@DHager

This comment has been minimized.

Copy link
Contributor Author

commented May 5, 2015

@Ocramius: I don't think a test is worthwhile in this case, because there's no new behavior of Doctrine to test, nor any particular way to "solve" or react the issue in Doctrine. (Yet?) We'd just be writing tests to assert how third-party non-PHP code behaves.

Here's a simple way to validate the issue, with thanks to @mathiasbynens for the example problem-character:

-- Create table using utf8 with two columns, one problematic and one OK
CREATE TABLE `trunctest` (
    `utf8text` VARCHAR(255) CHARACTER SET 'utf8' COLLATE 'utf8_general_ci' NOT NULL,
    `mb4text` VARCHAR(255) CHARACTER SET 'utf8mb4' COLLATE 'utf8mb4_general_ci' NOT NULL
);

-- Hex is equivalent to 'foo𝌆bar' and makes things simpler 
SELECT UNHEX('666F6FF09D8C86626172') INTO @astralstring;

-- Will generate a warning, but still insert "foo" into the utf8 field
INSERT INTO trunctest (`utf8text`,`mb4text`) VALUE (@astralstring,@astralstring);

-- Now the two fields don't match up!
SELECT utf8text, mb4text, (char_length(utf8text) = char_length(mb4text)) as success FROM trunctest;

-- Cleanup
DROP TABLE `trunctest`;

Turning on MySQLs "strict" mode should promote the warning to an error, but that's not the default and people would still need to go through the effort of changing their tables if they need to support the problem-characters.

@DHager

This comment has been minimized.

Copy link
Contributor Author

commented May 5, 2015

@fprochazka: I notice there's a MySQL57Platform class, do you mean creating a new one like MySQL553Platform that "layers on" the use of utf8mb4?

Alternately, it might be possible to initialize the Platform to a charset/collation default based on what we can glean from the database connection (i.e. with select @@character_set_server, @@collation_server;) ... But that introduces some new risk. For example, I bet a lot of existing installations are still set to latin1/latin1_swedish_ci, and having that pop in the schema tool is going to be a nasty surprise, since the new value is not a superset of what they used to be using.

@Ocramius

This comment has been minimized.

Copy link
Member

commented May 5, 2015

I don't think a test is worthwhile in this case

A change is only justified by a test here, IMO: we show why this default is
being changed and the test should fail (and decisions should be taken) if
the change is reverted.

You are not really testing the functionality, you are just hardening
support by preventing regressions ;-)

Marco Pivetta

http://twitter.com/Ocramius

http://ocramius.github.com/

On 5 May 2015 at 03:56, Darien notifications@github.com wrote:

@fprochazka https://github.com/fprochazka: I notice there's a
MySQL57Platform class, do you mean creating a new one like
MySQL553Platform that "layers on" the use of utf8mb4?


Reply to this email directly or view it on GitHub
#851 (comment).

@DHager

This comment has been minimized.

Copy link
Contributor Author

commented May 5, 2015

@fprochazka: I've got a side-branch that uses a MySQL553Platform where each platform can override the default collation/charset, but unfortunately it causes a bunch of problems with the AbstractMySQLPlatformTestCase, because it assumes all MySQL platforms will always use one and only one collation and charset between them.

@DHager

This comment has been minimized.

Copy link
Contributor Author

commented May 5, 2015

Ah, yes, now the build-server is telling me something I should've remembered from my own work: utf8mb4 introduces a new quirk, where varchar(255) cannot be indexed because it's (potentially) too long for InnoDB.

It seems like there are a few different choices, turning on MySQL-specific behavior:

  1. Hardcode the default charset/collation, keeping it as utf8, with the lurking incompatibility issue compared the UTF8 used by the rest of the tech-stack.
  2. Hardcode the default charset/collation as utf8mb4, and make people using varchar(255) indexes either specify a charset in the options for that field, or specify a lower length.
  3. Create a system that inspects the database-state on startup to determine which to use, which may bite users who never changed their settings from latin1 and were depending on Doctrine to quietly always override it for them.
  4. Make it possible for users to specify--through PHP code--the database-wide charset/collation defaults.
@DHager

This comment has been minimized.

Copy link
Contributor Author

commented May 5, 2015

@Ocramius, @fprochazka : At least for my immediate use-case, I think I've found a better solution with doctrine/DoctrineBundle#420 instead.

@DHager DHager closed this May 5, 2015

@Ocramius Ocramius reopened this May 6, 2015

@Ocramius

This comment has been minimized.

Copy link
Member

commented May 6, 2015

Reopening as it is worth discussing the upgrade.

@MisatoTremor

This comment has been minimized.

Copy link
Contributor

commented May 21, 2015

As a follow up on the post about varchar(255):

This is because the maximum size of an InnoDB index is 767 bytes. Thus in 3 byte utf8 a varchar type field can be at max 255 chars long (255 * 3 -> 765), while in 4 byte utf8 it can only be 191 chars long (191 * 4 -> 764).

Starting with MySQL 5.6.3 you can circumvent this by configuring it with the setting innodb_large_prefix=ON (defaults is OFF) which itself needs innodb_file_format=Barracuda (default is Antelope) AND creating the table with the option ROW_FORMAT set to either DYNAMIC or COMPRESSED. The maximum index size will then be 3072 bytes, so varchar can be longer again.
Starting with MyQL 5.7.7 however innodb_file_format=Barracuda and innodb_large_prefix=ON will be the default values and even be removed as configurable options in a later release.

Maybe these things should be considered to build a stable solution for this.
I think creating a platform classes for at least 5.5.3 and 5.7.7 should be suitable.

What do you think @Ocramius?

@deeky666

This comment has been minimized.

Copy link
Member

commented Jan 9, 2016

This one is rather tricky and considering the amount of risks I doubt we can easily change this in 2.x. Also I wonder IF we changed it somehow, how we can keep BC and consistency here. When users would start to upgrade they'd have new tables with charset utf8mb4 and old ones being stuck at utf8. So manual upgrade would be required here then. IMO the best option for Doctrine would be to not define a default charset and collation at all and let the user or database server decide about it. This might surely raise some other issues (like the index problem discussed) if a particular charset is used but the can of worms is open already anyways as you can define the default table charset and collation via table options.

@Ocramius

This comment has been minimized.

Copy link
Member

commented Jan 9, 2016

IMO the best option for Doctrine would be to not define a default charset and collation at all and let the user or database server decide about it.

Disagree with this, according to history (remember latin1?)

@deeky666

This comment has been minimized.

Copy link
Member

commented Jan 9, 2016

@Ocramius yeah what about it? Is it wrong to use latin1? For example in my company we are dealing with german customers only and everything is bound to latin1 from the customers side. We are sometimes having a hard job converting from latin1 to utf8 (our application and database default) forth and back. Also we could gain performance and disk space when using latin1 instead of utf8 not to mention sorting/comparing issues latin1_german2_ci is what we actually need for germanized applications.

@deeky666

This comment has been minimized.

Copy link
Member

commented Jan 9, 2016

So what can we do in 2.x go get at least an intermediate solution? I tend to prefer the idea of introducing a MySql553Platform, making the default charset depend on the platform used (utf8mb4 for MySQL > 5.5.3) and have a decent upgrade instruction. As a change of the default charset only affects newly created tables where a custom charset is not defined by the user, this could maybe acceptable?

@realityking

This comment has been minimized.

Copy link

commented Jan 9, 2016

@deeky666 Even if all your customers in Germany, they might have a name that doesn't fit into latin1.

Storing non-legacy data in anything but Unicode really shouldn't be encouraged.

@mcfedr

This comment has been minimized.

Copy link

commented Feb 15, 2016

I am very much in favour of this, there is small backwards compatibility issue that is worth talking about.

The default length of a string type is 255 in Doctrine, which fits in the maximum key length for InnoDB. If the default charset is going to be utf8mb the max key length is 191 characters.

This will cause the generated index SQL for many string columns to fail as it doesn't include key length.

@MisatoTremor

This comment has been minimized.

Copy link
Contributor

commented Feb 15, 2016

@mcfedr See #851 (comment) for details about this.

@ste93cry

This comment has been minimized.

Copy link

commented May 6, 2016

any news on this?

@ThaDafinser

This comment has been minimized.

Copy link

commented Jul 21, 2016

Just also wanted to switch "easily", but got trapped by this issue.

Since there is already a Mysql57Platform...what about changing the default only there?(i know it's sadly not that easy for the mass)
https://github.com/doctrine/dbal/blob/master/lib/Doctrine/DBAL/Platforms/MySQL57Platform.php

//EDIT
For now i created this workaround:

Just register your custom platform in the config
'platform' => new \YOUR_SPACE\Doctrine\DBAL\Platforms\MySQL57PlatformCustom()

<?php
namespace YOUR_SPACE\Doctrine\DBAL\Platforms;

use Doctrine\DBAL\Platforms\MySQL57Platform;

/**
 * Custom Mysql57 Platform, to use utf8mb4 per default
 */
class MySQL57PlatformCustom extends MySQL57Platform
{

    protected function _getCreateTableSQL($tableName, array $columns, array $options = array())
    {
        // Charset
        if (! isset($options['charset'])) {
            $options['charset'] = 'utf8mb4';
        }

        // Collate
        if (! isset($options['collate'])) {
            $options['collate'] = 'utf8mb4_unicode_ci';
        }

        return parent::_getCreateTableSQL($tableName, $columns, $options);
    }
}

j0k3r added a commit to wallabag/wallabag that referenced this pull request Oct 8, 2016

Fix emoji insertion in MySQL
Switch to utf8mb4 instead of utf8 because f*** MySQL
See doctrine/dbal#851

j0k3r added a commit to wallabag/wallabag that referenced this pull request Oct 11, 2016

Fix emoji insertion in MySQL
Switch to utf8mb4 instead of utf8 because f*** MySQL
See doctrine/dbal#851

j0k3r added a commit to wallabag/wallabag that referenced this pull request Oct 11, 2016

Fix emoji insertion in MySQL
Switch to utf8mb4 instead of utf8 because f*** MySQL
See doctrine/dbal#851

j0k3r added a commit to wallabag/wallabag that referenced this pull request Oct 22, 2016

Fix emoji insertion in MySQL
Switch to utf8mb4 instead of utf8 because f*** MySQL
See doctrine/dbal#851
@mnapoli

This comment has been minimized.

Copy link
Contributor

commented Jul 2, 2017

@ThaDafinser thank you for that piece of code, that's perfect!

And for those using MariaDB you will need a version > 10.2.

@Ocramius

This comment has been minimized.

Copy link
Member

commented Jul 2, 2017

FWIW, this would be an awesome addition if it wasn't for mysql's awful index size limits :-\

If we provide it out of the box, then the varchar size should go down by default.

@mcfedr

This comment has been minimized.

Copy link

commented Jul 2, 2017

@Ocramius Or generate index declarations with size specified, although that might be another can of worms

@mcfedr

This comment has been minimized.

Copy link

commented Jul 17, 2017

Also adding to this conversation the 'correct' collate to use with utf8mb4 is utf8mb4_unicode_520_ci

@afoeder

This comment has been minimized.

Copy link
Contributor

commented Aug 9, 2017

@mcfedr do you have a source for the utf8mb4_unicode_520_ci statement?

@mcfedr

This comment has been minimized.

Copy link

commented Aug 9, 2017

Unfortunately there isn't good information on the MySQL docs, but the best source I have found is the Wordpress bug tracker: https://core.trac.wordpress.org/ticket/32105#comment:3

Wordpress recently changed to using utf8mb4_unicode_520_ci by default. The key reason

The problem stems from MySQL's collation behaviour - it treats all Unicode Supplementary Characters (which emoji fall under) as being equivalent. It's not until MySQL 5.6, wich the addition of the utf8mb4_unicode_520_ci collation that this changes.

@ghost

This comment has been minimized.

Copy link

commented Jan 25, 2018

What's the status on this 'defaults' merge, and utf8mb4 support in general?

Currently, in a new Symfony4 project

bin/console doctrine:phpcr:repository:init

returns

	Successfully registered system node types.
	Executing initializer: CmfRoutingBundle

	In ObjectManager.php line 859:
	  Error inside the transport layer: An exception occurred while executing 'SELECT id FROM phpcr_nodes WHERE path COLLATE utf8mb4_bin = ? AND workspace_name = ?' with params ["\/", "default"]:
	  SQLSTATE[42000]: Syntax error or access violation: 1253 COLLATION 'utf8mb4_bin' is not valid for CHARACTER SET 'utf8'

	In AbstractMySQLDriver.php line 121:
	  An exception occurred while executing 'SELECT id FROM phpcr_nodes WHERE path COLLATE utf8mb4_bin = ? AND workspace_name = ?' with params ["\/", "default"]:
	  SQLSTATE[42000]: Syntax error or access violation: 1253 COLLATION 'utf8mb4_bin' is not valid for CHARACTER SET 'utf8'

	In PDOStatement.php line 107:
	  SQLSTATE[42000]: Syntax error or access violation: 1253 COLLATION 'utf8mb4_bin' is not valid for CHARACTER SET 'utf8'

	In PDOStatement.php line 105:
	  SQLSTATE[42000]: Syntax error or access violation: 1253 COLLATION 'utf8mb4_bin' is not valid for CHARACTER SET 'utf8'

also noting

vendor/doctrine/dbal/UPGRADE.md

	## Creating MySQL Tables now defaults to UTF-8

	  If you are creating a new MySQL Table through the Doctrine API, charset/collate are now set to 'utf8'/'utf8_unicode_ci' by default. Previously the MySQL server defaults were used.

the current config includes,

config/packages/doctrine.yaml 
	...
	doctrine:
	    dbal:
	        default_connection: dev
	        connections:
	            dev:
	                ...
	                driver: 'pdo_mysql'
	                server_version: '5.7'
	                charset: utf8mb4
	                default_table_options:
	                    charset: utf8mb4
	                    collate: utf8mb4_unicode_ci
	            phpcr_dev:
	                ...
	                driver: 'pdo_mysql'
	                server_version: '5.7'
	                charset: utf8mb4
	                default_table_options:
	                    charset: utf8mb4
	                    collate: utf8mb4_unicode_ci                                                                                                                                              
	...

	doctrine_phpcr:
	    session:
	        backend:
	            type: doctrinedbal
	            connection: phpcr_dev
	...
@mcfedr

This comment has been minimized.

Copy link

commented Jan 25, 2018

@ghost

This comment has been minimized.

Copy link

commented Jan 25, 2018

@mcfedr
mixing how/where?
both DBs have

	                charset: utf8mb4
	                default_table_options:
	                    charset: utf8mb4
	                    collate: utf8mb4_unicode_ci

consistent with MariaDB defaults ...

@mcfedr

This comment has been minimized.

Copy link

commented Jan 25, 2018

@ghost

This comment has been minimized.

Copy link

commented Jan 25, 2018

@mcfedr
These are 'clean installed' DBs ... so far nothing other than install & (attempts to) initialize, etc.

EDIT: I think my issue better belongs here:
doctrine/DoctrinePHPCRBundle#310

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
You can’t perform that action at this time.