Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UTF8 Encoded SQL Scripts in initdb #131

Closed
ghost opened this issue Jan 13, 2016 · 24 comments
Closed

UTF8 Encoded SQL Scripts in initdb #131

ghost opened this issue Jan 13, 2016 · 24 comments

Comments

@ghost
Copy link

ghost commented Jan 13, 2016

I recently completed a project that populated the database with UTF8 encoded text. My script was able to create the database using the correct character-set, but the text data could not be properly imported. Apparently, the initdb scripts are assumed to be ascii encoded, so I had to make a minor modification to docker-entrypoint.sh.

mysql=( mysql --protocol=socket --default_character_set=utf8 -uroot )

This change was not a big deal for me, but I am raising the question as to whether this would be a good default. I can't think of any way it breaks compatibility with existing scripts, and the new default would provide a real benefit to anyone pre-populating with UTF8 encoded data.

Of course, an even better solution may be to use a variable.

Please note, this is different from the discussion on the initial database's character-set (#14). This is a discussion about the encoding of the docker-entrypoint-initdb.d script files.

@GlenKPeterson
Copy link

I'm having a similar issue. I can source files in utf-8 and they work fine, but I can't use utf-8 when I type in SQL commands, even after setting:

set character_set_client='utf8';
set character_set_connection='utf8';
set character_set_database='utf8';
set character_set_results='utf8';
set character_set_server='utf8';
show variables like 'char%';

Based on some reading, I think the solution is right here:
https://github.com/docker-library/postgres/blob/8e867c8ba0fc8fd347e43ae53ddeba8e67242a53/9.5/Dockerfile#L23

The container operating system is not set, or not set to utf8. This line sets it. If that works and you're willing to merge it, I can try it out and submit a pull request.

@djui
Copy link

djui commented Sep 20, 2016

Any updates on this? I would prefer if I don't have to build my MariaDB container locally with a patched Dockerfile.

@tianon
Copy link
Member

tianon commented Sep 20, 2016

I'd recommend adding --character-set-server=utf8mb4 --collation-server=utf8mb4_unicode_ci to your container command, as noted in the image description under "Configuration without a cnf file" (https://github.com/docker-library/docs/blob/93eaf4f47c52c4b65d1cd31a5dcaa257848e5f13/mysql/README.md#configuration-without-a-cnf-file). That should set the default for the entire daemon.

See also #32 (comment) and docker-library/docs#502.

@djui
Copy link

djui commented Sep 20, 2016

This does not solve the problem.

With using the two flags, I get:

-- init.sql
CREATE TABLE `pages` (
  `title` varchar(75) NOT NULL,
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

INSERT INTO `pages` (`title`) VALUES
  ('Så fungerar det');
> select title from pages limit 1;
+--------------------+
| title              |
+--------------------+
| SÃ¥ fungerar det   |
+--------------------+
1 row in set (0.00 sec)

I believe the problem is not the default database character encoding setting, but as mentioned above, that the moment of initial import which happens from the shell towards the database imports the database with the wrong character encoding. As one can see I already set specifically for the table the utf8 character set.
I would like to rather see the same change as for Postgres: https://github.com/docker-library/postgres/blob/8e867c8ba0fc8fd347e43ae53ddeba8e67242a53/9.5/Dockerfile#L23

@GlenKPeterson
Copy link

GlenKPeterson commented Sep 20, 2016

@djui - nice simple test, though you have an extra comma after NULL,.

On Ubuntu 16.04 running his test I get

select title from pages;
+----------------+
| title          |
+----------------+
| S fungerar det |
+----------------+

The issue may be with the terminal in the docker container and not in MySQL directly because I don't seem to have problems if I source a sql file that contains non-ascii characters. It's not a docker issue because non-ascii characters work out of the box with the Postgres docker container. It's just something about how the MySQL docker container is (was?) set up.

@djui
Copy link

djui commented Sep 20, 2016

Correct, the issue is not MySQL, but the moment the terminal (shell) imports the SQL statements. One simple proof (in this case for MariaDB, but same for MySQL):

# docker-compose.yml
:
  db:
    #image: mariadb:latest
    build:
      context: .
      dockerfile: Dockerfile.mariadb
    ports:
      - '3306:3306'
    environment:
      - MYSQL_ROOT_PASSWORD=secret
      - MYSQL_DATABASE=foo
      - MYSQL_USER=bar
      - MYSQL_PASSWORD=baz
    volumes:
      - data:/var/lib/mysql
      - ./init.sql.gz:/docker-entrypoint-initdb.d/init.sql.gz
:
# Dockerfile.mariadb
FROM mariadb:latest

# make the "en_US.UTF-8" locale so ~postgres~ mysql/mariadb will be utf-8 enabled by default
RUN apt-get update && apt-get install -y locales && rm -rf /var/lib/apt/lists/* $
 && localedef -i en_US -c -f UTF-8 -A /usr/share/locale/locale.alias en_US.UTF-8
ENV LANG en_US.utf8
-- init.sql
CREATE TABLE `pages` (
  `title` varchar(75) NOT NULL,
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

INSERT INTO `pages` (`title`) VALUES
  ('Så fungerar det');
MariaDB [foo]> select title from pages limit 1;
+------------------+
| title            |
+------------------+
| Så fungerar det  |
+------------------+
1 row in set (0.00 sec)

@tianon
Copy link
Member

tianon commented Sep 20, 2016

Can you try again adding -e LANG=C.UTF-8 to your docker run line and see if that's sufficient?

@djui
Copy link

djui commented Sep 20, 2016

I tried -e LANG=en_US.utf8 which didn't work. -e LANG=C.UTF-8 seems to work. That would be a good enough fix! Thanks.

@GlenKPeterson
Copy link

GlenKPeterson commented Sep 20, 2016

@tianon -e LANG=C.UTF-8 worked for me from the docker run command line! Yay! Thank you!

@kisPocok
Copy link

Here comes the no-brainer one liner fix for dummies like me. Place it to your Dockerfile.
ENV LANG=C.UTF-8

@tianon
Copy link
Member

tianon commented Dec 26, 2017

Closing, since this is solved! 👍

@tianon tianon closed this as completed Dec 26, 2017
@ghost
Copy link
Author

ghost commented Dec 27, 2017

Can't we have this mostly harmless setting built into the official image? I don't believe it will impact existing scripts since UTF-8 is ASCII compatible.

@ghost
Copy link
Author

ghost commented Dec 27, 2017

I will do a pull request during the New Years holiday if nobody beats me to it. I have to still build it and test it.

@tianon
Copy link
Member

tianon commented Dec 27, 2017

I guess it depends on what the impact of setting LANG would be on the default behavior of mysqld (since this issue is only regarding the client, which is going to apply regardless of which image is used for the client-side, and will be unaffected by the daemon itself including LANG).

If LANG does have a negative impact on mysqld (or even just changes the default behavior) then we might need to instead just document that the mysql client relies on LANG and that C.UTF-8 is a good value that's available/working out-of-the-box for folks who need UTF-8.

@ghost
Copy link
Author

ghost commented Dec 28, 2017

Alternatively, make the change that I noted at the top and have no potential side effects beyond the mysql command. Hmm.

@Ovsyanka
Copy link

Ovsyanka commented Sep 5, 2019

As I understand setting LANG to C.UTF-8 makes mysql client to use utf-8 by default. But I think that better option would be set up this option directly in the mysql config file alongside with the server options. Like this:

; /etc/mysql/conf.d/utf8.cnf
[mysqld]
init_connect=‘SET collation_connection = utf8_unicode_ci’
character-set-server = utf8
collation-server = utf8_unicode_ci

[client]
default-character-set = utf8

client -> default-character-set = utf8 makes the same thing but it is more obvious and transparent then relaying on the LANG env.

@pothitos
Copy link

Just in case someone is interested in writing a proper docker-compose.yml that supports Unicode in a MySQL-compatible database, they may use something like

services:
  db:
    image: mariadb:10.4.8
    command: --character-set-server=utf8mb4 --collation-server=utf8mb4_unicode_ci
    environment:
      LANG: C.UTF-8
      MYSQL_ROOT_PASSWORD: root
    volumes:
      - ./db:/docker-entrypoint-initdb.d:ro

@takeyourhatoff
Copy link

takeyourhatoff commented Aug 26, 2020

Setting LANG=C.UTF-8 causes the mysql client to assume the utf8 encoding. This does not support characters outside the BMP, for which the utf8mb4 encoding must be used.

The easiest way I have found to support non BMP characters in the sql files in /docker-entrypoint-initdb.d/ is to prefix the files with SET NAMES utf8mb4;. There is then no need to alter any configuration files, or set any environment variables.

@plpetkov-tech
Copy link

For anyone lost in this thread having problems with cyrillic chars or special chars being changed to special non sense characters at init, using docker-compose:

The only thing you need to fix it is :
environment:
LANG: C.UTF-8
In your docker-compose.yml

This fixed my problem with init.sql in /docker-entrypoint-initdb.d with cyrillic texts

@Loongphy
Copy link

I follow your discuss , but mysql prints out like this.

mysql>  show variables like 'char%';
+--------------------------+----------------------------+
| Variable_name            | Value                      |
+--------------------------+----------------------------+
| character_set_client     | utf8                       |
| character_set_connection | utf8                       |
| character_set_database   | latin1                     |
| character_set_filesystem | binary                     |
| character_set_results    | utf8                       |
| character_set_server     | latin1                     |
| character_set_system     | utf8                       |
| character_sets_dir       | /usr/share/mysql/charsets/ |
+--------------------------+----------------------------+
8 rows in set (0.01 sec)

It doesn't work for database and server. So I add a configuaration in docker-compose.yml as the following.

command: --character-set-server=utf8mb4 --collation-server=utf8mb4_unicode_ci

The whole yaml file is:

version: "3"
services:
  mysql:
    container_name: mysql
    image: mysql:5.7
    environment:
      MYSQL_ROOT_PASSWORD: root
      LANG: C.UTF-8
    volumes:
      - ./mysql:/var/lib/mysql
    ports:
      - "3306:3306"
    command: --character-set-server=utf8mb4 --collation-server=utf8mb4_unicode_ci

Here is the final output.

mysql>  show variables like 'char%';
+--------------------------+----------------------------+
| Variable_name            | Value                      |
+--------------------------+----------------------------+
| character_set_client     | utf8                       |
| character_set_connection | utf8                       |
| character_set_database   | utf8mb4                    |
| character_set_filesystem | binary                     |
| character_set_results    | utf8                       |
| character_set_server     | utf8mb4                    |
| character_set_system     | utf8                       |
| character_sets_dir       | /usr/share/mysql/charsets/ |
+--------------------------+----------------------------+
8 rows in set (0.01 sec)

@SharakPL
Copy link

SharakPL commented Apr 8, 2021

In case you also want to autoimport .sql file (for example ./db-import/db.sql) into your fresh database then all you need is another volume mounted in /docker-entrypoint-initdb.d directory.

From the docs:

When a container is started for the first time, a new database with the specified name will be created and initialized with the provided configuration variables. Furthermore, it will execute files with extensions .sh, .sql and .sql.gz that are found in /docker-entrypoint-initdb.d. Files will be executed in alphabetical order. You can easily populate your mysql services by mounting a SQL dump into that directory and provide custom images with contributed data. SQL files will be imported by default to the database specified by the MYSQL_DATABASE variable.

If file is big with very long values to handle then the import may fail with 'MySql server has gone away' error. To avoid this just set bigger max-allowed-packet parameter:

version: "3"
services:
  mysql:
    container_name: mysql
    image: mysql:5.7
    environment:
      MYSQL_ROOT_PASSWORD: root
      LANG: C.UTF-8
    volumes:
      - ./mysql:/var/lib/mysql
      - ./db-import:/docker-entrypoint-initdb.d
    ports:
      - "3306:3306"
    command: [ "--character-set-server=utf8mb4", "--collation-server=utf8mb4_unicode_ci", "--max-allowed-packet=67108864" ]

@apuppy
Copy link

apuppy commented Apr 9, 2021

I tried -e LANG=en_US.utf8 which didn't work. -e LANG=C.UTF-8 seems to work. That would be a good enough fix! Thanks.

Solved my problem. thank you.

@paulkre
Copy link

paulkre commented Oct 31, 2021

This LANG variable should definitely be documented on Docker Hub.

@cavo789
Copy link

cavo789 commented Aug 29, 2023

Can you try again adding -e LANG=C.UTF-8 to your docker run line and see if that's sufficient?

Great! On my side, same result (of course) with ENV LANG=C.UTF-8 in my Dockerfile. Thanks for the tip.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests