Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed to output chinese emoji to mysql #306

Open
YunLongND opened this issue Apr 29, 2022 · 13 comments
Open

Failed to output chinese emoji to mysql #306

YunLongND opened this issue Apr 29, 2022 · 13 comments

Comments

@YunLongND
Copy link

My mysql uses code utf8mb4, and mysq's temporary table code is utf8mb4,
When I insert data into mysql temporary table, it has the following error:
org.embulk.exec.PartialExecutionException: java.lang.RuntimeException: java.sql.BatchUpdateException: Incorrect string value: '\xF0\x9F\xA4\xA3' for column ......

Can someone help me solve this problem? thanks.

@hiroyuki-sato
Copy link
Contributor

hiroyuki-sato commented May 6, 2022

Hello, @YunLongND

  • MySQL: 8.0.28
  • embulk: 0.9.24
  • embulk-output-mysql: 0.10.2

I could insert the following emoji into the MySQL database without any special configuration.
(I created the table using mysql command) I also tested without a creation table.

Could you tell me more detail about your environment (configuration, database version, and so on)?

CREATE TABLE `example` (
  `id` bigint DEFAULT NULL,
  `account` bigint DEFAULT NULL,
  `time` timestamp NULL DEFAULT NULL,
  `purchase` timestamp NULL DEFAULT NULL,
  `comment` text COLLATE utf8mb4_general_ci
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_general_ci
id,account,time,purchase,comment
1,32864,2015-01-27 19:23:49,20150127,embulk
2,14824,2015-01-27 19:01:23,20150127,embulk jruby
3,27559,2015-01-28 02:20:02,20150128,"Embulk ""csv"" parser plugin"
4,11270,2015-01-29 11:54:36,20150129,🍺🍻🍣
in:
  type: file
  path_prefix: sample_
  parser:
    charset: UTF-8
    newline: LF
    type: csv
    delimiter: ','
    quote: '"'
    escape: '"'
    null_string: 'NULL'
    trim_if_not_quoted: false
    skip_header_lines: 1
    allow_extra_columns: false
    allow_optional_columns: false
    columns:
    - {name: id, type: long}
    - {name: account, type: long}
    - {name: time, type: timestamp, format: '%Y-%m-%d %H:%M:%S'}
    - {name: purchase, type: timestamp, format: '%Y%m%d'}
    - {name: comment, type: string}
out:
  type: mysql
  host: localhost
  mode: insert
  user: user
  password: password
  database: embulk_test
  table: example
mysql> show variables like '%char%';
+--------------------------+--------------------------------------------------------+
| Variable_name            | Value                                                  |
+--------------------------+--------------------------------------------------------+
| character_set_client     | utf8mb4                                                |
| character_set_connection | utf8mb4                                                |
| character_set_database   | utf8mb4                                                |
| character_set_filesystem | binary                                                 |
| character_set_results    | utf8mb4                                                |
| character_set_server     | utf8mb4                                                |
| character_set_system     | utf8mb3                                                |
| character_sets_dir       | /usr/local/Cellar/mysql/8.0.28_1/share/mysql/charsets/ |
+--------------------------+--------------------------------------------------------+
8 rows in set (0.00 sec)

@YunLongND
Copy link
Author

@hiroyuki-sato Thank you very much for your attention to my question.
My configuration information is as follows:

  • mysql version: 5.7.35
  • embulk: 0.9.23
  • embulk-output-mysql: 0.10.2
    in: type: redshift host: redshift_host user: redshift_user password: redshift_password database: redshift_database port: 5439 fetch_rows: 1000 query: select * from schema.table options: {useUnicode: true, characterEncoding: utf8} out: type: mysql host: mysql_host port: 3306 user: mysql_user password: mysql_password database: mysql_database schema: mysql_schema table: mysql_table create_table_option: DEFAULT CHARSET=utf8mb4 options: {loglevel: 2, useUnicode: true, characterEncoding: utf8} mode: truncate_insert node_count: 16

I noticed that when I use mode truncate_insert to transfer data from Redshift to MySQL, embed first creates a temporary table in mysql, then stores the data in the temporary table, and then transfers the data in the temporary table to the target table.
My chinese emoji have been successfully stored in redshift, and the temporary table has been successfully created, and the code of the temporary table is utf8mb4.
However, when I insert the emoji data in redshift into the temporary table of MySQL, the above error occurs.
If you need more information, please let me know.

@hiroyuki-sato
Copy link
Contributor

Hello, @YunLongND

  • Did you try to insert data into the mysql_table using mysql command? (ex. 1insert into mysql_table values(1,"🍻🍣🍺") ;`
  • Did you try to insert data into another table with mode: insert?

I got the same error message if the target table does not support utf8mb4.
In my case, I altered changed the database character set and recreate the table.
After that, I succeed insert data.

truncate_insert does not recreate the target table.
Did you check the table schema using show create table mysql_table?

I inserted the same data using mode: truncate_insert.

@YunLongND
Copy link
Author

@hiroyuki-sato
I have checked that both my temporary table and my target table are coded utf8mb4.
And using SQL directly to insert emoji into the table can be successful.
Based on the above communication, we cannot confirm where the problem is, so if there is no better suggestion on this issue, let's leave it here for now.
I found a little bit of a similar problem in other embulk projects, I don't know if it's related.
embulk/embulk-input-s3#75
Thank you again for your efforts.

@hiroyuki-sato
Copy link
Contributor

Hello, @YunLongND

I succeed insert emoji in the following environment and configuration.

  • embulk 0.9.24
  • embulk-output-mysql: 0.10.2
  • MySQL: 5.7.38
[mysqld]
character-set-server=utf8mb4
collation-server=utf8mb4_unicode_ci

[client]
default-character-set=utf8mb4
id,comment
2,embulk jruby
4,🍺🍻🍣
in:
  type: file
  path_prefix: sample_2
  parser:
    charset: UTF-8
    newline: LF
    type: csv
    delimiter: ','
    quote: '"'
    escape: '"'
    null_string: 'NULL'
    trim_if_not_quoted: false
    skip_header_lines: 1
    allow_extra_columns: false
    allow_optional_columns: false
    columns:
    - {name: id, type: long}
    - {name: comment, type: string}
out:
  type: mysql
  host: 127.0.0.1
  mode: truncate_insert
  user: user
  password: password
  database: embulk_test
  table: example

docker-compose.yml

   version: '3'

   services:
     db:
       image: mysql:5.7
       container_name: mysql_container
       environment:
         MYSQL_ROOT_PASSWORD: password
         MYSQL_DATABASE: embulk_test
         MYSQL_USER: user
         MYSQL_PASSWORD: password
       volumes:
       - ./docker/db/data:/var/lib/mysql
       - ./docker/db/my.cnf:/etc/mysql/conf.d/my.cnf
       restart: always
       ports:
       - 3306:3306

@rajyan
Copy link
Contributor

rajyan commented Aug 5, 2022

I had the exact same problem, and reading the documents, it seems not possible to solve the problem only by setting connector options.

https://dev.mysql.com/doc/refman/5.7/en/charset-unicode-conversion.html

For Connector/J 8.0.12 and earlier: In order to use the utf8mb4 character set for the connection, the server MUST be configured with character_set_server=utf8mb4; if that is not the case, when UTF-8 is used for characterEncoding in the connection string, it will map to the MySQL character set name utf8, which is an alias for utf8mb3.

setting

{ useUnicode: true, characterEncoding: UTF-8 }

using mysql-connector-java:5.1.44 would use utf8mb3.

https://stackoverflow.com/questions/44591895/utf8mb4-in-mysql-workbench-and-jdbc

Starting from MySQL Connector/J 5.1.47,

When UTF-8 is used for characterEncoding in the connection string, it maps to the MySQL character set name utf8mb4.

You can check docs here

I think upgrading the driver version to 5.1.47+ or 8.0.13+ can solve the issue.

Current solution is to set the my.cnf correctly as the document says.

@hiroyuki-sato
Copy link
Contributor

Hello, @rajyan

You can change Connecotor/J to 5.1.47+ (not Connector/J 8.x) using driver_path option. Could you try it?
Could you provide the reproduce configuration/steps like this?

I think 🍣 is the four-byte UTF-8 characters.

echo -n '🍣' | od -t x1
0000000    f0  9f  8d  a3
0000004

About updating the driver.

embulk/embulk-input-jdbc#237 (comment)

We'll eventually want to update JDBC drivers, but actually, JDBC drivers often have silent incompatibility between versions. We have hesitated to update the "default" JDBC driver there, then. (E.g. default options, ...)
Instead, you should be able to switch the JDBC driver version with the driver_path option by yourself, without rebuilding the plugin. Please try that for a while.

See also: embulk/embulk-input-jdbc#238

@rajyan
Copy link
Contributor

rajyan commented Aug 5, 2022

Hi, @hiroyuki-sato

Thank you for noticing driver_path option!

I created a reproducible repo for the error.
https://github.com/rajyan/embulk-mysql-utf8

The error occurred by setting options: { characterEncoding: UTF-8 }, and using Connecotor/J to 5.1.49 didn't help. (maybe we can't change the charset to utf8mb4 by connector option?)
Only setting mysqld --character-set-server=utf8mb4 --collation-server=utf8mb4_general_ci solved the issue.
edit: updated the repro correctly

@rajyan
Copy link
Contributor

rajyan commented Aug 5, 2022

Sorry, I understood my problem and maybe it was not related. The unicode error happens when inserting to the intermediate table (because the default charset is not utf8mb4 the columns of the intermediate tables are not utf8mb4)

@rajyan
Copy link
Contributor

rajyan commented Aug 5, 2022

Updated the repro by adding create_table_option: DEFAULT CHARSET=utf8mb4
rajyan/embulk-mysql-utf8@80880d8

I could confirm that the error only happens in the current connector, and upgrading Connecotor/J to 5.1.49 can solve the error.
Although the inserted 4 byte chars are broken even using 5.1.49 ...

@rajyan
Copy link
Contributor

rajyan commented Aug 5, 2022

Although the inserted 4 byte chars are broken even using 5.1.49 ...

Sorry never mind. It was just my mysql client.

I can confirm that upgrading to Connecotor/J to 5.1.49 can solve the error! 😄
rajyan/embulk-mysql-utf8@0b32bcf

@rajyan
Copy link
Contributor

rajyan commented Aug 5, 2022

Thank you for
embulk/embulk-input-jdbc#238
and
embulk/embulk-input-jdbc#237 (comment)

because there are several simple work arounds/solutions like using driver_path or setting server/default charsets, I’m not that in trouble with this issue now:+1:

@rajyan
Copy link
Contributor

rajyan commented Aug 26, 2022

Summary of the problem
You'll get

org.embulk.exec.PartialExecutionException: java.lang.RuntimeException: java.sql.BatchUpdateException: Incorrect string value: '\xF0\x9F\xA4\xA3' for column ......

error if you are not setting utf8mb4 in the mysql database default charset.

This error occurs even you are setting

  • The table and column charset as utf8mb4
  • options : { useUnicode: true, characterEncoding: UTF-8 }

because Connector/J before version 5.1.47 uses utf8 (utf8mb3) as a default for the connection with options : { useUnicode: true, characterEncoding: UTF-8 }.

The solution now is

  • Setting the mysql database default charsets to utf8mb4
    or
  • Install driver newer than 5.1.47 and use it by setting through driver_path
    • The default for utf8 in the connector has changed to utf8mb4 as explained above

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

3 participants