Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix: Convert column value in binlog events to bytes instead of utf8 encoded unicode #1158

Merged
merged 7 commits into from Sep 6, 2022

Conversation

wangzihuacool
Copy link
Contributor

Related issue: #1157 #1112 #532

Description:

This PR converts character of the column data to bytes , and modifies applier connection charset from utf8mb4 to latin1, to aviod charset conversion and thus avioding Error 1366: Incorrect string value.

The current processing method is to convert the string parsed by the binlog event into utf8 encoded unicode, and then the applier sets the database connection string to utf8mb4 to insert the unicode characters into the table.This method has a drawback, that is, assuming that the data in the latin1 table is utf8 encoded characters, then the latin1 column may contain characters that are invalid single-byte utf8 characters. Characters in the \x80-\xFF range are most common. When written to utf8mb4 column without conversion, they fail as they do not exist in the utf8 codepage.

Since Latin1 is a single-byte encoding, and all 256 values ​​​​of 1 byte are fully occupied, theoretically, any encoded value can be stored in the Latin1 field.We convert the data of character type into bytes, and then write the single-byte characters of latin1 to the table, so there is no problem of transcoding

@wangzihuacool wangzihuacool changed the title Fix: Convert character to bytes and insert into table using latin1, instead of using utf8 character Fix: Convert column value in binlog events to bytes instead of utf8 encoded unicode Aug 11, 2022
@wangzihuacool
Copy link
Contributor Author

Restore db connection charset to utf8mb4, no need to change charset to latin1.

@timvaillancourt
Copy link
Collaborator

@wangzihuacool thanks for this PR! Currently it is not passing CI checks, could you remedy this when you have time? 🙇

@wangzihuacool
Copy link
Contributor Author

@wangzihuacool thanks for this PR! Currently it is not passing CI checks, could you remedy this when you have time? 🙇

Thank you for your reply. I have made some fixes for unit tests, and tests pass on my local run.
Looking forward to a CI check.

@wangzihuacool
Copy link
Contributor Author

wangzihuacool commented Aug 22, 2022

alter-charset migration tests failed. Since we changed the column value to bytes, there is no character set conversion when applying DML events. To support changing the character set of the column with characters in the source table that are invalid in the destination table (due to charset), we need to merge commits in PR #1003 . Or, when changing the character set of the column, we just fall back to the previous way of dealing with characters.

@wangzihuacool
Copy link
Contributor Author

I've made some changes. If the column value got from binlog event is a string type and no charset conversion, it will be converted to bytes;otherwise we just fall back to the previous way of dealing with characters.
Now the migration tests on MySQL 5.7 and 8.0 both pass on my local run. Looking forward to a CI check. Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants