Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't convert a string to binary base64 unnecessarily #25586

Merged
merged 5 commits into from May 24, 2023

Conversation

rodireich
Copy link
Contributor

@rodireich rodireich commented Apr 26, 2023

What

Mysql column that is encoded with any character set and collated (sorted) in a binary collation logic (e.g utf8mb4_bin) is today treated as a binary value and converted to base64 on detonation side.
This is not necessary as the value itself is still a simple string.
We need to treat it as a string and create one on destination.

How

Our code is checking incorrectly the .isBinary() property of a column when deciding how to convert it.
Remove this logic when copying to json.

🚨 User Impact 🚨

Any string value previously synced as base64 will remain so until a reset or a full refresh.
Newly synced values will appear as the original on source side ("abcd").

Resolves #9938 .

@rodireich rodireich requested a review from a team April 26, 2023 20:52
@rodireich rodireich requested a review from a team as a code owner April 26, 2023 20:52
@rodireich rodireich linked an issue Apr 26, 2023 that may be closed by this pull request
@rodireich
Copy link
Contributor Author

rodireich commented Apr 26, 2023

/test connector=connectors/source-mysql

🕑 connectors/source-mysql https://github.com/airbytehq/airbyte/actions/runs/4813268088
✅ connectors/source-mysql https://github.com/airbytehq/airbyte/actions/runs/4813268088
No Python unittests run

Build Passed

Test summary info:

=========================== short test summary info ============================
SKIPPED [1] ../usr/local/lib/python3.9/site-packages/connector_acceptance_test/plugin.py:63: Skipping TestConnection.test_check: not found in the config.
SKIPPED [1] ../usr/local/lib/python3.9/site-packages/connector_acceptance_test/plugin.py:63: Skipping TestDiscovery.test_discover: not found in the config.
SKIPPED [1] ../usr/local/lib/python3.9/site-packages/connector_acceptance_test/plugin.py:63: Skipping TestBasicRead.test_read: not found in the config.
SKIPPED [1] ../usr/local/lib/python3.9/site-packages/connector_acceptance_test/plugin.py:63: Skipping TestFullRefresh.test_sequential_reads: not found in the config.
SKIPPED [1] ../usr/local/lib/python3.9/site-packages/connector_acceptance_test/plugin.py:63: Skipping TestIncremental.test_two_sequential_reads: not found in the config.
SKIPPED [1] ../usr/local/lib/python3.9/site-packages/connector_acceptance_test/tests/test_core.py:100: The previous and actual specifications are identical.
================= 22 passed, 6 skipped, 30 warnings in 18.49s ==================

@github-actions
Copy link
Contributor

github-actions bot commented Apr 26, 2023

Affected Connector Report

NOTE ⚠️ Changes in this PR affect the following connectors. Make sure to do the following as needed:

  • Run integration tests
  • Bump connector or module version
  • Add changelog
  • Publish the new version

❌ Sources (2)

Connector Version Changelog Publish
source-mysql 2.0.23
(diff seed version)
source-mysql-strict-encrypt 2.0.23 🔵
(ignored)
🔵
(ignored)
  • See "Actionable Items" below for how to resolve warnings and errors.

✅ Destinations (0)

Connector Version Changelog Publish
  • See "Actionable Items" below for how to resolve warnings and errors.

✅ Other Modules (0)

Actionable Items

(click to expand)

Category Status Actionable Item
Version
mismatch
The version of the connector is different from its normal variant. Please bump the version of the connector.

doc not found
The connector does not seem to have a documentation file. This can be normal (e.g. basic connector like source-jdbc is not published or documented). Please double-check to make sure that it is not a bug.
Changelog
doc not found
The connector does not seem to have a documentation file. This can be normal (e.g. basic connector like source-jdbc is not published or documented). Please double-check to make sure that it is not a bug.

changelog missing
There is no chnagelog for the current version of the connector. If you are the author of the current version, please add a changelog.
Publish
not in seed
The connector is not in the cloud or oss registry, so its publication status cannot be checked. This can be normal (e.g. some connectors are cloud-specific, and only listed in the cloud seed file). Please double-check to make sure that you have added a metadata.yaml file and the expected registries are enabled.

@octavia-squidington-iii octavia-squidington-iii added the area/documentation Improvements or additions to documentation label Apr 26, 2023
putString(json, columnName, resultSet, colIndex);
}
}
case CHAR, VARCHAR -> putString(json, columnName, resultSet, colIndex);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need a test case for binary fields to ensure that conversion to string logic is taking place as expected?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code change is putting us back to handling VARCHAR the same way under any charset or collation.

We don't have tests for variation of charset as far as I can tell, but the change itself is returning to the way this used to behave before checking for .isBinary() was added.

@edgao
Copy link
Contributor

edgao commented May 1, 2023

checking my understanding:

  • current behavior is that we discover these columns as string and read records as base64encode(select the_column)
  • new behavior is still discovering string, but no longer base64 encoding the data?

this sounds like a breaking change. Do we need any coordination to release this? Should we recommend that users run a reset after upgrading? (or they can manually rewrite their raw tables + rebuild normalized tables I guess, but that sounds complicated)

@rodireich
Copy link
Contributor Author

What you describe @edgao is correct - that's the change.
But I thought of it as non breaking as old values that already synced to destination as base64 would not affect new values synced as a simple string.
Do you think we should flag this beforehand to customers so they can be aware and reset or re-run normalization ?

@edgao
Copy link
Contributor

edgao commented May 2, 2023

presumably there are some folks who need to manually base64-decode their data right now; we should at least tell them to stop doing that decode.

does our schema include anything like format: base64? I.e. can we continue reading base64 for existing users, but new users (or after schema refresh) start reading normal strings?

@rodireich
Copy link
Contributor Author

rodireich commented May 6, 2023

does our schema include anything like format: base64

I don't think that's the case.
The column in the case above was marked correctly as a string but its value was treated like a binary.
Note that the only thing binary here was the column collation order. not any encoding or data.

I'm not sure what can be done other than notifying customers that may have reverse conversion on they destination.
I'm not sure we can even check for existence of such cloud connections as the type is string.

@edgao
Copy link
Contributor

edgao commented May 8, 2023

sorry, what does this mean?

the only thing binary here was the column collation order. not any encoding or data

@rodireich
Copy link
Contributor Author

rodireich commented May 8, 2023

The case here is db with charset encoding of utf8,
and a collation (or sorting) order of utf8_bin.

CREATE TABLE `INV_LOT_LOC_ID` ( … ) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_bin

The data is encoded in plain utf8 but when sorted for example in ORDER BY, binary logic is used which is comparing byte to byte values rather than go into all the intricacies of variable length multibyte utf8 characters.
For example utf8 aware sorting may treat ß as equal to ss, while binary sorting would place ß after z

Not to be confused with Binary encoding that has its own "binary" collation order which is unrelated to this case and will still get destination strings converted to base64 as it should.
See https://dev.mysql.com/doc/refman/5.7/en/charset-binary-collations.html

Copy link
Contributor

@edgao edgao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah got it :/ that seems challenging to coordinate then, if we can't even recognize which workspaces are affected...

is our plan to just tell all source-mysql users something like "check your source DB for columns with COLLATE=utf8_bin and run a reset if you have them?" (or maybe "update your query to attempt base64_decode, and fallback to the raw value if that decode fails"?)

approving since I have no opinion on the code change, but I think the release plan needs some clarification. (destinations has no problems with this, since the schema was and will remain type: string)

@rodireich
Copy link
Contributor Author

Thanks Ed
And this is largely the plan - alert users to check if their connection is affected. Thanks for suggesting fallback in case a decode fails. that should probably solve it for most.
I will coordinate with cloud and community support to send out a message.

@rodireich
Copy link
Contributor Author

rodireich commented May 8, 2023

Airbyte technical support:
Once this change goes in, previously synced mysql-source connections that are syncing:
A database with any charset encoding which is set with an *_bin collation order (sorting)
For example hebrew charset with hebrew_bin collation.

Will see the following potentially breaking change:
Strings that were previously encoded on destination in base64 will now show up in their regular form 'abcd' rather than 'YWJjZA=='.
This means that a table that was previously synced and has VARCHAR columns converted to base64 will now have new values synced as simple strings.

What can be done:
a. Reset of table data on airbyte will get a fresh copy of all rows with now plain 'abcd' string.
b. If there's any script or procedure on destination that is consuming these base64 encoded columns, it should be updated to also be able to process plain strings (e.g if base64 decode failed use the string as is).

Note: the issue won't affect anybody who is running with default settings but only customers that intentionally created a binary collated DB or tables or columns.

cc: @erica-airbyte

@erica-airbyte
Copy link
Contributor

Thanks @rodireich can you fill out the breaking changes doc?

@rodireich
Copy link
Contributor Author

rodireich commented May 8, 2023

@amaliaroye
Copy link
Contributor

Hi @rodireich, is this still scheduled to be released tomorrow, May 24th?

@rodireich rodireich enabled auto-merge (squash) May 24, 2023 14:11
@rodireich rodireich merged commit 804087a into master May 24, 2023
23 checks passed
@rodireich rodireich deleted the 9938-source-mysql-support-utf8_bin-charset branch May 24, 2023 14:33
@rodireich
Copy link
Contributor Author

/connector-performance connector=connectors/source-mysql

@rodireich
Copy link
Contributor Author

/connector-performance connector=connectors/source-mysql ref=master

nguyenaiden pushed a commit that referenced this pull request May 25, 2023
* Don't convert a string to binary base64 unnecessarily

* update doc

* update doc
marcosmarxm pushed a commit to natalia-miinto/airbyte that referenced this pull request Jun 8, 2023
* Don't convert a string to binary base64 unnecessarily

* update doc

* update doc
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Source MySQL: support utf8_bin charset
6 participants