Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🎉 MySQL destination: normalization #4163

Merged
merged 46 commits into from
Jul 4, 2021

Conversation

tuliren
Copy link
Contributor

@tuliren tuliren commented Jun 16, 2021

What

How

Describe the solution

Todo

  • Append hash to table name
    • I could not fix all the test failures after making this change on both Java and Python
    • This change would affect existing normalizations and requires a migration
  • Investigate dependency specification
    • Move dbt-mysql to Dockerfile does not solve the problem
      • pip install dbt-mysql there still resulted in missing dependency exception related to dbt.
    • Cannot change dbt version to 0.19.1
      • dbt-mysql requires dbt 0.19.0
      • Forking our own dbt-mysql that depends on dbt 0.19.1 resulted in a charset issue and test failures.
        • This approach requires more investigation.
  • Add unit test for mysql name converter
  • Check in mysql dbt output
  • Update changed dbt output

Recommended reading order

  1. x.java
  2. y.python

Pre-merge Checklist

Expand the checklist which is relevant for this PR.

Connector checklist

  • Issue acceptance criteria met
  • PR name follows PR naming conventions
  • Secrets are annotated with airbyte_secret in output spec
  • Unit & integration tests added as appropriate (and are passing)
    • Community members: please provide proof of this succeeding locally e.g: screenshot or copy-paste acceptance test output. To run acceptance tests for a Python connector, follow instructions in the README. For java connectors run ./gradlew :airbyte-integrations:connectors:<name>:integrationTest.
  • /test connector=connectors/<name> command as documented here is passing.
    • Community members can skip this, Airbyters will run this for you.
  • Code reviews completed
  • Credentials added to Github CI if needed and not already present. instructions for injecting secrets into CI.
  • Documentation updated
    • README
    • CHANGELOG.md
    • Reference docs in the docs/integrations/ directory.
    • Build status added to build page
  • Build is successful
  • Connector version bumped like described here
  • New Connector version released on Dockerhub by running the /publish command described here
  • No major blockers
  • PR merged into master branch
  • Follow up tickets have been created
  • Associated tickets have been closed & stakeholders notified

@github-actions github-actions bot added area/worker Related to worker normalization labels Jun 16, 2021
@tuliren tuliren force-pushed the liren/mysql-destination-normalization branch from 934481e to 47945bf Compare June 18, 2021 08:52
@tuliren
Copy link
Contributor Author

tuliren commented Jun 18, 2021

/test connector=bases/base-normalization

🕑 bases/base-normalization https://github.com/airbytehq/airbyte/actions/runs/950760572
❌ bases/base-normalization https://github.com/airbytehq/airbyte/actions/runs/950760572

@tuliren
Copy link
Contributor Author

tuliren commented Jun 18, 2021

@ChristopheDuong, I am running into this error in the MySQL integration test:

java.sql.SQLSyntaxErrorException: Identifier name '_airbyte_raw_nested_stream_with_complex_columns_resulting_into_long_names' is too long

Do you know why the truncation is not working? It worked in the unit tests.

Here is the logs.

@ChristopheDuong
Copy link
Contributor

ChristopheDuong commented Jun 21, 2021

Do you know why the truncation is not working? It worked in the unit tests.

It seems to be working fine, for example on table names created by normalization: nested_stream_with_co__lting_into_long_names_ab1

but it is not applied to names of raw destination table names that are given as input tables to normalization: _airbyte_raw_nested_stream_with_complex_columns_resulting_into_long_names (as listed in the sources.yml file)

Do you know how would destination-mysql choose to call the output raw table for such a stream?

@ChristopheDuong
Copy link
Contributor

ChristopheDuong commented Jun 21, 2021

Do you know how would destination-mysql choose to call the output raw table for such a stream?

After downloading and looking at the output log from the destination-mysql, here is the answer:

2021-06-18 18:49:56 �[1;31mERROR�[m i.a.i.d.b.BufferedStreamConsumer(close):203 - {} - on close failed.
java.sql.SQLSyntaxErrorException: Identifier name '_airbyte_raw_nested_stream_with_complex_columns_resulting_into_long_names' is too long
	at com.mysql.cj.jdbc.exceptions.SQLError.createSQLException(SQLError.java:120) ~[mysql-connector-java-8.0.22.jar:8.0.22]
	at com.mysql.cj.jdbc.exceptions.SQLError.createSQLException(SQLError.java:97) ~[mysql-connector-java-8.0.22.jar:8.0.22]
	at com.mysql.cj.jdbc.exceptions.SQLExceptionsMapping.translateException(SQLExceptionsMapping.java:122) ~[mysql-connector-java-8.0.22.jar:8.0.22]
	at com.mysql.cj.jdbc.StatementImpl.executeInternal(StatementImpl.java:764) ~[mysql-connector-java-8.0.22.jar:8.0.22]
	at com.mysql.cj.jdbc.StatementImpl.execute(StatementImpl.java:648) ~[mysql-connector-java-8.0.22.jar:8.0.22]
	at org.apache.commons.dbcp2.DelegatingStatement.execute(DelegatingStatement.java:194) ~[commons-dbcp2-2.7.0.jar:2.7.0]
	at org.apache.commons.dbcp2.DelegatingStatement.execute(DelegatingStatement.java:194) ~[commons-dbcp2-2.7.0.jar:2.7.0]
	at io.airbyte.db.jdbc.JdbcDatabase.lambda$execute$0(JdbcDatabase.java:50) ~[io.airbyte-airbyte-db-0.26.0-alpha.jar:?]
	at io.airbyte.db.jdbc.DefaultJdbcDatabase.execute(DefaultJdbcDatabase.java:61) ~[io.airbyte-airbyte-db-0.26.0-alpha.jar:?]
	at io.airbyte.db.jdbc.JdbcDatabase.execute(JdbcDatabase.java:50) ~[io.airbyte-airbyte-db-0.26.0-alpha.jar:?]
	at io.airbyte.integrations.destination.jdbc.DefaultSqlOperations.createTableIfNotExists(DefaultSqlOperations.java:65) ~[io.airbyte.airbyte-integrations.connectors-destination-jdbc-0.26.0-alpha.jar:?]
	at io.airbyte.integrations.destination.jdbc.JdbcBufferedConsumerFactory.lambda$onCloseFunction$3(JdbcBufferedConsumerFactory.java:190) ~[io.airbyte.airbyte-integrations.connectors-destination-jdbc-0.26.0-alpha.jar:?]
	at io.airbyte.integrations.destination.buffered_stream_consumer.OnCloseFunction.accept(OnCloseFunction.java:29) ~[io.airbyte.airbyte-integrations.bases-base-java-0.26.0-alpha.jar:?]
	at io.airbyte.integrations.destination.buffered_stream_consumer.BufferedStreamConsumer.close(BufferedStreamConsumer.java:195) [io.airbyte.airbyte-integrations.bases-base-java-0.26.0-alpha.jar:?]
	at io.airbyte.integrations.base.FailureTrackingAirbyteMessageConsumer.close(FailureTrackingAirbyteMessageConsumer.java:82) [io.airbyte.airbyte-integrations.bases-base-java-0.26.0-alpha.jar:?]
	at io.airbyte.integrations.base.IntegrationRunner.consumeWriteStream(IntegrationRunner.java:138) [io.airbyte.airbyte-integrations.bases-base-java-0.26.0-alpha.jar:?]
	at io.airbyte.integrations.base.IntegrationRunner.run(IntegrationRunner.java:113) [io.airbyte.airbyte-integrations.bases-base-java-0.26.0-alpha.jar:?]
	at io.airbyte.integrations.destination.mysql.MySQLDestination.main(MySQLDestination.java:111) [io.airbyte.airbyte-integrations.connectors-destination-mysql-0.26.0-alpha.jar:?]

So I would say that you need to make sure the destination can implement some logic to fit long stream names successfully and reproduce the same logic in normalization when guessing the input table name... (you are experiencing here one of the pain points I was writing about in my refactoring of Configured Catalog document)

@github-actions github-actions bot added the area/connectors Connector related issues label Jun 23, 2021
@tuliren
Copy link
Contributor Author

tuliren commented Jun 23, 2021

/test connector=bases/base-normalization

🕑 bases/base-normalization https://github.com/airbytehq/airbyte/actions/runs/965479637
❌ bases/base-normalization https://github.com/airbytehq/airbyte/actions/runs/965479637

@tuliren
Copy link
Contributor Author

tuliren commented Jul 1, 2021

@ChristopheDuong, this PR is ready for review.

However, I was not able to do the following with this PR:

  • Remove dbt-mysql and dbt dependencies from setup
    • I tried to move dbt-mysql to Dockerfile. But installing dbt-mysql using pip does not solve the problem. There were still missing dependency exceptions related to dbt.
    • I also could not change dbt version to 0.19.1.
      • This is because dbt-mysql requires dbt 0.19.0.
      • I tried to fork our own dbt-mysql that depends on dbt 0.19.1. But that ended up with a charset issue, which failed the tests. I think this approach should eventually work. But it requires more investigation. The root cause is probably due to some changes from dbt 0.19.0 to 0.19.1.
  • Append a hash at the end of long table names
    • After making this change in both the Python and Java, I could not fix all the failed integration tests.
    • This change would affect existing normalizations and requires a migration, which seems beyong the scope of this PR.

Can we merge this PR as is for now? I can create issues to track the remaining problems.

I plan to check in the new mysql test output after the code review is done, because there are lots of files and can make it harder to review the code. But I push them if you prefer to review those files as well.

@ChristopheDuong
Copy link
Contributor

ChristopheDuong commented Jul 1, 2021

/test connector=bases/base-normalization

🕑 bases/base-normalization https://github.com/airbytehq/airbyte/actions/runs/989677289
❌ bases/base-normalization https://github.com/airbytehq/airbyte/actions/runs/989677289

@ChristopheDuong
Copy link
Contributor

ChristopheDuong commented Jul 1, 2021

It seems @marcosmarxm added an extra test case with the dedup fix with CDC...

It is so weird, MySQL runs this query fine:

Screenshot 2021-07-01 at 14 12 57

but fails with this because the column _ab_cdc_deleted_at contains NULL values.

Screenshot 2021-07-01 at 14 13 20

(note that cast(null as float) is perfectly fine, and if you ran the select queries without create table then it's good too)

Comment on lines +67 to +69
# In MySQL, the max number of columns is limited by row size (8KB),
# not by absolute column count. It is way fewer than 1500.
return
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not to be solved in this PR but we should probably handle the validity of rows (in terms of size) in the destination-mysql and log some warning/errors or reject such records?

About 250 columns, seems to be really low, especially for sources like hubspot/salesforce and they could easily run into exceptions there?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The max number of columns is related to the size of each column. In the test, we are creating string columns, which corresponds to char columns in mysql and each column is relatively large. That's why as few as 250 columns can fail the test. In reality, not every column will be of type char, and the max is 4096 in mysql.

@tuliren
Copy link
Contributor Author

tuliren commented Jul 1, 2021

The root cause of the casting failure is that when using json_extract, the extracted json null is different from a native mysql null. The correct approach is to use json_value for extracting scaler json values, which will return native mysql types.

@tuliren
Copy link
Contributor Author

tuliren commented Jul 1, 2021

/test connector=bases/base-normalization

🕑 bases/base-normalization https://github.com/airbytehq/airbyte/actions/runs/991703530
✅ bases/base-normalization https://github.com/airbytehq/airbyte/actions/runs/991703530

@tuliren
Copy link
Contributor Author

tuliren commented Jul 1, 2021

/test connector=bases/base-normalization

🕑 bases/base-normalization https://github.com/airbytehq/airbyte/actions/runs/991823249
✅ bases/base-normalization https://github.com/airbytehq/airbyte/actions/runs/991823249

@ChristopheDuong
Copy link
Contributor

good!

don't forget to bump and publish MySQL destination and the normalization image then

@tuliren
Copy link
Contributor Author

tuliren commented Jul 4, 2021

/publish connector=bases/base-normalization

🕑 bases/base-normalization https://github.com/airbytehq/airbyte/actions/runs/997568949
✅ bases/base-normalization https://github.com/airbytehq/airbyte/actions/runs/997568949

@github-actions github-actions bot added the area/documentation Improvements or additions to documentation label Jul 4, 2021
@tuliren tuliren merged commit 2caf390 into master Jul 4, 2021
@tuliren tuliren deleted the liren/mysql-destination-normalization branch July 4, 2021 03:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/connectors Connector related issues area/documentation Improvements or additions to documentation area/worker Related to worker normalization
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement normalization for the MySql destination.
2 participants