source: implementation for clickhouse #3361

subodh1810 · 2021-05-11T11:57:54Z

What

Issue : #3317
Docs : https://clickhouse.tech/docs/en/
JDBC : https://github.com/ClickHouse/clickhouse-jdbc
Docker images : https://hub.docker.com/r/yandex/clickhouse-server/tags?page=1&ordering=last_updated
The following screenshots show the time taken for full refresh and incremental sync via the JDBC stress test.
It takes between 20-25 seconds for sync to finish

How

The solution uses AbstractJDBC source with a bit of tweaking.

Pre-merge Checklist

Run integration tests
Publish Docker images

Recommended reading order

test.java
component.ts
the rest

subodh1810 · 2021-05-11T12:00:01Z

/test connector=source-clickhouse

🕑 source-clickhouse https://github.com/airbytehq/airbyte/actions/runs/831582250
✅ source-clickhouse https://github.com/airbytehq/airbyte/actions/runs/831582250

subodh1810 · 2021-05-11T12:00:09Z

/test connector=source-mysql

🕑 source-mysql https://github.com/airbytehq/airbyte/actions/runs/831582410
✅ source-mysql https://github.com/airbytehq/airbyte/actions/runs/831582410

subodh1810 · 2021-05-11T12:00:15Z

/test connector=source-postgres

🕑 source-postgres https://github.com/airbytehq/airbyte/actions/runs/831583005
✅ source-postgres https://github.com/airbytehq/airbyte/actions/runs/831583005

subodh1810 · 2021-05-11T12:01:07Z

/test connector=source-mssql

🕑 source-mssql https://github.com/airbytehq/airbyte/actions/runs/831585205
✅ source-mssql https://github.com/airbytehq/airbyte/actions/runs/831585205

subodh1810 · 2021-05-11T12:01:32Z

/test connector=source-oracle

🕑 source-oracle https://github.com/airbytehq/airbyte/actions/runs/831586629
✅ source-oracle https://github.com/airbytehq/airbyte/actions/runs/831586629

subodh1810 · 2021-05-11T12:01:39Z

/test connector=source-redshift

🕑 source-redshift https://github.com/airbytehq/airbyte/actions/runs/831586785
✅ source-redshift https://github.com/airbytehq/airbyte/actions/runs/831586785

jrhizor · 2021-05-11T15:51:55Z

...c/src/testFixtures/java/io/airbyte/integrations/source/jdbc/test/JdbcSourceStandardTest.java

@@ -389,7 +412,8 @@ void testReadMultipleTables() throws Exception {

    setEmittedAtToNull(actualMessages);

-    assertEquals(expectedMessages, actualMessages);
+    assertTrue(expectedMessages.size() == actualMessages.size() && expectedMessages.containsAll(actualMessages)


nit: easier to read as separate asserts

jrhizor · 2021-05-11T15:58:10Z

...ava/io/airbyte/integrations/source/clickhouse/ClickHouseJdbcStreamingQueryConfiguration.java

+  @Override
+  public void accept(Connection connection, PreparedStatement preparedStatement)
+      throws SQLException {
+


Is there a reason ClickHouse doesn't need autocommit or fetch size adjustments like JdbcStreamingQueryConfiguration for other databases?

If so, it'd be good to have a comment here to understand why this is a no-op.

I am still trying to understand why we need the auto commit to be off for other databases and whether its required for clickhouse or not

davinchia · 2021-05-12T02:57:43Z

airbyte-db/src/main/java/io/airbyte/db/jdbc/JdbcUtils.java

@@ -169,13 +169,22 @@ public static void setStatementField(PreparedStatement preparedStatement,
    switch (cursorFieldType) {
      // parse date, time, and timestamp the same way. this seems to not cause an problems and allows us


nit: update this comment

can we also leave a comment here summarising what we discussed yesterday around potential errors using Date as a cursor?

davinchia · 2021-05-12T02:58:40Z

airbyte-integrations/connectors/source-clickhouse/build.gradle

+
+dependencies {
+    implementation project(':airbyte-db')
+    implementation project(':airbyte-integrations:bases:base-java')


nit: sort this alphabetically

davinchia · 2021-05-12T02:59:18Z

airbyte-integrations/connectors/source-clickhouse/build.gradle

+    integrationTestJavaImplementation project(':airbyte-integrations:connectors:source-clickhouse')
+    integrationTestJavaImplementation "org.testcontainers:clickhouse:1.15.3"
+
+    implementation files(project(':airbyte-integrations:bases:base-java').airbyteDocker.outputs)


nit: move to this to the implementation project block

davinchia · 2021-05-12T03:01:49Z

airbyte-integrations/connectors/source-clickhouse/build.gradle

+
+    implementation 'ru.yandex.clickhouse:clickhouse-jdbc:0.3.1'
+
+    integrationTestJavaImplementation project(':airbyte-integrations:bases:standard-source-test')


nit: group as such

integrationTestJavaImplementation project(':airbyte-integrations:bases:standard-source-test') integrationTestJavaImplementation project(':airbyte-integrations:connectors:source-clickhouse') integrationTestJavaImplementation testFixtures(project(':airbyte-integrations:connectors:source-jdbc')) integrationTestJavaImplementation "org.testcontainers:clickhouse:1.15.3"

davinchia · 2021-05-12T03:09:20Z

...ava/io/airbyte/integrations/source/clickhouse/ClickHouseJdbcStreamingQueryConfiguration.java

+import java.sql.Connection;
+import java.sql.PreparedStatement;
+
+public class ClickHouseJdbcStreamingQueryConfiguration implements JdbcStreamingQueryConfiguration {


Instead of doing this, let's create a NoOpJdbcStreamingQueryConfiguration in the airbyte-db package and reusing it in the ClickHouse class. The cost of doing so is minimal (we still add one new class in total), and it'll let other classes reuse this in the future (I'd imagine there might be some).

davinchia · 2021-05-12T03:10:46Z

...ava/io/airbyte/integrations/source/clickhouse/ClickHouseJdbcStreamingQueryConfiguration.java

+public class ClickHouseJdbcStreamingQueryConfiguration implements JdbcStreamingQueryConfiguration {
+
+  /**
+   * The reason accept method for ClickHouse is not setting auto commit to false like other JDBC


Nice comment! I would move this to the constructor after we switch to using NoOpJdbcStreamingQueryConfiguration.

davinchia · 2021-05-12T03:11:13Z

...ava/io/airbyte/integrations/source/clickhouse/ClickHouseJdbcStreamingQueryConfiguration.java

+   * sources is cause method {@link ru.yandex.clickhouse.ClickHouseConnectionImpl#setAutoCommit} is
+   * empty. The reason accept method for ClickHouse is not setting fetch size to 1000 like other JDBC
+   * sources is cause method {@link ru.yandex.clickhouse.ClickHouseStatementImpl#setFetchSize} is
+   * empty


Curious - so what's Clickhouse's current fetch size?

I tried to figure this out but I dont know. The JDBC driver returns 0 as fetch size and there is nothing mentioned in the docs as well

davinchia · 2021-05-12T03:14:38Z

...rce-clickhouse/src/main/java/io/airbyte/integrations/source/clickhouse/ClickHouseSource.java

+            tableInfo -> {
+              try {
+                return database.resultSetQuery(connection -> {
+                  String sql = "SELECT name FROM system.columns WHERE database = ? AND  table = ? AND is_in_primary_key = 1";


Remind me again - why does ClickHouse require a different manner of finding primary keys?

/** * The default implementation relies on {@link java.sql.DatabaseMetaData#getPrimaryKeys} method to * get it but the ClickHouse JDBC driver returns an empty result set from the method * {@link ru.yandex.clickhouse.ClickHouseDatabaseMetadata#getPrimaryKeys}. That's why we have to * query the system table mentioned here * https://clickhouse.tech/docs/en/operations/system-tables/columns/ to fetch the primary keys. */

Added this as comment as well

davinchia · 2021-05-12T03:16:52Z

airbyte-integrations/connectors/source-clickhouse/src/main/resources/spec.json

+    "$schema": "http://json-schema.org/draft-07/schema#",
+    "title": "ClickHouse Source Spec",
+    "type": "object",
+    "required": ["host", "port", "database", "username", "password"],


in general, we don't want password to be required to make it easier on users to run tests. does Clickhouse enforce having a password?

davinchia · 2021-05-12T03:20:15Z

airbyte-integrations/connectors/source-clickhouse/src/main/resources/spec.json

@@ -0,0 +1,38 @@
+{
+  "documentationUrl": "https://docs.airbyte.io/integrations/destinations/clickhouse",


Question for my own understanding: we spoke about how ClickHouse has different table engines under the hood. Do you think users/Airbyte might benefit from that being exposed?

Small sanity check that treating Clickhouse as a JDBC source is good enough for our purposes.

As discussed over call, we dont need the user to specify the engine in setup wizard cause our approach is not specific to an engine

davinchia · 2021-05-12T03:21:22Z

...byte/integrations/io/airbyte/integration_tests/sources/ClickHouseJdbcStandardSourceTest.java

+  public String createTableQuery(String tableName, String columnClause, String primaryKeyClause) {
+    return String.format("CREATE TABLE %s(%s) %s",
+        tableName, columnClause, primaryKeyClause.equals("") ? "Engine = TinyLog"
+            : "ENGINE = MergeTree() ORDER BY " + primaryKeyClause + " PRIMARY KEY "


I know what this is because we spoke about this yesterday. can we leave a comment explaining/pointing to docs why we require this?

davinchia · 2021-05-12T03:32:44Z

.../airbyte/integrations/io/airbyte/integration_tests/sources/ClickHouseStandardSourceTest.java

+  }
+
+  @Override
+  protected ConfiguredAirbyteCatalog getConfiguredCatalog() {


@jrhizor I notice this method is identical across a bunch of the StandardSourceTest implementations, is this something we can move into the parent class, and provide a default implementation for, at a later date?

davinchia · 2021-05-12T03:34:40Z

...c/src/testFixtures/java/io/airbyte/integrations/source/jdbc/test/JdbcSourceStandardTest.java

@@ -143,6 +143,28 @@
   */
  public abstract AbstractJdbcSource getSource();

+  protected String createTableQuery(String tableName, String columnClause, String primaryKeyClause) {


davinchia

Looks good!

Some minor comments for readability + couple of nits. Feel free to merge after addressing them.

Guessing we'll put out documentation + publish in a follow up PR?

subodh1810 · 2021-05-12T08:21:36Z

/test connector=source-clickhouse

🕑 source-clickhouse https://github.com/airbytehq/airbyte/actions/runs/834763622
✅ source-clickhouse https://github.com/airbytehq/airbyte/actions/runs/834763622

source: implementation for clickhouse source

29735a5

subodh1810 requested review from cgardens and davinchia May 11, 2021 11:57

subodh1810 self-assigned this May 11, 2021

auto-assign bot requested review from jrhizor and michel-tricot May 11, 2021 11:57

subodh1810 changed the title ~~source: implementation for clickhouse source~~ source: implementation for clickhouse May 11, 2021

jrhizor approved these changes May 11, 2021

View reviewed changes

address PR comments

a25359e

davinchia reviewed May 12, 2021

View reviewed changes

davinchia approved these changes May 12, 2021

View reviewed changes

address PR comments by Davin

c373ff9

forgot to update comment

69b5d8f

subodh1810 merged commit c786118 into master May 12, 2021

subodh1810 deleted the source-clickhouse branch May 12, 2021 09:06

subodh1810 mentioned this pull request May 13, 2021

New Destination: ClickHouse #1903

Closed

davinchia mentioned this pull request May 26, 2021

Snowflake source integration #1629

Closed

igrankova added connectors/source/clickhouse connectors/sources-database labels Jan 5, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

source: implementation for clickhouse #3361

source: implementation for clickhouse #3361

subodh1810 commented May 11, 2021 •

edited

Loading

subodh1810 commented May 11, 2021 •

edited by github-actions bot

Loading

subodh1810 commented May 11, 2021 •

edited by github-actions bot

Loading

subodh1810 commented May 11, 2021 •

edited by github-actions bot

Loading

subodh1810 commented May 11, 2021 •

edited by github-actions bot

Loading

subodh1810 commented May 11, 2021 •

edited by github-actions bot

Loading

subodh1810 commented May 11, 2021 •

edited by github-actions bot

Loading

jrhizor May 11, 2021

jrhizor May 11, 2021

subodh1810 May 11, 2021

davinchia May 12, 2021

davinchia May 12, 2021

davinchia May 12, 2021

davinchia May 12, 2021

davinchia May 12, 2021

davinchia May 12, 2021 •

edited

Loading

davinchia May 12, 2021

davinchia May 12, 2021

subodh1810 May 12, 2021

davinchia May 12, 2021

subodh1810 May 12, 2021

davinchia May 12, 2021 •

edited

Loading

davinchia May 12, 2021

subodh1810 May 12, 2021

davinchia May 12, 2021 •

edited

Loading

davinchia May 12, 2021 •

edited

Loading

davinchia May 12, 2021

davinchia left a comment

subodh1810 commented May 12, 2021 •

edited by github-actions bot

Loading

		@@ -169,13 +169,22 @@ public static void setStatementField(PreparedStatement preparedStatement,
		switch (cursorFieldType) {
		// parse date, time, and timestamp the same way. this seems to not cause an problems and allows us


		implementation 'ru.yandex.clickhouse:clickhouse-jdbc:0.3.1'

		integrationTestJavaImplementation project(':airbyte-integrations:bases:standard-source-test')

		@@ -0,0 +1,38 @@
		{
		"documentationUrl": "https://docs.airbyte.io/integrations/destinations/clickhouse",

source: implementation for clickhouse #3361

source: implementation for clickhouse #3361

Conversation

subodh1810 commented May 11, 2021 • edited Loading

What

How

Pre-merge Checklist

Recommended reading order

subodh1810 commented May 11, 2021 • edited by github-actions bot Loading

subodh1810 commented May 11, 2021 • edited by github-actions bot Loading

subodh1810 commented May 11, 2021 • edited by github-actions bot Loading

subodh1810 commented May 11, 2021 • edited by github-actions bot Loading

subodh1810 commented May 11, 2021 • edited by github-actions bot Loading

subodh1810 commented May 11, 2021 • edited by github-actions bot Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

davinchia May 12, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

davinchia May 12, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

davinchia May 12, 2021 • edited Loading

Choose a reason for hiding this comment

davinchia May 12, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

davinchia left a comment

Choose a reason for hiding this comment

subodh1810 commented May 12, 2021 • edited by github-actions bot Loading

subodh1810 commented May 11, 2021 •

edited

Loading

subodh1810 commented May 11, 2021 •

edited by github-actions bot

Loading

subodh1810 commented May 11, 2021 •

edited by github-actions bot

Loading

subodh1810 commented May 11, 2021 •

edited by github-actions bot

Loading

subodh1810 commented May 11, 2021 •

edited by github-actions bot

Loading

subodh1810 commented May 11, 2021 •

edited by github-actions bot

Loading

subodh1810 commented May 11, 2021 •

edited by github-actions bot

Loading

davinchia May 12, 2021 •

edited

Loading

davinchia May 12, 2021 •

edited

Loading

davinchia May 12, 2021 •

edited

Loading

davinchia May 12, 2021 •

edited

Loading

subodh1810 commented May 12, 2021 •

edited by github-actions bot

Loading