Kudu Connector rework #78

mbalassi · 2020-04-17T12:01:28Z

Based on our proposal on the developer mailing list @gyfora and @thebalu have reworked the Kudu connector and added a connector to the Table API for it. The proposed changes are documented in the following design doc.

The initial commit by @gyfora contains the API rework, followed by the Table API addition and the respective README updates by @thebalu.

mbalassi · 2020-04-17T12:28:42Z

The checks seem to be failing because we also bumped the Flink version to 1.10.0 and missed to update the .travis.yml accordingly. Proposing that in a separate pull request then.

mbalassi · 2020-04-17T12:59:31Z

I have opened the PR bumping the underlying Flink version in #79. That will fix the CI failures for this PR. Let me know if you prefer that I also push that commit to this brnach for visibility.

granthenke

Just passing through with a quick scan while I had some time.

granthenke · 2020-04-17T13:49:55Z

flink-connector-kudu/pom.xml

@@ -30,24 +30,38 @@
  <packaging>jar</packaging>

  <properties>
-    <kudu.version>1.10.0</kudu.version>
-
+    <kudu.version>1.11.1</kudu.version>


Depending on when this lands, the Apache Kudu community is currently in the process of releasing 1.12.0. The relevant features that might impact this integration are support for DATE and VARCHAR types.

I agree, we did not consider this 👍

I took a look at Kudu 1.12.0. Date support seems to be an easy addition, but Varchar types are not supported by the Blink planner.
For this PR, I suggest we remain with 1.11.1, and Date support after the Kudu release is final.

granthenke · 2020-04-17T14:01:43Z

...kudu/src/main/java/org/apache/flink/connectors/kudu/connector/reader/KuduReaderIterator.java

            } else {
                Type type = column.getType();
                switch (type) {
                    case BINARY:
-                        values.setField(pos, name, row.getBinary(name));
+                        values.setField(pos, row.getBinary(name));


Given values.setField takes an Object you can probably remove this switch and call row.getObject(name).

yes, lets simplify that!

granthenke · 2020-04-17T14:01:59Z

...kudu/src/main/java/org/apache/flink/connectors/kudu/connector/reader/KuduReaderIterator.java

-                        values.setField(pos, name, row.getLong(name) / 1000);
+                        try {
+                            values.setField(pos, row.getTimestamp(name));
+                        } catch (Exception e) {


What exception is this handling?

that is beyond me, I will ask Balazs if he knows why it was there

This is actually not necessary, especially if this is simplified (above comment)

granthenke · 2020-04-17T14:09:48Z

flink-connector-kudu/src/main/java/org/apache/flink/connectors/kudu/table/KuduCatalog.java

+ * Catalog for reading and creating Kudu tables.
+ */
+@PublicEvolving
+public class KuduCatalog extends AbstractReadOnlyCatalog {


nit: extending AbstractReadOnlyCatalog to make it not read only is a bit confusing. Maybe call it AbstractCatalog and note that it's read only by default in the documentation?

I just added this as a helper class to factor out the unsupported features of the Catalog. Originally the intention was to make it read-only in the sense that it would only grant access to tables already present in Kudu.

Now we added support for creating Kudu tables but still the user is not allowed to create other kinds of tables (like you normally would in the Hive or In memory catalog). But at this point the naming might be unfortunate.

granthenke · 2020-04-17T14:15:00Z

flink-connector-kudu/src/main/java/org/apache/flink/connectors/kudu/table/KuduCatalog.java

+    }
+
+    @Override
+    public List<String> listTables(String databaseName) throws DatabaseNotExistException, CatalogException {


This api is a bit unusual given the call to kudu isn't actually filtering by "database" at all.

Kudu doesn't have a strict first-class database concept yet, though it does have a table name syntax to help loosely represent a database. The format is effectively <database>.<tablename>.

Handling this is likely out of scope for this patch, but I wanted to provide some context.

For now we decided to not introduce any database concept in this catalog for the reasons you highlighted and treat everything as a single database.

In the future we could improve this and adapt the convention you mentioned. In that case the default could be the global namespace as it is now.

granthenke · 2020-04-17T14:19:05Z

flink-connector-kudu/src/main/java/org/apache/flink/connectors/kudu/table/KuduTableFactory.java

+    }
+
+    private DescriptorProperties getValidatedProps(Map<String, String> properties) {
+        checkNotNull(properties.get(KUDU_MASTERS), "Missing required property " + KUDU_MASTERS);


Other integrations have provided an application level configuration to use a default if a per-table override is not provided. Does that make sense here too?

The KuduCatalog basically provides the default for this. So as long as you use the KuduCatalog (and specified the masters there) you wouldnt have to specify it for individual tables.

On the other hand when you are creating Kudu tables in other catalogs you will end up in this codepath and you have to specify the kudu masters

granthenke · 2020-04-17T14:20:21Z

flink-connector-kudu/src/main/java/org/apache/flink/connectors/kudu/table/KuduTableSink.java

+    public void setKeyFields(String[] keyFields) { /* this has no effect */}
+
+    @Override
+    public void setIsAppendOnly(Boolean isAppendOnly) { /* this has no effect */}


I guess this could enforce that only the INSERT operation is used? Is that useful?

This method is called to indicate that the sink (that supports upserts) dont actually need to upsert but only append which might be cheaper in certain cases.

In our case upsert and append is the same thing (we cannot append like in Kafka) so we should not do anything here

granthenke · 2020-04-17T14:24:26Z

...onnector-kudu/src/main/java/org/apache/flink/connectors/kudu/table/utils/KuduTableUtils.java

+
+    public static KuduTableInfo createTableInfo(String tableName, TableSchema schema, Map<String, String> props) {
+
+        boolean createIfMissing = props.containsKey(KUDU_HASH_COLS);


I don't understand this, what does KUDU_HASH_COLS have to do with createIfMissing?

When the user defines a table we try to infer from the provided properties if the table already exists in Kudu or not.

The logic is, that if the user provides the hash columns the table should be created (as it's only used for creations)

granthenke · 2020-04-17T14:27:03Z

...onnector-kudu/src/main/java/org/apache/flink/connectors/kudu/table/utils/KuduTableUtils.java

+                            ColumnSchema.ColumnSchemaBuilder builder = new ColumnSchema
+                                    .ColumnSchemaBuilder(t.f0, KuduTypeUtils.toKuduType(t.f1))
+                                    .key(keyColumns.contains(t.f0))
+                                    .nullable(!keyColumns.contains(t.f0) && t.f1.getLogicalType().isNullable());


Should this just use t.f1.getLogicalType().isNullable() and let Kudu complain if it's a key column? Silently dropping nullable seems like it could be problematic.

The Flink/Calcite type checking for relational queries takes nullability a bit too seriously at this point for type checking.

It is almost impossible to use NOT Null types in queries or with other connectors. Due to this limitation we decided to not treat Kudu key columns as non-nullable on the Flink side.

What this means is that we dont require kudu key columns to be NOT NULL in Flink and rely only for the key column property for making it a Kudu key column (which we have to set non nullable in the kudu schema)

granthenke · 2020-04-17T14:31:06Z

...connector-kudu/src/main/java/org/apache/flink/connectors/kudu/table/utils/KuduTypeUtils.java

+        }
+
+        @Override
+        public Type visit(VarCharType varCharType) {


Kudu 1.12.0 will have VARCHAR

VarChar seems to be not yet supported by the Blink planner

thebalu · 2020-04-22T12:57:52Z

We have addressed the comments of @granthenke; and implemented the recommended simplification. Also rebased, so that the checks now pass.
Please let us know if you have any additional suggestions.
@lresende

mbalassi · 2020-04-29T11:02:42Z

Hi @lresende, we would greatly appreciate if you could review this and merge if you are satisfied with it. Let us know if you need further input from us.

gyfora · 2020-05-06T14:45:06Z

...connector-kudu/src/test/java/org/apache/flink/connectors/kudu/batch/KuduInputFormatTest.java

@@ -73,7 +73,8 @@ void testInputFormatWithProjection() throws Exception {
    private List<Row> readRows(KuduTableInfo tableInfo, String... fieldProjection) throws Exception {
        String masterAddresses = harness.getMasterAddressesAsString();
        KuduReaderConfig readerConfig = KuduReaderConfig.Builder.setMasters(masterAddresses).build();
-        KuduRowInputFormat inputFormat = new KuduRowInputFormat(readerConfig, tableInfo, new ArrayList<>(), Arrays.asList(fieldProjection));
+        KuduRowInputFormat inputFormat = new KuduRowInputFormat(readerConfig, tableInfo, new ArrayList<>(),
+                fieldProjection == null ? null : Arrays.asList(fieldProjection));


we should also add a test that covers this

…duReader

lresende · 2020-05-07T17:24:36Z

I see some comments still being addressed, please let me know when it's ready to final review/merge.

gyfora · 2020-05-07T19:09:52Z

We have made some improvements and bug fixes (after extensive testing) but I think this is ready for final review in any case :)

nutony111 · 2020-05-12T06:24:12Z

I add the flink-connector-kudu_2.11 dependency to my project, but I use mvn install, it is the wrong message:
Could not transfer metadata org.apache.bahir:flink-connector-kudu_2.11:1.1-SNAPSHOT/maven-metadata.xml from/to ,!cloudera (https://repository.cloudera.com/artifactory/cloudera-repos/): D:\mavencangku\org\apache\bahir\flink-connector-kudu_2.11\1.1-SNAPSHOT\maven-metadata-,!cloudera.xml.part.lock (文件名、目录名或卷标语法不正确。in English: Incorrect file name, directory name, or volume label syntax. )
how can I deal with this question?

mbalassi · 2020-05-12T09:08:15Z

@nutony111 This seems completely unrelated to this pull request. We have not modified the repository information, and this repository is not included in bahor-flink. Seems like a temporary connection issue on your end, or a weird local maven setup.

nutony111 · 2020-05-12T09:31:37Z

I use the maven repository mirrors is alimaven, not central repository, is that this problem?

…

------------------ 原始邮件 ------------------ 发件人: "Márton Balassi"<notifications@github.com>; 发送时间: 2020年5月12日(星期二) 下午5:08 收件人: "apache/bahir-flink"<bahir-flink@noreply.github.com>; 抄送: "〓RoBLucci§"<735405805@qq.com>;"Mention"<mention@noreply.github.com>; 主题: Re: [apache/bahir-flink] Kudu Connector rework (#78) @nutony111 This seems completely unrelated to this pull request. We have not modified the repository information, and this repository is not included in bahor-flink. Seems like a temporary connection issue on your end, or a weird local maven setup. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

gyfora · 2020-05-19T07:14:32Z

@lresende , the PR is ready for your review when you have some time :)

mbalassi mentioned this pull request Apr 17, 2020

Bump Flink version to 1.10.0 #79

Merged

granthenke reviewed Apr 17, 2020

View reviewed changes

Gyula Fora and others added 4 commits April 21, 2020 17:39

[kudu] Kudu connector improvements / cleanup

db558a7

[kudu] Add Kudu table connector

8c20005

[kudu] README update for API rework and Table connector addition

db3e517

[kudu] Simplify KuduReaderIterator

c9b5366

thebalu force-pushed the pr branch from f5d4a53 to c9b5366 Compare April 21, 2020 15:50

Update README.md

321d41e

gyfora reviewed May 6, 2020

View reviewed changes

[Kudu] Differentiate empty vs no projection in KuduTableSource and Ku…

dc71dc7

…duReader

thebalu force-pushed the pr branch from 8c96497 to dc71dc7 Compare May 6, 2020 15:59

[Kudu][bugfix] Always use sql.Timestamp for UNIXTIME_MICROS

d7afd33

lresende merged commit 16b4b21 into apache:master May 19, 2020

eskabetxe mentioned this pull request Aug 5, 2020

Bug fix: When auto create table throw table already exists exception #63

Closed


		public static KuduTableInfo createTableInfo(String tableName, TableSchema schema, Map<String, String> props) {

		boolean createIfMissing = props.containsKey(KUDU_HASH_COLS);

Kudu Connector rework #78

Kudu Connector rework #78

Conversation

mbalassi commented Apr 17, 2020

mbalassi commented Apr 17, 2020

mbalassi commented Apr 17, 2020

granthenke left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

thebalu commented Apr 22, 2020

mbalassi commented Apr 29, 2020

Choose a reason for hiding this comment

lresende commented May 7, 2020

gyfora commented May 7, 2020

nutony111 commented May 12, 2020

mbalassi commented May 12, 2020

nutony111 commented May 12, 2020 via email

gyfora commented May 19, 2020