New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CALCITE-2913] Adapter for Apache Kafka #1127
Conversation
4e9bf04
to
9c0bf4e
Compare
I was hoping to see some sql test in this new adapter though, could that be mocked? |
There're some test cases in |
93cd668
to
58c0ea8
Compare
kafka/pom.xml
Outdated
</dependency> | ||
<dependency> | ||
<groupId>com.google.auto.value</groupId> | ||
<artifactId>auto-value</artifactId> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I realize this is a compile time dependency but I'm not sure we want to depend on AutoValue.
Generally we try to minimize dependencies.
If you really need it we should discuss on dev list what is appropriate codegen library for Calcite.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
auto-value is not necessary, removed;
consumer.subscribe(Collections.singletonList(tableOptions.topicName())); | ||
|
||
return new KafkaMessageEnumerator(consumer, tableOptions.rowConverter()); | ||
} catch (Exception e) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to catch coarse-grained Exception
? Perhaps just KafkaException
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No exception(besides RuntimeException) here actually, removed.
public abstract class KafkaBaseTable implements ScannableTable { | ||
final KafkaTableOptions tableOptions; | ||
|
||
public KafkaBaseTable(final KafkaTableOptions tableOptions) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pls make all constructors package-private for now. Until API is stable people should use KafkaTableFactory
only.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
private LinkedList<ConsumerRecord<K, V>> bufferedRecords = new LinkedList<>(); | ||
private ConsumerRecord<K, V> curRecord; | ||
|
||
public KafkaMessageEnumerator(final Consumer consumer, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
package-private constructor
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
* @param <V>: type for Kafka message value, | ||
* refer to {@link ConsumerConfig#VALUE_DESERIALIZER_CLASS_CONFIG}; | ||
*/ | ||
public class KafkaMessageEnumerator<K, V> implements Enumerator { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the type (T
) for Enumerator<T>
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated with Object[]
, refers to an array of elements in a row.
site/_docs/kafka_adapter.md
Outdated
|
||
The Kafka adapter exposes a Kafka topic as a STREAM table, so it can be query using | ||
[Calcite Stream SQL]({{ site.baseurl }}/docs/stream.html). Note that the adapter will not attempt to scan all topics, | ||
instead users need to configure tables one-by-one. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you mention that one topic is one table (one-to-one mapping).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added as
Note that the adapter will not attempt to scan all topics,
instead users need to configure tables manually, one Kafka stream table is mapping to one Kafka topic.
"name": "TABLE_NAME", | ||
"type": "custom", | ||
"factory": "org.apache.calcite.adapter.kafka.KafkaTableFactory", | ||
"row.converter": "com.example.CustKafkaRowConverter", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note to myself. check if attributes with dot (.
) is a good idea
/** | ||
* Parameter constants used to define a Kafka table. | ||
*/ | ||
public class KafkaTableConstants { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make all constants package private.
use interface instead of class for defining constants
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
* @param topicName, Kafka topic name; | ||
* @return row type | ||
*/ | ||
RelDataType rowDataType(String topicName); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From configuration I see that you define one converter per topic. If so, why do you need this method (rowDataType(String)
) ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't put row schema in constructor as KafkaRowConverter
is defined as interface. Usually we use an external metadata system to manage Kafka message schema, calling rowDataType(String topic)
would be more clear IMO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like current interface is both one-to-one (between topic and row) and one-to-many (between topic and RelDataType
). It seems more logical to have the following:
interface KafkaRowConverter {
RelDataType rowDataType();
Object[] toRow(ConsumerRecord<K, V> message);
}
Ie have separate KafkaRowConverter per topic.
Another option is to provide RelDataType
as constructor argument to KafkaTable
.
Pls let me know if you agree with such API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Traditionally we have been using single Function1<Consumer<K,V>, Object[]>
for such transformations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@asereda-gs sry that I was lost in the thread, the proposed interface looks good to me. If it's not a block issue, I would like to close this PR and update it in a new task(CALCITE-3080), to handle user-specified columns together. In this way, there would be two provided implementations: 1) to handle user-specified columns, 2) to support external row metadata;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK if CALCITE-3080 addresses it separately.
(String) operand.get(KafkaTableConstants.SCHEMA_ROW_CONVERTER)) | ||
.newInstance(); | ||
} catch (InstantiationException | IllegalAccessException | ClassNotFoundException e) { | ||
throw new RuntimeException(e); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you give more details in exception message (eg. Failed to create table $T in schema $S
)? In current context you have table name and config.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated with detailed error message
} | ||
|
||
curRecord = bufferedRecords.removeFirst(); | ||
return true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably it is also a good idea to check for DataContext.Variable.CANCEL_FLAG if user explicitly unsubscribes.
See CsvStreamScannableTable
for examples
pls squash and rebase all commits. |
Surely, rebased from github/master and merged into one commit. |
); | ||
} catch (ClassNotFoundException e) { | ||
throw new RuntimeException( | ||
String.format(Locale.getDefault(), "Class '%s' is not found.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I meant here is more like
catch (CheckedExeption e) {
final String details = String.format("Failed to create table %s (topic %s) in schema %s", name, topicName, schema,getName());
throw new RuntimeException(details, e); // instead of just: throw new RuntimeException(e);
}
You don't need to catch each exception individually.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah I got you, let me update it
reformatted, please take another look, will squash into 1 after review. |
} catch (ClassNotFoundException | NoSuchMethodException | IllegalAccessException | ||
| InstantiationException | InvocationTargetException e) { | ||
final String details = String.format( | ||
Locale.getDefault(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use Locale.ROOT
Pls mention (in docs) that ,currently, Kafka Adapter is in beta (preview phase) and we might change public API or even remove it. I would like to stay flexible first couple of versions. |
It would be nice to have examples with some JSON messages (using native calcite json functions). |
I've updated a note in document. For JSON examples, I would defer to my next task, mostly to leverage |
*/ | ||
|
||
/** | ||
* Query provider that reads from files and web pages in various formats. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't look like Kafka package description
|
||
for (int idx = 0; idx < 10; ++idx) { | ||
addRecord(new ConsumerRecord<byte[], byte[]>("testtopic", | ||
0, idx, ("mykey" + idx).getBytes(Charset.forName("UTF-8")), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Back to your comment here, do you think a vote is required to add a new adapter? Ideally I don't want to add it today and remove it days later.
|
I don't think a vote is required for a new adapter - it is just a code change, albeit a significant one. That said, I think we should get consensus on the dev list that it's a good idea. It will add to the work load of every future release manager. I've added some high-level review comments to https://issues.apache.org/jira/browse/CALCITE-2913. I will encourage other committers to make high-level comments in JIRA also. @asereda-gs is doing an excellent job of reviewing line-by-line (thank you Andrei!) |
sqlline
Outdated
@@ -37,7 +37,7 @@ if [ ! -f target/fullclasspath.txt ]; then | |||
fi | |||
|
|||
CP= | |||
for module in core cassandra druid elasticsearch2 elasticsearch5 file mongodb server spark splunk geode example/csv example/function; do | |||
for module in core cassandra druid elasticsearch2 elasticsearch5 file mongodb server spark splunk geode example/csv example/function kafka; do |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Julian's comment:
you've put kafka at the end, but it should be in alphabetical order
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Putting it at the end is, of course, the modest thing to do.
I learned a while ago that being modest (especially when editing files with lots of unit test methods) creates what DBMS folks call a hot-spot. Everyone adds at the end, so we get merge conflicts. Make your change in the most logical place (which is alphabetical if there is no other organizing principle).
kafka/pom.xml
Outdated
<artifactId>calcite-kafka</artifactId> | ||
<packaging>jar</packaging> | ||
<name>calcite kafka</name> | ||
<description>Calcite provider that reads from kafka topics</description> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pls change to
Kafka Adapter. Exposes kafka topic(s) as stream table(s).
if (tableOptions.getConsumerParams() != null) { | ||
consumerConfig.putAll(tableOptions.getConsumerParams()); | ||
} | ||
Consumer consumer = new KafkaConsumer<>(consumerConfig); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we subscribe as late as possible (inside Enumerator on first next
) ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
prepare
would be the best place, next
looks odd to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably it is fine as is for now. It can be improved later-on.
there're so many small commits now, @asereda-gs can you take a quick look and I would do a squash then? |
@xumingmin you can force push a squashed commit (no need to do it separately). Just rebase locally |
is there a template on how to write commit message? I don't find it in http://calcite.apache.org/develop/#contributing. |
Also check history of existing commits |
got you, make a minor change as
|
@xumingmin can you please address Julian's comments in JIRA about proper attribution (especially in documentation) :
|
return this; | ||
} | ||
|
||
public Map getConsumerParams() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are key/value types ? Map<String, String>
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, should
@xumingmin can you please address Julian's comments in JIRA about proper attribution (especially in documentation) :
Make sure to respect Kafka's branding. Each page in the documentation where Kafka is mentioned, the first mention should call it "Apache Kafka". I'd change the name of this case/commit to "Adapter for Apache Kafka".
I forget some places, let me go though it again
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are key/value types ?
Map<String, String>
?
Should be, let me specify it explicitly.
site/_docs/kafka_adapter.md
Outdated
how to decode Kafka message to Calcite row. [KafkaRowConverterImpl]({{ site.apiRoot }}/org/apache/calcite/adapter/kafka/KafkaRowConverterImpl.html) | ||
is used if not provided; | ||
|
||
2. More consumer settings can be added in parameter `consumer.paras`; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it should be consumer.params
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, let me update it.
bufferedRecords.add(record); | ||
} | ||
|
||
consumer.commitSync(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you use commitSync()
api, you should set enable.auto.commit
to false before create KafkaConsumer instance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would remove this line directly. If users need auto-commit, they can set in consumer.params
bee7541
to
12d6d8c
Compare
@xumingmin Whether it is necessary to support user consume from the special timestamp? |
@wangzzu I would like this feature in a following task, to support both from/to timestamp/offset. Create https://issues.apache.org/jira/browse/CALCITE-3073 for tracking. |
* @return fields in the row | ||
*/ | ||
@Override public Object[] toRow(final ConsumerRecord<String, String> message) { | ||
Object[] fields = new Object[4]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why you take a array of size 4 not 3?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
true, thanks for pointing out.
*/ | ||
@Override public RelDataType rowDataType(final String topicName) { | ||
final RelDataTypeFactory typeFactory = | ||
new SqlTypeFactoryImpl(RelDataTypeSystem.DEFAULT); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
topicName
is useless, is there any reason that pass to this method as parameter
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may update the way to define row schema, refer to thread https://github.com/apache/calcite/pull/1127/files#r283079680. Please let me know if it's a block issue.
@asereda-gs any idea when can we close this PR? I notice that Calcite 1.20.0 release is coming. Btw, I doubt there's enough time to merge CALCITE-3080, may be better idea to make below change now, any thoughts?
|
@xumingmin it is fine if we redefine It is the reason why I prefer to mention in 1.20 release notes that Kafka Adapter is currently in "preview mode" (some API might change). Pls rebase this PR (there are some conflicts) |
I see high-level discussions about scheduling and dependency on CALCITE-3080 going on here. As I said previously, please have these discussions in JIRA not git. |
Expose an Apache Kafka topic as a stream table.
@asereda-gs rebased from apache/master. |
Add an adapter to expose Kafka topics as STREAM tables.
KafkaTableFactory
is used here so end users need to specify table-topic mapping one-by-one.JIRA: https://issues.apache.org/jira/browse/CALCITE-2913
CC: @danny0405