-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-10979: [Rust] Basic Kafka Reader #8968
Conversation
Codecov Report
@@ Coverage Diff @@
## master #8968 +/- ##
==========================================
- Coverage 83.22% 83.17% -0.05%
==========================================
Files 196 199 +3
Lines 48232 48321 +89
==========================================
+ Hits 40142 40193 +51
- Misses 8090 8128 +38
Continue to review full report at Codecov.
|
Thanks @kflansburg this is interesting. Would you mind explaining your use case for this? It has been a while since I used Kafka personally and I am wondering what the benefit is to have the messages in Arrow format. Would the intent be to also use Arrow for the payload to take full advantage of this? For example, allowing DataFusion to run SQL queries against Kafka topics similar to KSQL? |
Hey @andygrove, this is definitely not a great use-case for Arrow since the format is not columnar, but I'm hoping to implement micro-batch style processing (possibly in DataFusion / Ballista), similar to Spark Structured Streaming. I really like your idea of using Kafka as a transport layer for Arrow Flight messages. I was planning to try to implement some sort of JSON parsing -> Arrow StructArray for the Kafka payload field, but parsing it as Arrow flight would be very cool as well. FYI I'm working now to resolve the CI issues related to compiling and/or linking against |
Hi @kflansburg this is some great work. I've just gone through the code briefly.
I'd be interested in seeing how we could go about with implementing this.
Our JSON reader already has the building blocks needed to trivially do this, and after #8938, you should be able to read all nested JSON types. I played around with converting Avro messages from Kafka into Arrow data. This would also be an interesting use-case for your streaming usecase. There is a slight downside to having the I'm a proponent of bundling crates into With the above said, I think we should use this crate as an opportunity to have a bigger discussion about where additional modules should live. For example, I recently opened a draft RFC for We could try the |
Great, thanks for the tip!
Thanks for the heads up, this seems to be the main issue with CI right now. I would switch to dynamic-linking, however Cargo will still not build without the correct libraries present. I figured that this would be a controversial thing to include in-tree for the reasons you mentioned. I don't really want to re-implement all of I will leave this discussion for the core maintainers though. |
Giving this some thought, I think we can have configuration fields that indicate the keys and/or payloads should be parsed as Raw Bytes, JSON, Arrow Flight, Avro, etc. The stretch goal here could be support for integration with a schema registry, but I haven't worked much with that. The only concern I have is with inconsistent schemas between messages in the same |
You could address this by only allowing subscription to 1 topic. If you have multiple topics, you'd end up with different schemas for the Converting More rabbit-hole kinds of ideas:
|
I definitely want to support subscribing to multiple topics, its often the case that multiple topics share the same schema. My concern is that the full Schema may not be possible to infer from a single message, even within a single topic. Its possible we can have the user supply the full schema but that would be cumbersome. I think I was planning to have a Now I'm thinking though that if the parsing happens after the |
Switching to I'm not sure what the error on the |
Thanks a lot for this PR and for opening this discussion. I am trying to understand the rational, also :) :
Wrt to the first point, we could use micro-batches, but doesn't that defeat the purpose of the Arrow format? The whole idea is based on the notion of data locality, low metadata footprint, and batch processing. All these are shredded in a micro-batch architecture. Wrt to the second point, wouldn't it make sense to build a Wrt to the third point, why adding the complexity of Arrow if in the end the payload is an opaque binary? The user would still have to convert that payload to the arrow format for compute (or is the idea to keep that as opaque?), so, IMO we are not really solving the problem: why would a user prefer to use a If anything, I would say that the architecture should be something like:
I.e. there should be a stream adapter (a chunk iterator) that maps rows to a batch, either in arrow or in other format. One idea in this direction would be to reduce the scope of this PR to introducing a stream adapter that does exactly this: a generic struct that takes a stream of rows and returns a new stream of But maybe I am completely missing the point ^_^ |
Hey @jorgecarleitao, thanks for the input.
I don't believe this is an accurate statement, while they are called "micro" batches, depending on the throughput of the topics that you are consuming, they may contain tens of thousands of messages. Regardless, I think there is an argument for leveraging Arrow's compute functionality, and the ecosystem built around the format, even when processing smaller pieces of data.
To be clear, the blocking behavior is up to the user. They may set a
According to the Kafka API, the payload is an opaque binary (which I might add is a supported data type in Arrow). As you mentioned, it is possible that the topic contains a custom encoding of data, and users should be able to access this raw data. Another case, however, is that the data is JSON or Avro (or Arrow Flight as Andy suggested). As mentioned in discussion above, I plan to allow the user to specify additional information when building the reader (schema, format) which will allow it to interpret the payload as JSON, etc. I wanted to keep this PR scoped to basic functionality for now. #8971 is another PR that I have opened to begin work on such functionality for JSON.
You've mentioned Streams several times, and I'm glad that you brought it up. I purposefully avoided
I believe that this is the first step to achieving the more complex features that you are describing. This PR:
In future work, we can:
Hopefully you can see how this PR fits into that process. |
Hey @kflansburg Thanks for the clarifications and for explaining the rational and use-cases. I am convinced that there is a good case for using Arrow with Kafka and I would like to thank you clarifying my view on this topic 👍
Maybe it is easier to illustrate this with an example: let's say that the payload for a given topic is JSON with the schema
and then convert it to something like
using a If yes, note that creating a Which brings me to my initial question in item 3: IMO the core issue is not interoperability with Wrt to
and return Does this make sense? |
This is correct.
I agree that there is a potential optimization here by parsing at the message level, rather than forming a
I will give this some thought for future work, though, because I think it becomes more critical when reading zero-copy formats.
I intended
This all makes sense, and aligns with the roadmap I mentioned above, but I think it is out of scope for this PR. Think of this PR as implementing Rust's |
Thanks for the response. I get what you mean, and I agree with it. From my end, the only missing points are what @nevi-me mentioned wrt to how we are adding adapters to different data producers to the cargo workspace and the CI user story. One practical path is to exclude the crate from the workspace like we do for the c data interface and test it separately (e.g. on a different docker image). We also need evaluate whether we plan to test this in integration or not (e.g. a local kafta service). I will listen other maintainers (@andygrove , @alamb @nevi-me ) about this PR. Thank you for your patience and insight, @kflansburg |
Hey @jorgecarleitao , sounds good. I do have some scripts that I've used for Dockerized integration testing locally if we decide to go that route. I personally would like to get this building on Windows and include it in the existing tests. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In terms of adding this crate to the arrow source tree it is my personal opinion that it would be better off in a separate repo (with a dependence on Arrow) rather than in the main arrow repo:
- The fraction of people that would use
arrow
and notkafka
would be large, and so the maintenance / CI burden of putting it in tree would outweigh the benefit of including it. - It is not part of the arrow spec or project
I think a similar argument could be applied to datafusion
and to a lesser degree the parquet json writer. The conclusion to "should it be in the arrow repo" comes down to a subjective judgement of value vs cost.
Thus, I am not opposed to including this crate either, if we have sufficient interest in the maintainers. I am thinking of things like "make an integration test" for CI
I made a diagram to try and show my thinking in this area (and grouping crates together with similar levels complexity vs specialization):
▲
│
│
│
│
│
┌───────────────┐ ┌───────────────┐ ┌────────────────┐ ┌─────────────┐ │
│ DataFusion │ │ arrow-flight │ │ kafka reader │ │ sql-reader │ ... │
└───────────────┘ └───────────────┘ └────────────────┘ └─────────────┘ │
│
│
│
│
┌─────────────────────┐ ┌─────────────────────┐ ┌─────────────────────┐ │
│ Parquet Reader │ │ Flight IPC │ │ JSON Reader/writer │ │
└─────────────────────┘ └─────────────────────┘ └─────────────────────┘ │
│
│
│
┌────────────────────────────────────────────────────────┐ │
│ │ │
│ Arrow Columnar Format │ │
│ + Compute Kernels │ │
│ │ │
└────────────────────────────────────────────────────────┘
Increasing
Specialization
Thank you for this PR @kflansburg |
+1 on @alamb's diagram. I think what would be interesting to extend in the future for transport protocol based readers like kafka is to make them composable with serialization based readers like json, parquet and cvs. I.e. as an end user, i can configure kafka reader to read json from kafka stream into arrow record batches. json reader parsing logic has now been decoupled from IO, so this should be pretty straight forward to implement. |
What is the status of this PR? As part of trying to clean up the backlog of Rust PRs in this repo, I am going through seemingly stale PRs and pinging the authors to see if there are any plans to continue the work or conversation. |
d4608a9
to
356c300
Compare
Seems like a lot of discussion have to happen before a feature like this can be considered. |
Introduce a basic Kafka reader based on
rdkafka
. Exposes anIterator
interface which yieldsResult<RecordBatch>
.Columns in the batch are:
Note that
rdkafka
has a C++ dependency (librdkafka
), but we can choose to make this dynamically linked.rdkafka
provides anasync
Consumer, but I have explicitly chosen the non-async
Consumer.