-
Notifications
You must be signed in to change notification settings - Fork 734
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Wanted: topic management, declarative #101
Comments
We are thinking about using terraform for this: https://github.com/packetloop/terraform-provider-kafka |
Maybe this: https://github.com/nbogojevic/kafka-operator |
My current best bet is to rely on I found that during stream processing it's a quite legitimate use case to want to add new topics dynamically. For example in log processing you could split on kubernetes namespace, and those may pop up while your streams app is running. |
I've tried to summarize my confusion by closing various open investigations. What surprised me the most was the insight(?) that Schema Registry doesn't help. At Yolean we store all our persistent data in Kafka. Some of it has the integrity requirements you'd normally entrust a relational database. If validation is supported at the API level when producing, we'll make design choices based on the type of data (rather than time constraints / laziness). Now I suspect we won't get that far without in-house libs - a maintenance effort we can't depend on, given the rapid evolution of Kafka APIs. Validation can still be simple enough, if schemas are available as files. We already use json-schema in some core producers using Node.js. Files have the added benefit that they get the same versioning and release cycle as the services they share repository with. We do need a little bit of tooling to copy schema files, be they Avro or JSON or Protobuf, to the source folders of the services that depend on them. Due to Docker build contexts we can't pull them from a parent folder. We still don't enforce anything, but luckily there's a compelling alternative: Monitoring (explored in #49 and #93). Consumer services can export counters on the number of messages they had to drop due to invalidity, or even if they don't we can look at generic failure rates. We can catch errors that schemas can't: the other day we had misbehaving client-side code that led to an increase in messages/second to a topic by a factor 100. We can also write dedicated validation services, whose sole purpose is to expose metrics. This is where naming conventions might matter. Let's say that topic names ending with a dash followed by digits indicate a version number. It'd be quite easy to spot, based on a topic listing and the bytes or message rates reported by broker JMX, that we have some service producing to an old version of a topic. |
We're now phasing in However, Kafka Streams in essence requires auto creation. We just have to live with the gotchas. Regarding topics + schema management, with the arguments I tried to summarize above, I ended up writing a Gradle build that generates Java POJO typed Serde impls from json-schema files. We can also use these schema files directly from libs like ajv in Node. To do this we had to establish a list of topics, or actually topic name regexes, where we configure the model (i.e. schema) that is used for values. With this list we can unit test the whole thing: "deserialize this byte stream for topic x" etc, which is what I argued for earlier. Regarding topic management, with or without auto create, we could compare that list to existing topics in kafka, which together with bytes in/out rates could indicate if there are bogus topics lying around or we're writing to unexpected topics. |
Good points here: https://www.confluent.io/blog/put-several-event-types-kafka-topic/. I agree to a lot of what Kleppmann says, and conclude that my design above with one schema per topic is too restrictive. Schema evolution support basically only caters for new versions adding new fields. A single version may need to produce using multiple schemas to the same topic, due to ordering requirements. I also happened to read http://www.dwmkerr.com/the-death-of-microservice-madness-in-2018/ today, and it nicely captures some versioning aspects of using topics as API contracts between services. As that post puts it:
I wanted to have as much of this complexity as possible known at build time, rather than provisioning/orchestration time (Kube manifests) or runtime (schema registry etc). Am I mistaken here? As a consumer, with Kleppmann's patch to Confluent's Avro serdes, you'll basically Kleppmann writes:
That you can deserialize JSON without a schema doesn't mean you always should. For domain entities or "facts" - not operational data - I'd like all our records to have schemas. If Schema Registry is evolving, it's a pity confluentinc/schema-registry#220 gets so little attention. |
Regarding naming conventions for topics, just now I spotted the config property |
Randomly reading through this issue, I noticed you mentioned |
@joewood Thanks a lot for correcting my misunderstanding here. We (Yolean) must have drawn some premature conclusions when evaluating streams. Actually my disappointment summarized in #101 (comment) remains unchanged, and with your insight there's really no argument left for You don't happen to have experience with Create Topic Policy? |
Reading about the new Knative - apparently awesome enough to need another landing page :) - it could be that CloudEvents matches what I've been looking for. It seems to go further than mapping a schema to topic messages, as it discusses feeds and event types and the relation to an "action". Types are managed in Kubernetes using CRD. They say that "A producer can generate events before a consumer is listening, and a consumer can express an interest in an event or class of events that is not yet being produced.". That's in contrast to auto.create.topics. Knative supports a "user-provided" Kafka event "Bus". There's an event sorce for kubernetes events, interesting as alternative to #186 #145. I haven't tested any of the above yet. I'll be particularily interested in how it relates to Kleppmann's that I referred to earlier. My first goal would be to see if I could implement event sources for our git hosting (Gogs for which we already publish events to Kafka using webhooks + #183) and Registry which didn't work with Pixy out of the box. |
so we want to undo #107. It was partially based on a false assumption, as pointed out in #101 (comment) Topics are created not only at produce but also at for example kafkacat -C. Typos cost us more time than it would take to automate topic creation and run ./bin/kafka-topics.sh in a temporary pod when we haven't automated.
We're a small team with a small set of kafka topics and clients, but already we need a self-documenting way to create and update topics.
Use cases:
retention.ms
.Using kafka-topics.sh or the java api ok for testing, but for instance with production topics we want to create them with 1 replica during dev and >=3 in a real ckluster.
An option is to use helm for topic creation manifests. I've searched for best practices, like tooling other organizations use, but with no luck. I imagine most teams will want to document their kafka topics, beyond a naming convention.
The text was updated successfully, but these errors were encountered: