Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wanted: topic management, declarative #101

Open
solsson opened this issue Nov 27, 2017 · 10 comments
Open

Wanted: topic management, declarative #101

solsson opened this issue Nov 27, 2017 · 10 comments

Comments

@solsson
Copy link
Contributor

solsson commented Nov 27, 2017

We're a small team with a small set of kafka topics and clients, but already we need a self-documenting way to create and update topics.

Use cases:

  • Ensure that a topic exists in minikube, or docker-compose, during development.
  • Ensure that a topic exists in production during service provisionting in Kubernetes.
  • Edit retention.ms.
  • Knowing if the topic was created with key and/or value scemas in mind.

Using kafka-topics.sh or the java api ok for testing, but for instance with production topics we want to create them with 1 replica during dev and >=3 in a real ckluster.

An option is to use helm for topic creation manifests. I've searched for best practices, like tooling other organizations use, but with no luck. I imagine most teams will want to document their kafka topics, beyond a naming convention.

@BenjaminDavison
Copy link

We are thinking about using terraform for this: https://github.com/packetloop/terraform-provider-kafka

@DanielFerreiraJorge
Copy link

@solsson
Copy link
Contributor Author

solsson commented Dec 11, 2017

My current best bet is to rely on auto.create.topics.enable=true + default.replication.factor and a naming convention (some ideas here).

I found that during stream processing it's a quite legitimate use case to want to add new topics dynamically. For example in log processing you could split on kubernetes namespace, and those may pop up while your streams app is running.

@solsson
Copy link
Contributor Author

solsson commented Dec 13, 2017

I've tried to summarize my confusion by closing various open investigations. What surprised me the most was the insight(?) that Schema Registry doesn't help.

At Yolean we store all our persistent data in Kafka. Some of it has the integrity requirements you'd normally entrust a relational database. If validation is supported at the API level when producing, we'll make design choices based on the type of data (rather than time constraints / laziness).

Now I suspect we won't get that far without in-house libs - a maintenance effort we can't depend on, given the rapid evolution of Kafka APIs.

Validation can still be simple enough, if schemas are available as files. We already use json-schema in some core producers using Node.js. Files have the added benefit that they get the same versioning and release cycle as the services they share repository with.

We do need a little bit of tooling to copy schema files, be they Avro or JSON or Protobuf, to the source folders of the services that depend on them. Due to Docker build contexts we can't pull them from a parent folder.

We still don't enforce anything, but luckily there's a compelling alternative: Monitoring (explored in #49 and #93). Consumer services can export counters on the number of messages they had to drop due to invalidity, or even if they don't we can look at generic failure rates. We can catch errors that schemas can't: the other day we had misbehaving client-side code that led to an increase in messages/second to a topic by a factor 100.

We can also write dedicated validation services, whose sole purpose is to expose metrics. This is where naming conventions might matter. Let's say that topic names ending with a dash followed by digits indicate a version number. It'd be quite easy to spot, based on a topic listing and the bytes or message rates reported by broker JMX, that we have some service producing to an old version of a topic.

@solsson
Copy link
Contributor Author

solsson commented Jan 8, 2018

We're now phasing in auto.topic.create (#107) in production. The gotchas that made me disable it from the start are there for sure -- run kafkacat with some typo in -t and a new topic gets created. Who would configure their database to create a table on INSERT INTO xyz?

However, Kafka Streams in essence requires auto creation. We just have to live with the gotchas.

Regarding topics + schema management, with the arguments I tried to summarize above, I ended up writing a Gradle build that generates Java POJO typed Serde impls from json-schema files. We can also use these schema files directly from libs like ajv in Node.

To do this we had to establish a list of topics, or actually topic name regexes, where we configure the model (i.e. schema) that is used for values. With this list we can unit test the whole thing: "deserialize this byte stream for topic x" etc, which is what I argued for earlier.

Regarding topic management, with or without auto create, we could compare that list to existing topics in kafka, which together with bytes in/out rates could indicate if there are bogus topics lying around or we're writing to unexpected topics.

@solsson
Copy link
Contributor Author

solsson commented Jan 18, 2018

Good points here: https://www.confluent.io/blog/put-several-event-types-kafka-topic/. I agree to a lot of what Kleppmann says, and conclude that my design above with one schema per topic is too restrictive. Schema evolution support basically only caters for new versions adding new fields. A single version may need to produce using multiple schemas to the same topic, due to ordering requirements.

I also happened to read http://www.dwmkerr.com/the-death-of-microservice-madness-in-2018/ today, and it nicely captures some versioning aspects of using topics as API contracts between services. As that post puts it:

"When a microservice system uses message queues for intra-service communication, you essentially have a large database (the message queue or broker) glueing the services together. Again, although it might not seem like a challenge at first, schema will come back to bite you."

I wanted to have as much of this complexity as possible known at build time, rather than provisioning/orchestration time (Kube manifests) or runtime (schema registry etc). Am I mistaken here? As a consumer, with Kleppmann's patch to Confluent's Avro serdes, you'll basically if on the deserialized type and do your processing from there. This could be used to try to support different versions of the upstream service's messages, without schema evolution. I guess that temptation should be resisted. At Yolean we've tried topic generation numbers for that, with mixed results.

Kleppmann writes:

If you are using a data encoding such as JSON, without a statically defined schema, you can easily put many different event types in the same topic. However, if you are using a schema-based encoding such as Avro, a bit more thought is needed to handle multiple event types in a single topic.

That you can deserialize JSON without a schema doesn't mean you always should. For domain entities or "facts" - not operational data - I'd like all our records to have schemas. If Schema Registry is evolving, it's a pity confluentinc/schema-registry#220 gets so little attention.

@solsson
Copy link
Contributor Author

solsson commented Feb 9, 2018

Regarding naming conventions for topics, just now I spotted the config property create.topic.policy.class.name. https://cwiki.apache.org/confluence/display/KAFKA/KIP-108%3A+Create+Topic+Policy looks useful.

@joewood
Copy link

joewood commented Feb 14, 2018

Randomly reading through this issue, I noticed you mentioned auto.create.topics.enable=true is required for streams. I don't believe this is the case as Kafka Streams uses the admin messages for topic creation and not the meta data request message (which is what this config relates to). Streams should create the derived topics based on the same configuration as the source topic (number of partitions etc.), with compaction for store backing topics.

@solsson
Copy link
Contributor Author

solsson commented Feb 14, 2018

@joewood Thanks a lot for correcting my misunderstanding here. We (Yolean) must have drawn some premature conclusions when evaluating streams.

Actually my disappointment summarized in #101 (comment) remains unchanged, and with your insight there's really no argument left for auto.create.topics.enable=true IMO, except that it is the default. I'd really like to do #148.

You don't happen to have experience with Create Topic Policy?

@solsson
Copy link
Contributor Author

solsson commented Jul 28, 2018

Reading about the new Knative - apparently awesome enough to need another landing page :) - it could be that CloudEvents matches what I've been looking for. It seems to go further than mapping a schema to topic messages, as it discusses feeds and event types and the relation to an "action". Types are managed in Kubernetes using CRD.

They say that "A producer can generate events before a consumer is listening, and a consumer can express an interest in an event or class of events that is not yet being produced.". That's in contrast to auto.create.topics.

Knative supports a "user-provided" Kafka event "Bus".

There's an event sorce for kubernetes events, interesting as alternative to #186 #145.

I haven't tested any of the above yet. I'll be particularily interested in how it relates to Kleppmann's that I referred to earlier. My first goal would be to see if I could implement event sources for our git hosting (Gogs for which we already publish events to Kafka using webhooks + #183) and Registry which didn't work with Pixy out of the box.

solsson added a commit that referenced this issue Nov 28, 2018
so we want to undo #107.
It was partially based on a false assumption, as pointed out in
#101 (comment)

Topics are created not only at produce but also at for example kafkacat -C.
Typos cost us more time than it would take to automate topic creation
and run ./bin/kafka-topics.sh in a temporary pod when we haven't automated.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants