## Messages and Batches

- The unit of data within Kafka is called a _message_
    - similar to a _row_ or a _record_ in a database world
    - simply an array of bytes as far as Kafka is concerned
- A message can have an optional bit of metadata, which is referred to as a _key_
    - key is also a byte array
    - keys are used when messages are to be written to partitions in a more controlled manner
- For efficiency, messages are written into Kafka in batches. A _batch_ is just a collection of messages, all of which are being produced to the same topic and partition.

    


## Schemas

- Messages are just opaque byte arrays to Kafka, but it is recommended that additional schema be imposed on the message content
- Apache Avro is preferred, which is a serialization framework originally developed for Hadoop
    - Avro provides a compat serialization format
    - Schemas are separated from the message payloads and do not require code to be generated when they change
    - Strong data typing and schema evolution
    - Backward and forward compatibility
- Consistent data format allows the writing and the reading of messages to be decoupled


## Topics and Partitions

- Messages in Kafka are categorized into _topics_
    - Analog to a database table
- Topics are additionaly broken down into a number of _partitions_
- Messages are written to partitions in an append-only fashion, and are read in order from beginning to end
- There is no guarantee of message time-ordering across the entire topic, just within a single partition
- Partitions are also the way that Kafka provides redundancy and scalability
    - Each partition can be hosted on a different server
    - A single topic can be scaled horizontally across multiple servers

<img src="img/Snip20200423_1.png" width=80%/>


## Producers and Consumers

- The two basic types of Kafka clients
- There are also advanced client APIs: Kafka Connect API for data integration and Kafka Streams for stream processing
- Producers create new messages
    - In general, a message will be produced to a specific topic
    - Messages can be directed to specific partitions, which is typically done using the message key and a partitioner
- Consumers read messages
    - The consumer subscribes to one or more topics and reads the messages in the order in which they were produced
    - The consumer keeps track of which messages it has already consumed by keeping track of the offset of messages
    - The _offset_ is an integer value that continually increases, and it is added to each message by Kafka as metadata as it is produced
    - The consumer stores the offset of the last consumed message for each partition either in Zookeeper or in Kafka itself
- Consumers work as part of a _consumer group_, which is one or more consumers that work together to consume a topic
    - The group assures that each partition is only consumed by one member
    - The mapping of a consumer to a partition is often called _ownership_ of the partition by the consumer

<img src="img/Snip20200423_2.png" width=80%/>
    

## Brokers and Clusters

- A single Kafka server is called a _broker_
- The broker receives messages from producers, assigns offsets to them, and commits the messages to storage on disk
- The broker also services consumers, responding to each fetch requests for partitions and responding with the messages that have been committed to disk
- Kafka brokers are designed to operate as part of a _cluster_
    - Within a cluster of brokers, one broker will function as the cluster _controller_ (elected automatically from the live members of the cluster)
    - The controller is responsible for administrative operations, including assigning partitions to brokers and monitoring for broker failures
- A partition is owned by a single broker in the cluster, and that broker is called the _leader_ of the partition
    - A partition may be assigned to multiple brokers, which will result in the partition being replicated, which provides redundancy of messages in the partition
    - However, all consumers and producers operating on that partition must connect to the leader

<img src="img/Snip20200423_3.png" width=80%/>

- Kafka provides the _retention_ feature, which is the durable storage of messages for some period of time
    - Kafka brokers can be configured to retain messages for some period of time, or until the topic reaches a certain of bytes
    - Once these limits are reached, messages are expired and deleted
    - Individual topics can also be configured with their own retention settings

- Topics can also be configured as *log compacted*, which means that Kafka will retain only the last messag produced with a specific key
    - This can be useful for changelog type data, where only the last update is interesting
