## Reliability Guarantees

- Kafka provides order guarantee of messages in a partition
- Produced messages are considered committed when they were written to the partition on all its in-sync replicas
- Messages that are committed will not be lost as long as at least one replica remains alive
- Consumers can only read messages that are committed


## Replication

- Each Kafka topic is broken down into *partitions*, which are the basic data building blocks
- A partition is tored on a single disk
- Kafka guarantees order of events within a partition and a partition can be either online (available) or offline (unavailable)
- Each partition can have multiple replicas, one of which is a designated leader
    - All events are produced to and consumed from the leader replica
    - Other replicas just need to stay in sync with the leader and replicate all the recent events on time
    - If the leader becomes unavailable, one of the in-sync replicas becomes the new leader


- A follower replica is considered in-sync if it
    - Has an active session with Zookeeper, meaning that it sends periodic heartbeat to Zookeeper
    - Fetched messages from the leader in the last n seconds (configurable)
    - Fetched the most recent messages from the leader in the last n seconds (configurable)


- An in-sync replica that is slightly behind can slow down producers and consumers - since they wait for all the in-sync replicas to get the message before it is *committed*

## Broker Configuration 

- There are three configuration parameters in the broker that change Kafka's behavior regarding reliable message storage
    - These configs can be applied at the broker level, which control configuration for all topics in the system
    - Or at the topic level, controlling behavior for a specific topic
    - This fine granularity of configuration allows the same Kafka cluster to be used to host reliable and non-reliable topics


### Replication Factor

- Topic level: `replication.factor`, broker level: `default.replication.factor`
- A replication factor of $N$ allows you to lose $N-1$ brokers while still being able to to read and write data to the topic reliably
    - Higher replication factor leads to higher availability, higher reliability, fewer disasters
    - However, you need at least $N$ brokers and $N$ times the disk space
    
    
- By default, Kafka places each replica for a partition on a separate broker, in some cases this is not safe enough if the brokers are on the same rack
- To protect against rack-level failures, brokers are recommended to placed in multiple racks and using the `broker.rack` broker configuration to configure the rack name for each broker
- If rack names are configured, Kafka ensures that replicas for a partition are spread across multiple racks


### Unclean Leader Election

- Only available at the broker level (cluster-wide in practice): `unclean.leader.election.enable`, default to true
- When the leader for a partition is no longer available, one of the in-sync replicas will be chosen as the new leader
    - The leader election is "clean" in that it **guarantees no loss of comitted data**


- In cases where there are no in-sync follower replicas exist when the leader replica becomes unavailable, we have the option of allowing the out-of-sync replica to become the new leader
    - If we don't want such behavior, then the partition remains offline until the old leader is revived, we lose availability
    - If we do want such behavior, we will lose all messages that were written to the old leader while that replica was out of sync, risking data loss and data inconsistencies 

- Typically, unclean leader election is disabled in systems where data quality and consistency are critical, e.g., banking systems
- Enabled where availability is more important, e.g., real-time clickstream analysis
    
### Minimum In-Sync Replicas

- Both the topic and the broker level: `min.insync.replicas`
- Even though a topic can be configured to have three replicas, you may end up only having one in-sync replica; if this in-sync replica becomes unavailable, you choose between availability and consistency
- Also, per Kafka reliability guarantees, data is considered committed when it is written to all in-sync replicas, even when `all` means just one replica


- `min.insync.replicas` can be used to ensure that committed data is written to more than one replica
- When the number of in-sync replicas falls below this number, the brokers will no longer accept produce requests by responding with `NotEnoughReplicasException`; consumers can continue readig existing data; effectively making the in-sync replica read-only
- This prevents the undesirable situation where data is produced and consumed, only to disappear when unclean election occurs





## Using Producers in a Reliable System

- The configuration on the producers needs to match the behavior of the brokers in order to guarantee reliability, thus the producers should
    - use the correct `acks` configuration to match reliability requirements
    - handle errors correctly both in configuration and in code

### Sending Acknowledgements

- `acks=0`: fastest, a message is considered committed if the producer managed to send it over the network, maximize throughput but guarantee to lose some messages since it won't have knowledge of a unavailable leader or cluster
- `acks=1`: requires an ack from the leader, gets `LeaderNotAvailableException` while a leader is getting elected, in which case the producer should retry, can still lose data if the leader crashes and some messages that were successfully written to the leader and acknowledged were not replicated to the followers before the crash
- `acks=all`: slowest, safest, the leader will wait until all in-sync replicas got the message before sending back an ack or an error

### Configuring Producer Retries

- Retriable errors will be handled by the producer API
- Non-retriable/non-transient/non-resolvable errors needs to be handled by the developer in code
- It is recommended to configure the producer to keep retrying upon a retriable error if you never want to lose a message
- Retries and error handling can guarantee that each message will be stored **at least once**, but it is difficult to guarantee *exactly once*

### Additional Error Handling

- Nonretriable broker errors such as errors regarding message size, authorization errors, etc.
- Errors that occur before the message was sent to the broker, e.g., serialization errors
- Errors that occur when the producer exhausted all retry attempts or when the available memory used by the producer is filled to the limit due to using all of it to store messages while retrying



## Using Consumers in a Reliable System

- Data is only available to consumers after it has been committed to Kafka (written to all in-sync replicas)
    - Consumers get data that is guaranteed to be consistent
- When reading data from a partition, a consumer is fetching a batch of events, checking the last offset in the batch, then requesting another batch starting from the last offset received
- Consumers commit their offsets in cases when they need resume consumptions from a previous checkpoint
    - For each partition it is consuming, the consumer stores its current location
    

### Consumer Configuration Properties for Reliable Processing

- `group.id`: if multiple consumers have the same group ID and subscribe to the same topic, each will be assigned a subset of the partitions in the topic and will therefore only read a subset of the messages individually
    - If a consumer needs to consume every single message in the topics, it needs to have a unique `group.id`
    
    
- `auto.offset.reset`: controls the consumer behavior when no offsets were committed (e.g., when the consumer first starts) or when the consumer asks for offsets that don't exist in the broker
    - Either `earliest`: start from the beginning of the partition whenever it doesn't have a valid offset; guarantees minimum data loss but leads to the consumer processing messages twice
    - Or `latest`: start from the end of the partition; minimizes duplicate processing but almost certainly leads to message loss


- `enable.auto.comit`: controls whether the offsets are implicitly committed by the consumer API or explicit committed by the developer
    - If all the processing of consumed records are done within the poll loop, then the automatic offset commit guarantees you never commit an offset that has not been processed
        - Leads to potential duplicate processing and also potential message loss


- `auto.commit.interval.ms`: if offsets are committed automatically, this controls how frequently the offsets are committed in the poll loop


### Explicitly Committing Offsets in Consumers

- Needed when duplicates need to be minimized or event processing are done outside the main consumer poll loop

#### Rules of Thumb

- Always commit offsets after events were processed


- Commit frequency is a trade-off between performance and number of duplicates in the event of a crash


- Make sure to commit the exact correct offset


- Rebalances will happen and they need to be handled properly


- Consumers may need to retry if some records are not fully processed and will need to be processed later
    - The consumer should not commit the offset of a message that it has not fully processed, the two common patterns:
        - Commit the last record that was processed sucessfully, store the records that needs to be processed in a buffer and keep retrying, while keep polling to prevent a rebalance (consumer `pause()` method also makes additional polls not returning additional data to make retrying easier)
        - Write the retriable record to a separate topic and continue, and use a separate consumer group to handle retries from the retry topic, or a consumer can subscribe to both the main and the retry topic but pause the retry topic between retries (similar to the dead-letter-queue system)
        
        
- Consumers may need to maintain state
    - Optionally write the latest state to a "result" topic at the same time the offset is committed
    - It is recommended to use libraries like Kafka Streams, which provides high level DSL-like APIs for aggregations, joins, windows, and other complex analytics


- Handling long processing times
    - Keep polling (sending heartbeats to the broker) to stop a rebalance from happening
    - Commonly a thread poll is used to parallelize processing of the records
    - After handling off the records to the worker threads, you can pause the consumer and keep polling without actually fetching additional data until the worker threads finish


- Exactly-once delivery
    - *Idempotent Writes*: Commonly to do exactly-once, the results are written to a system that has support for unique keys, including all key-value stores, relational database, Elasticsearch, etc
        - The common key is either from the record, or the combination of the topic, partition, and offset
    - *Transactional Writes*: Write the result to a system that has transactions, such as a relational database; the idea is to write the records and their offsets in the same transaction so they will be in-sync
        - When starting up, retrieve the offsets of the latest records written to the external store and then use `consumer.seek()` to start consuming again from those offsets


## Validating System Reliability

### Validating Configuration

- The broker and client configuration can be tested easily from the application logic
    - Helps to test if the configuration chosen can meet the requirements
    - It is good exercise to reason through the expected behavior of the system


- `org.apache.kafka.tools` includes `VerifiableProducer` and `VerifiableConsumer` classes that can be run as command-line tools or be embedded in automated testing framework
    - The verifiable producer produces a sequence of messages containing number 1 to N; and it cane be configured with the same configs applied on your producer (acks, retries, rate of production); upon execution, the producer print success or error for each message sent to the broker based on the acks received
    - The verifiable consumer consumes events and prints out the events it consumed in order, along with information regarding commits and rebalances
    

- Test cases to run for:
    - Leader election: kill the leader and observe the behavior of the producers and consumers
    - Controller election: how long does it take the system to resume after a restart of the controller
    - Rolling restart: restart the brokers one by one and no messages should be lost
    - Unclean leader election test: kill all the replicas for a partition one by one to make sure each goes out of sync and then start a broker that was out of sync


### Validating Application

- Checking custom error-handling code, offset commits, and rebalance listeners, etc
- Testing depends on the type of the application, but in general it is good to test under failure conditions:
    - Clients lose connectivity to the cluster
    - Leader election
    - Rolling restart of brokers
    - Rolling restart of consumers
    - Rolling restart of producers


### Monitoring Reliability in Production

- For the producers, the two metrics most important for reliability are error-rate and retry-rate per record (aggregated); also monitor the producer logs for errors that occur while sending events that are logged at WARN; be aware of logs that indicate the producer running out of retries
- For the consumers, the most important metric is consumer lag; make sure that consumers do eventually catch up rather than fall farther and farther behind
