In [2]:
%classpath add mvn org.apache.kafka kafka_2.12 2.3.0

In [3]:
%classpath config resolver repository.io.confluent https://packages.confluent.io/maven/
%classpath add mvn io.confluent kafka-avro-serializer 5.4.1

Added new repo: repository.io.confluent


In [4]:
%classpath add mvn org.apache.avro avro 1.8.2

# Producer Overview

- write messages to Kafka: record user activities, record metrics, storing log messages, communicating asynchronously with other applications, buffering information before writing to a database, etc.

## Producing Messages

- first create a `ProducerRecord`, which must include the topic and the value, optionally include a key and/or a partition
- then the producer serialize the key and value objects to ByteArrays so they can be sent over the network
- next, the data is sent to a partitioner; if a partition is specified in the ProducerRecord, then the partitioner performs a no-op; otherwise it chooses a partition for the record, usually based on the ProducerRecord key
- once a partition is selected, the producer adds the record to a batch of records that will also be sent to the same topic and partition
    - a separate thread is responsible for sending those batches of records to the appropriate Kafka brokers
- when the broker receives the messages, it sends back a response; if the messages were successfully written to Kafka, it returns a `RecordMetadata` object with the topic, partition, and the offset of the record within the partition
- if the broker failed to write the messages, it will return an error; when the producer receives the error, it may perform retry sending the messages before giving up and returning an error




<img src="img/Snip20200418_1.png"/>

## Constructing a Kafka Producer

- a Kafka producer has three mandatory properties
- `bootstrap.servers`: list of `host:port` pairs of brokers that the producer will use to establish initial connection to the Kafka cluster; does not need to include all brokers, since the producer will get more information after the initial connection
- `key.serializer`: name of a class that will be used to serialize the keys of the records (`org.apache.kafka.common.serialization.Serializer`)
- `value.serializer`: name of a class that will be used to serialize the values of the records

In [None]:
%%scala
import java.util.Properties
import org.apache.kafka.clients.producer.KafkaProducer

private val kafkaProps: Properties = new Properties()
kafkaProps.put("bootstrap.servers", "localhost:9092")
kafkaProps.put("key.serializer",
               "org.apache.kafka.common.serialization.StringSerializer")
kafkaProps.put("value.serializer",
               "org.apache.kafka.common.serialization.StringSerializer")

val producer = new KafkaProducer[String, String](kafkaProps)

In [5]:
package kafka_definitive_guide.producer;

import java.util.Properties;
import org.apache.kafka.clients.producer.ProducerConfig;
import org.apache.kafka.clients.producer.KafkaProducer;

public class ProducerFactory {
    public static KafkaProducer<String, String> makeDefaultProducer() {
        Properties kafkaProps = new Properties();
        kafkaProps.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, "kafka1:19092");
        kafkaProps.put(ProducerConfig.ACKS_CONFIG, "all");
        kafkaProps.put(ProducerConfig.RETRIES_CONFIG, 0);
        kafkaProps.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG,
                       "org.apache.kafka.common.serialization.StringSerializer");
        kafkaProps.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG,
                       "org.apache.kafka.common.serialization.StringSerializer");
        return new KafkaProducer<String, String>(kafkaProps);
    }
}

kafka_definitive_guide.producer.ProducerFactory

## Serializing Using Apache Avro

- Apache Avro is a language-neutral data serialization format
- Avro data is described in a language-independent schema, usually described in JSON


In [None]:
String schemaString =
    "{\"namespace\": \"customerManagement.avro\", \"type\": \"record\", \"name\": \"Customer\",\"fields\": [{\"name\": \"id\", \"type\": \"int\"}, {\"name\": \"name\", \"type\": \"string\"},{\"name\": \"email\", \"type\": [\"null\",\"string\"], \"default\":\"null\"}]}";


In [7]:
%%bash

curl -X POST -H "Content-Type: application/vnd.schemaregistry.v1+json" \
    --data '{"schema": "{\"namespace\": \"customerManagement.avro\", \"type\": \"record\", \"name\": \"Customer\",\"fields\": [{\"name\": \"id\", \"type\": \"int\"}, {\"name\": \"name\", \"type\": \"string\"},{\"name\": \"email\", \"type\": [\"null\",\"string\"], \"default\":\"null\"}]}"}' \
    http://schema-registry:8081/subjects/Customer/versions

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100   285  100     8  100   277    800  27700 --:--:-- --:--:-- --:--:-- 31666
{"id":1}



In [7]:
package kafka_definitive_guide.producer;

// Apache Avro Code gen is needed, the below example won't work
public class Customer {
    public final Integer id;
    public final String name;
    public Customer(Integer id, String name) {
        this.id = id;
        this.name = name;
    }
}

kafka_definitive_guide.producer.Customer

In [8]:
package kafka_definitive_guide.producer;

import java.util.Properties;
import java.util.concurrent.Future;
import org.apache.kafka.clients.producer.ProducerConfig;
import org.apache.kafka.clients.producer.KafkaProducer;
import org.apache.kafka.clients.producer.ProducerRecord;
import org.apache.kafka.clients.producer.RecordMetadata;
import org.apache.kafka.common.serialization.StringSerializer;
import org.apache.kafka.common.KafkaException;
import io.confluent.kafka.serializers.KafkaAvroSerializer;
import io.confluent.kafka.serializers.AbstractKafkaSchemaSerDeConfig;
import kafka_definitive_guide.producer.Customer;

Properties props = new Properties();
props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, "kafka:19092");
props.put(ProducerConfig.ACKS_CONFIG, "all");
props.put(ProducerConfig.RETRIES_CONFIG, 0);
props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, StringSerializer.class);
props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, KafkaAvroSerializer.class);
props.put(AbstractKafkaSchemaSerDeConfig.SCHEMA_REGISTRY_URL_CONFIG, "http://schema-registry:8081");

String topic = "customerContacts";
Customer customer = new Customer(1, "yifan");
ProducerRecord<String, Customer> record = new ProducerRecord<>(topic, String.valueOf(customer.id), customer);
KafkaProducer<String, Customer> producer = null;

try {
    producer = new KafkaProducer<>(props);
    RecordMetadata metadata = producer.send(record).get();
    System.out.println("topic: " + metadata.topic() + ", partition: " + metadata.partition() + ", offset: " + metadata.offset());
} catch (KafkaException e) {
    e.printStackTrace();
} finally {
    producer.close();
}


cannot find symbol: cannot find symbol

### Sending Messages

The producer consists of a pool of buffer space that holds records that haven't been transmitted to the server as well as a background I/O thread that is responsible for turning these records into requests and transmitting them to the cluster.

- The `send()` method is asynchronous, when called it adds the records to a buffer of pending record sends and immediately returns
- The producer maintains buffers of unsent records for each partition
    - these buffers are of a size specified by the `batch.size` config
- By default a buffer is available to send immediately even there is additional unused space in the buffer, optionally `linger.ms` can be set to something greater than 0 to instruct the producer to wait before sending a request
- `buffer.memory` controls the total amount of memory available to the producer for buffering, if records are sent faster than they can be transmitted then the buffer space will be exhausted, in which case additional `send()` calls will block, for a duration up to `max.block.ms` before it throws `TimeoutException`

#### Idempotent Producer 

- The idempotent producer strengthens Kafka's delivery sematics from at least once to exactly once, in particular producer retries will no longer introducer duplicates
- To enable: `enable.idempotence = true`, then `retries` config will default to `Integer.MAX_VALUE` and `acks` will default to `all`
- The producer can only guarantee idempotence for messages sent within a single session


#### Transactional Producer

- The transactional producer allows an application to send messages to multiple partitions and topics atomically
- To use the transaction, one must set the `transactional.id` config
- Idempotence is automatically enabled along with the producer configs which idempotence depends on
- Topics included in the transactions must be configured for durability
    - `replication.factor >= 3`
    - `min.insync.replicas >= 2`
- For transactional guarantees to be realized from end-to-end, the consumers must be configured to read only committed messages
- `transational.id` enables transaction recovery across multiple sessions of a single producer instance, and it should be unique to each producer instance running within a partitioned application

```java
Properties props = new Properties();
props.put("bootstrap.servers", "localhost:9092");
props.put("transactional.id", "my-transactional-id");
Producer<String, String> producer = new KafkaProducer<>(props, new StringSerializer(), new StringSerializer());

producer.initTransactions();

try {
    producer.beginTransaction();
    for (int i = 0; i < 100; i++)
        producer.send(new ProducerRecord<>("my-topic", Integer.toString(i), Integer.toString(i)));
    producer.commitTransaction();
} catch (ProducerFencedException | OutOfOrderSequenceException | AuthorizationException e) {
    producer.close();  // non recoverable exceptions
} catch (KafkaException e) {
    producer.abortTransaction(); // for all other exception, abort and retry
}
```


#### Calling `send()`

- _Fire-and-forget_
    - send a msg and dont care if it arrives successfully
    - most of the time it arrives
        - since Kafka is HA
        - and producer performs retries
- _Synchronous send_
    - `send()` returns a Future object
    - use `get()` to wait on the future 
- _Asynchronous send_
    - call `send()` with a callback
    - callback gets triggered when it receives a response from the Kafka broker

In [None]:
package kafka_definitive_guide.producer.example;

import java.util.concurrent.Future;
import org.apache.kafka.clients.producer.*;

import kafka_definitive_guide.producer.ProducerFactory;

KafkaProducer<String, String> producer = ProducerFactory.makeDefaultProducer();
ProducerRecord<String, String> record =
    new ProducerRecord<>("CustomerCountry", "Precision Products", "France");

try {
    Future<RecordMetadata> future = producer.send(record);
    RecordMetadata metadata = future.get();
    System.out.println("Topic: " + metadata.topic());
    System.out.println("Partition: " + metadata.partition());
    System.out.println("Offset: " + metadata.offset());
} catch (Exception e) {
    e.printStackTrace();
} finally {
    producer.close();   
}

In [None]:
%%scala
import org.apache.kafka.clients.producer.ProducerRecord
val record: ProducerRecord[String, String] =
    // topic, key, value
    new ProducerRecord("CustomerCountry", "Precision Products", "France")
try {
    // doesnt block!
    // we ignore errors that may occur while
    // sending messages to brokers, or in the brokers themselves
    val future = producer.send(record) // return Future[ProducerRecord]
    // alternativesly, do producer.send(record).get() to block
    val result = future.get()
} catch {
    // we can still get an exception before sending
    // e.g., SerializationException, TimeoutException
    case e: Exception => e.printStackTrace()
}

In [None]:
%%scala
import org.apache.kafka.clients.producer.Callback
import org.apache.kafka.clients.producer.RecordMetadata

class ProducerCallback extends Callback {
    override def onCompletion(recordMetadata: RecordMetadata, e: Exception): Unit = {
        print(recordMetadata.toString())
        if (e != null) {
            e.printStackTrace()
        }
    }
}

producer.send(record, new ProducerCallback())

## Configuring Producers

### `acks`

- controls how many partitions replicas must receive the record before the producer can consider the write successful
    - `acks=0`: producer will not wait for a reply from the broker before assuming the message was sent successfully
    - `acks=1`: producer will receive a success response from the broker the moment the leader replica received the message
    - `acks=all`: the prodcuer will receive a success response from the broker once all in-sync replicas received the message


### `buffer.memory`

- sets the amount of memory the producer will use to buffer messages waiting to be sent to brokers
- if messages are sent by the application faster than they can be delivered to the server, the producer may run out of space and additional `send()` calls will either block or throw an exception (based on `block.on.buffer.full` parameter)

### `compression.type`

- `snappy`/`gzip`/`lz4`
- default is uncompressed
- `snappy` provides decent compression ratios with low CPU overhead and good performance, recommended in cases where both performance and bandwidth are a concern
- `gzip` uses more CPU and time but result in better compression ratios

### `retries`

- how many times the producer will retry sending the message upon a transient error before giving up and notifying the client of an issue
    - `retry.backoff.ms`: how long the producer wait in between retries
    - backoff should consider the time it takes from recovering from a crashed broker (i.e., how long until all partitions get new leaders)
- developers should focus on handling non-retriable errors or cases where retry attempts were exhausted

### `batch.size`

- when multiple records are sent to the same partition, the producer will batch them together
- controls the amount of memory in bytes that will be used for each batch
- producer will send half-full batches and will not wait for a batch to become full

### `linger.ms`

- controls the amount of time to wait for additional messages before sending the current batch
- the producer sends a batch of messages either when the current batch is full or when `linger.ms` is reached
- increasing `linger.ms` increases latency but also increases throughput since we send more messages at once, there is less overhead per message
    
### `client.id`

- used by the brokers to identify messages sent from the client
- used in logging, metrics, and quotas

### `max.in.flight.requests.per.connection`

- controls how many messages the producer will send to the server without receiving responses
- setting this high can increase memory usage while improving throughput
- setting this too high can reduce throughput as batch becomes less efficient
- setting this to 1 guarantees that the mesages will be written to the broker in the order in which they were sent, even when retries occur

### `timeout.ms`, `request.timeout.ms`, `metadata.fetch.timeout.ms`

- controls how long the producer will wait for a reply from the server when sending the data (`request.timeout.ms`) and when requesting metadata such as the current leaders for the partitions we are writing to (`metadata.fetch.timeout.ms`)
- upon timeout the producer either retry sending or respond with an error (throwing exception or sends callback)
- `timeout.ms` controls the time the broker will wait for in-sync replicas to acknowledge the message in order to meet the `acks` configuration
    - the broker will return an error if the time elapses without necessary acknowledgements
    
### `max.block.ms`

- controls how long the producer will block when calling `send()` and when explicitly requesting metadata via `partitionsFor()`

### `max.request.size`

- controls the size of a produce request sent by the producer
- caps both the size of the largest message that can be sent and the number of messages that the producer can send in one request
- the broker has its own limit on the size of the largest message it will accept (`message.max.bytes`)


### `receive.buffer.bytes` & `send.buffer.bytes`

- the sizes of the TCP send/receive buffers used by the sockets when writing & reading data




### Ordering Guarantees

- Kafka preserves the order of messages within a partition
- however, having nonzero `retries` parameter and higher-than-one `max.in.flights.requests.per.session` means that it is possible the broker will fail to write the first batch of messages, succeed to write the second batch, then retry the first batch and succeed, thereby reversing the order
- setting `retries` to zero is not an option for a reliable system
- set `in.flight.requests.per.session` to 1 to make sure that while a batch of messages is retrying, additional messages will not be sent
- ^ only use when order is important as the throughput will be severely limited


## Partitions

- Kafka messages are key-value pairs; but one can create a `ProducerRecord` with only topic and value, with the key set to null by default
- Keys servers two goals: they add additional information that gets stored with the message; and they are used to decide which one of the topic partitions the message will be written to
- All messages with the same key will go to the same partition
    - When the key is null and the default partitioner is used, the record will be sent to one of the available partitions of the topic at random: a round-robin algorithm will be used to balance the messages among the partitions
    - If a key exists and the default partitioner is used, Kafka will hash the key and use the result to map the message to a specific partition
        - all partitions are considered during key mapping, so if a specific partition is unavailable when the data is written to it, there might be an error
        - the mapping of keys to partitions is consistent only as long as the number of partitions in a topic does not change


### Custom Partitioning Strategy

- Example: suppose one type of key is particularly common, and it leads to data skew among partitions and brokers 

In [None]:
package kafka_definitive_guide.producer;

import java.util.*;
import org.apache.kafka.clients.producer.Partitioner;
import org.apache.kafka.common.Cluster;
import org.apache.kafka.common.PartitionInfo;
import org.apache.kafka.common.record.InvalidRecordException;
import org.apache.kafka.common.utils.Utils;

public class SkewPartitioner implements Partitioner {
    public void configure(Map<String, ?> configs) {}
    
    public int partition(String topic,
                         Object key, byte[] keyBytes,
                         Object value, byte[] valueBytes,
                         Cluster cluster) {
        List<PartitionInfo> partitions = cluster.partitionsForTopic(topic);
        int numPartitions = partitions.size();
        
        if ((keyBytes == null) || (!(key instanceof String)))
            throw new InvalidRecordException("All messages are expected to have key");
        
        String keyString = (String) key;
        // let the very common key to always go to a certain partition, in this case the last one
        if (keyString.equals("VeryCommonKey")) return numPartitions; 
        // other records will get hashed to the rest of the partitions
        return (Math.abs(Utils.murmur2(keyBytes)) % (numPartitions - 1));
    }
    
    public void close() {}
}