# Apache Kafka

Date: 18 July 2024

By Selman Karaosmanoglu

The purpose of a data pipeline is to move data from one place or form to another.

Transformations for ETL pipelines take place within the data pipeline before the data reaches its destination.

The data pipeline needs to be monitored once it is in production to ensure **data integrity**.

An event stream represents entities’ status updates over time 

The main components of an ESP are Event broker, Event storage, Analytic, and Query Engine 

Apache Kafka is a very popular open-source ESP 

Popular Kafka service providers include Azure Event Hub, Confluent Cloud, IBM Event Stream, Amazon MSK 

The core components of Kafka are brokers, topics, partitions, replications, producers, and consumers 

Kafka brokers are clusters with many associated servers acting as the event broker to receive, store, and distribute events.

The Kafka-console-consumer manages consumers 

Kafka Streams API is a simple client library supporting you with data processing in event streaming pipelines 

⭐ A stream processor receives, transforms, and forwards the processed stream 

Kafka Streams API uses **a computational graph**

Streams API processes one record at a time and processes and analyzes data.

Kafka increase fault-tolerance and throughput with topic partitions and replications.

Event streaming is continous event transportation. Event streaming is the continual transportation between an event source and an event destination.

The continuous event transportation between an event source and an event destination is called event streaming.

Event is observable state updates over5 time.

Kafka originally used to track user activities.

There are two special types of processors in the topology: The source processor and the sink processor 

Kafka associated servers are called **brokers**

Kafka stores events permenantly. Permanent persistency

Once events are published and properly stored in topic partitions, you can create consumers to read them.

The processor performs operations on data like serializing, compressing, and encryption. 

Producers publish events into topic

KRaft is consensus protocol that consolidates metadata of brokers. 

KRaft is a consensus protocol that streamlines Kafka’s architecture by consolidating the metadata responsibilities within Kafka.

Stream processing is designed for ingesting information such as credit card transactions, that need to be processed immediately as they occur.

An **ad hoc data processor** filters raw data based on a condition, for example filtering weather data to only include extreme weather events, such as very high temperatures.



In [None]:
#  create a new topic with 3 partitions and 1 replication factor
kafka-topics  --bootstrap-server localhost:9092 --topic log_topic --create --partitions 2 --replication-factor 2

In [None]:
# List topics
kafka-topics --bootstrap-server localhost:9092 --list

In [None]:
# Get Topic Details
kafka-topics --bootstrap-server localhost:9092 --describe log_topic

In [None]:
# Delete Topic
kafka-topics --bootstrap-server localhost:9092 --topic log_topic --delete

In [None]:
# Manages producer
kafka-console-producer

In [None]:
# Manages consumer
kafka-console-consumer

### Install, configure and run Kafka

In [None]:
# Download Kafka
wget https://downloads.apache.org/kafka/3.7.0/kafka_2.12-3.7.0.tgz

In [None]:
# Extract Kafka
tar -xzf kafka_2.12-3.7.0.tgz

In [None]:
# Generate a cluster
KAFKA_CLUSTER_ID="$(bin/kafka-storage.sh random-uuid)"

In [None]:
# KRaft requires the log directories to be configured. Run the following command to configure the log directories passing the 
# cluster ID.KRaft requires the log directories to be configured. Run the following command to configure the log directories passing the cluster ID.
bin/kafka-storage.sh format -t $KAFKA_CLUSTER_ID -c config/kraft/server.properties

In [None]:
# Start the Kafka server by running the following command.
bin/kafka-server-start.sh config/kraft/server.properties

### Create topic

In [None]:
# Create topic news
bin/kafka-topics.sh --create --topic news --bootstrap-server localhost:9092

In [None]:
# Create a producer to send messages to Kafka
bin/kafka-console-producer.sh   --bootstrap-server localhost:9092   --topic news

In [None]:
# After the producer starts, and you get the '>' prompt, type any text message and press enter.
# Or you can copy the text below and paste. The below text sends three messages to Kafka.

Good morning
Good day
Enjoy the Kafka lab


### Start consumer

In [None]:
# listen to the messages in the topic news
bin/kafka-console-consumer.sh   --bootstrap-server localhost:9092   --topic news   --from-beginning

## ATM example

In [None]:
# Create topic
bin/kafka-topics.sh --create --topic bankbranch --partitions 2 --bootstrap-server localhost:9092

In [None]:
# List all topics
bin/kafka-topics.sh --bootstrap-server localhost:9092 --list

In [None]:
# Check details --describe
bin/kafka-topics.sh --bootstrap-server localhost:9092 --describe --topic bankbranch

In [None]:
# Producer to publish ATM transaction messages
bin/kafka-console-producer.sh --bootstrap-server localhost:9092 --topic bankbranch

In [None]:
# Add messages
{"atmid": 1, "transid": 100}
{"atmid": 1, "transid": 101}
{"atmid": 2, "transid": 200}
{"atmid": 1, "transid": 102}
{"atmid": 2, "transid": 201}

In [None]:
# Start a new consumer to subscribe bankbranch
bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic bankbranch --from-beginning

### Produce and Consume with Message Keys

In [None]:
# Start a new producer with the message key enabled
bin/kafka-console-producer.sh --bootstrap-server localhost:9092 --topic bankbranch --property parse.key=true --property key.separator=:

In [None]:
1:{"atmid": 1, "transid": 103}
1:{"atmid": 1, "transid": 104}
2:{"atmid": 2, "transid": 202}
2:{"atmid": 2, "transid": 203}
1:{"atmid": 1, "transid": 105}

In [None]:
# start a new consumer with --property print.key=true and --property key.separator=: arguments to print the keys.
bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic bankbranch --from-beginning --property print.key=true --property key.separator=:
# messages with the same key are being consumed in the same order (for example: trans102 -> trans103 -> trans104) as they were published.

### Consumer offset


In [None]:
# the following command to create a new consumer within a consumer group called atm-app
bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic bankbranch --group atm-app

In [None]:
# The details of the consumer group
bin/kafka-consumer-groups.sh --bootstrap-server localhost:9092 --describe --group atm-app

In [None]:
# Publish two messages
1:{"atmid": 1, "transid": 106}
2:{"atmid": 2, "transid": 204}

In [None]:
# Check consumer group details again
bin/kafka-consumer-groups.sh --bootstrap-server localhost:9092 --describe --group atm-app
# see that both offsets have been increased by 1, and the LAG columns for both partitions have become 1.
#  It means you have one new message for each partition to be consumed.

In [None]:
# Start consumer again
bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic bankbranch --group atm-app

In [None]:
# Resetting offset
bin/kafka-consumer-groups.sh --bootstrap-server localhost:9092  --topic bankbranch --group atm-app --reset-offsets --to-earliest --execute

In [None]:
# Shift the offset to the left by using --reset-offsets --shift-by -2
bin/kafka-consumer-groups.sh --bootstrap-server localhost:9092  --topic bankbranch --group atm-app --reset-offsets --shift-by -2 --execute

# Kafka Python

In [None]:
admin_client = KafkaAdminClient(bootstrap_servers="localhost:9092", client_id="test")

In [None]:
# The most common use of the admin_client is managing topics, such as creating and deleting topics.
new_topic = NewTopic(name="bankbranch", num_partitions=2, replication_factor=1)
topic_list.append(new_topic)

In [None]:
# Creating topics
admin_client.create_topics(new_topics=topic_list)
# Note: The create topic operation used above is equivalent to using kafka-topics.sh --topic in Kafka CLI client.

In [None]:
#### Describe Topics
configs = admin_client.describe_configs(
    config_resources=[ConfigResource(ConfigResourceType.TOPIC, "bankbranch")]
)

### Producer


In [None]:
producer = KafkaProducer(value_serializer=lambda v: json.dumps(v).encode("utf-8"))

In [None]:
producer.send("bankbranch", {"atmid": 1, "transid": 100})
producer.send("bankbranch", {"atmid": 2, "transid": 101})
# Note: The above producing message operation is equivalent to using kafka-console-producer.sh --topic in Kafka CLI client.

### Consumer

In [None]:
consumer = KafkaConsumer("bankbranch")

In [None]:
for msg in consumer:
    print(msg.value.decode("utf-8"))
# Note: The above consuming message operation is equivalent to using kafka-console-consumer.sh --topic in Kafka CLI client.

## Kafka-python example

### Create a topic with admin.py

In [None]:
from kafka.admin import KafkaAdminClient,NewTopic
admin_client = KafkaAdminClient(bootstrap_servers="localhost:9092", client_id='test')
topic_list = []
new_topic = NewTopic(name="bankbranch", num_partitions= 2, replication_factor=1)
topic_list.append(new_topic)
admin_client.create_topics(new_topics=topic_list)

### Create a producer with the producer.py file

In [None]:
from kafka import KafkaProducer
import json
producer = KafkaProducer(value_serializer=lambda v: json.dumps(v).encode('utf-8'))
producer.send("bankbranch", {'atmid':1, 'transid':100})
producer.send("bankbranch", {'atmid':2, 'transid':101})
producer.flush()
producer.close()

### Create the consumer.py

In [None]:
from kafka import KafkaConsumer
consumer = KafkaConsumer('bankbranch',
                        group_id=None,
                         bootstrap_servers=['localhost:9092'],
                         auto_offset_reset = 'earliest')
print("Hello")
print(consumer)

for msg in consumer:
    print(msg.value.decode("utf-8"))

### Finally execute python files
python3 admin.py
python3 producer.py
python3 consumer.py

### new_producer.py

In [None]:
from kafka import KafkaProducer
import json
producer = KafkaProducer(value_serializer=lambda v: json.dumps(v).encode('utf-8'))
transid = 102
while True:
        user_input = input("Do you want to add a transaction? (press 'n' to stop): ")
        if user_input.lower() == 'n':
            print("Stopping the transactions")
            break
        else:
            atm_choice = input("Which ATM you want to transact in? 1 or 2 ")
            if (atm_choice == '1' or atm_choice == '2'):
                producer.send("bankbranch", {'atmid':int(atm_choice), 'transid':transid})
                producer.flush()
                transid = transid + 1
            else:
                print('Invalid ATM number')
                continue

producer.close()