# Apache Kafka

![](images/kafka-real.jpg)

Jay Kreps, I am a committer on Apache Kafka

Answered Mar 24, 2014

> I thought that since Kafka was a system optimized for writing using a writer's name would make sense. I had taken a lot of lit classes in college and liked Franz Kafka. Plus the name sounded cool for an open source project.


https://www.quora.com/What-is-the-relation-between-Kafka-the-writer-and-Apache-Kafka-the-distributed-messaging-system

# Apache Kafka® is a distributed streaming platform

https://kafka.apache.org/intro

![](images/kafka-logo.png)

## Capabilities

## A streaming platform has three key capabilities

### Publish and subscribe to streams of records, similar to a message queue or enterprise messaging system.

![](https://d1.awsstatic.com/product-marketing/Messaging/sns_img_topic.e024462ec88e79ed63d690a2eed6e050e33fb36f.png)
https://aws.amazon.com/it/pub-sub-messaging/?sc_channel=sm&sc_campaign=Launch&sc_publisher=TWITTER&sc_country=Global&sc_geo=GLOBAL&sc_outcome=adoption&sc_content=SNS_pub_sub_SEOpage&sc_category=Amazon_Simple%20Notification%20Service&linkId=40048296

### Store streams of records in a fault-tolerant durable way.

![](http://tutorials.jenkov.com/images/data-streaming/data-streaming-storage-2.png)

http://tutorials.jenkov.com/data-streaming/index.html

### Process streams of records as they occur.

![](https://miro.medium.com/max/1400/0*Ud9GAwiHAragiaPv)
https://towardsdatascience.com/introduction-to-stream-processing-5a6db310f1b4

# Applications

### Kafka is generally used for two broad classes of applications

### Building real-time streaming data pipelines that reliably get data between systems or applications

![](https://cdn.confluent.io/wp-content/uploads/streaming_platform_rev-768x343.png)
https://www.confluent.io/blog/the-future-of-etl-isnt-what-it-used-to-be/

### Building real-time streaming applications that transform or react to the streams of data

![](https://www.researchgate.net/profile/Olawande_Daramola/publication/333653951/figure/fig1/AS:767176877277184@1559920629392/Data-flow-graph-of-a-stream-processor-The-figure-shows-how-applications-made-up-of.png)

https://www.researchgate.net/publication/333653951_Big_data_stream_analysis_a_systematic_literature_review

## Concepts

Kafka run as a cluster on one or more servers that can span multiple datacenters.

![](https://1fykyq3mdn5r21tpna3wkdyi-wpengine.netdna-ssl.com/wp-content/uploads/2016/08/image06.png)
https://eng.uber.com/ureplicator-apache-kafka-replicator/

The Kafka cluster stores streams of records in categories called topics.

![](https://dzone.com/storage/temp/7933597-production-1.png)
https://dzone.com/articles/monitoring-kafka-data-pipeline

Each record consists of a key, a value, and a timestamp.

![](https://miro.medium.com/max/1400/1*4UOYy2WLNt3cQCqMDqCYLA.jpeg)
https://medium.com/swlh/exploit-apache-kafkas-message-format-to-save-storage-and-bandwidth-7e0c533edf26

Messages consist of a variable-length header, a variable length opaque key byte array and a variable length opaque value byte array. 

Leaving the key and value opaque is the right decision: there is a great deal of progress being made on serialization libraries right now, and any particular choice is unlikely to be right for all uses. 

Needless to say a particular application using Kafka would likely mandate a particular serialization type as part of its usage. 

Messages (aka Records) are always written in batches. 

The technical term for a batch of messages is a record batch, and a record batch contains one or more records. 

In the degenerate case, we could have a record batch containing a single record. 

Record batches and records have their own headers. The format of each is described below.

```
baseOffset: int64
batchLength: int32
partitionLeaderEpoch: int32
magic: int8 (current magic value is 2)
crc: int32
attributes: int16
    bit 0~2:
        0: no compression
        1: gzip
        2: snappy
        3: lz4
        4: zstd
    bit 3: timestampType
    bit 4: isTransactional (0 means not transactional)
    bit 5: isControlBatch (0 means not a control batch)
    bit 6~15: unused
lastOffsetDelta: int32
firstTimestamp: int64
maxTimestamp: int64
producerId: int64
producerEpoch: int16
baseSequence: int32
records: [Record]
```

## API

![](images/kafka-apis.png)

## Producer

Allows an application to publish a stream of records to one or more Kafka topics.

![](https://camo.githubusercontent.com/62c459e0b8ca569011db620f0b57dd17e75090cf/68747470733a2f2f63646e2e73636f7463682e696f2f31353737352f505250673139393854664f36564b58546561547a5f696c6c757374726174696f6e2e6a7067)

https://github.com/amwaleh/Simple-stream-Kafka

In [1]:
# https://towardsdatascience.com/kafka-python-explained-in-10-lines-of-code-800e3e07dad1

from time import sleep
from json import dumps
from kafka import KafkaProducer

producer = KafkaProducer(bootstrap_servers=['localhost:9092'],
                         value_serializer=lambda x: 
                         dumps(x).encode('utf-8'))

for e in range(1000):
    data = {'number' : e}
    producer.send('numtest', value=data)
    sleep(5)
    

ModuleNotFoundError: No module named 'kafka'

## Consumer

allows an application to subscribe to one or more topics and process the stream of records produced to them

![](images/kafka-meme.jpg)

In [None]:
from kafka import KafkaConsumer
from json import loads

consumer = KafkaConsumer(
    'numtest',
     bootstrap_servers=['localhost:9092'],
     auto_offset_reset='earliest',
     enable_auto_commit=True,
     group_id='my-group',
     value_deserializer=lambda x: loads(x.decode('utf-8')))

for message in consumer:
    message = message.value
    print('{} read'.format(message))

## Streams

allows an application to act as a stream processor, consuming an input stream from one or more topics and producing an output stream to one or more output topics, effectively transforming the input streams to output streams

## Connector

allows building and running reusable producers or consumers that connect Kafka topics to existing applications or data systems. For example, a connector to a relational database might capture every change to a table.

# Communication

![](images/gimli.gif)

In Kafka the communication between the clients and the servers is done with a:

### simple

![](images/simple.jfif)

### high-performance

![](images/high-performance.jpg)

### language agnostic

![](images/language-agnostic.png)

 TCP protocol

![](images/salemove_tcp_rocks.png)

This protocol is versioned and maintains backwards compatibility with older version. 

We provide a Java client for Kafka, but clients are available in many languages.

![](images/kafka-java.jfif)

# Topic

A topic is a category or feed name to which records are published. Topics in Kafka are always multi-subscriber; that is, a topic can have zero, one, or many consumers that subscribe to the data written to it.

![](images/log_anatomy.png)

## Partition

Each partition is an ordered, immutable sequence of records that is continually appended to—a structured commit log. The records in the partitions are each assigned a sequential id number called the offset that uniquely identifies each record within the partition.

## Retention

The Kafka cluster durably persists all published records—whether or not they have been consumed—using a configurable retention period. For example, if the retention policy is set to two days, then for the two days after a record is published, it is available for consumption, after which it will be discarded to free up space. Kafka's performance is effectively constant with respect to data size so storing data for a long time is not a problem.

![](images/log_consumer.png)

# Producer

Producers publish data to the topics of their choice. The producer is responsible for choosing which record to assign to which partition within the topic. This can be done in a round-robin fashion simply to balance load or it can be done according to some semantic partition function (say based on some key in the record)

# Consumer

Consumers label themselves with a consumer group name, and each record published to a topic is delivered to one consumer instance within each subscribing consumer group. Consumer instances can be in separate processes or on separate machines.

If all the consumer instances have the same consumer group, then the records will effectively be load balanced over the consumer instances.

If all the consumer instances have different consumer groups, then each record will be broadcast to all the consumer processes.

![](images/consumer-groups.png)

A two server Kafka cluster hosting four partitions (P0-P3) with two consumer groups. Consumer group A has two consumer instances and group B has four.

# Guarantees

![](images/garantee.jfif)

Messages sent by a producer to a particular topic partition will be appended in the order they are sent. That is, if a record M1 is sent by the same producer as a record M2, and M1 is sent first, then M1 will have a lower offset than M2 and appear earlier in the log.

A consumer instance sees records in the order they are stored in the log.

For a topic with replication factor N, we will tolerate up to N-1 server failures without losing any records committed to the log.

# Kafka as a Messaging System

Messaging traditionally has two models: queuing and publish-subscribe. 

In a queue, a pool of consumers may read from a server and each record goes to one of them; in publish-subscribe the record is broadcast to all consumers

The consumer group concept in Kafka generalizes these two concepts. As with a queue the consumer group allows you to divide up processing over a collection of processes (the members of the consumer group). As with publish-subscribe, Kafka allows you to broadcast messages to multiple consumer groups.

Kafka does it better. By having a notion of parallelism—the partition—within the topics, Kafka is able to provide both ordering guarantees and load balancing over a pool of consumer processes. This is achieved by assigning the partitions in the topic to the consumers in the consumer group so that each partition is consumed by exactly one consumer in the group. By doing this we ensure that the consumer is the only reader of that partition and consumes the data in order. Since there are many partitions this still balances the load over many consumer instances. Note however that there cannot be more consumer instances in a consumer group than partitions.

# Kafka as a Storage System

Any message queue that allows publishing messages decoupled from consuming them is effectively acting as a storage system for the in-flight messages

Data written to Kafka is written to disk and replicated for fault-tolerance. Kafka allows producers to wait on acknowledgement so that a write isn't considered complete until it is fully replicated and guaranteed to persist even if the server written to fails.

As a result of taking storage seriously and allowing the clients to control their read position, you can think of Kafka as a kind of special purpose distributed filesystem dedicated to high-performance, low-latency commit log storage, replication, and propagation.

# Kafka as a Stream Processing

It isn't enough to just read, write, and store streams of data, the purpose is to enable real-time processing of streams.

In Kafka a stream processor is anything that takes continual streams of data from input topics, performs some processing on this input, and produces continual streams of data to output topics.

For example, a retail application might take in input streams of sales and shipments, and output a stream of reorders and price adjustments computed off this data.

It is possible to do simple processing directly using the producer and consumer APIs. However for more complex transformations Kafka provides a fully integrated Streams API. This allows building applications that do non-trivial processing that compute aggregations off of streams or join streams together.

This facility helps solve the hard problems this type of application faces: handling out-of-order data, reprocessing input as code changes, performing stateful computations, etc.

The streams API builds on the core primitives Kafka provides: it uses the producer and consumer APIs for input, uses Kafka for stateful storage, and uses the same group mechanism for fault tolerance among the stream processor instances.

# Putting the Pieces Together

This combination of messaging, storage, and stream processing may seem unusual but it is essential to Kafka's role as a streaming platform.

A distributed file system like HDFS allows storing static files for batch processing. Effectively a system like this allows storing and processing historical data from the past.

A traditional enterprise messaging system allows processing future messages that will arrive after you subscribe. Applications built in this way process future data as it arrives.

Kafka combines both of these capabilities, and the combination is critical both for Kafka usage as a platform for streaming applications as well as for streaming data pipelines.

By combining storage and low-latency subscriptions, streaming applications can treat both past and future data the same way. That is a single application can process historical, stored data but rather than ending when it reaches the last record it can keep processing as future data arrives. This is a generalized notion of stream processing that subsumes batch processing as well as message-driven applications.

Likewise for streaming data pipelines the combination of subscription to real-time events make it possible to use Kafka for very low-latency pipelines; but the ability to store data reliably make it possible to use it for critical data where the delivery of data must be guaranteed or for integration with offline systems that load data only periodically or may go down for extended periods of time for maintenance. The stream processing facilities make it possible to transform data as it arrives.

![](https://miro.medium.com/max/400/0*eXVQRi1zQ60WkmVz)

# Quick Start (with Docker)

```DockerFile
FROM openjdk:8-jre-alpine
LABEL maintainer="Salvo Nicotra"
ENV PATH /opt/kafka/bin:$PATH
ENV KAFKA_DIR "/opt/kafka"
ARG KAFKA_VERSION="2.13-2.7.0"

RUN apk update && apk add --no-cache bash gcompat

# Installing Kafka
# ADD will automatically extract the file
ADD setup/kafka_${KAFKA_VERSION}.tgz /opt

# Create Sym Link 
RUN ln -s /opt/kafka_${KAFKA_VERSION} ${KAFKA_DIR} 

ADD kafka-manager.sh ${KAFKA_DIR}/bin/kafka-manager
# Copy All conf here
ADD conf/* ${KAFKA_DIR}/config/

ENTRYPOINT [ "kafka-manager" ]
```

# Wrapper

```bash
#!/bin/bash
set -v
ZK_DATA_DIR=/tmp/zookeeper
ZK_SERVER="localhost"
KAFKA_TOPIC="tap"
[[ -z "${KAFKA_ACTION}" ]] && { echo "KAFKA_ACTION required"; exit 1; }
[[ -z "${KAFKA_DIR}" ]] && { echo "KAFKA_DIR missing"; exit 1; }
# ACTIONS start-zk, start-kafka, create-topic, 

echo "Running action ${KAFKA_ACTION} (Kakfa Dir:${KAFKA_DIR}, ZK Server: ${ZK_SERVER})"
case ${KAFKA_ACTION} in
"start-zk")
echo "Starting ZK"
mkdir -p ${ZK_DATA_DIR}; # Data dir is setup in conf/zookeeper.properties
cd ${KAFKA_DIR}
zookeeper-server-start.sh config/zookeeper.properties
;;
"start-kafka")
cd ${KAFKA_DIR}
kafka-server-start.sh config/server.properties
;;
"create-topic")
cd ${KAFKA_DIR}
kafka-topics.sh --create --zookeeper 10.0.100.22:2181 --replication-factor 1 --partitions 1 --topic ${KAFKA_TOPIC}
;;
"producer")
cd ${KAFKA_DIR}
#bin/kafka-console-producer.sh --broker-list localhost:9092 --topic test
kafka-console-producer.sh --broker-list 10.0.100.23:9092 --topic ${KAFKA_TOPIC}
;;
"consumer")
cd ${KAFKA_DIR}
kafka-console-consumer.sh --bootstrap-server 10.0.100.23:9092 --topic ${KAFKA_TOPIC} --from-beginning
;;
"connect-standalone")
cd ${KAFKA_DIR}
#connect-standalone-twitter.properties mysqlSinkTwitter.conf
touch /tmp/my-test.txt
bin/connect-standalone.sh config/${KAFKA_WORKER_PROPERTIES} config/${KAFKA_CONNECTOR_PROPERTIES}  
;;
esac
```

<center>
    <h1>Apache ZooKeeper</h1>
    <img src="images/Apache_ZooKeeper_Logo.svg"/>
</center>

ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services. 

All of these kinds of services are used in some form or another by distributed applications. 

Each time they are implemented there is a lot of work that goes into fixing the bugs and race conditions that are inevitable.

Because of the difficulty of implementing these kinds of services, applications initially usually skimp on them, which make them brittle in the presence of change and difficult to manage.

Even when done correctly, different implementations of these services lead to management complexity when the applications are deployed.

# Origin of the Name

> ZooKeeper was developed at Yahoo! Research. We had been working on ZooKeeper for a while and pitching it to other groups, so we needed a name. At the time, the group had been working with the Hadoop team and had started a variety of projects with the names of animals, Apache Pig being the most well known. As we were talking about different possible names, one of the group members mentioned that we should avoid another animal name because our manager thought it was starting to sound like we lived in a zoo. That is when it clicked: distributed systems are a zoo. They are chaotic and hard to manage, and ZooKeeper is meant to keep them under control.

> The cat on the book cover is also appropriate, because an early article from Yahoo! Research about ZooKeeper described distributed process management as similar to herding cats. ZooKeeper sounds much better than CatHerder, though.

https://www.oreilly.com/library/view/zookeeper/9781449361297/ch01.html

## Configuration

## Standalone

conf/zoo.cfg 

- tickTime : the basic time unit in milliseconds used by ZooKeeper. It is used to do heartbeats and the minimum session timeout will be twice the tickTime.

- dataDir : the location to store the in-memory database snapshots and, unless specified otherwise, the transaction log of updates to the database.

- clientPort : the port to listen for client connections

```properties
tickTime=2000
dataDir=/var/lib/zookeeper
clientPort=2181
```

## Start ZooKeeper in Docker

In [None]:
%%%bash
docker stop kafkaZK
docker container rm kafkaZK
docker build kafka/ --tag tap:kafka
docker run -e KAFKA_ACTION=start-zk --network tap --ip 10.0.100.22  -p 2181:2181 --name kafkaZK -it tap:kafka

## Zookeeper UI

## Zk Cli

Enter in kafkaZK
```bash
cd /opt/kafka/bin
./zookeeper-shell.sh 10.0.100.22:2181
ls /
```

## Zk UI

https://github.com/juris/docker-zkui

In [1]:
%%%bash
docker run -d --name zkui -p 9090:9090 --network tap -e ZK_SERVER=kafkaZK:2181 juris/zkui

UsageError: Cell magic `%%%bash` not found.


## Pretty Zoo



https://github.com/vran-dev/PrettyZoo

![](https://github.com/vran-dev/PrettyZoo/blob/master/release/img/icon.png?raw=true)

![](https://www.askideas.com/media/48/Please-Wait-While-The-Wizard-Installs-The-Software-Funny-Technology-Meme-Picture.jpg)

### Shall we put in docker ?

https://betterprogramming.pub/running-desktop-apps-in-docker-43a70a5265c4

https://medium.com/@SaravSun/running-gui-applications-inside-docker-containers-83d65c0db110

# Kafka Server

## Configuration

```properties
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements.  See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License.  You may obtain a copy of the License at
#
#    http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# see kafka.server.KafkaConfig for additional details and defaults

############################# Server Basics #############################

# The id of the broker. This must be set to a unique integer for each broker.
broker.id=0

############################# Socket Server Settings #############################

# The address the socket server listens on. It will get the value returned from 
# java.net.InetAddress.getCanonicalHostName() if not configured.
#   FORMAT:
#     listeners = listener_name://host_name:port
#   EXAMPLE:
#     listeners = PLAINTEXT://your.host.name:9092
#listeners=PLAINTEXT://:9092

# Hostname and port the broker will advertise to producers and consumers. If not set, 
# it uses the value for "listeners" if configured.  Otherwise, it will use the value
# returned from java.net.InetAddress.getCanonicalHostName().
#advertised.listeners=PLAINTEXT://your.host.name:9092

# Maps listener names to security protocols, the default is for them to be the same. See the config documentation for more details
#listener.security.protocol.map=PLAINTEXT:PLAINTEXT,SSL:SSL,SASL_PLAINTEXT:SASL_PLAINTEXT,SASL_SSL:SASL_SSL

# The number of threads that the server uses for receiving requests from the network and sending responses to the network
num.network.threads=3

# The number of threads that the server uses for processing requests, which may include disk I/O
num.io.threads=8

# The send buffer (SO_SNDBUF) used by the socket server
socket.send.buffer.bytes=102400

# The receive buffer (SO_RCVBUF) used by the socket server
socket.receive.buffer.bytes=102400

# The maximum size of a request that the socket server will accept (protection against OOM)
socket.request.max.bytes=104857600


############################# Log Basics #############################

# A comma separated list of directories under which to store log files
log.dirs=/tmp/kafka-logs

# The default number of log partitions per topic. More partitions allow greater
# parallelism for consumption, but this will also result in more files across
# the brokers.
num.partitions=1

# The number of threads per data directory to be used for log recovery at startup and flushing at shutdown.
# This value is recommended to be increased for installations with data dirs located in RAID array.
num.recovery.threads.per.data.dir=1

############################# Internal Topic Settings  #############################
# The replication factor for the group metadata internal topics "__consumer_offsets" and "__transaction_state"
# For anything other than development testing, a value greater than 1 is recommended for to ensure availability such as 3.
offsets.topic.replication.factor=1
transaction.state.log.replication.factor=1
transaction.state.log.min.isr=1

############################# Log Flush Policy #############################

# Messages are immediately written to the filesystem but by default we only fsync() to sync
# the OS cache lazily. The following configurations control the flush of data to disk.
# There are a few important trade-offs here:
#    1. Durability: Unflushed data may be lost if you are not using replication.
#    2. Latency: Very large flush intervals may lead to latency spikes when the flush does occur as there will be a lot of data to flush.
#    3. Throughput: The flush is generally the most expensive operation, and a small flush interval may lead to excessive seeks.
# The settings below allow one to configure the flush policy to flush data after a period of time or
# every N messages (or both). This can be done globally and overridden on a per-topic basis.

# The number of messages to accept before forcing a flush of data to disk
#log.flush.interval.messages=10000

# The maximum amount of time a message can sit in a log before we force a flush
#log.flush.interval.ms=1000

############################# Log Retention Policy #############################

# The following configurations control the disposal of log segments. The policy can
# be set to delete segments after a period of time, or after a given size has accumulated.
# A segment will be deleted whenever *either* of these criteria are met. Deletion always happens
# from the end of the log.

# The minimum age of a log file to be eligible for deletion due to age
log.retention.hours=168

# A size-based retention policy for logs. Segments are pruned from the log unless the remaining
# segments drop below log.retention.bytes. Functions independently of log.retention.hours.
#log.retention.bytes=1073741824

# The maximum size of a log segment file. When this size is reached a new log segment will be created.
log.segment.bytes=1073741824

# The interval at which log segments are checked to see if they can be deleted according
# to the retention policies
log.retention.check.interval.ms=300000

############################# Zookeeper #############################

# Zookeeper connection string (see zookeeper docs for details).
# This is a comma separated host:port pairs, each corresponding to a zk
# server. e.g. "127.0.0.1:3000,127.0.0.1:3001,127.0.0.1:3002".
# You can also append an optional chroot string to the urls to specify the
# root directory for all kafka znodes.
zookeeper.connect=10.0.100.22:2181

# Timeout in ms for connecting to zookeeper
zookeeper.connection.timeout.ms=6000


############################# Group Coordinator Settings #############################

# The following configuration specifies the time, in milliseconds, that the GroupCoordinator will delay the initial consumer rebalance.
# The rebalance will be further delayed by the value of group.initial.rebalance.delay.ms as new members join the group, up to a maximum of max.poll.interval.ms.
# The default value for this is 3 seconds.
# We override this to 0 here as it makes for a better out-of-the-box experience for development and testing.
# However, in production environments the default value of 3 seconds is more suitable as this will help to avoid unnecessary, and potentially expensive, rebalances during application startup.
group.initial.rebalance.delay.ms=0
```

## Start Kafka Server in Docker

In [None]:
%%%bash
#!/usr/bin/env bash
# Stop
docker stop kafkaServer

# Remove previuos container 
docker container rm kafkaServer
# Optional 
docker build kafka/ --tag tap:kafka
# Start 
docker stop kafkaServer
docker run -e KAFKA_ACTION=start-kafka --network tap --ip 10.0.100.23  -p 9092:9092 --name kafkaServer -it tap:kafka

# Kakfa UI

Several available 
https://www.conduktor.io/

# Topic

## Create Topic in Docker

In [None]:
%%%bash
#!/usr/bin/env bash
# Stop
docker stop kafkaTopic
# Remove previuos container 
docker container rm kafkaTopic

docker build kafka/ --tag tap:kafka
docker run -e KAFKA_ACTION=create-topic -e KAKFA_SERVER=10.0.100.23 -e KAFKA_TOPIC=tap --network tap --ip 10.0.100.24 --name kafkaTopic -it tap:kafka

# Producer Console

## Start Producer Console in Docker 

In [None]:
%%%bash
#!/usr/bin/env bash
docker build ../kafka/ --tag tap:kafka
docker run -e KAFKA_ACTION=producer -e KAFKA_TOPIC=tap --network tap  -it tap:kafka

# Consumer Console

## Start Consumer Console in Docker 

In [None]:
%%%bash
#!/usr/bin/env bash
docker build ../kafka/ --tag tap:kafka
docker run -e KAFKA_ACTION=consumer -e KAFKA_TOPIC=tap --network tap   -it tap:kafka

# Kafka Real World Example

https://www.confluent.io/blog/how-kafka-is-used-by-netflix/


# Kafka Hello World

## Python


## DockerFile

```DockerFile
FROM python
ENV PATH /usr/src/app/bin:$PATH
WORKDIR /usr/src/app

COPY requirements.txt ./
RUN pip install --no-cache-dir -r requirements.txt

COPY bin/* ./
COPY python-manager.sh /
ENTRYPOINT [ "/python-manager.sh" ]
```

## Requirements
List of python packages automatically added to docker image

```
kafka-python
```

## python-manager

```bash
#!/bin/bash
[[ -z "${PYTHON_APP}" ]] && { echo "PYTHON_APP required"; exit 1; }
PYTHON_DIR="/usr/src/app/"

echo "Running python ${PYTHON_APP} (Python Dir:${PYTHON_DIR})"
cd /usr/src/app/
python ${PYTHON_APP}
```

# Let's run


- Start Zk
- Start Kafka Server
- Start Producer kafkaPython10linesProducer.sh
- Start Consumer kafkaPython10linesConsumer.sh

- Another Consumer
```bash
docker run --network tap -e PYTHON_APP=tap10linesConsumer.py --name kafkaPython10linesConsumer2 -it tap:python
```

# Let's Play!

# Flume - Kafka

```bash
./flumeTwitterKafka.sh

./kafkaCreateConsumer.sh
```


- https://www.confluent.io/resources/kafka-the-definitive-guide
- https://ordina-jworks.github.io/kafka/2018/10/23/kafka-stream-introduction.html
- https://medium.com/better-programming/kafka-docker-run-multiple-kafka-brokers-and-zookeeper-services-in-docker-3ab287056fd5
- https://coralogix.com/log-analytics-blog/a-complete-introduction-to-apache-kafka/
- https://timber.io/blog/hello-world-in-kafka-using-python/
- https://medium.com/@contactsunny/simple-apache-kafka-producer-and-consumer-using-spring-boot-41be672f4e2b
- https://towardsdatascience.com/getting-started-with-apache-kafka-in-python-604b3250aa05
- https://www.slideshare.net/ConfluentInc/show-me-kafka-tools-that-will-increase-my-productivity-stephane-maarek-datacumulus-kafka-summit-london-2019
- https://www.slideshare.net/FlorentRamiere/apache-kafka-patterns-antipatterns
- https://medium.com/@stephane.maarek/the-kafka-api-battle-producer-vs-consumer-vs-kafka-connect-vs-kafka-streams-vs-ksql-ef584274c1e
- https://medium.com/streamthoughts/understanding-kafka-partition-assignment-strategies-and-how-to-write-your-own-custom-assignor-ebeda1fc06f3