In [None]:
#| label: import
#| echo: false
#| include: true
#| code-fold: false

## General imports
import warnings
warnings.filterwarnings('ignore')

## Data manipulation imports
import pandas as pd
import numpy as np

## Display imports
from IPython.display import display, Markdown

## Plot imports
import matplotlib.style as style
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = (5,5/2.5)
import seaborn as sns
sns.set_style('whitegrid')
sns.set_theme()
sns.set_context(
    "paper", 
    rc={
        "figsize"       :   plt.rcParams['figure.figsize'],
        'font_scale'    :   1.25,
    }
)
height = plt.rcParams['figure.figsize'][0]
aspect = plt.rcParams['figure.figsize'][0] / plt.rcParams['figure.figsize'][1] / 2

# Introduction 
Kafka can refer to two different things: a writer and a software platform. If you think about Franz Kafka, a German-speaking Bohemian novelist and short-story writer based in Prague this article may not be the right fit for you. However, if you are interested in Kafka, the writer, you can read more about him [here](https://en.wikipedia.org/wiki/Franz_Kafka). We are going to talk about Kafka, the software platform.

Apache Kafka, is an open-source distributed event streaming platform used by thousands of companies for high-performance data pipelines, streaming analytics, data integration, and mission-critical applications^[https://kafka.apache.org/].

Why is it important? It is important because it allows you to publish and subscribe to streams of records. In this way, you can store streams of records in a fault-tolerant durable way. It is also fast, scalable, and distributed by design^[https://kafka.apache.org/intro]. In my current work Kafka is used to store data from different sources and then process it.

Yes, I am talking about 🚂. A modern train in a complex system consisting of many different components (e.g. bogies; doors; heating and ventilation; air compressors; toilets; entertainment and information system; any many more). The data of this components are processed by control units and send landside to servers where they are stored. 

In this tutorial we are going to review and reproduce the steps of the [Confluent Kafka Tutorial](https://docs.confluent.io/platform/current/platform-quickstart.html). Confluent is a company founded by the creators of Apache Kafka. It offers a comprehensive platform built around Kafka, facilitating real-time data integration, processing, and streaming. Confluent's solutions serve businesses by making it easier to build and manage event-driven architectures at scale.^[https://developer.confluent.io/]. The tutorial contains the following steps:  

1. Install and run Confluent Platform and Apache Kafka®.  
1. Generate real-time mock data.  
1. Create topics to store your data.  
1. Create real-time streams on your data.  
1. Query and join streams with SQL statements.  
1. Build a view that updates as new events arrive.  
1. Visualize the topology of your streaming app.  

The aim of reproducing these steps is to get a better understanding of Kafka, its components and how to get started with it.


# Installation




For the setup of our environment we use docker. In particular there is a docker-compose file from the Confluent Platform. 