# Kafka Producer

This notebooks showcases a Kafka producer using Python. The producer is simulated to be a dumb producer -- it sends the data as is without any effort of serializing it as JSON.


## Setup

In [1]:
# Install the required packages
!pip install kafka-python



In [2]:
# Download the source data
!gsutil cp gs://pandora-sde-case/data.csv /home/febiyan_rachman/

Copying gs://pandora-sde-case/data.csv...
- [1 files][ 82.2 MiB/ 82.2 MiB]                                                
Operation completed over 1 objects/82.2 MiB.                                     


In [3]:
from time import sleep
from json import dumps
from itertools import islice
from kafka import KafkaProducer

## Set The Producer Configuration

The producer is currently set to send 100 records of data in every 5 seconds. The data itself has 20 seconds gap in between records, but for the sake of doing experiments on the data quickly, I can't wait for 20 seconds for every micro batch.

In [4]:
read_file_name = "/home/febiyan_rachman/data.csv"
read_num_lines = 1000 # Send 1000 lines at a time
read_sleep_time = 5  # We can change this to simulate 20 seconds data submission at a time
kafka_bootstrap_server = "localhost:9092"
kafka_topic = "readings_raw"

## Define Producer

In [5]:
producer = KafkaProducer(
  bootstrap_servers=[kafka_bootstrap_server],
  value_serializer=lambda x: dumps(x).encode('utf-8')
)

## Produce Until The End of Lines

In [None]:
with open(read_file_name, 'r') as file:
    print("Sending data.")
    while True:
        lines = list(islice(file, read_num_lines))
        for line in lines:
          if not lines:
            break
          # Send lines as is
          producer.send(kafka_topic, value={
            "data": line.replace('\n', '') # Don't bring the \n to the stream
          })
        # Sleep
        sleep(read_sleep_time)
    print("Done.")

Sending data.
