## Sending data to a Kafka server

This notebook uses the [Python client for the Apache Kafka distributed stream processing system](http://kafka-python.readthedocs.io/en/master/index.html) to send messages to a Kafka server. 

* Sensor data is available from https://uv.ulb.ac.be/pluginfile.php/923479/course/section/165902/data.conv.txt.gz
* Sensor location is available from https://uv.ulb.ac.be/pluginfile.php/923479/course/section/165902/mote_locs.txt

In this example, Kafka is used to send messages containing the temperature data of sensor 1, from the 28/02 to the 06/03.

You need to have Kafka and Zookeeper servers running to execut this notebook. If you use the Docker course container, or work on the course cluster, these servers should already be running. Otherwise, you may start them on your machine with

```
nohup $KAFKA_PATH/bin/zookeeper-server-start.sh $KAFKA_PATH/config/zookeeper.properties  > $HOME/zookeeper.log 2>&1 &
nohup $KAFKA_PATH/bin/kafka-server-start.sh $KAFKA_PATH/config/server.properties > $HOME/kafka.log 2>&1 &
```

where `KAFKA_PATH` points to the folder containing Kafka. See https://kafka.apache.org/quickstart for how to install Kafka on your machine. 


### General import

In [1]:
from kafka import KafkaProducer
import time
import numpy as np

### Load measurements, sort by Date/Time, add relative number of seconds since beginning

In [2]:
import pandas as pd

#Takes about one minute to load
data=pd.read_csv("../../data/data.conv.txt.gz",header=None,sep=" ")
data.columns=["Date","Hour","Sensor","Value","Voltage"]
data=data.sort_values(['Date','Hour']).reset_index(drop=True)

In [3]:
data['datetime']=pd.to_datetime(data.Date+' '+data.Hour)
data['relative_datetime']=data['datetime']-data['datetime'][0]
data['seconds']=data['relative_datetime'].dt.total_seconds()

In [4]:
sensorId_type=data.Sensor.str.split("-",expand=True)
sensorId_type.columns=['SensorId','Type']
data['SensorId']=sensorId_type['SensorId'].astype(int)
data['Type']=sensorId_type['Type'].astype(int)


In [5]:
#Drop features not needed for the simulation
data=data.drop(['datetime','relative_datetime','Sensor','Date','Hour','Voltage'],axis=1)

In [6]:
data[:5]

Unnamed: 0,Value,seconds,SensorId,Type
0,17.6364,0.0,41,0
1,16.6956,0.007,44,0
2,45.7037,0.092,41,1
3,2.3,0.237,44,2
4,47.9942,0.285,44,1


### Select temperature data from sensor 1

In [7]:
temp=data[(data.SensorId==1) & (data.Type==0)]
temp=temp.reset_index(drop=True)

In [8]:
temp[:3]

Unnamed: 0,Value,seconds,SensorId,Type
0,19.2436,79.124,1,0
1,19.224,169.155,1,0
2,19.2142,200.931,1,0


In [9]:
temp[-3:]

Unnamed: 0,Value,seconds,SensorId,Type
43044,122.153,2695791.517,1,0
43045,121.526,2695821.42,1,0
43046,121.997,2696386.581,1,0


### Create  Kafka producer

In [10]:
producer = KafkaProducer(bootstrap_servers='kafka1:19092,kafka2:29092,kafka3:39092')

NoBrokersAvailable: NoBrokersAvailable

### Stream data

We simulate the streaming of data by sending every five seconds the set of measurements collected during one day. This allows to speed up the simulation (for 8 days - from 28/02/2017 to 7/03/2017: 8*10=80 seconds).


In [None]:
interval=10

#Start at relative day 0 (2017-02-28)
day=0

#For synchronization with receiver (for the sake of the simulation), starts at a number of seconds multiple of 'interval'
current_time=time.time()
time_to_wait=interval-current_time%interval
time.sleep(time_to_wait)

#Loop for sending messages to Kafka with the topic persistence
for day in range(0,8):
    
    time_start=time.time()
    
    #Select sensor measurements for the corresponding relative day
    data_current_day=temp[(temp.seconds>=day*86400) & (temp.seconds<(day+1)*86400)]
    data_current_day=data_current_day.dropna()
    #For all measurements in that hour
    for i in range(len(data_current_day)):
        #Get data
        current_data=list(data_current_day.iloc[i])
        #Transform list to string
        message=str(current_data)
        #Send
        producer.send('persistence',message.encode())
    
    time_to_send=time.time()-time_start
    print("Time to send "+str(len(data_current_day))+" measurements (day "+str(day)+" ) : "+str(time_to_send))

    day=day+1
    
    time.sleep(interval-time_to_send)