# Streaming with Kafka and Spark

Here I try to implement a basic pipeline for the project

## Get Kafka and Spark ready

### Standalone cluster deployment

We can now initialize all the required variables with `findspark.init()` by passing the path to the spark folder we downloaded previously.

In [None]:
import findspark
findspark.init('/home/hilario/PoD/2ndSemester/MAPD-B/Data-processing/Labs/Spark_Migliorini/spark/spark-3.1.2-bin-hadoop3.2/')

First we need to start the master, This will spin up the spark master with address spark://localhost:7077 and a cluster dashboark at localhost:8080.

We can now create a worker

In [2]:
%%script bash --no-raise-error

# start master 
$SPARK_HOME/sbin/start-master.sh --host localhost \
    --port 7077 --webui-port 8080
    
# start worker
$SPARK_HOME/sbin/start-worker.sh spark://localhost:7077 \
    --cores 4 --memory 2g



starting org.apache.spark.deploy.master.Master, logging to /home/hilario/PoD/2ndSemester/MAPD-B/Data-processing/Labs/Spark_Migliorini/spark/spark-3.1.2-bin-hadoop3.2//logs/spark-hilario-org.apache.spark.deploy.master.Master-1-Hilario-zenbook.out
starting org.apache.spark.deploy.worker.Worker, logging to /home/hilario/PoD/2ndSemester/MAPD-B/Data-processing/Labs/Spark_Migliorini/spark/spark-3.1.2-bin-hadoop3.2//logs/spark-hilario-org.apache.spark.deploy.worker.Worker-1-Hilario-zenbook.out


## Create the spark session

We can now create the spark session. With the following command we are asking to the master (and resource manager) to create an application with required resources and configurations. In this case we are using all the default options.

In [3]:
from pyspark.sql import SparkSession

KAFKA_BOOTSTRAP_SERVERS = ''

    
spark = SparkSession.builder \
    .master("spark://localhost:7077")\
    .appName("Spark structured streaming application")\
    .config("spark.jars.packages","org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.1")\
    .getOrCreate()
  


In [4]:
spark

## KAFKA

In [5]:
KAFKA_HOME = '/home/hilario/PoD/2ndSemester/MAPD-B/Data-processing/Labs/kafka_pazzini/kafka_2.13-2.7.0'
KAFKA_BOOTSTRAP_SERVERS = 'localhost:9092'

In [None]:
# Start Zookeeper
# bin/zookeeper-server-start.sh config/zookeeper.properties 
#os.system('{0}/bin/zookeeper-server-start.sh {0}/config/zookeeper.properties'.format(KAFKA_HOME)) 
    
# Start one Kafka Broker
#bin/kafka-server-start.sh config/server.properties
#os.system('{0}/bin/kafka-server-start.sh {0}/config/server.properties'.format(KAFKA_HOME)) 

### Create the topics for kafka

In [6]:
from kafka.admin import KafkaAdminClient, NewTopic
kafka_admin = KafkaAdminClient(
        bootstrap_servers=KAFKA_BOOTSTRAP_SERVERS,
    )

#Here we will inject the data
new_topic_a = NewTopic(name='Experiment_measurements', 
                       num_partitions=1, 
                       replication_factor=1)

#Here we inject the number of processed hits, post cleaning
new_topic_b = NewTopic(name='results', 
                       num_partitions=1, 
                       replication_factor=1)

kafka_admin.create_topics(new_topics=[new_topic_a,new_topic_b])


CreateTopicsResponse_v3(throttle_time_ms=0, topic_errors=[(topic='Experiment_measurements', error_code=0, error_message=None), (topic='results', error_code=0, error_message=None)])

In [7]:
kafka_admin.list_topics()

['results', 'Experiment_measurements']

## KAFKA - SPARK INTEGRATION

### Read the data from the kafka topic (define the consumer)

In [8]:
inputDF = spark\
    .readStream\
    .format("kafka")\
    .option("kafka.bootstrap.servers", KAFKA_BOOTSTRAP_SERVERS)\
    .option('subscribe', 'Experiment_measurements')\
    .load()

In [18]:
inputDF.printSchema()

root
 |-- key: binary (nullable = true)
 |-- value: binary (nullable = true)
 |-- topic: string (nullable = true)
 |-- partition: integer (nullable = true)
 |-- offset: long (nullable = true)
 |-- timestamp: timestamp (nullable = true)
 |-- timestampType: integer (nullable = true)



In [10]:
from pyspark.sql.functions import from_json, col
from pyspark.sql.types import StructField, StructType, StringType, DoubleType, IntegerType

## the schema of the json data format used to create the messages
schema = StructType(
        [
                StructField("HEAD",        IntegerType()),
                StructField("FPGA",        IntegerType()),
                StructField("TDC_CHANNEL", IntegerType()),
                StructField("ORBIT_CNT",   DoubleType()),
                StructField("BX_COUNTER",  IntegerType()),
                StructField("TDC_MEAS",    DoubleType())
        ]  
)

## a new DF can be created from the previous by using the pyspark.sql functions
jsonDF = inputDF.select(from_json(col("value").alias('value').cast("string"), schema).alias('value'))

In [11]:
#jsonDF.printSchema()

root
 |-- value: struct (nullable = true)
 |    |-- HEAD: integer (nullable = true)
 |    |-- FPGA: integer (nullable = true)
 |    |-- TDC_CHANNEL: integer (nullable = true)
 |    |-- ORBIT_CNT: double (nullable = true)
 |    |-- BX_COUNTER: integer (nullable = true)
 |    |-- TDC_MEAS: double (nullable = true)



In [13]:
#jsonDF.writeStream\
#   .outputMode("append")\
#   .format("console")\
#   .start()\
#   .awaitTermination()

In [12]:
flatDF = jsonDF.selectExpr("value.HEAD", 
                           "value.FPGA", 
                           "value.TDC_CHANNEL",
                           "value.ORBIT_CNT",
                           "value.BX_COUNTER",
                           "value.TDC_MEAS")

In [13]:
#flatDF.printSchema()

root
 |-- HEAD: integer (nullable = true)
 |-- FPGA: integer (nullable = true)
 |-- TDC_CHANNEL: integer (nullable = true)
 |-- ORBIT_CNT: double (nullable = true)
 |-- BX_COUNTER: integer (nullable = true)
 |-- TDC_MEAS: double (nullable = true)



In [16]:
#flatDF.writeStream\
#   .outputMode("append")\
#   .format("console")\
#   .start()\
#   .awaitTermination()

### SPARK processing

In [14]:
from pyspark.sql.functions import concat, col, lit, countDistinct, when

#Keep the events where "HEAD"=2
cleanDF = flatDF.where(col('HEAD')==2)  

In [15]:
import json

def computations(DF, epoch):

    #As the 4 calculations that we have to perform are done foe each chamber we set 4 dataframes
    chamber_1 = DF.filter((col("FPGA") == 0) & (col("TDC_CHANNEL")>=0) & (col("TDC_CHANNEL")<=63))
    chamber_2 = DF.filter((col("FPGA") == 0) & (col("TDC_CHANNEL")>=64) & (col("TDC_CHANNEL")<=127))
    chamber_3 = DF.filter((col("FPGA") == 1) & (col("TDC_CHANNEL")>=0) & (col("TDC_CHANNEL")<=63))
    chamber_4 = DF.filter((col("FPGA") == 1) & (col("TDC_CHANNEL")>=64) & (col("TDC_CHANNEL")<=127))

    #Now we can count the number of events in each chamber
    n_c1 = chamber_1.count()
    n_c2 = chamber_2.count()
    n_c3 = chamber_3.count()
    n_c4 = chamber_4.count()

    #Total number of events
    n = n_c1 + n_c2 + n_c3 + n_c4


    #Histograms    
    h_c1 = chamber_1.groupBy('TDC_CHANNEL').count().collect()
    h_c2 = chamber_2.groupBy('TDC_CHANNEL').count().collect()
    h_c3 = chamber_3.groupBy('TDC_CHANNEL').count().collect()
    h_c4 = chamber_4.groupBy('TDC_CHANNEL').count().collect()

    h_active_1 = chamber_1.groupBy('TDC_CHANNEL','ORBIT_CNT').count().collect()
    h_active_2 = chamber_2.groupBy('TDC_CHANNEL','ORBIT_CNT').count().collect()
    h_active_3 = chamber_3.groupBy('TDC_CHANNEL','ORBIT_CNT').count().collect()
    h_active_4 = chamber_4.groupBy('TDC_CHANNEL','ORBIT_CNT').count().collect()
    
    
    #Organise the results to send them to one topic as a dictionary
    results = {'Total_events': n,
              'Events_per_chamber': [n_c1,n_c2,n_c3,n_c4],
              'Histogram_1': [h_c1, h_c2, h_c3, h_c4],
              'Histogram_2': [h_active_1,h_active_2,h_active_3,h_active_4]}
    
    #publish the results in the "results" topic for further usage
    producer.send(topic='results', value=json.dumps(results).encode('utf-8'))

In [16]:
# Send the results to the kafka topic
from kafka import KafkaProducer
#Initialize the producer
producer = KafkaProducer(bootstrap_servers=KAFKA_BOOTSTRAP_SERVERS)


In [23]:
#Trigger the processing
cleanDF.writeStream\
    .foreachBatch(computations)\
    .trigger(processingTime='5 second')\
    .start()\
    .awaitTermination()

KeyboardInterrupt: 

In [112]:
spark.stop()