# MAPD-B distributed processing exam
## Project 4: Streaming processing of cosmic rays using Drift Tubes detectors


The goal of this project is to reproduce a real-time processing of real data collected in a
particle physics detector and publish the results in a dashboard for live monitoring.


### Students:
+ **Capettini Hilario** (2013031)

+ **Carmona Gerardo** (2005005)

+ **Monaco Saverio** (2012264)

## Introduction

Extremely brief introduction, the elements and data flow.

## The cluster 

Cluster design, what was installed.

In [None]:
#IMPORTS
import json
import numpy as np
import time

import findspark
from pyspark.sql import SparkSession
from kafka.admin import KafkaAdminClient, NewTopic
from pyspark.sql.functions import from_json, col, when, sum as ssum
from pyspark.sql.types import StructField, StructType, DoubleType, IntegerType
import json
import numpy as np
import time
import pyspark.sql.functions as F
from kafka import KafkaProducer

## Streaming with Kafka and Spark

Here I try to implement a basic pipeline for the project conecting kafka with spark.

## Get Kafka and Spark ready

We can now initialize all the required variables with `findspark.init()` by passing the path to the spark folder we downloaded previously.

In [None]:
findspark.init('/usr/local/spark')

In [None]:
#%%script bash --no-raise-error
#$SPARK_HOME/sbin/start-all.sh
#$SPARK_HOME/sbin/start-master.sh

# # start master 
# $SPARK_HOME/sbin/start-master.sh --host localhost \
#     --port 7077 --webui-port 8080
    
# # start worker
# $SPARK_HOME/sbin/start-worker.sh spark://localhost:7077 \
#     --cores 8 --memory 6g

## Create the Spark session

We can now create the spark session. With the following command we are asking to the master (and resource manager) to create an application with required resources and configurations. In this case we are using all the default options.

In [None]:
spark = SparkSession.builder \
    .master("spark://master:7077")\
    .appName("Spark Streaming")\
    .config("spark.jars.packages","org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.1")\
    .config("spark.sql.shuffle.partitions",8)\
    .getOrCreate()


In [None]:
spark

## Kafka

In [None]:
KAFKA_HOME = '/usr/local/kafka'
KAFKA_BOOTSTRAP_SERVERS = 'slave01:9092'
#KAFKA_BOOTSTRAP_SERVERS = 'localhost:9092'

In [None]:
#By some reason I can't launch this from here using OS, so i open the terminals in the KAFKA_HOME folder
# and launch the zookeper and the kafka server comands manually


# Start Zookeeper
# bin/zookeeper-server-start.sh config/zookeeper.properties 
#os.system('{0}/bin/zookeeper-server-start.sh {0}/config/zookeeper.properties'.format(KAFKA_HOME)) 
    
# Start one Kafka Broker
#bin/kafka-server-start.sh config/server.properties
#os.system('{0}/bin/kafka-server-start.sh {0}/config/server.properties'.format(KAFKA_HOME)) 

### Create the topics for Kafka

In [None]:
kafka_admin = KafkaAdminClient(
        bootstrap_servers=KAFKA_BOOTSTRAP_SERVERS,
    )

#Here we will inject the data
new_topic_a = NewTopic(name='Experiment_measurements', 
                       num_partitions=16, 
                       replication_factor=1)

#Here we inject the number of processed hits, post cleaning
new_topic_b = NewTopic(name='results', 
                       num_partitions=1, 
                       replication_factor=1)

kafka_admin.create_topics(new_topics=[new_topic_a,new_topic_b])


In [None]:
kafka_admin.list_topics()

## Kafka - Spark INTEGRATION

### Read the data from the Kafka topic (define the consumer)

In [None]:
inputDF = spark\
    .readStream\
    .format("kafka")\
    .option("kafka.bootstrap.servers", KAFKA_BOOTSTRAP_SERVERS)\
    .option('subscribe', 'Experiment_measurements')\
    .load()


In [None]:
## The schema of the json data format used to create the messages
schema = StructType(
        [
                StructField("HEAD",        IntegerType()),
                StructField("FPGA",        IntegerType()),
                StructField("TDC_CHANNEL", IntegerType()),
                StructField("ORBIT_CNT",   DoubleType()),
                StructField("BX_COUNTER",  IntegerType()),
                StructField("TDC_MEAS",    DoubleType())
        ]  
)

## a new DF can be created from the previous by using the pyspark.sql functions
jsonDF = inputDF.select(from_json(col("value").alias('value').cast("string"), schema).alias('value'))

In [None]:
flatDF = jsonDF.selectExpr("value.HEAD", 
                           "value.FPGA", 
                           "value.TDC_CHANNEL",
                           "value.ORBIT_CNT",
                           "value.BX_COUNTER",
                           "value.TDC_MEAS")

In [None]:
flatDF.printSchema()

## Spark processing

In [None]:
## FILTERING OF THE DATA
## we only keep the events with "HEAD" = 2 and "TDC_CHANNEL" <= 128

cleanDF = flatDF.where((col('HEAD')==2) & (col('TDC_CHANNEL') <= 128))

In [None]:
## Colection of functions for the main computation

def chamber_assignment(df):
    '''Assign chamber number and leave the scintillator carriers with chamber == null'''

    return(df.withColumn('CHAMBER',when(col("FPGA") == 0,
                                                when(col("TDC_CHANNEL")<=63,1).\
                                                otherwise(when(col("TDC_CHANNEL")<128,2))).\
                                           otherwise(when(col("TDC_CHANNEL")<=63,3).\
                                                     otherwise(when(col("TDC_CHANNEL")<128,4))
                                           )).\
                                           select([ col('TDC_CHANNEL'), col('ORBIT_CNT'),
                                           col('BX_COUNTER'),col('TDC_MEAS'),
                                           col('CHAMBER')])
          )


def scintillator_data(df):
    '''Define a dataframe containing the relevant information for 
    the scintillator analysis''' 
    
    #First we filter the events encoding the passage time,
    #then we add the PASSAGE time for each event 
    #Finally if we have two scilantor hits within the same orbit we keep 
    #the one with the smaller time
    return(df.filter((col("CHAMBER").isNull()) & (col("FPGA") == 1)).\
                          withColumn("PASSAGETIME", 25 * (col("ORBIT_CNT") * 3564 + col("BX_COUNTER") + col("TDC_MEAS")/30)).\
                          drop("TDC_CHANNEL").drop("BX_COUNTER").\
                          drop("TDC_MEAS").drop("CHAMBER").\
                          groupBy("ORBIT_CNT").min("PASSAGETIME").\
                          withColumnRenamed("ORBIT_CNT","ORBIT_CNT_sci").\
                          withColumnRenamed("min(PASSAGETIME)","PASSAGETIME")
          )


def histogram_a(df,min_v,max_v,inc,key):# TODO: replicate the function generalization to the actual code
    '''This function return the bins and counts for the first type of requested histogram'''
    hist_bins = np.arange(min_v,max_v,inc)
    hist = df\
        .filter((min_v<=F.col(key)) & (F.col(key)<=max_v))\
        .withColumn('BIN', F.floor((F.col(key)-min_v)/inc))\
        .groupBy('CHAMBER','BIN')\
        .count().select('CHAMBER','BIN', col('count').alias('COUNT'))
    return (hist_bins, hist)


def histogram_b(df,min_v,max_v,inc,key_1,key_2):
    '''This function return the bins and counts for the second type of requested histogram'''
    hist_bins = np.arange(min_v,max_v,inc)
    hist = df\
        .groupBy('CHAMBER',key_1)\
        .agg(F.countDistinct(key_2).alias('ACTIVE'))\
        .filter((min_v<=F.col(key_1))&(F.col(key_1)<=max_v))\
        .withColumn('BIN',F.floor((F.col(key_1)-min_v)/inc))\
        .groupBy('CHAMBER','BIN')\
        .agg(F.sum('ACTIVE').alias('COUNT'))
    return(hist_bins, hist)


def numpify(bins, pos_count):
    '''NUMPIFY RESULTS'''
    counter = np.zeros(len(bins))#np.zeros(len(bins)-1)?
    positions = np.array(list(pos_count.keys()))
    counts = np.array(list(pos_count.values()))
    counter[positions] = counts
    return counter


def prepare_results(hist, hist_bins):
    '''COLLECTING RESULTS'''    
    _hist = hist.groupBy('CHAMBER').agg(
    F.map_from_entries(
        F.collect_list(
            F.struct("BIN", "COUNT"))).alias("COUNT")
        ).collect()    

    # JSON FORMATING OF RESULTS
    _hist_dict = {row.CHAMBER: {
        'Bins': list(hist_bins), 'Counts': list(numpify(hist_bins,row.COUNT))
    } for row in _hist}
    
    return _hist_dict


In [None]:
def computations(df, epoch, log):
    '''This is the main function of the code, it requires a dataframe as input. The dataframe is analysed
       and the results are published in the kafka topic "results" '''
    main_df = chamber_assignment(df)

    scintillator_df = scintillator_data(main_df)
    
    ### Drop the columns with null values from main_df
    hit_df = main_df.na.drop(subset=["CHAMBER"])
    
    ## TOTAL NUMBER OF PROCESSED HITS
    total_hits = hit_df.count()
    if not total_hits: return

    ## TOTAL NUMBER OF PROCESSED HITS PER CHAMBER
    chamber_hits = hit_df\
        .groupBy('CHAMBER').count()\
        .select(col('CHAMBER'),col('count').alias('COUNT'))
    
    ## ACTIVE TDC_CHANNEL PER CHAMBER
    min_v_1 = 0
    max_v_1 = 127
    inc_1 = 5
    hist_1_bins, hist_1 = histogram_a(hit_df,min_v_1,max_v_1,inc_1, 'TDC_CHANNEL')
    
    ## ACTIVE TDC_CHANNEL PER CHAMBER PER ORBIT_CNT
    min_v_2 = 6.e5 #main_df.agg(F.min(F.col('ORBIT_CNT')).alias('min')).collect()[-1].min
    max_v_2 = 1.e7 #main_df.agg(F.max(F.col('ORBIT_CNT')).alias('max')).collect()[-1].max
    inc_2 = 0.5e6
    hist_2_bins, hist_2 = histogram_b(hit_df,min_v_2,max_v_2,inc_2, 'ORBIT_CNT', 'TDC_CHANNEL')
    
    
    ### keep only the hits with a scintillator signal within the same orbit
    chamber_sci = hit_df.join(scintillator_df,main_df.ORBIT_CNT ==  scintillator_df.ORBIT_CNT_sci,"inner")

    ## ADD TIME CORRECTION BY CHAMBER
    chamber_sci = chamber_sci.withColumn('TIME_OFFSET',when(col("CHAMBER") == 1, 93.9).\
                                                       when(col("CHAMBER") == 2, 101.4).\
                                                       when(col("CHAMBER") == 3, 95.5).\
                                                       when(col("CHAMBER") == 4, 92.4))

    ### Add the ABSSOLUTETIME and DRIFTIME
    chamber_sci = chamber_sci.withColumn("ABSOLUTETIME",
                             25 * (col("ORBIT_CNT") * 3564 + col("BX_COUNTER") + col("TDC_MEAS")/30)).\
                              withColumn("DRIFTIME",col("ABSOLUTETIME")-col("PASSAGETIME") + col("TIME_OFFSET"))
   

    ## ACTIVE TDC_CHANNEL PER CHAMBER WITHIN SCINTILLATOR SIGNAL
    min_v_3 = 0
    max_v_3= 127
    inc_3 = 5
    hist_3_bins, hist_3 = histogram_a(chamber_sci,min_v_3,max_v_3,inc_3, 'TDC_CHANNEL')
    

    ## HISTOGRAM OF DRIFTIME, PER CHAMBER
    min_v_4 = 0
    max_v_4= 1000
    inc_4 = 10
    hist_4_bins, hist_4 = histogram_a(chamber_sci,min_v_4,max_v_4,inc_4, 'DRIFTIME')


    # PREPARE THE RESULTS
    _chamber_hits = {row.CHAMBER: int(row.COUNT) for row in chamber_hits.collect()}
    _hist_1_dict = prepare_results(hist_1,hist_1_bins)
    _hist_2_dict = prepare_results(hist_2,hist_2_bins)
    _hist_3_dict = prepare_results(hist_3,hist_3_bins)
    _hist_4_dict = prepare_results(hist_4,hist_4_bins)
    
    default = lambda bins: {'Bins': list(bins), 'Counts' : [0]*(len(bins)-1)}
    
    results = {f'Chamber_{i}': {
        'Count': _chamber_hits.get(i, 0),
        'Hist_1': _hist_1_dict.get(i, default(hist_1_bins)),
        'Hist_2': _hist_2_dict.get(i, default(hist_2_bins)),
        'Hist_3': _hist_3_dict.get(i, default(hist_3_bins)),
        'Hist_4': _hist_4_dict.get(i, default(hist_4_bins))} for i in range(1,5)}

    results.update({
        'Index': time.time(),
        'Total Count': int(total_hits)
    })

    log(results)
    return


In [None]:
#Send the results to the kafka topic
#Initialize the producer
producer = KafkaProducer(bootstrap_servers=KAFKA_BOOTSTRAP_SERVERS)

logger = lambda value: producer.send(topic="results", value= str(value).encode('utf-8'))

In [None]:
#Trigger the processing
cleanDF.writeStream\
    .foreachBatch(lambda df, epoch: computations(df,epoch,logger))\
    .trigger(processingTime='5 seconds')\
    .start()\
    .awaitTermination()

In [None]:
spark.stop()

If you also want to delete any data of your local Kafka environment including any events you have created along the way, run the command:

`` $ rm -rf /tmp/kafka-logs /tmp/zookeeper `` 

##  Results

### Vertical scalability

### Horizontal scalability

### Scaling with ammount of data