# Streaming with Kafka and Spark

Here I try to implement a basic pipeline for the project conecting kafka with spark.

### Computer setting
I downloaded and located in my home the spark file **spark-3.1.2-bin-hadoop3.2** and also the kafka file **kafka_2.13-2.7.0**

## Get Kafka and Spark ready

### Standalone cluster deployment

We can now initialize all the required variables with `findspark.init()` by passing the path to the spark folder we downloaded previously.

In [1]:
import findspark
findspark.init('/usr/local/spark')

First we need to start the master, This will spin up the spark master with address spark://localhost:7077 and a cluster dashboark at localhost:8080.

We can now create a worker

## Create the spark session

We can now create the spark session. With the following command we are asking to the master (and resource manager) to create an application with required resources and configurations. In this case we are using all the default options.

In [2]:
from pyspark.sql import SparkSession

KAFKA_BOOTSTRAP_SERVERS = ''


spark = SparkSession.builder \
    .master("spark://master:7077")\
    .appName("Spark Streaming")\
    .config("spark.jars.packages","org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.1")\
    .getOrCreate()


In [3]:
spark

## KAFKA

In [4]:
KAFKA_HOME = '/usr/local/kafka'
KAFKA_BOOTSTRAP_SERVERS = 'slave01:9092'

In [5]:
#By some reason I can't launch this from here using OS, so i open the terminals in the KAFKA_HOME folder
# and launch the zookeper and the kafka server comands manually


# Start Zookeeper
# bin/zookeeper-server-start.sh config/zookeeper.properties 
#os.system('{0}/bin/zookeeper-server-start.sh {0}/config/zookeeper.properties'.format(KAFKA_HOME)) 
    
# Start one Kafka Broker
#bin/kafka-server-start.sh config/server.properties
#os.system('{0}/bin/kafka-server-start.sh {0}/config/server.properties'.format(KAFKA_HOME)) 

### Create the topics for kafka

In [6]:
from kafka.admin import KafkaAdminClient, NewTopic
kafka_admin = KafkaAdminClient(
        bootstrap_servers=KAFKA_BOOTSTRAP_SERVERS,
    )

#Here we will inject the data
new_topic_a = NewTopic(name='Experiment_measurements', 
                       num_partitions=1, 
                       replication_factor=1)

#Here we inject the number of processed hits, post cleaning
new_topic_b = NewTopic(name='results', 
                       num_partitions=1, 
                       replication_factor=1)

kafka_admin.create_topics(new_topics=[new_topic_a,new_topic_b])


TopicAlreadyExistsError: [Error 36] TopicAlreadyExistsError: Request 'CreateTopicsRequest_v3(create_topic_requests=[(topic='Experiment_measurements', num_partitions=1, replication_factor=1, replica_assignment=[], configs=[]), (topic='results', num_partitions=1, replication_factor=1, replica_assignment=[], configs=[])], timeout=30000, validate_only=False)' failed with response 'CreateTopicsResponse_v3(throttle_time_ms=0, topic_errors=[(topic='Experiment_measurements', error_code=36, error_message="Topic 'Experiment_measurements' already exists."), (topic='results', error_code=0, error_message=None)])'.

In [9]:
kafka_admin.list_topics()

['Experiment_measurements', 'results']

## KAFKA - SPARK INTEGRATION

### Read the data from the kafka topic (define the consumer)

In [10]:
inputDF = spark\
    .readStream\
    .format("kafka")\
    .option("kafka.bootstrap.servers", KAFKA_BOOTSTRAP_SERVERS)\
    .option('subscribe', 'Experiment_measurements')\
    .load()

In [11]:
#inputDF.printSchema()

In [12]:
from pyspark.sql.functions import from_json, col, when, sum as ssum
from pyspark.sql.types import StructField, StructType, DoubleType, IntegerType

## the schema of the json data format used to create the messages
schema = StructType(
        [
                StructField("HEAD",        IntegerType()),
                StructField("FPGA",        IntegerType()),
                StructField("TDC_CHANNEL", IntegerType()),
                StructField("ORBIT_CNT",   DoubleType()),
                StructField("BX_COUNTER",  IntegerType()),
                StructField("TDC_MEAS",    DoubleType())
        ]  
)

## a new DF can be created from the previous by using the pyspark.sql functions
jsonDF = inputDF.select(from_json(col("value").alias('value').cast("string"), schema).alias('value'))

In [13]:
jsonDF.printSchema()

root
 |-- value: struct (nullable = true)
 |    |-- HEAD: integer (nullable = true)
 |    |-- FPGA: integer (nullable = true)
 |    |-- TDC_CHANNEL: integer (nullable = true)
 |    |-- ORBIT_CNT: double (nullable = true)
 |    |-- BX_COUNTER: integer (nullable = true)
 |    |-- TDC_MEAS: double (nullable = true)



In [14]:
#jsonDF.writeStream\
#   .outputMode("append")\
#   .format("console")\
#   .start()\
#   .awaitTermination()

In [15]:
flatDF = jsonDF.selectExpr("value.HEAD", 
                           "value.FPGA", 
                           "value.TDC_CHANNEL",
                           "value.ORBIT_CNT",
                           "value.BX_COUNTER",
                           "value.TDC_MEAS")

In [16]:
flatDF.printSchema()

root
 |-- HEAD: integer (nullable = true)
 |-- FPGA: integer (nullable = true)
 |-- TDC_CHANNEL: integer (nullable = true)
 |-- ORBIT_CNT: double (nullable = true)
 |-- BX_COUNTER: integer (nullable = true)
 |-- TDC_MEAS: double (nullable = true)



In [17]:
#flatDF.writeStream\
#   .outputMode("append")\
#   .format("console")\
#   .start()\
#   .awaitTermination()

### SPARK processing

In [18]:
import json
import numpy as np
import time

#Keep the events where "HEAD"=2
cleanDF = flatDF.where(col('HEAD')==2)  

In [19]:
def computations(DF, epoch):
    #This function perform the whole operations on the received batch,
    #

    #As the 4 calculations that we have to perform are done foe each chamber we set 4 dataframes
    chamber_1 = DF.filter((col("FPGA") == 0) & (col("TDC_CHANNEL")>=0) & (col("TDC_CHANNEL")<=63))
    chamber_2 = DF.filter((col("FPGA") == 0) & (col("TDC_CHANNEL")>=64) & (col("TDC_CHANNEL")<=127))
    chamber_3 = DF.filter((col("FPGA") == 1) & (col("TDC_CHANNEL")>=0) & (col("TDC_CHANNEL")<=63))
    chamber_4 = DF.filter((col("FPGA") == 1) & (col("TDC_CHANNEL")>=64) & (col("TDC_CHANNEL")<=127))

    #Now we can count the number of events in each chamber
    n_c1 = chamber_1.count()
    n_c2 = chamber_2.count()
    n_c3 = chamber_3.count()
    n_c4 = chamber_4.count()

    #Total number of events
    n = n_c1 + n_c2 + n_c3 + n_c4


    #Histograms    
    h_c1 = chamber_1.groupBy('TDC_CHANNEL').count().collect()
    h_c2 = chamber_2.groupBy('TDC_CHANNEL').count().collect()
    h_c3 = chamber_3.groupBy('TDC_CHANNEL').count().collect()
    h_c4 = chamber_4.groupBy('TDC_CHANNEL').count().collect()

    h_active_1 = chamber_1.groupBy('TDC_CHANNEL','ORBIT_CNT').count().collect()
    h_active_2 = chamber_2.groupBy('TDC_CHANNEL','ORBIT_CNT').count().collect()
    h_active_3 = chamber_3.groupBy('TDC_CHANNEL','ORBIT_CNT').count().collect()
    h_active_4 = chamber_4.groupBy('TDC_CHANNEL','ORBIT_CNT').count().collect()
    
    
    #Organise the results to send them to one topic as a dictionary
    results = {'Total_events': n,
              'Events_per_chamber': [n_c1,n_c2,n_c3,n_c4],
              'Histogram_1': [h_c1, h_c2, h_c3, h_c4],
              'Histogram_2': [h_active_1,h_active_2,h_active_3,h_active_4]}
    
    #publish the results in the "results" topic for further usage
    producer.send(topic='results', value=json.dumps(results).encode('utf-8'))
    producer.flush()

In [20]:
def computations_2(DF, epoch):
    start=time.time()
    #This function perform the whole operations on the received batch,
    #

    #As the 4 calculations that we have to perform are done foe each chamber we set 4 dataframes
    chamber_1 = DF.filter((col("FPGA") == 0) & (col("TDC_CHANNEL")>=0) & (col("TDC_CHANNEL")<=63))
    chamber_2 = DF.filter((col("FPGA") == 0) & (col("TDC_CHANNEL")>=64) & (col("TDC_CHANNEL")<=127))
    chamber_3 = DF.filter((col("FPGA") == 1) & (col("TDC_CHANNEL")>=0) & (col("TDC_CHANNEL")<=63))
    chamber_4 = DF.filter((col("FPGA") == 1) & (col("TDC_CHANNEL")>=64) & (col("TDC_CHANNEL")<=127))

    #Initialize results dictionary
    results = {}
    results["Total Count"] = {}
    chamber_name = ["Chamber_1", "Chamber_2", "Chamber_3", "Chamber_4"]
    for chamber in chamber_name:
        results[chamber] = {}
        results[chamber]["Count"] = {}
        
    #Now we can count the number of events in each chamber
    results["Chamber_1"]["Count"] = chamber_1.count()
    results["Chamber_2"]["Count"] = chamber_2.count()
    results["Chamber_3"]["Count"] = chamber_3.count()
    results["Chamber_4"]["Count"] = chamber_4.count()

    #Total number of events
    results["Total Count"] = results["Chamber_1"]["Count"] + results["Chamber_2"]["Count"] + \
                             results["Chamber_3"]["Count"] + results["Chamber_4"]["Count"]
    

    # Compute histograms for each chamber   
    i=0    
    for chamber in [chamber_1, chamber_2, chamber_3, chamber_4]:
        #Initialize dictionary partitions to save the results
        for hist in ["Hist_1","Hist_2"]:
            results[chamber_name[i]][hist] = {}
            results[chamber_name[i]][hist]["Bins"] = {}
            results[chamber_name[i]][hist]["Counts"] = {}
        
        if(chamber.count()!=0): 
            #Histogram 1
            bins, counts = (
                chamber.select("TDC_CHANNEL")
                .rdd.map(lambda x: x.TDC_CHANNEL)
                .histogram(list(np.arange(0,170,5)))
            )
            
            results[chamber_name[i]]["Hist_1"]["Bins"] = bins
            results[chamber_name[i]]["Hist_1"]["Counts"] = counts
            
            #Histogram 2
            bins, counts = (
            chamber.groupBy("TDC_CHANNEL","ORBIT_CNT")
            .count()
            .select("ORBIT_CNT")
            .rdd.map(lambda x: x.ORBIT_CNT)
            .histogram(list(np.arange(6.e5,1.e7,0.5e6)))
            )
            
            results[chamber_name[i]]["Hist_2"]["Bins"] = bins
            results[chamber_name[i]]["Hist_2"]["Counts"] = counts            
            
        else:
            #Histogram 1
            results[chamber_name[i]]["Hist_1"]["Bins"] = list(np.arange(0,170,5))
            counts = list(np.arange(0,170,5)* 0) 
            results[chamber_name[i]]["Hist_1"]["Counts"] = counts
            
            #Histogram 2
            results[chamber_name[i]]["Hist_2"]["Bins"] = list(np.arange(6.e5,1.e7,0.5e6))
            counts = list(np.arange(6.e5,1.e7,0.5e6)* 0) 
            results[chamber_name[i]]["Hist_2"]["Counts"] = counts
        i +=1
    
    end =time.time()
    print("Time =",end-start)
    producer.send(topic="results", value= str(results).encode('utf-8'))
    producer.flush()

In [21]:
def computations_3(DF, epoch):
    start=time.time()
    #This function perform the whole operations on the received batch,
    
    #Add a column with the chamber number
    DF_new = DF.filter(col("HEAD")==2).withColumn('chamber',when((col("FPGA") == 0) & (col("TDC_CHANNEL")<=63),1).
                                 when((col("FPGA") == 0) & (col("TDC_CHANNEL")>=64),2).
                                 when((col("FPGA") == 1) & (col("TDC_CHANNEL")<=63),3).
                                 when((col("FPGA") == 1) & (col("TDC_CHANNEL")>=64),4)).\
                                 select([ col('TDC_CHANNEL'), col('ORBIT_CNT'),
                                    col('BX_COUNTER'),col('TDC_MEAS'),
                                    col('chamber')])
    #DF_new.persist()
    #DF_new.show()
    #Initialize results dictionary
    results = {}
    results["Total Count"] = {}
    results["Index"] = time.time()
    chamber_name = ["Chamber_1", "Chamber_2", "Chamber_3", "Chamber_4"]
    for chamber in chamber_name:
        results[chamber] = {}
        results[chamber]["Count"] = {}
        for hist in ["Hist_1","Hist_2"]:
            results[chamber][hist] = {}
            results[chamber][hist]["Bins"] = {}
            results[chamber][hist]["Counts"] = {}
        
    # Compute histograms for each chamber   
    for i in [1,2,3,4]:
        #Now we can count the number of events in each chamber
        chamber = DF_new.filter(col("chamber") == i).persist()
        results[f"Chamber_{i}"]["Count"] = chamber.count()
        
        if(results[f"Chamber_{i}"]["Count"]!=0):
            
            #Histogram 1
            bins, counts = (
            chamber.select("TDC_CHANNEL")
                 .rdd.map(lambda x: x.TDC_CHANNEL)
                 .histogram(list(np.arange(0,170,5)))
            )
            
            results[f"Chamber_{i}"]["Hist_1"]["Bins"] = bins
            results[f"Chamber_{i}"]["Hist_1"]["Counts"] = counts
            
            #Histogram 2
            bins, counts = (
            chamber.groupBy("TDC_CHANNEL","ORBIT_CNT")
            .count()
            .select("ORBIT_CNT")
            .rdd.map(lambda x: x.ORBIT_CNT)
            .histogram(list(np.arange(6.e5,1.e7,0.5e6)))
            )
            
            results[f"Chamber_{i}"]["Hist_2"]["Bins"] = bins
            results[f"Chamber_{i}"]["Hist_2"]["Counts"] = counts            
                
        else:
            #Histogram 1
            results[f"Chamber_{i}"]["Hist_1"]["Bins"] = list(np.arange(0,170,5))
            counts = list(np.arange(0,170,5)* 0) 
            results[f"Chamber_{i}"]["Hist_1"]["Counts"] = counts
            
             #Histogram 2
            results[f"Chamber_{i}"]["Hist_2"]["Bins"] = list(np.arange(6.e5,1.e7,0.5e6))
            counts = list(np.arange(6.e5,1.e7,0.5e6)* 0) 
            results[f"Chamber_{i}"]["Hist_2"]["Counts"] = counts
        chamber.unpersist()
        
    results["Total Count"] = results["Chamber_1"]["Count"] + results["Chamber_2"]["Count"] + \
                             results["Chamber_3"]["Count"] + results["Chamber_4"]["Count"]
    end =time.time()
    print("Time =",end-start)
       
    
    producer.send(topic="results", value= str(results).encode('utf-8'))
    #producer.flush()

In [22]:
def computations_4(df, epoch):
    start=time.time()

    ## TOTAL NUMBER OF PROCESSED HITS
    clean_df = df.filter(df.HEAD == 2)
    total_hits = clean_df.count()
    
    ## CHAMBER FILTERING
    c_fp = clean_df.filter(clean_df.FPGA == 0)
    c_ga = clean_df.filter(clean_df.FPGA == 1)
    c_0 = c_fp.filter(c_fp.TDC_CHANNEL < 64)
    c_1 = c_fp.filter(c_fp.TDC_CHANNEL >= 64)
    c_2 = c_ga.filter(c_fp.TDC_CHANNEL < 64)
    c_3 = c_ga.filter(c_fp.TDC_CHANNEL >= 64)
    
    ## TOTAL NUMBER OF PROCESSED HITS PER CHAMBER
    hits_0 = c_0.count()
    hits_1 = c_1.count()
    hits_2 = c_2.count()
    hits_3 = c_3.count()
    
    ## ACTIVE TDC_CHANNEL PER CHAMBER
    hist_0 = c_0.groupBy('TDC_CHANNEL').count().select('TDC_CHANNEL',col('count').alias('COUNT')).collect()
    hist_1 = c_1.groupBy('TDC_CHANNEL').count().select('TDC_CHANNEL',col('count').alias('COUNT')).collect()
    hist_2 = c_2.groupBy('TDC_CHANNEL').count().select('TDC_CHANNEL',col('count').alias('COUNT')).collect()
    hist_3 = c_3.groupBy('TDC_CHANNEL').count().select('TDC_CHANNEL',col('count').alias('COUNT')).collect()
    
    ## ACTIVE TDC_CHANNEL PER CHAMBER PER ORBIT_CNT
    orb_0 = c_0.groupBy('TDC_CHANNEL','ORBIT_CNT').count().select('TDC_CHANNEL','ORBIT_CNT',col('count').alias('COUNT')).collect()
    orb_1 = c_1.groupBy('TDC_CHANNEL','ORBIT_CNT').count().select('TDC_CHANNEL','ORBIT_CNT',col('count').alias('COUNT')).collect()
    orb_2 = c_2.groupBy('TDC_CHANNEL','ORBIT_CNT').count().select('TDC_CHANNEL','ORBIT_CNT',col('count').alias('COUNT')).collect()
    orb_3 = c_3.groupBy('TDC_CHANNEL','ORBIT_CNT').count().select('TDC_CHANNEL','ORBIT_CNT',col('count').alias('COUNT')).collect()

    end =time.time()
    print("Time =",end-start)



In [23]:
from pyspark.sql import functions as F

def computations_5(df, epoch):
    start=time.time()

    ## FILTERING DATA AND SETTING CHAMBER
    clean_df = df.filter(col("HEAD")==2).withColumn('CHAMBER',
                                when(col("FPGA") == 0,
                                     when(col("TDC_CHANNEL")<=63,1).\
                                     otherwise(2)).\
                                                    otherwise(
                                    when(col("TDC_CHANNEL")<=63,3).\
                                    otherwise(4)
                                )).\
                                select([ col('TDC_CHANNEL'), col('ORBIT_CNT'),
                                   col('BX_COUNTER'),col('TDC_MEAS'),
                                   col('CHAMBER')])

    ## TOTAL NUMBER OF PROCESSED HITS
    total_hits = clean_df.count()

    ## TOTAL NUMBER OF PROCESSED HITS PER CHAMBER
    chamber_hits = clean_df.groupBy('CHAMBER').count().select(col('CHAMBER'),col('count').alias('COUNT'))#.collect()
    
    ## ACTIVE TDC_CHANNEL PER CHAMBER
    hist_1 = clean_df.groupBy('CHAMBER','TDC_CHANNEL').count().select('CHAMBER','TDC_CHANNEL',col('count').alias('COUNT'))#.collect()

    ## ACTIVE TDC_CHANNEL PER CHAMBER PER ORBIT_CNT
    hist_2 = clean_df\
        .groupBy('CHAMBER','ORBIT_CNT')\
        .agg(F.countDistinct('TDC_CHANNEL')\
        .alias('ACTIVE_CHANNELS'))#.collect()

    ## COLLECTING RESULTS
    _chamber_hits = chamber_hits.collect()
    
    _hist_1 = hist_1.groupBy('CHAMBER').agg(
    F.map_from_entries(
        F.collect_list(
            F.struct("TDC_CHANNEL", "COUNT"))).alias("COUNT")
        ).collect()

    _hist_2 = hist_2.groupBy('CHAMBER').agg(
    F.map_from_entries(
        F.collect_list(
            F.struct("ORBIT_CNT","ACTIVE_CHANNELS"))).alias("COUNT")
        ).collect()

    ## JSON FORMATING OF RESULTS
    _hist_1_dict = {row.CHAMBER: row.COUNT for row in _hist_1}

    _hist_2_dict = {row.CHAMBER: row.COUNT for row in _hist_2}

    results = {f'Chamber_{row.CHAMBER}': {
        'Count': int(row.COUNT),
        'Hist_1': _hist_1_dict[row.CHAMBER],
        'Hist_2': _hist_2_dict[row.CHAMBER]} for row in _chamber_hits}

    results.update({
        'Index': time.time(),#TODO: Better indexing
        'Total Count': int(total_hits)
    })

    end = time.time()
    print("Time =",end-start)



In [24]:
def computations_6(df, epoch):
    start=time.time()

    ## TOTAL NUMBER OF PROCESSED HITS
    clean_df = df.filter(col("HEAD")==2).withColumn('CHAMBER',
                                when(col("FPGA") == 0,
                                     when(col("TDC_CHANNEL")<=63,1).\
                                     otherwise(2)).\
                                                    otherwise(
                                    when(col("TDC_CHANNEL")<=63,3).\
                                    otherwise(4)
                                )).\
                                select([ col('TDC_CHANNEL'), col('ORBIT_CNT'),
                                   col('BX_COUNTER'),col('TDC_MEAS'),
                                   col('CHAMBER')])

    results = {}

    # TOTAL NUMBER OF PROCESSED HITS
    total_hits = clean_df.count()
    results = {'total_hits': int(total_hits), 'chambers': {f'chamber_{i}': {} for i in range(1,5)}}
    
    ## This is the most general grouping, from it we will count the other groupings
    ## ACTIVE TDC_CHANNEL PER CHAMBER PER ORBIT_CNT
    hist_2 = clean_df.groupBy('CHAMBER','TDC_CHANNEL','ORBIT_CNT').count().select('CHAMBER','TDC_CHANNEL','ORBIT_CNT',col('count').alias('COUNT'))#.collect()
    
    ## ACTIVE TDC_CHANNEL PER CHAMBER
    hist_1 = hist_2.groupby('CHAMBER','TDC_CHANNEL').agg(ssum('COUNT')).select('CHAMBER','TDC_CHANNEL',col('sum(COUNT)').alias('COUNT'))#.collect()

    ## TOTAL NUMBER OF PROCESSED HITS PER CHAMBER
    chamber_hits = hist_1.groupBy('CHAMBER').agg(ssum('COUNT')).select(col('CHAMBER'),col('sum(COUNT)').alias('COUNT'))#.collect()

    hist_2.persist()
    hist_1.persist()

    _chamber_hits = chamber_hits.collect()
    
    _hist_1 = hist_1.groupBy('CHAMBER').agg(
    F.map_from_entries(
        F.collect_list(
            F.struct("TDC_CHANNEL", "COUNT"))).alias("distribution")
        ).collect()

    _hist_2 = hist_2.groupBy('CHAMBER').agg(
    F.map_from_entries(
        F.collect_list(
            F.struct(
                F.concat_ws('_',"ORBIT_CNT","TDC_CHANNEL"), 
                "COUNT"))).alias("distribution")
        ).collect()
    
#    hist_2.unpersist()
#    hist_1.unpersist()
    
    end =time.time()
    print("Time =",end-start)



In [55]:
def computations_8(df, epoch):
    start=time.time()

    ## FILTERING DATA AND SETTING CHAMBER
    clean_df = df.filter(col("HEAD")==2).withColumn('CHAMBER',
                                when(col("FPGA") == 0,
                                     when(col("TDC_CHANNEL")<=63,1).\
                                     otherwise(2)).\
                                                    otherwise(
                                    when(col("TDC_CHANNEL")<=63,3).\
                                    otherwise(4)
                                )).\
                                select([ col('TDC_CHANNEL'), col('ORBIT_CNT'),
                                   col('BX_COUNTER'),col('TDC_MEAS'),
                                   col('CHAMBER')])

    ## TOTAL NUMBER OF PROCESSED HITS
    total_hits = clean_df.count()
    if not total_hits: return

    ## TOTAL NUMBER OF PROCESSED HITS PER CHAMBER
    chamber_hits = clean_df\
        .groupBy('CHAMBER').count()\
        .select(col('CHAMBER'),col('count').alias('COUNT'))#.collect()
    
    ## ACTIVE TDC_CHANNEL PER CHAMBER
    min_v_1 = 0
    max_v_1 = 170
    inc_1 = 5
    hist_1_bins = np.arange(min_v_1,max_v_1,inc_1)
    hist_1 = clean_df\
        .groupBy('CHAMBER','TDC_CHANNEL')\
        .count()\
        .select('CHAMBER','TDC_CHANNEL',col('count').alias('COUNT'))\
        .withColumn('BIN',
                    F.floor((F.col('TDC_CHANNEL')-min_v_1)/inc_1)
                   )\
        .groupBy('CHAMBER','BIN')\
        .agg(F.sum('COUNT')\
            .alias('COUNT')
            )#.collect()

    ## ACTIVE TDC_CHANNEL PER CHAMBER PER ORBIT_CNT
    min_v_2 = clean_df.agg(F.min(F.col('ORBIT_CNT')).alias('min')).collect()[-1].min#6.e5
    max_v_2 = clean_df.agg(F.max(F.col('ORBIT_CNT')).alias('max')).collect()[-1].max#1.e7
    inc_2 = 0.5e6
    hist_2_bins = np.arange(min_v_2,max_v_2,inc_2)
    hist_2 = clean_df\
        .groupBy('CHAMBER','ORBIT_CNT')\
        .agg(F.countDistinct('TDC_CHANNEL')\
            .alias('ACTIVE_CHANNELS')
            )\
        .withColumn('BIN',
                    F.floor((F.col('ORBIT_CNT')-min_v_2)/inc_2)
                   )\
        .groupBy('CHAMBER','BIN')\
        .agg(F.sum('ACTIVE_CHANNELS')\
            .alias('COUNT')
            )#.collect()

    ## COLLECTING RESULTS
    _chamber_hits = chamber_hits.collect()
    
    _hist_1 = hist_1.groupBy('CHAMBER').agg(
    F.map_from_entries(
        F.collect_list(
            F.struct("BIN", "COUNT"))).alias("COUNT")
        ).collect()

    _hist_2 = hist_2.groupBy('CHAMBER').agg(
    F.map_from_entries(
        F.collect_list(
            F.struct("BIN","COUNT"))).alias("COUNT")
        ).collect()

    ## NUMPIFY RESULTS
    def numpify(bins, pos_count):
        counter = np.zeros(len(bins))
        positions = np.array(list(pos_count.keys()))
        counts = np.array(list(pos_count.values()))
        counter[positions] = counts
        return counter

    ## JSON FORMATING OF RESULTS
    _hist_1_dict = {row.CHAMBER: {
        'Bins': list(hist_1_bins), 'Counts': list(numpify(hist_1_bins,row.COUNT))
    } for row in _hist_1}

    _hist_2_dict = {row.CHAMBER: {
        'Bins': list(hist_2_bins), 'Counts': list(numpify(hist_2_bins,row.COUNT))
    } for row in _hist_2}

    results = {f'Chamber_{row.CHAMBER}': {
        'Count': int(row.COUNT),
        'Hist_1': _hist_1_dict[row.CHAMBER],
        'Hist_2': _hist_2_dict[row.CHAMBER]} for row in _chamber_hits}

    results.update({
        'Index': time.time(),#TODO: Better indexing
        'Total Count': int(total_hits)
    })

    print(results)

    end = time.time()
    print("Time =",end-start)



In [56]:
from kafka import KafkaProducer

#Send the results to the kafka topic
#Initialize the producer
producer = KafkaProducer(bootstrap_servers=KAFKA_BOOTSTRAP_SERVERS)#, value_serializer=lambda x: json.dumps(x).encode('utf-8'))


In [57]:
#Trigger the processing
cleanDF.writeStream\
    .foreachBatch(computations_8)\
    .trigger(processingTime='6 second')\
    .start()\
    .awaitTermination()

[ 0  1  2  3  4  5  6  7  8  9 10 11 12]
[ 0  1  2  3  4  5  6  7  8  9 10 11 12]
[12 13 14 15 16 17 18 19 20 21 22 23 24 25 27]
[12 13 14 15 16 17 18 19 20 21 22 23 24 25 27]
[   0 6773]
[   0 6773]
[   0 6773]
[   0 6773]
Time = 3.7399096488952637
{'Chamber_1': {'Count': 544, 'Hist_1': {'Bins': [0, 5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 105, 110, 115, 120, 125, 130, 135, 140, 145, 150, 155, 160, 165], 'Counts': [28.0, 34.0, 26.0, 54.0, 40.0, 36.0, 45.0, 36.0, 57.0, 69.0, 35.0, 50.0, 34.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]}, 'Hist_2': {'Bins': [616953.0, 1116953.0, 1616953.0, 2116953.0, 2616953.0, 3116953.0, 3616953.0, 4116953.0, 4616953.0, 5116953.0, 5616953.0, 6116953.0, 6616953.0, 7116953.0, 7616953.0, 8116953.0, 8616953.0, 9116953.0, 9616953.0, 10116953.0, 10616953.0, 11116953.0, 11616953.0, 12116953.0, 12616953.0, 13116953.0, 13616953.0, 14116953.0, 14616953.0, 15116953

[ 0  1  2  3  4  5  6  7  8  9 10 11 12]
[ 0  1  2  3  4  5  6  7  8  9 10 11 12]
[12 13 14 15 16 17 18 19 20 21 22 23 24 25 27]
[12 13 14 15 16 17 18 19 20 21 22 23 24 25 27]
[0]
[0]
[0]
[0]
Time = 4.618537902832031
{'Chamber_1': {'Count': 1222, 'Hist_1': {'Bins': [0, 5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 105, 110, 115, 120, 125, 130, 135, 140, 145, 150, 155, 160, 165], 'Counts': [97.0, 110.0, 68.0, 130.0, 82.0, 73.0, 73.0, 70.0, 107.0, 113.0, 67.0, 120.0, 112.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]}, 'Hist_2': {'Bins': [1115542.0], 'Counts': [1177.0]}}, 'Chamber_3': {'Count': 1869, 'Hist_1': {'Bins': [0, 5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 105, 110, 115, 120, 125, 130, 135, 140, 145, 150, 155, 160, 165], 'Counts': [134.0, 147.0, 151.0, 93.0, 148.0, 138.0, 93.0, 262.0, 162.0, 139.0, 155.0, 141.0, 106.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 

KeyboardInterrupt: 

In [47]:
spark.stop()

If you also want to delete any data of your local Kafka environment including any events you have created along the way, run the command:

`` $ rm -rf /tmp/kafka-logs /tmp/zookeeper `` 