<center><h1>Management and Analysis of Physics Dataset (MOD. B) </h1></center>
<center><h2> Project 5 - Streaming processing of cosmic rays using Drift Tubes detectors</h2></center>
<center><h2>Group 2305</h2></center>

<center><style>
    table {font-size: 24px;}
</style></center>

| Last Name        | First Name         |Student ID|
|:----------------:|:------------------:|:--------------:|
| Bertinelli       | Gabriele           |1219907 (tri)   |
| Bhatti           | Roben              |2091187         |
| Bonato           | Diego              |2091250         |
| Cacciola         | Martina            |2097476         |

<left><h2> Part 2 - Data processing</h2></left>

### Import packages and modules

In [1]:
import os
import pandas as pd
import numpy as np
import json
import boto3
import json
import time

from tqdm        import tqdm

import kafka
from kafka       import KafkaProducer
from kafka.admin import KafkaAdminClient, NewTopic



from pyspark.sql.functions import from_json, col, countDistinct, count, when, collect_list
from pyspark.sql import SparkSession

from pyspark.sql.types import StructField, StructType, StringType, DoubleType, IntegerType



### Session and Spark Context creation

With the following command we are asking to the master (and the resource manager) to create an application (Session) with the required resources and configurations.
In order to test the performance of the network, we varied the following parameters: 
- `spark.executor.instances`: controls the number of executors requested (Excecutors perform the actual computations on the data. They are responsible for executing these tasks in parallel and returning the results back to the driver program).
- `spark.executor.cores`: specifies the number of CPU cores that are allocated to each Executor
- `spark.sql.shuffle.partitions`: configures the number of partitions that are used when shuffling data for joins or aggregations.
- `spark.sql.execution.arrow.pyspark.enabled`: Apache Arrow is an in-memory columnar data format that is used in Spark to efficiently transfer data between JVM and Python processes. Arrow is available as an optimization when converting a Spark DataFrame to a Pandas DataFrame and when creating a Spark DataFrame from a Pandas DataFrame.

The Spark Context is the driver application program used to submit applications to Spark, and it is used to work with RDDs.

In [2]:
spark = SparkSession.builder \
    .master("##")\
    .appName("cosmic_rays_spark")\
    .config("spark.executor.instances", 10)\
    .config("spark.executor.cores",1)\
    .config("spark.sql.shuffle.partitions", 10)\
    .config("spark.executor.memory", "1500m")\
    .config("spark.driver.memory", "1g")\
    .config("spark.sql.execution.arrow.pyspark.enabled", "true")\
    .config("spark.sql.execution.arrow.pyspark.fallback.enabled", "true")\
    .config("spark.sql.adaptive.enabled", "false")\
    .config("spark.sql.adaptive.coalescePartitions.enabled", "false")\
    .config("spark.jars.packages","org.apache.spark:spark-sql-kafka-0-10_2.12:3.3.2")\
    .config("spark.sql.streaming.forceDeleteTempCheckpointLocation", "true")\
    .config('spark.eventLog.enabled', 'true')\
    .config('spark.sql.streaming.stateStore.stateSchemaCheck', 'false')\
    .config("spark.sql.streaming.numRecentProgressUpdates", 1000)\
    .getOrCreate()

:: loading settings :: url = jar:file:/home/bertinelli/spark/spark/lib/python3.8/site-packages/pyspark/jars/ivy-2.5.1.jar!/org/apache/ivy/core/settings/ivysettings.xml


Ivy Default Cache set to: /home/bertinelli/.ivy2/cache
The jars for the packages stored in: /home/bertinelli/.ivy2/jars
org.apache.spark#spark-sql-kafka-0-10_2.12 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-9f51bc1d-490e-477c-9f70-41543341391b;1.0
	confs: [default]
	found org.apache.spark#spark-sql-kafka-0-10_2.12;3.3.2 in central
	found org.apache.spark#spark-token-provider-kafka-0-10_2.12;3.3.2 in central
	found org.apache.kafka#kafka-clients;2.8.1 in central
	found org.lz4#lz4-java;1.8.0 in central
	found org.xerial.snappy#snappy-java;1.1.8.4 in central
	found org.slf4j#slf4j-api;1.7.32 in central
	found org.apache.hadoop#hadoop-client-runtime;3.3.2 in central
	found org.spark-project.spark#unused;1.0.0 in central
	found org.apache.hadoop#hadoop-client-api;3.3.2 in central
	found commons-logging#commons-logging;1.1.3 in central
	found com.google.code.findbugs#jsr305;3.0.0 in central
	found org.apache.commons#commons-pool2;2.11.1 in central

23/07/09 18:53:32 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


In [None]:
sc = spark.sparkContext
sc

### Producer creation
Here we create the producer, which is then used to send the message containing the “cleaned” data to the dashboard via producer.send (in the function) and foreachBatch (applying the function).

In [4]:
# define the list of brokers in the cluster
KAFKA_BOOTSTRAP_SERVER = '##' #kafka broker

In [5]:
# producer definition
producer = KafkaProducer(bootstrap_servers = KAFKA_BOOTSTRAP_SERVER,
                         batch_size=16000, #16MB
                          linger_ms=20  ) #ms

        
# KAFKA ADMIN is responsible for creating/deleting topics

# connecting to client 
kafka_admin = KafkaAdminClient(bootstrap_servers = KAFKA_BOOTSTRAP_SERVER)

In [6]:
kafka_admin.list_topics()

['data_clean', 'test_clean_1', 'data_raw', '__consumer_offsets']

### Data preprocessing

We create a DataFrame representing the stream of input lines from Kafka by connecting to the appropriate servers and topic.

In [7]:
# read streaming df from kafka
inputDF = spark\
    .readStream\
    .format("kafka")\
    .option("kafka.bootstrap.servers", KAFKA_BOOTSTRAP_SERVER)\
    .option("kafkaConsumer.pollTimeoutMs", 1000)\
    .option('subscribe', 'data_raw')\
    .option("startingOffsets", "latest") \
    .load()

In [8]:
inputDF.isStreaming

True

We extract the values from the Kafka message, and we use `selectExpr` to create a table with the desired columns. 
`selectExpr()` is a function that takes a set of SQL expressions in a string to execute. This gives the ability to run SQL like expressions without creating a temporary table and views. `selectExpr()` just has one signature that takes SQL expression in a String and returns a new DataFrame.

In [9]:
# extract the value from the kafka message
rawraw_data = inputDF.select(col("value").cast("string")).alias("csv")

# split the csv line in the corresponding fields
raw_data = rawraw_data.selectExpr("cast(split(value, ',')[0] as int) as HEAD",
                                   "cast(split(value, ',')[1] as int) as FPGA",
                                   "cast(split(value, ',')[2] as int) as TDC_CHANNEL",
                                   "cast(split(value, ',')[3] as long) as ORBIT_CNT",
                                   "cast(split(value, ',')[4] as int) as BX_COUNTER",
                                   "cast(split(value, ',')[5] as double) as TDC_MEAS")

In [10]:
#raw_data.printSchema()

We select only the useful data (`HEAD=2`) and we apply a mapping between the data-format and the detectors, creating a new column `CHAMBER`. We eventually remove all rows with null values.

In [11]:
# selecting the wanted column
raw_data = raw_data.filter(raw_data.HEAD == 2)

# add CHAMBER column
raw_data = raw_data.withColumn("CHAMBER", \
                    when((raw_data.FPGA == 0) & (raw_data.TDC_CHANNEL>=0) & (raw_data.TDC_CHANNEL<64), 0) \
                   .when((raw_data.FPGA == 0) & (raw_data.TDC_CHANNEL>=64) & (raw_data.TDC_CHANNEL<128), 1 ) \
                   .when((raw_data.FPGA == 1) & (raw_data.TDC_CHANNEL>=0) & (raw_data.TDC_CHANNEL<64), 2 ) \
                   .when((raw_data.FPGA == 1) & (raw_data.TDC_CHANNEL>=64) & (raw_data.TDC_CHANNEL<128), 3 )\
                  ).na.drop()



### Data processing

We create a function that will be applied to each batch of data in order to compute the following piece of data:
1. total number of processed hits, post-clensing (1 value per batch) &rarr; `hit_count`
2. total number of processed hits, post-clensing, per chamber (4 values per batch) &rarr; `hit_count_chamber`
3. histogram of the counts of active `TDC_CHANNEL`, per chamber (4 arrays per batch) &rarr; `ch*_tdc_counts_list`
4. histogram of the total number of active `TDC_CHANNEL` in each `ORBIT_CNT`, per chamber (4 arrays per batch) &rarr; `ch*_orbit_counts_list`

Finally, we create a json message that will be sent to a Kafka topic via `producer.send()`.
Additionally, we print the time taken by each batch from the beginning of the data processing phase until the message is sent to the topic.

We use the `.persist()` function on the initial and intermediate dataframes for tasks 3 and 4.
It is a good practice to persist the DataFrame in memory before performing operations on it repeatedly. This helps in caching the DataFrame's partitions in memory and avoids unnecessary re-computation when the DataFrame is accessed multiple times.

In [12]:
from numpyencoder import NumpyEncoder

ID = -1

# function to apply to each batch: writes and sends a kafka message at the end
def batch_processing(df, epoch_id):

    df = df.persist()

    # 1: total number of processed hits, post-cleansing (1 value per batch)
    
    start = time.time()
    
    hit_count = df.count()
    
    
    # 2: total number of processed hits, post-cleansing, per chamber (4 values per batch)


    hit_count_chamber = df.groupby('CHAMBER').agg(count('TDC_CHANNEL').alias('HIT_COUNT'))\
                        .sort("CHAMBER").select('HIT_COUNT')\
                        .agg(collect_list('HIT_COUNT')).collect()



    # 3: histogram of the counts of active TDC_CHANNEL, per chamber (4 arrays per batch)

    tdc_counts = df.groupby(['CHAMBER','TDC_CHANNEL']).agg(count('TDC_CHANNEL').alias('TDC_COUNTS'))
    tdc_counts = tdc_counts.persist()

    # Filter the tdc_counts DataFrame for each chamber 
    
    ch0_tdc_counts = tdc_counts.filter(tdc_counts.CHAMBER == 0).select('TDC_CHANNEL','TDC_COUNTS')\
                    .sort("TDC_CHANNEL").toPandas()
    
    ch1_tdc_counts = tdc_counts.filter(tdc_counts.CHAMBER == 1).select('TDC_CHANNEL','TDC_COUNTS')\
                    .sort("TDC_CHANNEL").toPandas()
    
    ch2_tdc_counts = tdc_counts.filter(tdc_counts.CHAMBER == 2).select('TDC_CHANNEL','TDC_COUNTS')\
                    .sort("TDC_CHANNEL").toPandas()
    
    ch3_tdc_counts = tdc_counts.filter(tdc_counts.CHAMBER == 3).select('TDC_CHANNEL','TDC_COUNTS')\
                    .sort("TDC_CHANNEL").toPandas()

    
    
    #Save it in a list
    
    ch0_tdc_channels_list = list(ch0_tdc_counts['TDC_CHANNEL'])
    ch0_tdc_counts_list   = list(ch0_tdc_counts['TDC_COUNTS'])

    ch1_tdc_channels_list = list(ch1_tdc_counts['TDC_CHANNEL'])
    ch1_tdc_counts_list   = list(ch1_tdc_counts['TDC_COUNTS'])
    
    ch2_tdc_channels_list = list(ch2_tdc_counts['TDC_CHANNEL'])
    ch2_tdc_counts_list   = list(ch2_tdc_counts['TDC_COUNTS'])
    
    ch3_tdc_channels_list = list(ch3_tdc_counts['TDC_CHANNEL'])
    ch3_tdc_counts_list   = list(ch3_tdc_counts['TDC_COUNTS'])
    
    

    # 4: histogram of the total number of active TDC_CHANNEL in each ORBIT_CNT, per chamber (4 arrays per batch)

    orbit_count=df.groupby(['CHAMBER','ORBIT_CNT']).agg(countDistinct("TDC_CHANNEL").alias('TDC_ORBIT'))
    orbit_count = orbit_count.persist()

    ch0_orbit_counts = orbit_count.filter(orbit_count.CHAMBER == 0).select('ORBIT_CNT','TDC_ORBIT')\
                    .sort("ORBIT_CNT").toPandas()
    
    ch1_orbit_counts = orbit_count.filter(orbit_count.CHAMBER == 1).select('ORBIT_CNT','TDC_ORBIT')\
                    .sort("ORBIT_CNT").toPandas()
    
    ch2_orbit_counts = orbit_count.filter(orbit_count.CHAMBER == 2).select('ORBIT_CNT','TDC_ORBIT')\
                    .sort("ORBIT_CNT").toPandas()
    
    ch3_orbit_counts = orbit_count.filter(orbit_count.CHAMBER == 3).select('ORBIT_CNT','TDC_ORBIT')\
                    .sort("ORBIT_CNT").toPandas()
    
    #Save it in a list
    
    ch0_orbit_list          = list(ch0_orbit_counts['ORBIT_CNT'])
    ch0_orbit_counts_list   = list(ch0_orbit_counts['TDC_ORBIT'])

    ch1_orbit_list          = list(ch1_orbit_counts['ORBIT_CNT'])
    ch1_orbit_counts_list   = list(ch1_orbit_counts['TDC_ORBIT'])

    ch2_orbit_list          = list(ch2_orbit_counts['ORBIT_CNT'])
    ch2_orbit_counts_list   = list(ch2_orbit_counts['TDC_ORBIT'])
    
    ch3_orbit_list          = list(ch3_orbit_counts['ORBIT_CNT'])
    ch3_orbit_counts_list   = list(ch3_orbit_counts['TDC_ORBIT'])
    

 
    
    df.unpersist()
    tdc_counts.unpersist()
    orbit_count.unpersist()
        
    
    global ID
    ID += 1

    # prepare message to send to kafka
    
    msg = {

        'msg_ID': ID,
        'hit_count': hit_count,
        'hit_count_chamber': hit_count_chamber[0][0],
           
        'tdc_counts_chamber': {
            '0': {
                'bin_edges': ch0_tdc_channels_list,
                'hist_counts': ch0_tdc_counts_list
            },
            '1': {
                'bin_edges': ch1_tdc_channels_list,
                'hist_counts': ch1_tdc_counts_list
            },
            '2': {
                'bin_edges': ch2_tdc_channels_list,
                'hist_counts': ch2_tdc_counts_list
            },
            '3': {
                'bin_edges': ch3_tdc_channels_list,
                'hist_counts': ch3_tdc_counts_list
            }
        },
        'active_tdc_chamber': {
            '0': {
                'bin_edges': ch0_orbit_list,
                'hist_counts': ch0_orbit_counts_list
            },
            '1': {
                'bin_edges': ch1_orbit_list,
                'hist_counts': ch1_orbit_counts_list
            },
            '2': {
                'bin_edges': ch2_orbit_list,
                'hist_counts': ch2_orbit_counts_list
            },
            '3': {
                'bin_edges': ch3_orbit_list,
                'hist_counts': ch3_orbit_counts_list
            }
        }
    }
   

    
    
    
    producer.send('data_clean', json.dumps(msg).encode('utf-8'))
    producer.flush()
    #producer.poll(0)
    
    stop = time.time() - start
    
    print(f'\nTime: {stop}')

### Sending message to Kafka topic

Here, we apply the function defined earlier, `batch_processing()`, to each batch in the streaming dataset `raw_data`, through the `foreachBatch()` function.
`.trigger()` is a method that allows you to define the trigger for a streaming query. A trigger determines when the streaming query should be executed and how often it should process the data. To investigate performance differences, we varies the `processingTime` parameter.

In [13]:
query = raw_data.writeStream\
            .outputMode("update")\
            .foreachBatch(batch_processing)\
            .option("checkpointLocation", "checkpoint")\
            .trigger(processingTime='5 seconds')\
            .start()
query.awaitTermination(5)

[Stage 1:>                                                          (0 + 1) / 1]

False

                                                                                


Time: 23.333239793777466
23/07/09 18:54:13 WARN ProcessingTimeExecutor: Current batch is falling behind. The trigger interval is 5000 milliseconds, but spent 26411 milliseconds


[Stage 56:=====>                                                   (1 + 8) / 10]

23/07/09 18:54:16 WARN TaskSetManager: Lost task 9.0 in stage 56.0 (TID 210) (10.67.22.77 executor 0): java.util.concurrent.TimeoutException: Cannot fetch record for offset 3576 in 1000 milliseconds
	at org.apache.spark.sql.kafka010.consumer.InternalKafkaConsumer.fetch(KafkaDataConsumer.scala:97)
	at org.apache.spark.sql.kafka010.consumer.KafkaDataConsumer.fetchData(KafkaDataConsumer.scala:552)
	at org.apache.spark.sql.kafka010.consumer.KafkaDataConsumer.fetchRecord(KafkaDataConsumer.scala:476)
	at org.apache.spark.sql.kafka010.consumer.KafkaDataConsumer.$anonfun$get$1(KafkaDataConsumer.scala:313)
	at org.apache.spark.util.UninterruptibleThread.runUninterruptibly(UninterruptibleThread.scala:77)
	at org.apache.spark.sql.kafka010.consumer.KafkaDataConsumer.runUninterruptiblyIfPossible(KafkaDataConsumer.scala:618)
	at org.apache.spark.sql.kafka010.consumer.KafkaDataConsumer.get(KafkaDataConsumer.scala:290)
	at org.apache.spark.sql.kafka010.KafkaBatchPartitionReader.next(KafkaBatchPartitio



23/07/09 18:54:17 WARN TaskSetManager: Lost task 2.0 in stage 56.0 (TID 203) (10.67.22.111 executor 6): java.util.concurrent.TimeoutException: Cannot fetch record for offset 3468 in 1000 milliseconds
	at org.apache.spark.sql.kafka010.consumer.InternalKafkaConsumer.fetch(KafkaDataConsumer.scala:97)
	at org.apache.spark.sql.kafka010.consumer.KafkaDataConsumer.fetchData(KafkaDataConsumer.scala:552)
	at org.apache.spark.sql.kafka010.consumer.KafkaDataConsumer.fetchRecord(KafkaDataConsumer.scala:476)
	at org.apache.spark.sql.kafka010.consumer.KafkaDataConsumer.$anonfun$get$1(KafkaDataConsumer.scala:313)
	at org.apache.spark.util.UninterruptibleThread.runUninterruptibly(UninterruptibleThread.scala:77)
	at org.apache.spark.sql.kafka010.consumer.KafkaDataConsumer.runUninterruptiblyIfPossible(KafkaDataConsumer.scala:618)
	at org.apache.spark.sql.kafka010.consumer.KafkaDataConsumer.get(KafkaDataConsumer.scala:290)
	at org.apache.spark.sql.kafka010.KafkaBatchPartitionReader.next(KafkaBatchPartiti

                                                                                


Time: 17.259406089782715
23/07/09 18:54:30 WARN ProcessingTimeExecutor: Current batch is falling behind. The trigger interval is 5000 milliseconds, but spent 17611 milliseconds


                                                                                


Time: 7.772867679595947
23/07/09 18:54:39 WARN ProcessingTimeExecutor: Current batch is falling behind. The trigger interval is 5000 milliseconds, but spent 8035 milliseconds

Time: 5.393748760223389
23/07/09 18:54:44 WARN ProcessingTimeExecutor: Current batch is falling behind. The trigger interval is 5000 milliseconds, but spent 5635 milliseconds

Time: 5.390012741088867
23/07/09 18:54:50 WARN ProcessingTimeExecutor: Current batch is falling behind. The trigger interval is 5000 milliseconds, but spent 5624 milliseconds


                                                                                


Time: 5.221779108047485
23/07/09 18:54:55 WARN ProcessingTimeExecutor: Current batch is falling behind. The trigger interval is 5000 milliseconds, but spent 5494 milliseconds


                                                                                


Time: 4.829960346221924
23/07/09 18:55:00 WARN ProcessingTimeExecutor: Current batch is falling behind. The trigger interval is 5000 milliseconds, but spent 5061 milliseconds

Time: 4.542272090911865

Time: 4.629445314407349

Time: 4.434342861175537

Time: 3.8700287342071533


                                                                                


Time: 3.9892547130584717

Time: 3.41048264503479

Time: 3.506680727005005


                                                                                


Time: 3.6059868335723877

Time: 3.1846048831939697


                                                                                


Time: 3.816849708557129

Time: 3.565668821334839

Time: 3.7120273113250732


                                                                                


Time: 3.7680063247680664

Time: 3.149793863296509

Time: 3.757244110107422

Time: 2.896088123321533

Time: 3.446298122406006

Time: 3.2694919109344482

Time: 3.2564494609832764

Time: 3.259594678878784

Time: 2.812570095062256

Time: 3.298490285873413

Time: 2.913853645324707

Time: 3.0727643966674805

Time: 3.1094141006469727

Time: 2.691243886947632

Time: 3.064256429672241

Time: 2.956057071685791

Time: 3.2700514793395996

Time: 2.7510828971862793

Time: 3.21443772315979

Time: 3.1759955883026123

Time: 2.9388937950134277

Time: 3.2907140254974365

Time: 2.747364044189453

Time: 2.9534990787506104

Time: 2.66043758392334

Time: 3.0427260398864746

Time: 3.160946846008301

Time: 2.81296443939209

Time: 3.1592609882354736

Time: 2.575378656387329

Time: 2.9755373001098633

Time: 3.009711980819702

Time: 2.8884880542755127

Time: 3.008955478668213

Time: 2.801079034805298

Time: 2.953573703765869

Time: 3.042083501815796


                                                                                


Time: 2.994786500930786


                                                                                


Time: 2.8760201930999756

Time: 3.0294787883758545


                                                                                


Time: 3.008552312850952

Time: 2.4440202713012695


                                                                                


Time: 3.2959132194519043

Time: 2.6078410148620605

Time: 2.876523017883301

Time: 3.189990520477295

Time: 3.015429735183716


                                                                                


Time: 3.019137382507324

Time: 3.0072250366210938

Time: 2.7554783821105957

Time: 2.823387384414673

Time: 2.3266735076904297

Time: 2.8080697059631348

Time: 2.0749716758728027

Time: 2.62727427482605

Time: 2.05653715133667

Time: 2.4869773387908936

Time: 2.8282835483551025

Time: 2.3100874423980713

Time: 2.729393243789673

Time: 2.22770094871521

Time: 2.6316006183624268

Time: 2.7601494789123535

Time: 2.667609214782715

Time: 2.6650891304016113

Time: 2.3871240615844727


                                                                                


Time: 2.52372670173645

Time: 2.257821798324585

Time: 2.580472707748413

Time: 1.9868428707122803

Time: 2.2676098346710205


                                                                                


Time: 2.5303797721862793

Time: 2.3071255683898926


                                                                                


Time: 2.5183262825012207

Time: 1.9255688190460205

Time: 2.8958358764648438

Time: 2.1098105907440186

Time: 2.2779085636138916


                                                                                


Time: 2.5749142169952393

Time: 2.279555082321167

Time: 2.629796028137207


                                                                                


Time: 2.4965438842773438

Time: 2.2775838375091553

Time: 2.71659779548645

Time: 2.191462993621826

Time: 2.628805637359619

Time: 1.9191901683807373

Time: 2.4842031002044678

Time: 2.5166053771972656

Time: 2.166447877883911

Time: 2.3941640853881836

Time: 1.9022841453552246

Time: 2.4899544715881348

Time: 2.557339668273926

Time: 2.2597131729125977

Time: 2.534168243408203

Time: 2.287539482116699

Time: 2.561946392059326

Time: 2.086841106414795

Time: 2.6206109523773193

Time: 2.513747453689575

Time: 2.3901748657226562

Time: 2.5918385982513428

Time: 2.239313840866089

Time: 2.809548854827881

Time: 2.7469751834869385

Time: 2.2950141429901123

Time: 2.625988006591797

Time: 2.0006887912750244

Time: 2.70859956741333

Time: 1.9605824947357178

Time: 2.3598086833953857


                                                                                


Time: 2.4638166427612305

Time: 2.3313255310058594

Time: 2.5002381801605225

Time: 2.317396402359009


                                                                                


Time: 2.624452590942383

Time: 2.093494176864624

Time: 2.3732099533081055


                                                                                


Time: 2.519057035446167

Time: 2.2385497093200684


                                                                                


Time: 2.518535614013672

Time: 2.1100881099700928


                                                                                


Time: 3.1038897037506104

Time: 1.8902966976165771

Time: 2.3244259357452393


                                                                                


Time: 2.3915998935699463

Time: 2.276236057281494

Time: 2.4216148853302

Time: 1.95577073097229


                                                                                


Time: 2.426583766937256

Time: 1.8471801280975342

Time: 2.4292259216308594

Time: 2.3690011501312256

Time: 2.0698370933532715

Time: 2.3731672763824463

Time: 2.232344150543213

Time: 2.4207046031951904


                                                                                


Time: 2.8191587924957275

Time: 2.2325692176818848

Time: 2.6194610595703125

Time: 2.0561158657073975

Time: 2.3727610111236572

Time: 2.371978759765625

Time: 2.2337653636932373

Time: 2.5663230419158936

Time: 2.0533816814422607

Time: 2.5097880363464355

Time: 2.518064498901367

Time: 2.319547653198242

Time: 2.566708564758301

Time: 2.062512159347534

Time: 2.6225991249084473

Time: 2.5584399700164795

Time: 2.1881372928619385

Time: 2.4177534580230713

Time: 2.248551607131958

Time: 2.371579647064209

Time: 1.8533215522766113

Time: 2.4808788299560547

Time: 2.5733091831207275

Time: 2.0275349617004395

Time: 2.374767780303955

Time: 1.898397445678711

Time: 2.3799285888671875


                                                                                


Time: 2.620708703994751

Time: 2.182457685470581

Time: 2.4468491077423096

Time: 1.9832878112792969


                                                                                


Time: 2.61042857170105


                                                                                


Time: 2.7551894187927246

Time: 2.559438467025757

Time: 2.853097915649414

Time: 2.237277030944824

Time: 2.6790223121643066

Time: 2.065248489379883

Time: 2.392308473587036


                                                                                


Time: 3.1624374389648438

Time: 2.177893877029419

Time: 2.437223434448242

Time: 2.125791549682617


                                                                                


Time: 2.4024598598480225

Time: 1.9754137992858887

Time: 2.276840925216675

Time: 2.552828550338745

Time: 2.161890745162964

Time: 2.434418201446533

Time: 1.8059372901916504


                                                                                


Time: 2.7177393436431885

Time: 2.516848087310791

Time: 2.1920626163482666

Time: 2.4804608821868896

Time: 2.206873893737793


                                                                                


Time: 2.389843225479126

Time: 1.8556489944458008

Time: 2.190796136856079

Time: 2.459735155105591

Time: 2.102717161178589

Time: 2.5428974628448486

Time: 2.069857358932495

Time: 2.551340341567993

Time: 2.471088409423828

Time: 2.4444797039031982


                                                                                


Time: 2.565580368041992

Time: 2.1616575717926025


                                                                                


Time: 2.927673101425171


                                                                                


Time: 2.457263231277466

Time: 2.333871841430664


                                                                                


Time: 2.2513229846954346

Time: 1.9208505153656006


                                                                                


Time: 2.380497694015503


                                                                                


Time: 2.764342784881592


                                                                                


Time: 2.415839672088623


### Metrics

To analyze the performance of the various combinations of metrics, we save the metrics in a .json file so they can be analyzed and compared later.

In [None]:
a = query.recentProgress

#kp = kafkapartitions
#em exectutor memory
#dm driver memory
#sp shuffle partitions
#aT arrow true
#w workers
with open('metriche/10ex_1core_6sp_aT_2w_6kp_5000batch_1secProcTime.json', 'w') as file:
    # Perform file operations
    json.dump(a, file)

In [14]:
print("Print this to stop the code and avoid to stop spark accidentally")

Print this to stop the code and avoid to stop spark accidentally


### Stop workers and master

In [15]:
# stop the running Spark context and Spark session
sc.stop()

spark.stop()