- Andrea De Vita
- Enrico Lupi
- Manfredi Miranda
- Francesco Zane

-----------------------

# Streaming Processing of the QUAX Experiment Data for the Detection of Galactic Axions

## Abstract

The axion is a hypothetical particle introduced to solve the strong CP problem of Quantum Chromo Dynamics. It is speculated that axions may also constitute the dark matter content in our galaxy. The [QUAX](https://www.pd.infn.it/eng/quax/) (QUaerere AXions) experiment aims at detecting this particle by using a copper cavity immersed in a static magnetic field of 8.1 T, cooled down at a working temperature of about 150 mK.

The goal of this project is to create a quasi real-time processing chain of the data produced by the QUAX experimental apparatus and a live monitoring system of the detector data, using [Apache Kafka](https://kafka.apache.org/) and [Apache Spark](https://spark.apache.org/).

## Table of Contents

1. [Introduction](#introduction) <br>
    1.1. [Experiment](#intro_experiment) <br>
    1.2. [Data Structure](#intro_data_structure) <br>
    1.3. [Cluster](#intro_cluster) <br>
2. [Data Processing](#processing) <br>
    2.1. [Pipeline Overview](#pipeline) <br>
    2.2. [Kafka - Streaming Data](#kafka) <br> 
    &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 2.2.1. [Kafka Topics](#kafka_topic) <br>
    &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 2.2.2. [Data Pre-processing](#kafka_preprocessing) <br>
    &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 2.2.3. [Producer](#kafka_producer) <br>
    2.3. [Spark - Distributed Processing](#spark) <br>
    &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 2.3.1. [Spark Structured Streaming](#spark_streaming) <br>
    &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 2.3.2. [FFT and Averaging](#spark_fft) <br>
    &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 2.3.3. [Output Message](#spark_output) <br>
    2.4. [Live Plot and Monitoring](#live_plot) <br>
3. [Performance Study](#performance) <br>
    3.1. [Kafka Message Size](#test_kafka_msgsize) <br>
    3.2. [Number of Partitions](#test_partitions)<br>
    3.3. [Trigger *ProcessingTime* and *maxOffsetsPerTrigger*](#test_trigger)<br>
4. [Conclusion](#conclusion)

## 1. Introduction <a name="introduction"></a>

### 1.1. Experiment <a name="intro_experiment"></a>


![quax lab](Images\labQuax.webp)

The QUAX experiment at Legnaro INFN Laboratories aims at axion detection by using a copper cavity immersed in a static magnetic field of 8.1T, cooled down at a working temperature of about 150mK. The axion is expected to couple with the spin of the electron, interacting with the cavity and inducing a radio-frequency that can be sensed via a Josephson parametric amplifier. For a given configuration of the RF cavity, a scan of the phase of the electromagnetic field is performed to be able to possibly identify a localised excess, a hint of the coupling of an axion with the photon. 

The data acquisition system of the QUAX experiment generates two streams of digitized reading of the amplifiers, representing the real and imaginary components of the measured phase. To improve the signal over noise ratio, a QUAX data-taking run extends over a long time (up to weeks), repeating the scans over multiple times. Data are saved locally on the DAQ servers in the form of binary files, each corresponding to a multitude of continuous scans performed in the entire frequency range. A single pair of raw files is thus representative of only a few seconds of data taking, but are already including several (thousands) scans. 

### 1.2. Data Structure <a name="intro_data_structure"></a>

The dataset is composed of 2 sets (named duck_i and duck_q respectively) of .dat binary files, each one comprised of a continuous series of ADC readings from the amplifier. Each ADC reading is written in the raw files as a 32 bit floating point value. The ADC readout frequency is 2 × 10<sup>6</sup> Hz (2 MegaSample per second, or 2MS/s), thus resulting in a raw data throughput of 128 Mbps (16 MB/s). During data taking the readouts are formatted in .dat file such that each file is comprised of 8193 × 2<sup>10</sup> samples. This results in producing a pair of .dat files (duck_i and duck_q) every 4.2 s.

The dataset is provided on a cloud storage s3 bucket hosted on Cloud Veneto.

### 1.3. Cluster <a name="intro_cluster"></a>

This project has been done on a cluster composed by 4 virtual machines, each with 4 VCPUs with 25 GB disk space and 8 GB RAM each. The virtual machines are hosted on [CloudVeneto](https://cloudveneto.it/), an OpenStack-based cloud managed by University of Padova and INFN. Spark version 3.3.2 (using Scala version 2.12.15) and Kafka version 3.4.0 will be used.

## 2. Data Processing <a name="processing"></a>

The processing of the raw data is comprised of two phases:
1. Run a Fourier transform on each scan to move from the time domain to the frequency domain
2. Average (in bins of frequency) all scans in a data-taking run, to extract a single frequency scan
 
This procedure is highly parallelizable, and should be implemented in a quasi-online pipeline for two main reasons:
1. Monitoring the scans during the data taking to promptly spot and identify possible issues in the detector setup or instabilities in the condition of the experiment
2. Data is continuously produced with a very large rate, and the local storage provided by the DAQ server of the QUAX experiment is not really suited for large-volume and long-lasting datasets

### 2.1. Pipeline Overview <a name="pipeline"></a>

The data processing pipeline will be implemented as follows:
- Each pair of files is unpacked according to their schema and split into scans.
- Data is produced to a Kafka topic by a stream-emulator script every 5 seconds to simulate the fixed ADC scanning rate and the fixed size of files written to disk. 
- The processing of each file runs is performed in a distributed framework using pySpark: for each scan, a FFT is executed in parallel and the results of all FFTs are averaged.
- The results are re-injected into a new Kafka topic hosted on the same brokers.
- A final consumer performs the plotting, displaying live updates of the scans and continuously updating the entire "run-wide" scan using bokeh.

The overall pipeline can be thus summarised as:
![pipeline schema](Images\Pipeline_Schema.png)

### 2.2. Kafka - Streaming Data <a name="kafka"></a>

Apache Kafka is an open-source distributed event streaming platform for high-performance data pipelines, streaming analytics, data integration, and mission-critical applications. As discussed previously, in this work it will be used to handle the live streaming of data from the DAQ servers all the way to the final live plot.

#### 2.2.1. Kafka Topics <a name="kafka_topic"></a>

The first step is to create a topic on the broker to hold the data from the DAQ. We create it with 12 separate partitions and no replication. The meaning of the name *chunk_data* will be made clear in the next section.

We will also create a second topic for later, aptly named *results*, where to publish the results of the data processing, i.e. the FFT and averaging.

In [None]:
# connect to the cluster to run admin functions
kafka_admin = KafkaAdminClient(
    bootstrap_servers=KAFKA_BOOTSTRAP_SERVERS,
)

# define new topic to hold data
topic_in = NewTopic(name='chunk_data',
                    num_partitions=12, 
                    replication_factor=1)

In [None]:
# define new topic for the results
topic_out = NewTopic(name='results',
                     num_partitions=12, 
                     replication_factor=1)

#### 2.2.2. Data Preprocessing <a name="kafka_preprocessing"></a>

The size of the files produced by the DAQ is 32 MB, which means that each message handled by Kafka should be 64 MB as we need both the real and imaginary components. Unfortunately, the default message size in Kafka is only 1 MB. There are of course ways to circumvent this limit, namely:
 
- at the broker level, changing the *replica.fetch.max.bytes* in the broker settings and increasing the *max.message.bytes* for the topic to the desired value
- at the consumer level, increasing the *max.partition.fetch.bytes*, otherwise the consumer will fail to fetch these messages and will get stuck on processing
- at the producer level, increasing the *max.request.size* to ensure large messages can be sent

While this solution is possible, it is still against the philosophy of Kafka: sending large messages is considered inefficient as they should be huge in number but not in size.

We thus decided to first unpack the data into slices and send a pair of real and imaginary slices as a message. Since for each FFT we want *n<sub>bins</sub>* = 3 × 2<sup>10</sup> = 3072 bins and we have a total of 8193 × 2<sup>10</sup> samples per file, the amount of slices to compute FFTs on for each file (and thus of mesages to be sent) is

$$n_{slices} = \cfrac{n_{samples}}{n_{bins}} = \cfrac{8193 \times 2^{10} }{3 \times 2^{10}} = 2731$$

In [None]:
# read all data from input files\n",
real = bytearray(binary_data_real)
imag = bytearray(binary_data_imm)

# unpack data
for f in range(n_slice):
    r_bin = real[4*n_bins*f:4*n_bins*(f+1)] # one float every 4 bytes
    i_bin = imag[4*n_bins*f:4*n_bins*(f+1)]

# create kafka message
msg = r_bin + i_bin

#### 2.2.3. Producer <a name="kafka_producer"></a>

Lastly we can initialize the Kafka producer, the one responsible to read data from the files and actually sending it to the correct topic. The message, as described before, is given by two consecutives byte arrays containing the real and imaginary slices. The key, instead, contains the number of the file and of the particular slice contained in the message.

In [None]:
# Create a Kafka producer instance
chunk_producer = KafkaProducer(bootstrap_servers=KAFKA_BOOTSTRAP_SERVERS)

# function to read files, unpack them and send them to Kafka
def send_chunks(file_paths,dirPath,DAQ_period=5):
    
    # returns a list of lists each containing a pair of real and imaginary files 
    partners = sorted(find_partner(file_paths),
                      key=lambda x: get_number_from_filename(x[0]))
    
    startTot = time.time()
    wastedTime = 0
    
    for couple in partners: 
        start_time = time.time()
            
        # read all data from input files
        couple=[dirPath+x for x in couple]
        binary_data_real = read_binary_file(couple[0])
        binary_data_imm = read_binary_file(couple[1])
        
        real = bytearray(binary_data_real)
        imag = bytearray(binary_data_imm)
        
        file_num=int(couple[0][-9:-4])
        
        # unpack data
        for f in range(n_slice):
            r_bin = real[4*n_bins*f:4*n_bins*(f+1)] # one float every 4 bytes
            i_bin = imag[4*n_bins*f:4*n_bins*(f+1)]

            msg = r_bin + i_bin
        
            # key = file + bin number
            key = (file_num).to_bytes(2, "big") + f.to_bytes(2, "big")
           
            print(Fore.RED +"Sending file",file_num,"\tslice number:",f+1,end="\r")
            
            # send to Kafka topic
            chunk_producer.send(topic = "chunk_data",
                                key   = key,
                                value = msg)
        
        end_time1 = time.time()
        deltat = end_time1 - start_time
        print("                                                                 ",end="\r")
        print("File", file_num,"commissioned in", round(deltat,3), "s!")
        
        chunk_producer.flush()  # Flush the producer after senting the entire file
        
        end_time2 = time.time()
        deltat = end_time2 - start_time
        print("                                                                 ")
        print("File", file_num,"completed in ", deltat, " sec!")
        print("------------------------------")
        
        wastedTime+=(end_time2 - end_time1)
        
        # sleep to reproduce DAQ acquisition time
        if deltat < DAQ_period:
            time.sleep(DAQ_period - deltat)
            
    endTot = time.time()
    deltaTot = endTot - startTot
    
    print("                                                                 ")
    print("                                                                 ")
    print("------------------------------")
    print(Fore.GREEN+"Total time", round(deltaTot,3), "s!")
    print(Fore.RED +"Wasted time", round(wastedTime,3), "s!")
    print(Fore.BLACK +"------------------------------")
        
send_chunks(file_paths,folder_path)

### 2.3. Spark - Distributed Processing <a name="spark"></a>

Apache Spark is an open-source unified analytics engine for large-scale data processing. As outlined in the overview, Spark will do the "heavy lifting" of the data processing by computing the FFTs for each slice in parallel and averaging them.

The first step is to create a Spark application.

In [None]:
spark = SparkSession.builder \
        .master("spark://10.67.22.8:7077") \
        .appName("Spark structured streaming application") \
        .config("spark.executor.memory", "1000m") \
        .config("spark.sql.execution.arrow.pyspark.enabled", "true") \
        .config("spark.sql.adaptive.enabled", "false") \
        .config("spark.sql.execution.arrow.pyspark.fallback.enabled", "false") \
        .config("spark.sql.shuffle.partitions", 12) \
        .config("spark.jars.packages","org.apache.spark:spark-sql-kafka-0-10_2.12:3.3.2") \
        .getOrCreate()

#### 2.3.1 Spark Structured Streaming <a name="spark_streaming"></a>

In order to deal with the quasi-continuous stream of data from Kafka we will use Spark Structured Streaming. This lets us use the DataFrame API and consider the incoming data as new rows of an unbound table.

Given the simplicity and lack of multiple features of the data we could also have chosen to use Spark Streaming, which works by dividing the input data into a sequence micro-batches (DStream) that can be treated as static datasets. Unfortunately, Spark Streaming is considered deprecated, and as such the packages necessary to connect it to Kafka are not present in Spark version 3, which we are currently using.

We first create an input (streaming) DataFrame subscribed to the *chunk_data* topic.

In [None]:
inputDF = spark \
    .readStream \
    .format("kafka") \
    .option("kafka.bootstrap.servers", KAFKA_BOOTSTRAP_SERVERS) \
    .option("kafkaConsumer.pollTimeoutMs", 30_000) \
    .option("startingOffsets", "latest") \
    .option("maxOffsetsPerTrigger", 2000) \
    .option("subscribe", "chunk_data") \
    .load()

Let's explain what some of these options are:
- *startingOffset* refers to the start point when a query is started: either "earliest" which is from the earliest offsets, "latest" which is just from the latest offsets, or a json string specifying a starting offset for each TopicPartition.
- *kafkaConsumer.pollTimeoutMs* is the timeout in milliseconds to poll data from Kafka in executors.
- *maxOffsetsPerTrigger* is the rate limit on maximum number of offsets processed per trigger interval. The specified total number of offsets will be proportionally split across topicPartitions of different volume.

Note that *maxOffsetsPerTrigger* is equal to 2000, which means that we are never going to deal with batches bigger than a single file. This means, however, that it is possible to have batches containing slices of different files. This last case is not a problem, though, as the division in files depends only on the DAQ system and does not reflect any underlying physics.

The input DataFrame from Kafka has the following schema:

In [None]:
root
 |-- key: binary (nullable = true)
 |-- value: binary (nullable = true)
 |-- topic: string (nullable = true)
 |-- partition: integer (nullable = true)
 |-- offset: long (nullable = true)
 |-- timestamp: timestamp (nullable = true)
 |-- timestampType: integer (nullable = true)

#### 2.3.2 FFT and Averaging <a name="spark_fft"></a>

We are only interested in the *key* and *value* pair. We then unpack the *value* column into a list of floats and compute the Fourier Transform. Note that we have to explicitly declare the Fourier Transform as an UDF (User Defined Function) so that it can act on SQL columns.

In [None]:
# Function to convert a byte array into a list of float values
def bytes_to_float32_list(bytes_value):
    float_list = []
    for i in range(0, len(bytes_value), 4): 
        float_value = sstruct.unpack('f', bytes_value[i:i+4])[0]
        float_list.append(float_value)
    
    return float_list

# Function to compute the Fourier transform of a given array
def Fourier(x):
    x = np.array(x)
    # take real and imaginary components and get complex number  
    z = to_complex(x)
    
    power = np.abs(np.fft.fft(z))**2
    FS = fft_bandwidth
    norm = n_bins * FS * np.sqrt(2)
    normalized_power = power / norm
    power_shifted = np.fft.fftshift(normalized_power)
    
    power_shifted = power_shifted.tolist()
    
    return(power_shifted)

# Function to index elements in a list with file numbers
def indexing(x,file_num):
    k = []
    for i in range(len(x)):
        add = (f'{file_num}_{i}',x[i])
        k.append(add)
    return k

# Function to extract the file number from a byte array (big-endian short)
def extract_file_num(key_bytes):
    return sstruct.unpack('>H', key_bytes[:2])[0]  # Unpack from big-endian short



# schema for indexing UDF
schema = StructType(
        [
                StructField("index", StringType()),
                StructField("x", FloatType())
        ]
)

# Define UDF
fft_udf = udf(Fourier, ArrayType(FloatType()))
extract_file_num_udf = udf(extract_file_num, IntegerType())
indexing_udf = udf(indexing, ArrayType(schema))

In [None]:
# Apply UDFs to transform 'value' column in a list of Fourier transformed value
streaming_df = inputDF.select('key', 'value')
streaming_df = streaming_df.withColumn('float', bytes_to_float32_udf(streaming_df['value']))
streaming_df = streaming_df.withColumn('fft', fft_udf(streaming_df['float']))

# Extract file numbers from 'key' column
streaming_df = streaming_df.withColumn('file_num', extract_file_num_udf(col('key')))

# Apply UDF to index 'fft' column by 'file_num'
streaming_df = streaming_df.withColumn('indexed_fft', indexing_udf(streaming_df['fft'],streaming_df['file_num']) )

# Explode the 'indexed_fft' array to separate rows
exploded_df = streaming_df.select('key', explode('indexed_fft').alias('indexed_fft'))

We then extract the file number information from the *key* and, after some transformations, obtain a row for each bin with the following schema:

In [None]:
root
 |-- key: binary (nullable = true)
 |-- indexed_fft: struct (nullable = true)
 |    |-- index: integer (nullable = true)
 |    |-- x: float (nullable = true)

where *indexed_fft* is a structure containing two elements:
- *index*, the combination of the file number and bin number
- *x*, the value fo the FFT in the specific bin for the specific file

Finally, we compute the mean and standard deviation of the FFTs for each bin after grouping by the FFT index.

In [None]:
# Group by 'indexed_fft.indice' and calculate statistics
result_df = exploded_df.groupBy("indexed_fft.index").agg(
    mean("indexed_fft.x").alias("mean_x"),
    stddev("indexed_fft.x").alias("stddev_x"),
    count("indexed_fft.x").alias("count_x")
)

#### 2.3.3 Output Message <a name="spark_output"></a>

The only step left is to produce a message to send Kafka in the *results* topic. This message is only sent if the whole file has been analyzed, i.e. if all 2731 slices have been used to compute the mean. 

The message will follow this schema:

In [None]:
root
 |-- data: struct (nullable = false)
 |    |-- index: integer (nullable = true)
 |    |-- mean_x: double (nullable = true)
 |    |-- stddev_x: double (nullable = true)
 |    |-- count_x: integer (nullable = true)

and be sent as Json file. In the options for the *writeStream* we have:
- *trigger*, which controls how often to run a microbatch query periodically based on the *ProcessingTime*
- *outputMode*, that defines what gets written out to the external storage. In particular, *update* means that only the rows that were updated in the Result Table since the last trigger will be written to the external storage 

In [None]:
# Select and structure the data for output as a single JSON message 
# when the mean is calculated from a full couple of file
result_json_df = result_df.where(col('count_x')==2731) \
    .select(struct("index", "mean_x", "stddev_x","count_x").alias("data"))

# function to send data to Kafka as JSON messages
def send_to_kafka(batch_df, batch_id):
    batch_json = batch_df.toJSON().collect()
    all_data_json = json.dumps([json.loads(row) for row in batch_json])
    
    producer = KafkaProducer(bootstrap_servers=KAFKA_BOOTSTRAP_SERVERS)
    producer.send("results", value=all_data_json.encode("utf-8"))
    producer.close()

# Write the JSON data to Kafka as a single message
query = result_json_df.writeStream \
    .trigger(processingTime="12 seconds")\
    .outputMode("update") \
    .foreachBatch(send_to_kafka) \
    .start()

query.awaitTermination()

### 2.4 Live Plot and Monitoring <a name="live_plot"></a>

The only step left is to plot the results and monitor them in real time. To handle live plotting we will use Bokeh. 

We will show both the results for the latest batch sent to Kafka and the cumulative average of all the files processed up until now with their relative errors.

In [None]:
def read_kafka(): # Here we define a consumer to read the Json sent from the Spark notebook 
    ...           # so we can turn it into a dictionary
    
# Then we define four Bokeh structure that can be updated
stream_source = ColumnDataSource(data = {'freq': [], 'fft' : []})

# Define a callback function to update the plot data
def update():
    ...
    kafka_data = read_kafka()
    stream_source.data['fft'] = kafka_data['mean']
    ...
    

# Set up a periodic callback to update the plot every 5 seconds
callback = PeriodicCallback(update, 5_000) 


callback.start()

The final result looks like this:

<video width="826" height="504" 
       src="./Images/live_plot.webm"  
       controls>
</video>

Lastly, we develop a method to save the results as a csv file for storage.

In [None]:
# Define a pandas storage for the final results
storage = pd.DataFrame({'freq':xaxis,'fft':np.zeros(n_bins), 'std':np.zeros(n_bins)})

def update():
    ...
    storage['fft'] = cum_y
    storage['std'] = cum_sigma
 

callback.stop()
storage.to_csv('final_results.csv')

## 3. Performance Study <a name="test"></a>

We will now look into how well our pipeline performs and how tuning the different parameters affects its execution.

### 3.1. Kafka Message Size <a name="test_kafka_msgsize"></a>

The first parameter we want to test is the size of the Kafka messages. In order to do so, we change the code reported above to make it more flexible so that we can change the number of slices per message sent from the DAQ-emulator; this number varies from a minimum of 1 to a maximum of 40, so that the final message is still below 1 MB size.

The results in this section have been obtained by running the code for all 16 raw data files and averaging the execution times for each one of them.

In [None]:
slices_per_msg = 1
msg_number = math.ceil(n_slice/slices_per_msg)

# read data from input files
real = bytearray(binary_data_real)
imag = bytearray(binary_data_imm)

# unpack data
for f in range(msg_number):
    start = 4*n_bins*slices_per_msg*f
    end = 4*n_bins*slices_per_msg*(f+1)
    if end > 4*n_samples:
        end = 4*n_samples
        
    # create Kafka message
    r_bin = real[start:end]
    i_bin = imag[start:end]
    msg = r_bin + i_bin

As a preliminary test, we first run the code without actually sending any message to Kafka, in order to gauge how much time is needed to read the files and divide them in slices. *A priori* it could be argued that for very large files this procedure could take a very long time, and thus it should be parallelized as well. In fact, after testing we find that in our concrete case the average execution time is only $\bar{t}_{no \, msg \, sent} = 0.27 \pm 0.02 $ s. Such a low time means that even reading the data on a single machine is completely fine.

We then test the actual time needed to send messages, using 1, 10, 20, 30 and 40 slices per message, which correspond to sizes of 24, 240, 480, 720 and 960 KB respectively. We obtain the following results:
- $\bar{t}_{\,\,1 \, slice/msg} = 16.5 \pm 0.2 $ s
- $\bar{t}_{   10 \, slice/msg} = 15.5 \pm 0.1 $ s
- $\bar{t}_{   20 \, slice/msg} = 15.1 \pm 0.1 $ s
- $\bar{t}_{   30 \, slice/msg} = 15.0 \pm 0.1 $ s
- $\bar{t}_{   40 \, slice/msg} = 15.6 \pm 0.1 $ s

![kafka slices per message](Images\KafkaTest_SlicePerMsg.png)

The data seems to follow a parabolic trend with a minimum around 30 slices per message: both a low number of large messages an a high number of small messages are not ideal. The difference, though, is not very significant (only of 9%), and external factors like the quality of the connection could play a bigger role in the performance.

The most important thing to note, however, is the absolute value of these times: even in our best-case scenario, the execution time is still around 15 second, more than three times the time needed to produce new files (4.2 s). This means that the data transmission will be a huge bottleneck in the pipeline and will slow down the rest of the process.

## 3.2. Number of Partitions <a name="test_partitions"></a>

The second parameter we tested is the number of partitions of the Kafka topic *chunk_data*, which also corresponds to the number of Spark partititons. We first checked with 4 and later with 12 partitions: *a priori* we expect this second value to be optimal, as we would have a one-to-one correspondence between the partitions and the available cores (four cores for each of the three VCPUs). It is important to note that in order to increase the number of partitions we need also to increase the *spark.executor.memory* to 1 GB, otherwise it would lead to a crash.

Unfortunately, we do not see any real improvement: this is due to the fact that, as discussed above, Kafka communication is the real limiting factor of the pipeline. Even with a non-ideal number of partitions Spark is still faster than Kafka and so the performance stays roughly the same, around 200 records per second for both the input and processing rate.

## 3.3. Trigger *ProcessingTime* and *maxOffsetsPerTrigger* <a name="test_trigger"></a>

After we studied how the performance is affected by changing the *maxOffsetsPerTrigger* value when creating the input DataFrame and the Trigger *ProcessingTime* when computing the results. We will analyze the following performance metrics:
- Input rate: it specifies how much data is flowing into Structured Streaming from Kafka
- Processing rate: it is how quickly we were able to analyze that data
- Input Rows: it defines the total number of records processed in a trigger. The more the trigger time, the more will be the Input Rows
- Batch Duration: It defines the process duration of each batch
- Operation Duration: It defines the amount of time taken to perform the below operations (in milliseconds).
    - addBatch: Time taken to read the micro-batch’s input data from the sources, process it, and write the batch’s output to the sink. This should take the bulk of the micro-batch’s time.
    - getBatch: Time taken to prepare the logical query to read the input of the current micro-batch from the sources.
    - latestOffset & getOffset: Time taken to query the maximum available offset for this source.
    - queryPlanning: Time taken to generates the execution plan.
    - walCommit: Time taken to write the offsets to the metadata log.

We start using a *maxOffsets* of 1000 and triggering every 20 seconds.
![offset-1000_trigger-20](Images\offset-1000_trigger-20.png)

We can observe that both Input Rate and Input Rows reach a constant plateau around 50 records/sec and 1000 records respectively. This is because the triggering time is too long for this *maxOffsets* so that we always reach the maximum amount of records allowed inside one batch. This is of course not ideal as it means we have dead time between one batch and the next, as one batch is closed and finishes processing but we still need to wait for the net trigger. We can confirm this by visualizing the batch processing timeline, where we can see that each job is time separated from the others:
![deadtime](Images\deadtime.jpg)

Then we evaluate a *maxOffsets* of 2000 and triggering every 5 seconds, the opposite situation as the one above:
![offset-2000_trigger-5](Images\offset-2000_trigger-5.png)

In this case we can see that Input Rate and Input Rows are higher but oscillate significantly. The Input  Rate, in particular, is once again limited by Kafka: we are very close to the expected limit of 165 records/sec (2731 messages in 16.5 seconds on average). For the Input Rows, instead, we see that at first we have a higher value (still lower than the set maximum), but then Spark automatically reduces. This is because the processing time for each batch (as we can see in Batch Duration) exceeds the 5 seconds limit of the trigger, so Spark tries to reduce the size of each batch to fit into this time window.

The objective now is thus to find the correct equilibrium in the parameters so that the batches have the correct size and the trigger is neither too long nor too short. After some trials, the optimum conditions have been reached with a *maxOffsets* of 2000 and triggering every 12 seconds:
![best_offset-2000_trigger-12](Images\best_offset-2000_trigger-12.jpg)

We can see that once again Input Rate and Input Rows are roughly constant and very close to the maximum values. The processing Rate is also higher than the previous cases, reaching almost 250 records/sec.

Using this last configuration, all 16 files are processed in 4 mins and 30 sec. This time is mostly due to the low input rate caused by Kafka. Even setting that aside, though, the time to analyze one batch is roughly 8 sec, so the time to analyze a whole file will surely be higher than the time necessary to produce a new one, so the total process would lag behind.

## 4. Conclusion <a name="conclusion"></a>

In this work we have developed a data processing pipeline to handle and analyze data produced by the QUAX experiment at Legnaro INFN Laboratories. In this pipeline the raw data is produced to a Kafka topic and then read by a Spark streming application, which computes the FFT and averages the results. Finally, these results are outputted to another topic and used to perform a live plot for monitoring. <br>
The performances of the pipeline have later been studied and optimized as a function of some parameters.

The final configuration of the pipeline, which uses 12 Kafka and Spark partitions, a *maxOffsetsPerTrigger* of 2000 a Trigger *ProcessingTime* is able to analyze all 16 files available for testing in 4 minutes and 30 seconds.

This result is far from optimal, as the files are produced in only a quarter of the processing time (~ 67 seconds). The first "culprit" is Kafka, which is not able to send messages at a fast enough rate to keep up with the production rate, but Spark is not working at the desired rate, too. <br>
A first approach to solve this problem would thus be to use a different method to link the source to the Spark Stream instead of Kafka. Moreover, Spark could be further optimized and a more powerful cluster could be used.