- Andrea De Vita
- Enrico Lupi
- Manfredi Miranda
- Francesco Zane

-----------------------

# Streaming Processing of the QUAX Experiment Data for the Detection of Galactic Axions

## Abstract

The axion is a hypothetical particle introduced to solve the strong CP problem of Quantum Chromo Dynamics. It is speculated that axions may also constitute the dark matter (DM) content in our galaxy. The [QUAX](https://www.pd.infn.it/eng/quax/) (QUaerere AXions) experiment aims at detecting this particle by using a copper cavity immersed in a static magnetic field of 8.1 T, cooled down at a working temperature of about 150 mK.

The goal of this project is to create a quasi real-time processing chain of the data produced by the QUAX experimental apparatus and a live monitoring system of the detector data, using [Apache Kafka](https://kafka.apache.org/) and [Apache Spark](https://spark.apache.org/).

## Table of Contents

1. [Introduction](#introduction) <br>
    1.1. [Experiment](#intro_experiment) <br>
    1.2. [Data Structure](#intro_data_structure) <br>
    1.3. [Cluster](#intro_cluster) <br>
2. [Data Processing](#processing) <br>
    2.1. [Pipeline Overview](#pipeline) <br>
    2.2. [Kafka - Receiving Data from DAQ](#kafka) <br> 
    &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 2.2.1. [Data Topic](#kafka_topic) <br>
    &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 2.2.2. [Data Pre-processing](#kafka_preprocessing) <br>
    &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 2.2.3. [Producer](#kafka_producer) <br>
    2.3. [Spark - Distributed Processing](#spark) <br>
    &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 2.2.1. [FFT](#spark_fft) <br>
    &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 2.2.1. [Output Message](#spark_output) <br>
    2.4. [Live Plot](#live_plot) <br>
3. [Performance Tests](#test) <br>
    3.1. [Kafka](#test_kafka) <br>
    3.2. [Spark](#test_spark) <br>
4. [Conclusion](#conclusion)

## 1. Introduction <a name="introduction"></a>

### 1.1. Experiment <a name="intro_experiment"></a>


![quax lab](Images\labQuax.webp)

The QUAX experiment aims at the axion detection by using a copper cavity immersed in a static magnetic field of 8.1T, cooled down at a working temperature of about 150mK. The axion is expected to couple with the spin of the electron, interacting with the cavity and inducing a radio-frequency that can be sensed via a Josephson parametric amplifier. For a given configuration of the RF cavity, a scan of the phase of the electromagnetic field is performed to be able to possibly identify a localised excess, a hint of the coupling of an axion with the photon. 

The data acquisition system of the QUAX experiment generates two streams of digitized reading of the amplifiers, representing the real and imaginary components of the measured phase. To improve the signal over noise ratio, a QUAX data-taking run extends over a long time (up to weeks), repeating the scans over multiple times. Data are saved locally on the DAQ servers in the form of binary files, each corresponding to a multitude of continuous scans performed in the entire frequency range. A single pair of raw files is thus representative of only a few seconds of data taking, but are already including several (thousands) scans. 

### 1.2. Data Structure <a name="intro_data_structure"></a>

The dataset is composed of 2 sets (named duck_i and duck_q respectively) of .dat binary files, each one comprised of a continuous series of ADC readings from the amplifier. Each ADC reading is written in the raw files as a 32 bit floating point value. The ADC readout frequency is 2 × 10<sup>6</sup> Hz (2 MegaSample per second, or 2MS/s), thus resulting in a raw data throughput of 128 Mbps (16 MB/s). During data taking the readouts are formatted in .dat file such that each file is comprised of 8193 × 2<sup>10</sup> samples. This results in producing a pair of .dat files (duck_i and duck_q) every 4.2 s.

The dataset is provided on a cloud storage s3 bucket hosted on Cloud Veneto.

### 1.3. Cluster <a name="intro_cluster"></a>

This project has been done on a cluster composed by 4 virtual machines, each with 4 VCPUs with 25 GB disk space and 8 GB RAM each. The virtual machines are hosted on [CloudVeneto](https://cloudveneto.it/), an OpenStack-based cloud managed by University of Padova and INFN.

## 2. Data Processing <a name="processing"></a>

The processing of the raw data is comprised of two phases:
1. Run a Fourier transform on each scan to move from the time domain to the frequency domain
2. Average (in bins of frequency) all scans in a data-taking run, to extract a single frequency scan
 
This procedure is highly parallelizable, and should be implemented in a quasi-online pipeline for two main reasons:
1. Monitoring the scans during the data taking to promptly spot and identify possible issues in the detector setup or instabilities in the condition of the experiment
2. Data is continuously produced with a very large rate, and the local storage provided by the DAQ server of the QUAX experiment is not really suited for large-volume and long-lasting datasets

### 2.1. Pipeline Overview <a name="pipeline"></a>

The data processing pipeline will be implemented as follows:
- Each pair of files is unpacked according to their schema and split into scans.
- Data is produced to a Kafka topic by a stream-emulator script every 5 seconds to simulate the fixed ADC scanning rate and the fixed size of files written to disk. 
- The processing of each file runs is performed in a distributed framework using pySpark: for each scan, a FFT is executed in parallel and the results of all FFTs are averaged.
- The results are re-injected into a new Kafka topic hosted on the same brokers.
- A final consumer performs the plotting, displaying live updates of the scans and continuously updating the entire "run-wide" scan using bokeh.

The overall pipeline can be thus summarised as:
![pipeline schema](Images\Pipeline_Schema2.png)

### 2.2. Kafka - Receiving Data from DAQ <a name="kafka"></a>

Apache Kafka is an open-source distributed event streaming platform for high-performance data pipelines, streaming analytics, data integration, and mission-critical applications. As discussed previously, in this work it will be used to handle the live streaming of data from the DAQ servers all the way to the final live plot.

#### 2.2.1. Data Topic <a name="kafka_topic"></a>

The first step is to create a topic on the broker to hold the data from the DAQ. We create it with 4 separate partitions and no replication. The meaning of the name "chunk_data" will be made clear in the next section.

In [None]:
# connect to the cluster to run admin functions
kafka_admin = KafkaAdminClient(
    bootstrap_servers=KAFKA_BOOTSTRAP_SERVERS,
)

# define new topic to hold data
topic_in = NewTopic(name='chunk_data',
                    num_partitions=4, 
                    replication_factor=1)

#### 2.2.2. Data Preprocessing <a name="kafka_preprocessing"></a>

The size of the files produced by the DAQ is 32 MB, which means that each message handled by Kafka should be 64 MB as we need both the real and imaginary componenets. Unfortunately, the default message size in Kafka is only 1 MB. There are of course ways to circumvent this limit, namely:
 
- at the broker level, changing the *replica.fetch.max.bytes* in the broker settings and increasing the *max.message.bytes* for the topic to the desired value
- at the consumer level, increasing the *max.partition.fetch.bytes*, otherwise the consumer will fail to fetch these messages and will get stuck on processing
- at the producer level, increasing the *max.request.size* to ensure large messages can be sent

While this solution is possible, it is still against the philosophy of Kafka: sending large messages is considered inefficient as they should be huge in number but not in size.

We thus decided to first unpack the data into slices and send a pair of real and imaginary slices as a message. Since for each FFT we want *n<sub>bins</sub>* = 3 × 2<sup>10</sup> = 3027 bins and we have a total of 8193 × 2<sup>10</sup> samples per file, the amount of slices to compute FFTs on for each file (and thus of mesages to be sent) is

$$n_{slices} = \cfrac{n_{samples}}{n_{bins}} = \cfrac{8193 \times 2^{10} }{3 \times 2^{10}} = 2731$$

In [None]:
# read all data from input files
real = bytearray(binary_data_real)
imag = bytearray(binary_data_imm)

# unpack data
for f in range(n_slice):
    r_bin = real[4*n_bins*f:4*n_bins*(f+1)] # one float every 4 bytes
    i_bin = imag[4*n_bins*f:4*n_bins*(f+1)]
    
    # create kafka message
    msg = r_bin + i_bin

#### 2.2.3. Producer <a name="kafka_producer"></a>

Lastly we can initialize the Kafka producer, the one responsible to read data from the files and actually sending it to the correct topic. The message, as descibed before, is given by two consecutives byte arrays containing the real and imaginary slices. The key, instead, contains the number of the file and of the particular slice contained in the message.

In [None]:
# Create a Kafka producer instance
chunk_producer = KafkaProducer(bootstrap_servers=KAFKA_BOOTSTRAP_SERVERS)


def send_chunks(file_paths,dirPath):
    
    partners = sorted(find_partner(file_paths), key=lambda x: get_number_from_filename(x[0]))
    
    for couple in partners: 
        
        couple=[dirPath+x for x in couple]
        binary_data_real = read_binary_file(couple[0])
        binary_data_imm = read_binary_file(couple[1])

        real = bytearray(binary_data_real)
        imag = bytearray(binary_data_imm)
        
        file_num=int(couple[0][-9:-4])
        
        for f in range(n_slice):
            
            r_bin = real[4*n_bins*f:4*n_bins*(f+1)] # one float every 4 bytes
            i_bin = imag[4*n_bins*f:4*n_bins*(f+1)]
            msg = r_bin + i_bin
        
            # key = file + bin number
            key = (file_num).to_bytes(2, "big") + f.to_bytes(2, "big")
           
            print("Sending file",file_num,"\tslice number:",f+1,end="\r")
            chunk_producer.send(topic = "chunk_data",
                            key   = key,
                            value = msg)
            
            chunk_producer.flush()  # Flush the producer buffer
        
        print("                                                                 ")
        print("File", file_num,"completed!")
        print("------------------------------")
        #time.sleep(5)   # Sleep for a short duration before sending the next message
                        # to mimick waiting time for new data
        
send_chunks(file_paths,folder_path)

### 2.3. Spark - Distributed Processing <a name="spark"></a>

Apache Spark is an open-source unified analytics engine for large-scale data processing. 