# MAPD-B distributed processing exam
## Project 4: Streaming processing of cosmic rays using Drift Tubes detectors

The goal of this project is to reproduce a real-time processing of real data collected in a
particle physics detector and publish the results in a dashboard for live monitoring.

### Students:
* [Hilario Capettini](https://github.com/hcapettini2) (2013031)
* [Javier Gerardo Carmona](https://github.com/eigen-carmona/) (2005005)
* [Saverio Monaco](https://github.com/SaverioMonaco/) (2012264)

##  The Cluster

5 Cloud Veneto VMs were assigned for this project:
* MAPD-B_Gr17-5 10.67.22.83
* MAPD-B_Gr17-4 10.67.22.136
* MAPD-B_Gr17-3 10.67.22.102
* MAPD-B_Gr17-2 10.67.22.39
* MAPD-B_Gr17-1 10.67.22.137

Each machine runs CentOs and the Specs are:
* RAM:   8GB
* VCPUs: 4
* Disk:  25GB

## Setting the Cluster

(Maybe here we can spend a few words on how to set up a Cluster)
- We chose a "master node" VM. In this case, we establish the following pairing:

10.67.22.83 master\
10.67.22.137 slave01\
10.67.22.39 slave02\
10.67.22.102 slave03\
10.67.22.136 slave04

which was added verbatim to the ```/etc/hosts``` file for every VM.

- Then, we generate a public key for the master VM. This is added to each vm authorized keys. No passphrase is created, so that master can access any of the VMs without the need for a password.



### Configuration chosen

### Benchmarks of other configurations

In [1]:
# Here we basically justify why we chosen the current configuration

## Setting Kafka and Spark

(Maybe here we can spend a few words about the commands on how to create the pipeline, in a cluster)
### Spark
#### Installation
- We begin by installing ```java-11-openjdk``` along with ```spark-3.1.2``` in every one of the cluster's virtual machines.
- MISSING STEP spark-env.sh
#### Execution
- ```\$SPARK_HOME/sbin/start-master.sh``` on master machine
- ```\$SPARK_HOME/sbin/start-worker.sh spark://master:7077``` on each desired worker (possibly including master)

### Kafka
#### Installation
- ```wget -c https://dlcdn.apache.org/kafka/2.8.0/kafka_2.13-2.8.0.tgz```
- ```tar -xzf kafka_2.13-2.8.0.tgz```
#### Execution
- ```./$KAFKA_HOME/bin/zookeeper-server-start.sh config/zookeeper.properties```
- ```./$KAFKA_HOME/bin/kafka-server-start.sh config/server.properties```


### Consumer and Producer

## Data Processing

In [13]:
flatDF.printSchema()

root
 |-- HEAD: integer (nullable = true)
 |-- FPGA: integer (nullable = true)
 |-- TDC_CHANNEL: integer (nullable = true)
 |-- ORBIT_CNT: double (nullable = true)
 |-- BX_COUNTER: integer (nullable = true)
 |-- TDC_MEAS: double (nullable = true)



In [None]:
## Colection of functions for the main computation

def chamber_assignment(df):
    '''Assign chamber number and leave the scintillator carriers with chamber == null'''

    return(df.withColumn('CHAMBER',when(col("FPGA") == 0,
                                                when(col("TDC_CHANNEL")<=63,1).\
                                                otherwise(when(col("TDC_CHANNEL")<128,2))).\
                                           otherwise(when(col("TDC_CHANNEL")<=63,3).\
                                                     otherwise(when(col("TDC_CHANNEL")<128,4))
                                           )).\
                                           select([ col('TDC_CHANNEL'), col('ORBIT_CNT'),
                                           col('BX_COUNTER'),col('TDC_MEAS'),
                                           col('CHAMBER')])
          )


def scintillator_data(df):
    '''Define a dataframe containing the relevant information for 
    the scintillator analysis''' 
    
    #First we filter the events encoding the passage time,
    #then we add the PASSAGE time for each event 
    #Finally if we have two scilantor hits within the same orbit we keep 
    #the one with the smaller time
    return(df.filter((col("CHAMBER").isNull()) & (col("FPGA") == 1)).\
                          withColumn("PASSAGETIME", 25 * (col("ORBIT_CNT") * 3564 + col("BX_COUNTER") + col("TDC_MEAS")/30)).\
                          drop("TDC_CHANNEL").drop("BX_COUNTER").\
                          drop("TDC_MEAS").drop("CHAMBER").\
                          groupBy("ORBIT_CNT").min("PASSAGETIME").\
                          withColumnRenamed("ORBIT_CNT","ORBIT_CNT_sci").\
                          withColumnRenamed("min(PASSAGETIME)","PASSAGETIME")
          )


def histogram_a(df,min_v,max_v,inc,key):# TODO: replicate the function generalization to the actual code
    '''This function return the bins and counts for the first type of requested histogram'''
    hist_bins = np.arange(min_v,max_v,inc)
    hist = df\
        .filter((min_v<=F.col(key)) & (F.col(key)<=max_v))\
        .withColumn('BIN', F.floor((F.col(key)-min_v)/inc))\
        .groupBy('CHAMBER','BIN')\
        .count().select('CHAMBER','BIN', col('count').alias('COUNT'))
    return (hist_bins, hist)


def histogram_b(df,min_v,max_v,inc,key_1,key_2):
    '''This function return the bins and counts for the second type of requested histogram'''
    hist_bins = np.arange(min_v,max_v,inc)
    hist = df\
        .groupBy('CHAMBER',key_1)\
        .agg(F.countDistinct(key_2).alias('ACTIVE'))\
        .filter((min_v<=F.col(key_1))&(F.col(key_1)<=max_v))\
        .withColumn('BIN',F.floor((F.col(key_1)-min_v)/inc))\
        .groupBy('CHAMBER','BIN')\
        .agg(F.sum('ACTIVE').alias('COUNT'))
    return(hist_bins, hist)


def numpify(bins, pos_count):
    '''NUMPIFY RESULTS'''
    counter = np.zeros(len(bins))#np.zeros(len(bins)-1)?
    positions = np.array(list(pos_count.keys()))
    counts = np.array(list(pos_count.values()))
    counter[positions] = counts
    return counter


def prepare_results(hist, hist_bins):
    '''COLLECTING RESULTS'''    
    _hist = hist.groupBy('CHAMBER').agg(
    F.map_from_entries(
        F.collect_list(
            F.struct("BIN", "COUNT"))).alias("COUNT")
        ).collect()    

    # JSON FORMATING OF RESULTS
    _hist_dict = {row.CHAMBER: {
        'Bins': list(hist_bins), 'Counts': list(numpify(hist_bins,row.COUNT))
    } for row in _hist}
    
    return _hist_dict


In [None]:
def computations(df, epoch, log):
    '''This is the main function of the code, it requires a dataframe as input. The dataframe is analysed
       and the results are published in the kafka topic "results" '''
    main_df = chamber_assignment(df)

    scintillator_df = scintillator_data(main_df)
    
    ### Drop the columns with null values from main_df
    hit_df = main_df.na.drop(subset=["CHAMBER"])
    
    ## TOTAL NUMBER OF PROCESSED HITS
    total_hits = hit_df.count()
    if not total_hits: return

    ## TOTAL NUMBER OF PROCESSED HITS PER CHAMBER
    chamber_hits = hit_df\
        .groupBy('CHAMBER').count()\
        .select(col('CHAMBER'),col('count').alias('COUNT'))
    
    ## ACTIVE TDC_CHANNEL PER CHAMBER
    min_v_1 = 0
    max_v_1 = 127
    inc_1 = 5
    hist_1_bins, hist_1 = histogram_a(hit_df,min_v_1,max_v_1,inc_1, 'TDC_CHANNEL')
    
    ## ACTIVE TDC_CHANNEL PER CHAMBER PER ORBIT_CNT
    min_v_2 = 6.e5 #main_df.agg(F.min(F.col('ORBIT_CNT')).alias('min')).collect()[-1].min
    max_v_2 = 1.e7 #main_df.agg(F.max(F.col('ORBIT_CNT')).alias('max')).collect()[-1].max
    inc_2 = 0.5e6
    hist_2_bins, hist_2 = histogram_b(hit_df,min_v_2,max_v_2,inc_2, 'ORBIT_CNT', 'TDC_CHANNEL')
    
    
    ### keep only the hits with a scintillator signal within the same orbit
    chamber_sci = hit_df.join(scintillator_df,main_df.ORBIT_CNT ==  scintillator_df.ORBIT_CNT_sci,"inner")

    ## ADD TIME CORRECTION BY CHAMBER
    chamber_sci = chamber_sci.withColumn('TIME_OFFSET',when(col("CHAMBER") == 1, 93.9).\
                                                       when(col("CHAMBER") == 2, 101.4).\
                                                       when(col("CHAMBER") == 3, 95.5).\
                                                       when(col("CHAMBER") == 4, 92.4))

    ### Add the ABSSOLUTETIME and DRIFTIME
    chamber_sci = chamber_sci.withColumn("ABSOLUTETIME",
                             25 * (col("ORBIT_CNT") * 3564 + col("BX_COUNTER") + col("TDC_MEAS")/30)).\
                              withColumn("DRIFTIME",col("ABSOLUTETIME")-col("PASSAGETIME") + col("TIME_OFFSET"))
   

    ## ACTIVE TDC_CHANNEL PER CHAMBER WITHIN SCINTILLATOR SIGNAL
    min_v_3 = 0
    max_v_3= 127
    inc_3 = 5
    hist_3_bins, hist_3 = histogram_a(chamber_sci,min_v_3,max_v_3,inc_3, 'TDC_CHANNEL')
    

    ## HISTOGRAM OF DRIFTIME, PER CHAMBER
    min_v_4 = 0
    max_v_4= 1000
    inc_4 = 10
    hist_4_bins, hist_4 = histogram_a(chamber_sci,min_v_4,max_v_4,inc_4, 'DRIFTIME')


    # PREPARE THE RESULTS
    _chamber_hits = {row.CHAMBER: int(row.COUNT) for row in chamber_hits.collect()}
    _hist_1_dict = prepare_results(hist_1,hist_1_bins)
    _hist_2_dict = prepare_results(hist_2,hist_2_bins)
    _hist_3_dict = prepare_results(hist_3,hist_3_bins)
    _hist_4_dict = prepare_results(hist_4,hist_4_bins)
    
    default = lambda bins: {'Bins': list(bins), 'Counts' : [0]*(len(bins)-1)}
    
    results = {f'Chamber_{i}': {
        'Count': _chamber_hits.get(i, 0),
        'Hist_1': _hist_1_dict.get(i, default(hist_1_bins)),
        'Hist_2': _hist_2_dict.get(i, default(hist_2_bins)),
        'Hist_3': _hist_3_dict.get(i, default(hist_3_bins)),
        'Hist_4': _hist_4_dict.get(i, default(hist_4_bins))} for i in range(1,5)}

    results.update({
        'Index': time.time(),
        'Total Count': int(total_hits)
    })

    log(results)
    return


##  Results

### Vertical scalability

### Horizontal scalability

### Scaling with ammount of data

## Live Plotting

To create a live webpage dashboard we used [Plotly Dash](https://github.com/plotly/dash) a Python library built on top of Plotly to create Analytical Web Apps.

The information reported in the Dashboard are the following:

**PLOTS**
1. total number of processed hits, post-clensing (PLOT AND TABLE)
2. total number of processed hits, post-clensing, per chamber (TABLE)
3. histogram of the counts of active TDC_CHANNEL, per chamber (HISTOGRAM 1)
4. histogram of the total number of active TDC_CHANNEL in each ORBIT_CNT, per chamber (HISTOGRAM 2)

**EXTRA**
1. histogram of the counts of active TDC_CHANNEL, per chamber, ONLY for those orbits with at least one scintillator signal in it (EXTRA 1)
2. histogram of the DRIFTIME, per chamber (EXTRA 2 AND EXTRA 2 (cumulative))

<img src="imgs/dashboard.png"/>

## Backup Information

### Data-cleansing

Data-cleansing : $$\text{HEAD} == 2 $$
Other entries provide ancillary information

### Chamber mapping

• Chamber 0 → (FPGA = 0) AND (TDC_CHANNEL in [0-63])\
• Chamber 1 → (FPGA = 0) AND (TDC_CHANNEL in [64-127])\
• Chamber 2 → (FPGA = 1) AND (TDC_CHANNEL in [0-63])\
• Chamber 3 → (FPGA = 1) AND (TDC_CHANNEL in [64-127])

### Driftime

#### Absolute time
For each hit we can associate an absolute time:

$$t_{TDC\space hit} = 25 ∗ ( ORBIT\_CNT ∗ 3564 + BX\_COUNTER + TDC\_MEAS /30)\quad [ns]$$


#### Passage of a muon time
The passage time of any muon is provided by an external scintillator signal which correspond to the following selection:

$$\text{(FPGA == 1) AND (TDC_CHANNEL == 128)}$$

#### Scintillator time offset

```python
# scintillator time offset by Chamber
time_offset_by_chamber = {
0: 95.0 - 1.1, # Ch 0
1: 95.0 + 6.4, # Ch 1
2: 95.0 + 0.5, # Ch 2
3: 95.0 - 2.6, # Ch 3
}
```

#### Driftime
For those hits with a scintillator signal within the same orbit, a DRIFTIME can be defined, corresponding to the ABSOLUTETIME difference between each hit and the scintillator (from the same orbit).

### Dashboard

#### Callbacks
The dashboard file reads periodically from a file located in ```./board/message.pkl``` that contains the last instance produced by the topic_results (in consumer).

Whenever a variable gets updated during the reading of the file, the appropriate update function for the figures gets called and updates them.\
The update functions come with a ```callback decorator```:

Example of callback functions: 

```python
@app.callback(Output('hist1-1','figure'),
             [Input('graph-update', 'n_intervals')])
def updateHist1(n):
     return hist_getdata(1,1)
    
@app.callback(Output('hist1-2','figure'),
             [Input('graph-update', 'n_intervals')])
def updateHist2(n):
     return hist_getdata(1,2)
    
@app.callback(Output('hist1-3','figure'),
             [Input('graph-update', 'n_intervals')])
def updateHist3(n):
     return hist_getdata(1,3)
```

Structure of the callbacks:
<img src="./imgs/dashboard_callbacks.png"/>