# Streaming processing of cosmic rays using Drift Tubes detectors
## Final project of Management and Analysis of Physics Dataset (B)
### Project 4

* [Hilario Capettini](https://github.com/hcapettini2)
* [Javier Gerardo Carmona](https://github.com/eigen-carmona/)
* [Saverio Monaco](https://github.com/SaverioMonaco/)

In [1]:
import os

## 1 Data acquisition

In [2]:
%%bash
FILE=$HOME/.s3cfg
if test -f "$FILE"; then
    echo "$FILE exists."
else
    echo "host_base = cloud-areapd.pd.infn.it:5210
host_bucket = cloud-areapd.pd.infn.it:5210
use_https = true
access_key = <your EC2 access key>
secret_key = <your EC2 secret key>" > $FILE
    echo "To find your EC2 credentials, in the Dashboard go to Project → API Access and then click on View Credentials.
https://cloudveneto.ict.unipd.it/dashboard/auth/login/"
fi


To find your EC2 credentials, in the Dashboard go to Project → API Access and then click on View Credentials.
https://cloudveneto.ict.unipd.it/dashboard/auth/login/


In [3]:
!s3cmd ls s3://MAPD_miniDT_stream --no-check-certificate

/bin/bash: s3cmd: command not found


%%bash

BUCKET=MAPD_miniDT_stream
OBJECT=data_000000.txt
DIR=./data/

s3cmd get s3://$BUCKET/$OBJECT $DIR --no-check-certificate

In [None]:
# Let us make it a python function

In [None]:
def download_data(file,BUCKET='MAPD_miniDT_stream',DIR='./data'):
    if file > 9:
        OBJECT = 'data_0000' + str(file) +'.txt'
    else:
        OBJECT = 'data_00000' + str(file) +'.txt'
    
    #print('s3cmd get s3://'+BUCKET+'/'+OBJECT+' '+DIR+' --no-check-certificate')
    os.system('s3cmd get s3://'+BUCKET+'/'+OBJECT+' '+DIR+' --no-check-certificate')

In [None]:
download_data(4)

## 2 Dataset processing
The processing can be performed with either Spark or Dask.

In [1]:
#### SAMPLE DATA PROCESSING
#!pip install pyspark
from pyspark.sql import SparkSession
import pandas as pd

spark = SparkSession.builder.getOrCreate()
sample_data = pd.read_csv('sample_data.csv')
#df = spark.read.format('csv').load('sample_data.csv') #currently not recognising column names as such
df = spark.createDataFrame(sample_data)
df.dtypes

[('HEAD', 'bigint'),
 ('FPGA', 'bigint'),
 ('TDC_CHANNEL', 'bigint'),
 ('ORBIT_CNT', 'bigint'),
 ('BX_COUNTER', 'bigint'),
 ('TDC_MEAS', 'double')]

The following information should be produced per each batch:
1. total number of processed hits, post-clensing (1 value per batch)
2. total number of processed hits, post-clensing, per chamber (4 values per batch)
3. histogram of the counts of active TDC_CHANNEL, per chamber (4 arrays per batch)
4. histogram of the total number of active TDC_CHANNEL in each ORBIT_CNT, per cham-
ber (4 arrays per batch)

These informations should be wrapped in one message per batch, and injected in a new
Kafka topic.

In [2]:
## TOTAL NUMBER OF PROCESSED HITS
clean_df = df.filter(df.HEAD == 2)
total_hits = clean_df.count()

In [3]:
## CHAMBER FILTERING
c_fp = clean_df.filter(clean_df.FPGA == 0)
c_ga = clean_df.filter(clean_df.FPGA == 1)
c_0 = c_fp.filter(c_fp.TDC_CHANNEL < 64)
c_1 = c_fp.filter(c_fp.TDC_CHANNEL >= 64)
c_2 = c_ga.filter(c_fp.TDC_CHANNEL < 64)
c_3 = c_ga.filter(c_fp.TDC_CHANNEL >= 64)

In [4]:
## TOTAL NUMBER OF PROCESSED HITS PER CHAMBER
hits_0 = c_0.count().collect()
hits_1 = c_1.count().collect()
hits_2 = c_2.count().collect()
hits_3 = c_3.count().collect()

In [9]:
## ACTIVE TDC_CHANNEL PER CHAMBER
hist_0 = c_0.groupBy('TDC_CHANNEL').count().collect()
hist_1 = c_1.groupBy('TDC_CHANNEL').count().collect()
hist_2 = c_2.groupBy('TDC_CHANNEL').count().collect()
hist_3 = c_3.groupBy('TDC_CHANNEL').count().collect()

In [10]:
## ACTIVE TDC_CHANNEL PER CHAMBER PER ORBIT_CNT
orb_0 = c_0.groupBy('TDC_CHANNEL','ORBIT_CNT').count().collect()
orb_1 = c_1.groupBy('TDC_CHANNEL','ORBIT_CNT').count().collect()
orb_2 = c_2.groupBy('TDC_CHANNEL','ORBIT_CNT').count().collect()
orb_3 = c_3.groupBy('TDC_CHANNEL','ORBIT_CNT').count().collect()

## 3 Live plotting
The results of the processing should be presented in the form of a continuously updating
dashboard.

[Row(TDC_CHANNEL=126, count=1),
 Row(TDC_CHANNEL=124, count=1),
 Row(TDC_CHANNEL=107, count=2),
 Row(TDC_CHANNEL=105, count=1),
 Row(TDC_CHANNEL=127, count=1),
 Row(TDC_CHANNEL=66, count=2),
 Row(TDC_CHANNEL=75, count=1),
 Row(TDC_CHANNEL=117, count=1),
 Row(TDC_CHANNEL=70, count=1),
 Row(TDC_CHANNEL=64, count=1)]

## 4 Extras
Two additional types of results can be added to the list of the processing, and displayed
on the live visualization:
1. histogram of the counts of active TDC_CHANNEL, per chamber, ONLY for those orbits with at least one scintillator signal in it (4 arrays per batch)
2. histogram of the DRIFTIME, per chamber (4 arrays per batch)