# MAPD mod. B - Final Project
##  Streaming processing of cosmic rays using Drift Tubes detectors

The goal of this project is to reproduce a real-time processing of real data collected in a particle physics detector and publish the results in a dashboard for live monitoring.

### Students:
    - Aidin Attar - 2048654
    - Ema Baci - 2050726
    - Mariam Hergnyan - 2040478

We have to create a streaming application to monitor basic detector quality plots as an online streaming application. The data is taken from s3 bucket and converted to json format which is correspondant to the data format according to the muon detector. The format is shown below.

| HEAD | FPGA | TDC_CHANNEL |  ORBIT_CNT | BX_COUNTER | TDC_MEAS |
|:----:|:----:|:-----------:|:----------:|:----------:|:--------:|
|   1  |   1  |      0      | 3387315431 |      0     |    130   |
|   0  |   1  |      2      | 3387315431 |    1119    |    24    |
|   4  |   1  |      0      | 3387315431 |      0     | -0.57373 |
|   5  |   1  |      0      | 3387315431 |      0     |   45.5   |
|   2  |   0  |      75     | 3387200947 |    2922    |     2    |
|   2  |   0  |     105     | 3387200955 |    2227    |    29    |
|  ... |  ... |     ...     |     ...    |     ...    |    ...   |

The overall scema of the project is shown below.
![Schematic view of the configuration](Schema1.png)

We set one Kafka topic to stream the data and one topic to recieve it for the results. For Spark we had 1 master node with 2 executors and 3 worker nodes with 4 executos each. All of them with 1 core of 1500 mebibytes.

![Schematic view of the virtual machines (VM)](Schema2.png)


## Part 1 - Producer

In [1]:
import kafka
from kafka       import KafkaProducer
from kafka.admin import KafkaAdminClient
from kafka.admin import NewTopic

from tqdm        import tqdm

import pandas as pd
import os
import boto3
import json
import time

import urllib3
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)

In [2]:
# kafka directory and IP
KAFKA_HOME = '/usr/local/kafka'
KAFKA_BOOTSTRAP_SERVERS = ['pd-slave3:9092']

In [3]:
# producer definition using the address given before
producer = KafkaProducer(bootstrap_servers=KAFKA_BOOTSTRAP_SERVERS)

# connect to the cluster to run admin functions
kafka_admin = KafkaAdminClient(
    bootstrap_servers=KAFKA_BOOTSTRAP_SERVERS,
    api_version=(0,10,2))

In [6]:
# delete previous topics

if 'topic_stream' in kafka_admin.list_topics():
    kafka_admin.delete_topics(['topic_stream'])
    
if 'topic_results' in kafka_admin.list_topics():
    kafka_admin.delete_topics(['topic_results'])

kafka_admin.list_topics()

['__consumer_offsets']

In [7]:
# define new topics for stream and results
# (run 2 times: TODO correct this)

stream_topic = NewTopic(name='topic_stream', 
                        num_partitions=14, 
                        replication_factor=1)

results_topic = NewTopic(name='topic_results', 
                         num_partitions=14, 
                         replication_factor=1)

kafka_admin.create_topics(new_topics=[stream_topic, results_topic])

CreateTopicsResponse_v3(throttle_time_ms=0, topic_errors=[(topic='topic_stream', error_code=0, error_message=None), (topic='topic_results', error_code=0, error_message=None)])

In [8]:
# load data from s3 bucket
# the keys are useless as verify is false
s3_client = boto3.client('s3',
                         endpoint_url='https://cloud-areapd.pd.infn.it:5210',
                         aws_access_key_id='1a4543841b844a88bb3f2eba45764d61',
                         aws_secret_access_key='42e2f16592f54668b8421ecf5ca7ba51',
                         verify=False)

list_obj_contents = s3_client.list_objects(Bucket= 'mapd-minidt-stream')['Contents']

# wait time to have ~1000 rows/s
wait_time = .00018
#wait_time = .0
while True:
    try:
        for i in range(0, len(list_obj_contents)):
            file_name = list_obj_contents[i]['Key']
            row_data = pd.read_csv(s3_client.get_object(Bucket='mapd-minidt-stream', 
                                                        Key=file_name).get('Body'), dtype='str')
            # convert file to json
            json_data = row_data.to_dict( 'records' )

            # tqdm to visualize progresses
            for msg in tqdm(json_data): 
                # publish message
                producer.send('topic_stream', json.dumps(msg).encode('utf-8'))
                # sleep time
                time.sleep(wait_time)
            # send a message "synchronously"
            producer.flush()
    except KeyboardInterrupt:
        break

100%|████████████████████████████████████████████████████████████████████████| 1310592/1310592 [22:04<00:00, 989.25it/s]
100%|████████████████████████████████████████████████████████████████████████| 1310720/1310720 [24:20<00:00, 897.49it/s]
100%|████████████████████████████████████████████████████████████████████████| 1310720/1310720 [23:40<00:00, 922.74it/s]
 40%|████████████████████████████▉                                            | 518862/1310720 [09:40<14:45, 894.37it/s]


In [None]:
# close the producer
producer.close()