<center><h1>Management and Analysis of Physics Dataset (MOD. B) </h1></center>
<center><h2> Project 5 - Streaming processing of cosmic rays using Drift Tubes detectors</h2></center>
<center><h2>Group 2305</h2></center>

<center><style>
    table {font-size: 24px;}
</style></center>

| Last Name        | First Name         |Student ID|
|:----------------:|:------------------:|:--------------:|
| Bertinelli       | Gabriele           |1219907 (tri)   |
| Bhatti           | Roben              |2091187         |
| Bonato           | Diego              |2091250         |
| Cacciola         | Martina            |2097476         |

<left><h2> Part 1 - Producer</h2></left>

### Import packages and modules 

In [1]:
# import kafka
from kafka       import KafkaProducer
from kafka.admin import KafkaAdminClient, NewTopic


import pandas as pd
import os
import boto3
import json
import time
from tqdm        import tqdm

### Producer creation

The Producer (Sender) is the Kafka abstraction for publishing data into a Kafka topic.  

In order for a Producer to publish a message, we need to specify at least two things:
1. The location of the cluster over the network &rarr; `KAFKA_BOOTSTRAP_SERVER`

In [2]:
# define the list of brokers in the cluster
KAFKA_BOOTSTRAP_SERVER = '##'

In [3]:
# producer definition
producer = KafkaProducer(bootstrap_servers = KAFKA_BOOTSTRAP_SERVER,
                         batch_size=16000, #16MB
                            linger_ms=20  ) #ms a producer is willing to wait before sending a batch out

        
# KAFKA ADMIN is responsible for creating/deleting topics
# connecting to client 
kafka_admin = KafkaAdminClient(bootstrap_servers = KAFKA_BOOTSTRAP_SERVER)

In [4]:
kafka_admin.list_topics()

['data_clean', 'test_clean_1', 'data_raw', '__consumer_offsets']

### Topic creation


2. The topic to which the messages will be published &rarr; `create_topics`. 

In addition to the name, it is possible to specify the number of partitions of a topic, `num_partitions`, and the number of replication times of the topic, `replication_factor`.  
Kafka Replication Factor refers to the multiple copies of data stored across several Kafka brokers. Since we have just one broker we set `replication_factor=1`.  
The number of partitions (for the `data_raw` topic) was varied, along with Spark's parameters, to test the performance of the network in different configurations.

In [5]:
#delete existing topics

if 'data_raw' in kafka_admin.list_topics():
    kafka_admin.delete_topics(['data_raw'])
    
if 'data_clean' in kafka_admin.list_topics():
    kafka_admin.delete_topics(['data_clean'])

kafka_admin.list_topics()


['test_clean_1', '__consumer_offsets']

In [6]:
data_raw = NewTopic(name='data_raw', 
                        num_partitions=10,                      
                        replication_factor=1)                     #replication factor is 1 (no replication) because we have one broker    

data_clean = NewTopic(name='data_clean', 
                          num_partitions=1, 
                          replication_factor=1)

kafka_admin.create_topics(new_topics=[data_raw, data_clean])
kafka_admin.list_topics()

['data_clean', 'test_clean_1', 'data_raw', '__consumer_offsets']

### Load data from s3 bucket
Key
We first connect to the s3 bucket containing tha data. We access each file, through a loop, by the `Key` value and we save it in a Pandas Dataframe. We rename the long-name column to `values` and then we loop in the dataframe and we append each row to a message to send to the `data_raw` topic. When the batch reach the defined `batch_size`, we send the message through `producer.flush()`.

The `KafkaProducer.send()` method is asynchronous, which means it enqueues the message on an internal queue. The actual sending of the message to the broker happens later, based on tunable parameters. We use this function to gather the rows inside the batch. 
To send messages synchronously, we use the `flush()` method of the producer. It ensures that all outstanding messages are sent before proceeding. We use it when the batch is complete and we want to send it to the topic.

In [7]:
# the keys are useless as verify is false
s3_client = boto3.client('#',
                         endpoint_url='##',
                         aws_access_key_id='##',
                         aws_secret_access_key='##',
                         verify=False)

#bucket containing data, the key value is the name of txt file we are going to parse
list_obj_contents = s3_client.list_objects(Bucket = '##')['Contents'] 




In [8]:
batch_counter = 0 


### Change these parameters to adjust number of records/s

wait_time = 0.95  # we send 1 batch of data every wait_time seconds

batch_size = 1000  #number of rows sent for batch 


### we send more than one batch per second by design, we could also send 1 batch per second


### loop into s3 bucket and send batches to the broker simulating a streaming data

for obj in list_obj_contents:
    
    #load each txt file into pandas dataframe
    
    df=pd.read_csv(s3_client.get_object(Bucket='##', Key=obj['Key']).get('Body'), sep=' ')
    df.rename(columns = {'HEAD,FPGA,TDC_CHANNEL,ORBIT_CNT,BX_COUNTER,TDC_MEAS':'values'}, inplace = True)
    
    #df['string'] = df[df.columns].astype(str).apply(lambda x: ', '.join(x), axis = 1)

    print('NewFile') 
    
    for i in tqdm(range(len(df))):
        
        row = df['values'].iloc[i].encode()
        
        #row_bytes = bytes(row.to_csv(lineterminator=',', header=False,index=False), encoding='utf-8')
        
        # append a record to the msg to send
        producer.send('data_raw', row)
        
        #batch counter increaser
        batch_counter+=1
        
        #send message to the topic when we reach batch size 
        if batch_counter==batch_size:
            producer.flush()
            # sleep time
            time.sleep(wait_time)
            batch_counter=0
            
    # send last batch 
    producer.flush()



NewFile


 86%|███████████████████████████▍    | 1124999/1310592 [20:26<03:22, 916.99it/s]


KeyboardInterrupt: 