# Tasks Minichallenge 1 hpc

## Part 1: Kafka Cluster and Application Setup

### Data Generator functions
The first functions utilizes the Binance Websocket (https://www.binance.com/en/support/faq/binance-options-api-interface-and-websocket-fe0be251ac014a8082e702f83d089e54). It provides data from several different assets with an update speed of 1000ms. The function is written in *notebooks/binance_producer.py* and is started with the docker-compose file *notebooks/docker-compose.yml*. It uses JSON for serializing the data. 

The second functions makes calls to the twitter API and retrieves the most recent tweets regarding cryptocurrencies. The function is written in *notebooks/twitter_producer.py* and also gets started withthe docker-compose file *notebooks/docker-compose.yml*. It uses pickle for serializing the data. 

### Data Processor and Data Sinks
The first data processor for the Binance Websocket Producer retrieves all messages, calculates either a mean, a sum or takes the first and last value recorded. The transformation is made in a Pandas DataFrame and the values get stored in a HDF5 File. 

The second data processor retrieves the messages from the twitter producer and transforms the text of the released tweets. In the first step a text preprocessing is applied and then a sentiment analysis is performed. After this, the calculated values and the text is stored in a HDF5 file. 

### Application components and data flow
![](Application_Overview.png)

#### What are the tasks of the components?
* twitter_producer.py: Making calls to the API every 10 seconds to stay within API limits. Send whole response to Kafka using the pickle serializer. 
* twitter_processor.py: Consuming the messages in the set time interval. Deserializing the data, applying the text transformation and writing the data to the HDF5 File. 
* binance_producer.py: Opening a connection to the binance websocket and maintaining it. Serializing the data using JSON and the utf-8 encoding. Sending the data to Kafka. 
* binance_processor.py: Consuming the messages in the set time interval. Deserializing data, calculating desired values and storing data in HDF5 File. 

#### Which interfaces do the components have?
* twitter_producer.py: The data generator for the twitter data uses the REST API of Twitter and a Kafka Producer. 
* twitter_processor.py: The data processor for the twitter data uses a Kafka Consumer and HDF5 Files. 
* binance_producer.py: The data generator for the binance data uses the Websocket from Binance and a Kafka Producer. 
* binance_processor.py: The data processor for the binance data uses a Kafka Consumer and HDF5 Files. 

#### Why did you devide to use these components?
The only design choice available were the HDF5 files. HDF5 files provide the following advantages:
* well-suited for large amounts of data
* fast write and append operations
* ability to create multiple dataframes within one file
* self-explaining file format

#### Are there any other design decisions you have made?
I choose to use pandas for the calculation of the mean, min and max values for the binance data. Pandas dataframes are easy and fast to construct with JSON data. Transforming them into numpy array would have been more complicated.

#### Which requirements (e.g. libraries, hardware, ...) does a component have?
* twitter_producer.py:
  * kafka
  * pickle
  * requests
* twitter_processor.py
  * kafka
  * h5py
  * nltk
* binance_producer.py
  * websockets
  * kafka
* binance_processor.py
  * pandas
  * h5py
  * kafka

## Part 2: Communication Patterns

### Rewritten Application and containerization
The rewritten application is in the *zmq* folder and can be started with the *zmq/docker-compose.yml* file. It is written using the zeromq framework. 

### Questions
#### Which communication pattern is used by Kafka?
Kafka uses a pub-sub communication pattern. Kafka uses Producers, Consumers and Brokers. Producers produce or publish the messages to a topic, Consumers subscribe to a topic and receive the messages. The broker receives and persists messages directly on the disk. If the messages are consumed by a consumer, they still persist on the disk for consumers. 

#### What is the difference compared to your choosen pattern?
The implementation with zeromq has no service in the middle. It's a push-pull messaging pattern. The generator sends the data directly to the processor or processors. If the data generator service is connected to a port and sends the messages while the receiving service is not running, messages get lost. No messages are persisted within the pattern. Handling bottlenecks or service failures are up to the user. 

#### What are advantages and disadvantages of these patterns?
High performance: ZeroMQ is designed for high-performance messaging, with low latency and high throughput.
Lightweight: ZeroMQ is a lightweight library that is easy to use and integrate into applications.
Flexibility: ZeroMQ provides a wide range of messaging patterns and supports multiple programming languages.
Decentralized: ZeroMQ does not require a centralized message broker, making it a more decentralized messaging system.
Low latency: ZeroMQ's design minimizes latency by avoiding the need for a central broker and using efficient message passing.

##### Push-Pull ZeroMQ
**Advantages:**
* High performance: ZeroMQ is designed for high-performance messaging with low latency and high throughput. 
* Lightweight library
* Flexible message patterns

**Disadvantages:**
* Messages get lost when the data processor is not connected to the data generator. No message persistance. 
* Limited scalability. ZeroMQ's performance can degrade if the number of nodes in a network grows. 
* Has no fault-tolerance. 
* If multiple pull nodes are started, the signals are distributed and no overview of all the incoming messages is available.

##### Pub-Sub Kafka
**Advantages:**
* Kafka is designed for scalability and can handle a large number of consumers. 
* Kafka allows stream processing. The data can be processed and analyzed as it is ingested into the system. 

**Disadvantages:**
* Higher latency
* Complexity

#### How can you scale the two different approaches? What are challenges to be considered?
##### Kafka
Since Kafka is designed to be scaled, there are less challenges compared to ZeroMQ. However, the following challenges can arise:
* Hardware limitations. As the cluster grows more ressources are required. 
* Monitoring. As the cluster grows and is distributed across multiple data centers, monitoring and managing gets more complex. 
* Consumer lag. If consumers can't keep up with the incoming message rate it starts lagging behind and starts impairing the performance of the system.

##### ZeroMQ
* Ressource limitations. As the network grows, more ressources are required.
* Synchronization. ZeroMQ provides asynchronous messaging which can be challenging to synchronize across multiple nodes. The message patterns needs to be designed accordingly. 
* As the system grows it also gets more prone to errors. Since ZeroMQ has no fault-tolerance built in and no messages are persisted a error handling system needs to be implemented. 
* Security. If the data is sent over the internet, encryption and authentication needs to be implemented. 

#### What other 2-3 topologies/patterns do you know used for data processing? Describe the differences and use cases with at least one additional topology.

##### Request/Response
The request and response pattern can be used in distributed systems. Clients send requests to a system, which sends a response back. This pattern is also known as the Client-Server pattern. 

The request and response pattern is very easy to implement and to understand. The pattern is also highly scalable as multiple clients can make requests to the server which can handle them in parallel. This pattern is also very reliable since the client waits for the response before further processing. 

Typical use cases would be web applications and a variety of distributed systems where message delivery is critical. 

##### Exclusive pair
An exclusive pair is consists of one sender and one receiver only. The exclusive pair is very simple to implement since only two nodes are necessary for communication. 

The exclusive pair also guarantees message delivery and is suitable for applications where message delivery is critical. A disadvantage is the scalability. Since the message pattern can only consist of two nodes, the only possibility for scaling is vertical. 

Typical use cases would be applications where it is critical that messages sent are received. Another use case would be when a ressource is allowed to only be accessed from one client. The exclusive pair could be used to lock out the other clients. 

##### Fan-out messaging
Fan-out messaging is a system with one sender and multiple receivers. The messages are sent in a one to many fashion. In this system the sender doesn't know about the receivers and vice versa. The receivers are only responsible for receiving and processing the data. 

Fan-out messaging patterns are scalable. More receivers can be added easily and the sender can send a message in parallel to a large number of receivers. The fan-out messaging pattern is very flexible. Receivers can be added and removed without any changes to the sender. During removal or addition of receivers the sender can also continue to send messages. 

Typical use cases would be broadcasting news to a large number of clients simultaneously. Another typical application would be sending out notifications for events like system failures or maintenance updates. 

#### Which pattern suits your chosen application best?
It's not critical that every message that's sent is also received. The binance processor calculates the values every minute. If a single or even more values are missing, the results are not affected by a lot. 
Same goes for the tweets. The query can't receive all the published tweets. There will always be missing tweets. So when a single tweet is missing, it's not a problem. So the push-pull pattern is suited for both applications. If the throughput would increase and a single receiver is not able to keep up more receivers could be added. 