# Tasks Minichallenge 1 hpc

## Part 1: Kafka Cluster and Application Setup

### Data Generator functions
The first functions utilizes the Binance Websocket (https://www.binance.com/en/support/faq/binance-options-api-interface-and-websocket-fe0be251ac014a8082e702f83d089e54). It provides data from several different assets with an update speed of 1000ms. The function is written in *notebooks/binance_producer.py* and is started with the docker-compose file *notebooks/docker-compose.yml*. It uses JSON for serializing the data. 

The second functions makes calls to the twitter API and retrieves the most recent tweets regarding cryptocurrencies. The function is written in *notebooks/twitter_producer.py* and also gets started withthe docker-compose file *notebooks/docker-compose.yml*. It uses pickle for serializing the data. 

### Data Processor and Data Sinks
The first data processor for the Binance Websocket Producer retrieves all messages, calculates either a mean, a sum or takes the first and last value recorded. The transformation is made in a Pandas DataFrame and the values get stored in a HDF5 File. 

The second data processor retrieves the messages from the twitter producer and transforms the text of the released tweets. In the first step a text preprocessing is applied and then a sentiment analysis is performed. After this, the calculated values and the text is stored in a HDF5 file. 

### Application components and data flow
![](Application_Overview.png)

#### What are the tasks of the components?
* twitter_producer.py: Making calls to the API every 10 seconds to stay within API limits. Send whole response to Kafka using the pickle serializer. 
* twitter_processor.py: Consuming the messages in the set time interval. Deserializing the data, applying the text transformation and writing the data to the HDF5 File. 
* binance_producer.py: Opening a connection to the binance websocket and maintaining it. Serializing the data using JSON and the utf-8 encoding. Sending the data to Kafka. 
* binance_processor.py: Consuming the messages in the set time interval. Deserializing data, calculating desired values and storing data in HDF5 File. 

#### Which interfaces do the components have?
* twitter_producer.py: The data generator for the twitter data uses the REST API of Twitter and a Kafka Producer. 
* twitter_processor.py: The data processor for the twitter data uses a Kafka Consumer and HDF5 Files. 
* binance_producer.py: The data generator for the binance data uses the Websocket from Binance and a Kafka Producer. 
* binance_processor.py: The data processor for the binance data uses a Kafka Consumer and HDF5 Files. 

#### Why did you devide to use these components?
The only design choice available were the HDF5 files. HDF5 files provide the following advantages:
* well-suited for large amounts of data
* fast write and append operations
* ability to create multiple dataframes within one file
* self-explaining file format

#### Are there any other design decisions you have made?
I choose to use pandas for the calculation of the mean, min and max values for the binance data. Pandas dataframes are easy and fast to construct with JSON data. Transforming them into numpy array would have been more complicated.

#### Which requirements (e.g. libraries, hardware, ...) does a component have?
* twitter_producer.py:
  * kafka
  * pickle
  * requests
* twitter_processor.py
  * kafka
  * h5py
  * nltk
* binance_producer.py
  * websockets
  * kafka
* binance_processor.py
  * pandas
  * h5py
  * kafka

## Part 2: Communication Patterns

### Rewritten Application and containerization
The rewritten application is in the *zmq* folder and can be started with the *zmq/docker-compose.yml* file. It is written using the zeromq framework. 

###