<div style= "text-align: right">
    <p style= "text-align: right; font-weight: bold; font-size: x-large;">FIT3182 Big Data Management and Processing</p>
    <p style= "text-align: right; font-weight: bold; font-size: large;">Assignment 2</p>
    <p style= "text-align: right">Foo Kai Yan</p>
    <p style= "text-align: right">kfoo0012@student.monash.edu<br><br><i>33085625<br><br><i>18<sup>th</sup> May 2024</i></p>
<div>
<hr style="border-color: black;">

## Student Statement
The assignment was completed with the assistance of some code obtained from seminar/tutorial/lab/applied class.

### Installing PyMongo

In [1]:
!pip install pymongo



### Import required Libraries

In [2]:
import os
import csv
import random
import pandas as pd
import datetime as dt
from time import sleep
from json import dumps
from pprint import pprint
from pymongo import MongoClient
from kafka3 import KafkaProducer

### Check working directory

In [3]:
os.getcwd()

'/home/student/ASSIGNMENT2'

## Producer 1

In [4]:
# hostip obtained using `ipconfig` command in command prompt
hostip = "192.168.68.58"

# By opening the climate_streaming.csv, the column names is already known so one by one copy paste method was used to get the data from the climate_streaming.csv
def readClimateStreamingCSV():
    '''
    readClimateStreamingCSV function:
    - reads data from 'climate_streaming.csv' using pandas
    - returns an array
    '''
    # Read the CSV file into a DataFrame
    climate_streaming_data = pd.read_csv('climate_streaming.csv')
    climate_streaming_data_array = []
    
    # Iterate through each row in the DataFrame using the index
    for index in range(len(climate_streaming_data)):
        # Access the row by its index
        row = climate_streaming_data.iloc[index]
        
        # Create a dictionary for each row's data
        climate_streaming_data_point = {
            "latitude": float(row["latitude"]),
            "longitude": float(row["longitude"]),
            "air_temperature_celcius": float(row["air_temperature_celcius"]),
            "relative_humidity": float(row["relative_humidity"]),
            "windspeed_knots": float(row["windspeed_knots"]),
            "max_wind_speed": float(row["max_wind_speed"]),
            "GHI_w/m2": float(row["GHI_w/m2"])
        }
        
        # Process the precipitation data
        precipitation = str(row["precipitation "]).strip()  # Remove any leading/trailing whitespace
        if precipitation:
            # Split precipitation type and amount
            climate_streaming_data_point['precipitation_flag'] = precipitation[-1]
            climate_streaming_data_point["precipitation"] = float(precipitation[:-1])
        else:
            # Handle missing or empty precipitation data
            climate_streaming_data_point['precipitation_flag'] = None
            climate_streaming_data_point["precipitation"] = None
        
        # Append the dictionary to the array
        climate_streaming_data_array.append(climate_streaming_data_point)

    return climate_streaming_data_array

def publish_message(producer_instance, topic_name, data):
    '''
    publish_message function: 
    - takes a Kafka producer instance, a topic name, and data
    - then attempts to send the data to the specified Kafka topic. 
    - If successful, it prints a confirmation message; otherwise, it prints an error message.
    '''
    try:
        producer_instance.send(topic_name, value=data)
        producer_instance.flush()
        print('Message published successfully. Data: ' + str(data))
    except Exception as ex:
        print('Exception in publishing message.')
        print(str(ex))

def connect_kafka_producer():
    '''
    connect_kafka_producer function: 
    - attempts to connect to a Kafka producer using the host IP initialised beforehand and port 9092
    - returning the producer instance or None if unsuccessful.
    '''
    _producer = None
    try:
        _producer = KafkaProducer(bootstrap_servers=[f'{hostip}:9092'],
                                  value_serializer=lambda x: dumps(x).encode('utf-8'),
                                  api_version=(0, 10))
    except Exception as ex:
        print('Exception while connecting Kafka.')
        print(str(ex))
    finally:
        return _producer

if __name__ == '__main__':
    data = readClimateStreamingCSV()
    topic_name = "Climate"
    producer = connect_kafka_producer()
    latest_date = dt.datetime(2024, 1, 1) # Last date from historic CSV is 1/1/2024
    print("Publishing records...")

    while True:
        random_number = random.randrange(0, len(data))
        selected_data = data[random_number] # Pick a random climate data point
        latest_date += dt.timedelta(days=1) # Increase date from previous date
        selected_data["latest_date"] = latest_date.isoformat() # Set date to string format (to be stored in JSON)
        selected_data["producer_id"] = "producer1_climate"

        publish_message(producer, topic_name, selected_data) # Publish message

        sleep(10)

Publishing records...
Message published successfully. Data: {'latitude': -37.436, 'longitude': 148.088, 'air_temperature_celcius': 15.0, 'relative_humidity': 41.5, 'windspeed_knots': 17.0, 'max_wind_speed': 28.9, 'GHI_w/m2': 138.0, 'precipitation_flag': 'I', 'precipitation': 0.0, 'latest_date': '2024-01-02T00:00:00', 'producer_id': 'producer1_climate'}
Message published successfully. Data: {'latitude': -36.098, 'longitude': 143.735, 'air_temperature_celcius': 17.0, 'relative_humidity': 58.1, 'windspeed_knots': 11.7, 'max_wind_speed': 19.0, 'GHI_w/m2': 136.0, 'precipitation_flag': 'G', 'precipitation': 0.04, 'latest_date': '2024-01-03T00:00:00', 'producer_id': 'producer1_climate'}


KeyboardInterrupt: 

The code above read climate data from 'climate_streaming.csv' and publish it to a Kafka topic. 
- The function `readClimateStreamingCSV` reads data from 'climate_streaming.csv' using pandas.
- Each row is processed to extract climate attributes, and stores them in a list of dictionaries. 
- `publish_message` function takes a Kafka producer instance, a topic name, and data, then attempts to send the data to the specified Kafka topic. If successful, it prints a confirmation message; otherwise, it prints an error message. 
- `connect_kafka_producer` function attempts to connect to a Kafka producer using the host IP initialised beforehand and port 9092, returning the producer instance or `None` if unsuccessful.

In the main execution block, 
```
if __name__ == '__main__':
```
- Climate streaming data is read with `readClimateStreamingCSV`
- Kafka producer is connected with `connect_kafka_producer`
- A date variable with the last date from the historic CSV file is initialised with `latest_date`
- An infinite loop is entered where it selects a random data point from the climate data, increments the date by one day, formats the date for JSON storage, adds a producer ID, and publishes the message to the Kafka topic `Climate`. 
    - The loop pauses for 10 seconds before repeating the process, effectively streaming the climate data to the Kafka topic at regular intervals. 