<a href="https://colab.research.google.com/github/diogocristovao/SPBD/blob/main/docs/labs/projs/spbd2425_tp2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Sistemas para Processamento de Big Data
## TP2 - Energy Meter Live Monitoring


The sensor data corresponds to regular readings from 11 residential energy meters. The data covers the month of February 2024.

Each data sample has the following schema:

timestamp | sensor_id | energy
----------|-------------|-----------
timestamp | string  | float

Each energy value (KWh) corresponds to the accumulated value of the meter at the time of measurement. As such,
each meter is expected to produce a monotonically increasing series of pairs of timestamp and energy consummed up to that moment.

The meters do not start at zero or at the same value.


## Questions

For all the sensors combined:

1. For the current month and current day, compute the running total energy consumed so far. The values should be updated every 5 minutes.

2. For the current month and current day, compute the running total energy consumed so far, **as a percentage**, **compared to the same periods in February 2024**. The values should be updated every 5 minutes.

For each sensor, separately:

3. For the current month and current day, compute the running total energy consumed so far, as a percentage, **comparing the value of each individual sensor, relative to the same results for all the sensors together (as in #1)**. The values should be updated every 5 minutes. (Sorted in descending order by value and sensor.)

**Note:** For simplicity, it is fine to assume the first reading of each day can be used to start counting how much energy has been consumed so far. There is no need to interpolate/estimate the value of the meters at midnight.




## Requeriments

Solve each question using Structured Spark Streaming.

## Other Grading Criteria

+ Grading will also take into account the general clarity of the programming and of the presentation report (notebook).




### Deadline

December 6.

Penalty of 0.25 grade points per day late.

Penalty accumulates until the grade of the assignment reaches 8.0.

---
### Colab Setup


In [None]:
#@title Install PySpark
!pip install pyspark --quiet

In [1]:
#@title Download Archived February Energy Readings
!wget -q -O /tmp/readings.csv https://raw.githubusercontent.com/smduarte/spbd-2425/refs/heads/main/docs/labs/projs/energy-readings.csv
!grep "2024-02" /tmp/readings.csv > february-energy-readings.csv
!head -2 february-energy-readings.csv


2024-02-01 00:00:00;D;2615.0
2024-02-01 00:00:18;C;1098.8


In [2]:
#@title Start the Structured Source
!wget -q -O - https://github.com/smduarte/spbd-2425/raw/main/scripts/json_energy_sender.tgz  | tar xfz - 2> /dev/null

!nohup python json_energy_sender/server.py --filename february-energy-readings.csv --speedup 60 > /dev/null 2> /dev/null &




energy-readings.csv  server.py


In [3]:
!head -n 30 february-energy-readings.csv

2024-02-01 00:00:00;D;2615.0
2024-02-01 00:00:18;C;1098.8
2024-02-01 00:00:25;A;650.5
2024-02-01 00:00:33;J;966.7
2024-02-01 00:00:42;H;2145.4
2024-02-01 00:00:54;E;1874.0
2024-02-01 00:01:52;K;841.2
2024-02-01 00:02:00;E;1874.1
2024-02-01 00:02:20;I;927.2
2024-02-01 00:02:36;K;841.3
2024-02-01 00:03:24;G;833.7
2024-02-01 00:03:32;B;627.5
2024-02-01 00:04:24;D;2615.1
2024-02-01 00:04:40;F;748.0
2024-02-01 00:04:44;H;2145.5
2024-02-01 00:05:26;C;1098.8
2024-02-01 00:05:34;A;650.5
2024-02-01 00:05:42;J;966.7
2024-02-01 00:05:46;F;748.1
2024-02-01 00:06:26;J;966.8
2024-02-01 00:07:04;G;833.8
2024-02-01 00:07:08;E;1874.1
2024-02-01 00:07:28;I;927.2
2024-02-01 00:07:44;K;841.3
2024-02-01 00:08:36;E;1874.2
2024-02-01 00:08:40;B;627.5
2024-02-01 00:08:46;H;2145.6
2024-02-01 00:08:48;D;2615.2
2024-02-01 00:10:34;C;1098.8
2024-02-01 00:10:42;A;650.5


ls: cannot access 'json_energy_sender/output': No such file or directory



Note: --speedup 60, means the stream is played 60x faster than realtime. Therefore, 1 second in real time corresponds to 1 minute of stream data.


In [4]:
#@title Sample code to process the structured stream...
from pyspark.sql import *
from pyspark.sql.functions import *

spark = SparkSession \
    .builder \
    .appName("StructuredWebLogExample") \
    .getOrCreate()


# Extract a sample JSON string to infer schema
sample_json = '{"date": "2024-02-01 00:00:00", "sensor": "D", "energy": 2615.0}'
inferred_schema = schema_of_json(sample_json)


# Create DataFrame representing the stream of input
# lines from connection to logsender 7777
try:
  json_lines = spark.readStream.format("socket") \
      .option("host", "localhost") \
      .option("port", 7777) \
      .load()

  # Parse the JSON using the inferred schema
  json_lines = json_lines.withColumn("json_data", from_json(col("value"), inferred_schema)) \
    .select("json_data.*")  # Expand the JSON fields into columns


  query = json_lines \
    .writeStream \
    .outputMode("append") \
    .trigger(processingTime='1 seconds') \
    .foreachBatch(lambda df, epoch: df.show(10, False)) \
    .start()

  query.awaitTermination(10)
except Exception as err:
  print(err)
  query.stop()

+----+------+------+
|date|energy|sensor|
+----+------+------+
+----+------+------+

+-------------------+------+------+
|date               |energy|sensor|
+-------------------+------+------+
|2024-02-01 00:00:00|2615.0|D     |
|2024-02-01 00:00:25|650.5 |A     |
|2024-02-01 00:00:42|2145.4|H     |
|2024-02-01 00:01:52|841.2 |K     |
|2024-02-01 00:02:20|927.2 |I     |
|2024-02-01 00:03:24|833.7 |G     |
|2024-02-01 00:04:24|2615.1|D     |
|2024-02-01 00:00:18|1098.8|C     |
|2024-02-01 00:00:33|966.7 |J     |
|2024-02-01 00:00:54|1874.0|E     |
+-------------------+------+------+
only showing top 10 rows

+-------------------+------+------+
|date               |energy|sensor|
+-------------------+------+------+
|2024-02-01 00:04:40|748.0 |F     |
|2024-02-01 00:05:26|1098.8|C     |
|2024-02-01 00:05:42|966.7 |J     |
|2024-02-01 00:06:26|966.8 |J     |
|2024-02-01 00:04:44|2145.5|H     |
|2024-02-01 00:05:34|650.5 |A     |
|2024-02-01 00:05:46|748.1 |F     |
+-------------------+----

In [5]:
for stream in spark.streams.active:
    stream.stop()

+-------------------+------+------+
|date               |energy|sensor|
+-------------------+------+------+
|2024-02-01 00:15:18|748.2 |F     |
|2024-02-01 00:15:50|650.5 |A     |
|2024-02-01 00:15:42|1098.8|C     |
+-------------------+------+------+



In [None]:
#@title Question 1


from pyspark.sql import *
from pyspark.sql.functions import *
from pyspark.sql.types import StructType, StructField, StringType, FloatType

# Initialize Spark session
spark = SparkSession \
    .builder \
    .appName("EnergyMeterRunningTotal") \
    .getOrCreate()

# Infer schema from a sample JSON
sample_json = '{"date": "2024-02-01 00:00:00", "sensor": "D", "energy": 2615.0}'
inferred_schema = schema_of_json(sample_json)

try:
    # Create DataFrame representing the stream of input
    json_lines = spark.readStream.format("socket") \
        .option("host", "localhost") \
        .option("port", 7777) \
        .load()

    # Parse the JSON data
    parsed_stream = json_lines.withColumn("json_data", from_json(col("value"), inferred_schema)) \
        .selectExpr("json_data.date as timestamp", "json_data.sensor as sensor_id", "json_data.energy as energy") \
        .withColumn("timestamp", to_timestamp(col("timestamp")))  # Convert timestamp to proper type

    # Filter for the current month (February)
    filtered_stream = parsed_stream.filter(month(col("timestamp")) == 2)

    # Define update function for the accumulated state
    def update_running_total(batch_df, epoch_id, state={}):
        # Calculate total energy in the current batch
        batch_total = batch_df.agg(sum("energy")).collect()[0][0]
        # Update the state with the current batch total
        state["total_energy"] = state.get("total_energy", 0) + (batch_total if batch_total else 0)
        # Print the running total
        print(f"Running Total Energy: {state['total_energy']}")

    # Process the stream in batches and update the running total
    query = filtered_stream.writeStream \
        .foreachBatch(update_running_total) \
        .trigger(processingTime="10 seconds") \
        .start()

    query.awaitTermination(60)
except Exception as err:
    print(f"Error: {err}")
    if 'query' in locals():
        query.stop()




Running Total Energy: 0
Running Total Energy: 25234.9
Running Total Energy: 94453.69999999998


ERROR:root:KeyboardInterrupt while sending command.
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/py4j/java_gateway.py", line 1038, in send_command
    response = connection.send_command(command)
  File "/usr/local/lib/python3.10/dist-packages/py4j/clientserver.py", line 511, in send_command
    answer = smart_decode(self.stream.readline()[:-1])
  File "/usr/lib/python3.10/socket.py", line 705, in readinto
    return self._sock.recv_into(b)
KeyboardInterrupt


KeyboardInterrupt: 

In [16]:
#@title QUESTÃO 2

from pyspark.sql import *
from pyspark.sql.functions import *
from pyspark.sql.types import StructType, StructField, StringType, FloatType

# Inicializar sessão Spark
spark = SparkSession \
    .builder \
    .appName("EnergyComparison") \
    .getOrCreate()

# Inferir o esquema a partir de um JSON de exemplo
sample_json = '{"date": "2024-02-01 00:00:00", "sensor": "D", "energy": 2615.0}'
inferred_schema = schema_of_json(sample_json)

try:
    # Leitura do arquivo CSV de fevereiro de 2024
    february_df = spark.read.csv("february-energy-readings.csv", header=False, sep=";") \
        .withColumnRenamed("_c0", "date") \
        .withColumnRenamed("_c1", "sensor") \
        .withColumnRenamed("_c2", "energy") \
        .withColumn("date", to_timestamp(col("date"))) \
        .withColumn("energy", col("energy").cast("float")) \
        .select("date", "sensor", "energy")

    # Calcular o consumo acumulado diário para fevereiro de 2024
    february_total = february_df.withColumn("day", dayofmonth(col("date"))) \
        .groupBy("day") \
        .agg(sum("energy").alias("total_energy_february"))

    # Criar DataFrame representando o stream de entrada
    json_lines = spark.readStream.format("socket") \
        .option("host", "localhost") \
        .option("port", 7777) \
        .load()

    # Processar os dados do stream
    current_stream = json_lines.withColumn("json_data", from_json(col("value"), inferred_schema)) \
        .selectExpr("json_data.date as date", "json_data.sensor as sensor", "json_data.energy as energy") \
        .withColumn("date", to_timestamp(col("date"))) \
        .withColumn("energy", col("energy").cast("float")) \
        .filter(to_date(col("date")) == current_date())  # Filtrar apenas o dia atual

    # Calcular a running total energy acumulada no dia atual
    def update_running_total(batch_df, epoch_id, state={}):
        # Total acumulado no lote atual
        batch_total = batch_df.agg(sum("energy")).collect()[0][0]
        # Atualizar o estado com o total acumulado
        state["total_energy_current"] = state.get("total_energy_current", 0) + (batch_total if batch_total else 0)
        print(f"Running Total Energy (Today): {state['total_energy_current']}")
        # Combinar com os dados de fevereiro para o mesmo dia
        february_day_total = february_total.filter(col("day") == dayofmonth(current_date())).collect()
        if february_day_total:
            february_energy = february_day_total[0]["total_energy_february"]
            percentage_difference = ((state["total_energy_current"] - february_energy) / february_energy) * 100
            print(f"Comparison with February: {state['total_energy_current']} vs {february_energy} ({percentage_difference:.2f}%)")
        else:
            print("No data for February for the current day.")

    # Processar o stream em lotes e calcular o acumulado
    query = current_stream.writeStream \
        .foreachBatch(update_running_total) \
        .trigger(processingTime="10 seconds") \
        .start()

    query.awaitTermination(60)
except Exception as err:
    print(f"Error: {err}")
    if 'query' in locals():
        query.stop()


Running Total Energy (Today): 0
No data for February for the current day.
Running Total Energy (Today): 0
No data for February for the current day.
Running Total Energy (Today): 0
No data for February for the current day.


ERROR:root:KeyboardInterrupt while sending command.
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/py4j/java_gateway.py", line 1038, in send_command
    response = connection.send_command(command)
  File "/usr/local/lib/python3.10/dist-packages/py4j/clientserver.py", line 511, in send_command
    answer = smart_decode(self.stream.readline()[:-1])
  File "/usr/lib/python3.10/socket.py", line 705, in readinto
    return self._sock.recv_into(b)
KeyboardInterrupt


KeyboardInterrupt: 

In [None]:
#@title Questão 3

from pyspark.sql import *
from pyspark.sql.functions import *
from pyspark.sql.types import StructType, StructField, StringType, FloatType

# Initialize Spark session
spark = SparkSession \
    .builder \
    .appName("EnergyMeterRunningTotal") \
    .getOrCreate()

# Infer schema from a sample JSON
sample_json = '{"date": "2024-02-01 00:00:00", "sensor": "D", "energy": 2615.0}'
inferred_schema = schema_of_json(sample_json)

try:
    # Create DataFrame representing the stream of input
    json_lines = spark.readStream.format("socket") \
        .option("host", "localhost") \
        .option("port", 7777) \
        .load()

    # Parse the JSON data
    parsed_stream = json_lines.withColumn("json_data", from_json(col("value"), inferred_schema)) \
        .selectExpr("json_data.date as timestamp", "json_data.sensor as sensor_id", "json_data.energy as energy") \
        .withColumn("timestamp", to_timestamp(col("timestamp")))  # Convert timestamp to proper type

    # Filter for the current month and current day
    current_month = month(current_timestamp())
    current_day = dayofmonth(current_timestamp())
    filtered_stream = parsed_stream.filter((month(col("timestamp")) == current_month) & (dayofmonth(col("timestamp")) == current_day))

    # Define update function for the accumulated state
    def update_running_total(batch_df, epoch_id, state={}):
        if not batch_df.rdd.isEmpty():
            # Calculate total energy for all sensors in the current batch
            batch_total_all_sensors = batch_df.agg(sum("energy")).collect()[0][0]
            state["total_energy_all_sensors"] = state.get("total_energy_all_sensors", 0) + (batch_total_all_sensors if batch_total_all_sensors else 0)

            # Calculate total energy per sensor in the current batch
            sensor_totals = batch_df.groupBy("sensor_id").agg(sum("energy").alias("sensor_energy")).collect()
            for row in sensor_totals:
                sensor_id = row["sensor_id"]
                sensor_energy = row["sensor_energy"]
                state[sensor_id] = state.get(sensor_id, 0) + (sensor_energy if sensor_energy else 0)

            # Print the running total for all sensors
            print(f"Running Total Energy (All Sensors): {state['total_energy_all_sensors']}")

            # Print the running total for each sensor as a percentage of the total for all sensors
            results = []
            for sensor_id in state:
                if sensor_id != "total_energy_all_sensors":
                    total_energy_sensor = state[sensor_id]
                    total_energy_all = state["total_energy_all_sensors"]
                    percentage = (total_energy_sensor / total_energy_all) * 100 if total_energy_all > 0 else 0
                    results.append((sensor_id, percentage, total_energy_sensor))

            # Sort the results by percentage in descending order
            sorted_results = sorted(results, key=lambda x: (-x[1], x[0]))
            for sensor_id, percentage, total_energy_sensor in sorted_results:
                print(f"Sensor: {sensor_id}, Running Total Energy: {total_energy_sensor}, Percentage of Total: {percentage:.2f}%")

    # Process the stream in batches and update the running total
    query = filtered_stream.writeStream \
        .foreachBatch(update_running_total) \
        .trigger(processingTime="1 minutes") \
        .start()

    query.awaitTermination(120)
except Exception as err:
    print(f"Error: {err}")
    if 'query' in locals():
        query.stop()


+-------------------+------+------+
|date               |energy|sensor|
+-------------------+------+------+
|2024-02-01 00:42:24|627.6 |B     |
+-------------------+------+------+

+-------------------+------+------+
|date               |energy|sensor|
+-------------------+------+------+
|2024-02-01 00:43:00|833.8 |G     |
|2024-02-01 00:43:24|927.2 |I     |
+-------------------+------+------+

+-------------------+------+------+
|date               |energy|sensor|
+-------------------+------+------+
|2024-02-01 00:44:00|2616.1|D     |
|2024-02-01 00:44:24|841.9 |K     |
|2024-02-01 00:44:10|1874.7|E     |
+-------------------+------+------+

+-------------------+------+------+
|date               |energy|sensor|
+-------------------+------+------+
|2024-02-01 00:44:34|967.3 |J     |
|2024-02-01 00:45:16|1874.8|E     |
|2024-02-01 00:45:00|748.5 |F     |
|2024-02-01 00:45:26|2146.5|H     |
+-------------------+------+------+

+-------------------+------+------+
|date               |ene

ERROR:root:KeyboardInterrupt while sending command.
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/py4j/java_gateway.py", line 1038, in send_command
    response = connection.send_command(command)
  File "/usr/local/lib/python3.10/dist-packages/py4j/clientserver.py", line 511, in send_command
    answer = smart_decode(self.stream.readline()[:-1])
  File "/usr/lib/python3.10/socket.py", line 705, in readinto
    return self._sock.recv_into(b)
KeyboardInterrupt


+-------------------+------+------+
|date               |energy|sensor|
+-------------------+------+------+
|2024-02-01 00:55:02|842.2 |K     |
|2024-02-01 00:55:12|967.4 |J     |
+-------------------+------+------+



KeyboardInterrupt: 