Memory sink (for debugging) - The output is stored in memory as an in-memory table. Both, Append and Complete output modes, are supported. This should be used for debugging purposes on low data volumes as the entire output is collected and stored in the driver’s memory. Hence, use it with caution.

### TODO Recording


- Stream data from a file source (`input/` directory)
- Check that the `input/` directory exists in the current working directory

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql.types import (StructType, StructField, IntegerType, StringType,
                               DoubleType, LongType)
from pyspark.sql.functions import current_timestamp, col

spark = SparkSession.builder.appName("MemorySinkDemo").getOrCreate()

print("SparkSession created successfully!")

25/02/12 19:42:44 WARN Utils: Your hostname, Jananis-MacBook-Air.local resolves to a loopback address: 127.0.0.1; using 192.168.68.52 instead (on interface en0)
25/02/12 19:42:44 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/02/12 19:42:44 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


SparkSession created successfully!


In [2]:
schema = StructType([
    StructField("Rank", IntegerType(), True),
    StructField("Name", StringType(), True),
    StructField("Manufacturer", StringType(), True),
    StructField("Country", StringType(), True),
    StructField("Year", IntegerType(), True),
    StructField("Segment", StringType(), True),
    StructField("Total_Cores", LongType(), True),
    StructField("Processor_Speed", IntegerType(), True),
    StructField("CoProcessor_Cores", StringType(), True), 
    StructField("Rmax", DoubleType(), True),
    StructField("Rpeak", DoubleType(), True),
    StructField("Power", DoubleType(), True),
    StructField("Power_Efficiency", DoubleType(), True),
    StructField("Architecture", StringType(), True),
    StructField("Processor_Technology", StringType(), True),
    StructField("Operating_System", StringType(), True),
    StructField("OS_Family", StringType(), True),
])

streaming_df = spark \
    .readStream \
    .format("json") \
    .schema(schema) \
    .load("input/")

print("Streaming DataFrame created. Monitoring 'input/' directory for new files...")

Streaming DataFrame created. Monitoring 'input/' directory for new files...


Apply Transformations to the Streaming Data

In [3]:
transformed_df = streaming_df \
    .select("Rank", "Name", "Country", "Processor_Speed", "Rmax") \
    .filter(col("Country") == "Japan") \
    .withColumn("Processing_Time", current_timestamp())

print("Transformations applied to streaming DataFrame.")

Transformations applied to streaming DataFrame.


Write the Transformed Data to a Memory Sink and Query the Data

Adaptive Query Execution (AQE) (spark.sql.adaptive.enabled) is an optimization feature in batch processing.
However, AQE is not supported in Streaming DataFrames, so Spark disables it automatically.

In [5]:
query = transformed_df \
    .writeStream \
    .format("memory") \
    .queryName("supercomputer_memory_sink") \
    .option("checkpointLocation", "checkpoint/")\
    .outputMode("append") \
    .start()

25/02/12 19:43:02 WARN ResolveWriteToStream: spark.sql.adaptive.enabled is not supported in streaming DataFrames/Datasets and will be disabled.
                                                                                

### TODO Recording:

- Add the JSON files to the input/ folder
- After each file is added come back to the notebook and run the next cell

In [9]:
result_df = spark.sql("SELECT * FROM supercomputer_memory_sink")

result_df.show(n=50, truncate=False)

+----+----------------------------------+-------+---------------+--------+-----------------------+
|Rank|Name                              |Country|Processor_Speed|Rmax    |Processing_Time        |
+----+----------------------------------+-------+---------------+--------+-----------------------+
|1   |Supercomputer Fugaku              |Japan  |2200           |442010.0|2025-02-12 19:43:02.247|
|16  |ABCI 2.0                          |Japan  |2400           |22208.72|2025-02-12 19:43:02.247|
|17  |Wisteria/BDEC-01 (Odyssey)        |Japan  |2200           |22121.0 |2025-02-12 19:43:02.247|
|31  |TOKI-SORA                         |Japan  |2200           |16592.0 |2025-02-12 19:43:02.247|
|39  |Oakforest-PACS                    |Japan  |1400           |13554.6 |2025-02-12 19:43:02.247|
|48  |Earth Simulator -SX-Aurora TSUBASA|Japan  |1600           |9990.7  |2025-02-12 19:43:02.247|
|59  |TSUBAME3.0                        |Japan  |2400           |8125.0  |2025-02-12 19:43:02.247|
|63  |Plas

In [10]:
query.stop()