<a href="https://colab.research.google.com/github/carsofferrei/04_data_processing/blob/main/spark_streaming/challenges/final_challenges.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**This script responds to the challenge for assessment in module #5-real-time-data:** Producer, read and write Stream in the bronze layer and write Stream in the silver layer by applying some transformations. Also, make some reports in the silver data layer.
At the end, it contains the answer to the theoretical question.

In [None]:
# Guarantee that don't have any other data on content folder
!rm -rf /content/lake

# Setting up PySpark

In [21]:
%pip install pyspark



# Context
Message events are coming from platform message broker (kafka, pubsub, kinesis...).
You need to process the data according to the requirements.

Message schema:
- timestamp
- value
- event_type
- message_id
- country_id
- user_id



# Challenge 1

Step 1
- Change exising producer
	- Change parquet location to "/content/lake/bronze/messages/data"
	- Add checkpoint (/content/lake/bronze/messages/checkpoint)
	- Delete /content/lake/bronze/messages and reprocess data
	- For reprocessing, run the streaming for at least 1 minute, then stop it

Step 2
- Implement new stream job to read from messages in bronze layer and split result in two locations
	- "messages_corrupted"
		- logic: event_status is null, empty or equal to "NONE"
    - extra logic: add country name by joining message with countries dataset
		- partition by "date" -extract it from timestamp
		- location: /content/lake/silver/messages_corrupted/data

	- "messages"
		- logic: not corrupted data
		- extra logic: add country name by joining message with countries dataset
		- partition by "date" -extract it from timestamp
		- location: /content/lake/silver/messages/data

	- technical requirements
		- add checkpint (choose location)
		- use StructSchema
		- Set trigger interval to 5 seconds
		- run streaming for at least 20 seconds, then stop it

	- alternatives
		- implementing single streaming job with foreach/- foreachBatch logic to write into two locations
		- implementing two streaming jobs, one for messages and another for messages_corrupted
		- (paying attention on the paths and checkpoints)


  - Check results:
    - results from messages in bronze layer should match with the sum of messages+messages_corrupted in the silver layer

In [22]:
%pip install faker



# Producer

In [23]:
import pyspark.sql.functions as F
from pyspark.sql import DataFrame
from faker import Faker
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('Test streaming').getOrCreate()
sc = spark.sparkContext

fake = Faker()
messages = [fake.uuid4() for _ in range(50)]

def enrich_data(df, messages=messages):
  fake = Faker()
  new_columns = {
      'event_type': F.lit(fake.random_element(elements=('OPEN', 'RECEIVED', 'SENT', 'CREATED', 'CLICKED', '', 'NONE'))),
      'message_id': F.lit(fake.random_element(elements=messages)),
      'channel': F.lit(fake.random_element(elements=('CHAT', 'EMAIL', 'SMS', 'PUSH', 'OTHER'))),
      'country_id': F.lit(fake.random_int(min=2000, max=2015)),
      'user_id': F.lit(fake.random_int(min=1000, max=1050)),
  }
  df = df.withColumns(new_columns)
  return df

def insert_messages(df: DataFrame, batch_id):
  enrich = enrich_data(df)
  enrich.write.mode("append").format("parquet").save("/content/lake/bronze/messages/data")

# read stream
df_stream = spark.readStream.format("rate").option("rowsPerSecond", 1).load()

# write stream
query = (df_stream
         .writeStream
         .outputMode('append')
         .option('checkpointLocation', "/content/lake/bronze/checkpoint")
         .trigger(processingTime='1 seconds')
         .foreachBatch(insert_messages)
         .start()
)

query.awaitTermination(60)


False

In [24]:
query.isActive

True

In [25]:
query.stop()

In [26]:
df = spark.read.format("parquet").load("/content/lake/bronze/messages/data")
df.show()
df.count()

+--------------------+-----+----------+--------------------+-------+----------+-------+
|           timestamp|value|event_type|          message_id|channel|country_id|user_id|
+--------------------+-----+----------+--------------------+-------+----------+-------+
|2024-12-15 15:46:...|   35|  RECEIVED|bdf4e074-07c4-4de...|  OTHER|      2014|   1048|
|2024-12-15 15:46:...|   25|  RECEIVED|410d41f5-fcfa-4d0...|   CHAT|      2009|   1049|
|2024-12-15 15:46:...|    9|   CREATED|e49f7aea-7f4c-456...|  EMAIL|      2015|   1001|
|2024-12-15 15:46:...|    5|  RECEIVED|d02d83cb-64a1-4f5...|   CHAT|      2001|   1004|
|2024-12-15 15:46:...|    2|   CLICKED|7bfad84a-bad3-427...|  EMAIL|      2003|   1024|
|2024-12-15 15:46:...|    6|  RECEIVED|89c3e007-2d41-41c...|   CHAT|      2009|   1035|
|2024-12-15 15:47:...|   64|   CLICKED|cfcd006b-4690-4ef...|  OTHER|      2014|   1013|
|2024-12-15 15:46:...|   15|   CREATED|10105805-d6a9-4b6...|  OTHER|      2009|   1001|
|2024-12-15 15:47:...|   56|   C

65

# Additional datasets

In [27]:
countries = [
    {"country_id": 2000, "country": "Brazil"},
    {"country_id": 2001, "country": "Portugal"},
    {"country_id": 2002, "country": "Spain"},
    {"country_id": 2003, "country": "Germany"},
    {"country_id": 2004, "country": "France"},
    {"country_id": 2005, "country": "Italy"},
    {"country_id": 2006, "country": "United Kingdom"},
    {"country_id": 2007, "country": "United States"},
    {"country_id": 2008, "country": "Canada"},
    {"country_id": 2009, "country": "Australia"},
    {"country_id": 2010, "country": "Japan"},
    {"country_id": 2011, "country": "China"},
    {"country_id": 2012, "country": "India"},
    {"country_id": 2013, "country": "South Korea"},
    {"country_id": 2014, "country": "Russia"},
    {"country_id": 2015, "country": "Argentina"}
]

countries = spark.createDataFrame(countries)

In [28]:
df_stream.printSchema()

root
 |-- timestamp: timestamp (nullable = true)
 |-- value: long (nullable = true)



In [29]:
df.printSchema()

root
 |-- timestamp: timestamp (nullable = true)
 |-- value: long (nullable = true)
 |-- event_type: string (nullable = true)
 |-- message_id: string (nullable = true)
 |-- channel: string (nullable = true)
 |-- country_id: integer (nullable = true)
 |-- user_id: integer (nullable = true)



# Streaming Messages x Messages Corrupted

In [30]:
from pyspark.sql.functions import *
from pyspark.sql.types import *
import time

def clean_message_events(df: DataFrame, batch_id):
    if df.isEmpty():
        print("No data in the current batch.")
        return

    # Join with countries
    df = df.join(countries, "country_id", "left")

    # Filtering corrupted messages
    corrupted_events = df.filter((col("event_type").isNull() |
                              (col("event_type") == "") |
                              (col("event_type") == "NONE")
                              ))

    # Saving corrupted messages
    corrupted_events.write.mode("append").format("parquet").partitionBy("date").save("/content/lake/silver/messages_corrupted/data")

    # Creating dataframe without corrupted messages
    clean_events = df.filter(~(col("event_type").isNull() |
                          (col("event_type") == "") |
                          (col("event_type") == "NONE"))
                        )
    # Saving clean messages
    clean_events.write.mode("append").format("parquet").partitionBy("date").save("/content/lake/silver/messages/data")


print(f'Define streaming schema...')
messages_schema = StructType([
          StructField('timestamp', TimestampType(), True),
          StructField('value', LongType(), True),
          StructField('event_type', StringType(), True),
          StructField('message_id', StringType(), True),
          StructField('channel', StringType(), True),
          StructField('country_id', IntegerType(), True),
          StructField('user_id', IntegerType(), True)
          ])

print(f'Read the streaming data...')
messages_event_data = spark.readStream.format("parquet").schema(messages_schema).load("/content/lake/bronze/messages/data/*")

print(f'Create a new date column that will be the split column in writeStreaming...')
messages_event_data = messages_event_data.withColumn("date", col('timestamp').cast("date"))

print(f'Write Streaming...')
stream_silver_query = (messages_event_data
                          .writeStream
                          .outputMode('append')
                          .option('checkpointLocation', '/content/lake/silver/checkpoint')
                          .trigger(processingTime='5 seconds')
                          .foreachBatch(clean_message_events)
                          .start()
                          )

stream_silver_query.awaitTermination(20)

Define streaming schema...
Read the streaming data...
Create a new date column that will be the split column in writeStreaming...
Write Streaming...


False

In [31]:
print(stream_silver_query.isActive)
print(stream_silver_query.stop())

True
None


## Checking data

In [33]:
def checking_silver_data():
    bronze_count = spark.read.format("parquet").load("/content/lake/bronze/messages/data/*").count()
    clean_count = spark.read.format("parquet").load("/content/lake/silver/messages/data/*").count()
    corrupted_count = spark.read.format("parquet").load("/content/lake/silver/messages_corrupted/data/*").count()

    assert bronze_count == clean_count + corrupted_count, "Dataframes doesn't matches. Did you run the code more than once? Be careful with append mode"
    print(f'Validation passed: Bronze [{bronze_count}] = Clean [{clean_count}] + Corrupted [{corrupted_count}]')

checking_silver_data()

Validation passed: Bronze [65] = Clean [50] + Corrupted [15]


# Challenge 2

- Run business report
- But first, there is a bug in the system which is causing some duplicated messages, we need to exclude these lines from the report

- removing duplicates logic:
  - Identify possible duplicates on message_id, event_type and channel
  - in case of duplicates, consider only the first message (occurrence by timestamp)
  - Ex:
    In table below, the correct message to consider is the second line

```
    message_id | channel | event_type | timestamp
    123        | CHAT    | CREATED    | 10:10:01
    123        | CHAT    | CREATED    | 07:56:45 (first occurrence)
    123        | CHAT    | CREATED    | 08:13:33
```

- After cleaning the data we're able to create the busines report

In [34]:
# dedup data
from pyspark.sql import functions as F
from pyspark.sql.window import Window
df = spark.read.format("parquet").load("/content/lake/silver/messages")
dedup = df.withColumn("row_number", F.row_number().over(Window.partitionBy("message_id", "event_type", "channel").orderBy("timestamp"))).filter("row_number = 1").drop("row_number")

In [35]:
silver_rows = df.count()
dedup_rows = dedup.count()

print(f'Clean messages data had {silver_rows - dedup_rows} dupplicated records.')

Clean messages data had 2 dupplicated records.


In [36]:
dedup.limit(3).show()

+----------+--------------------+-----+----------+--------------------+-------+-------+---------+----------+
|country_id|           timestamp|value|event_type|          message_id|channel|user_id|  country|      date|
+----------+--------------------+-----+----------+--------------------+-------+-------+---------+----------+
|      2009|2024-12-15 15:46:...|   15|   CREATED|10105805-d6a9-4b6...|  OTHER|   1001|Australia|2024-12-15|
|      2001|2024-12-15 15:46:...|   26|      OPEN|10105805-d6a9-4b6...|   CHAT|   1049| Portugal|2024-12-15|
|      2008|2024-12-15 15:46:...|   21|      OPEN|10105805-d6a9-4b6...|  OTHER|   1007|   Canada|2024-12-15|
+----------+--------------------+-----+----------+--------------------+-------+-------+---------+----------+



### Report 1
  - Aggregate data by date, event_type and channel
  - Count number of messages
  - pivot event_type from rows into columns
  - schema expected:
  
```
|      date|channel|CLICKED|CREATED|OPEN|RECEIVED|SENT|
+----------+-------+-------+-------+----+--------+----+
|2024-12-03|    SMS|      4|      4|   1|       1|   5|
|2024-12-03|   CHAT|      3|      7|   5|       8|   4|
|2024-12-03|   PUSH|   NULL|      3|   4|       3|   4|
```

In [37]:
print(f'Obtaining the pivot and expected schema')
df.groupBy("date", "channel").pivot("event_type").agg(count("*").alias("event_count")).fillna(0).show()

Obtaining the pivot and expected schema
+----------+-------+-------+-------+----+--------+----+
|      date|channel|CLICKED|CREATED|OPEN|RECEIVED|SENT|
+----------+-------+-------+-------+----+--------+----+
|2024-12-15|  OTHER|      4|      2|   3|       1|   3|
|2024-12-15|   PUSH|      0|      3|   2|       1|   2|
|2024-12-15|  EMAIL|      2|      1|   3|       0|   3|
|2024-12-15|    SMS|      3|      2|   2|       2|   0|
|2024-12-15|   CHAT|      1|      1|   5|       3|   1|
+----------+-------+-------+-------+----+--------+----+



## Report 2

- Identify the most active users by channel (sorted by number of iterations)
- schema expected:

```
+-------+----------+----+-----+-----+----+---+
|user_id|iterations|CHAT|EMAIL|OTHER|PUSH|SMS|
+-------+----------+----+-----+-----+----+---+
|   1022|         5|   2|    0|    1|   0|  2|
|   1004|         4|   1|    1|    1|   1|  0|
|   1013|         4|   0|    0|    2|   1|  1|
|   1020|         4|   2|    0|    1|   1|  0|
```


In [38]:
agg_int_total = df.groupBy("user_id").agg(count("*").alias("iterations")).fillna(0)
agg_channel_user = df.groupBy("user_id").pivot("channel").agg(count("*").alias("iterations")).fillna(0)
agg_int_total.join(agg_channel_user, "user_id", "left").sort(desc("iterations")).show()

+-------+----------+----+-----+-----+----+---+
|user_id|iterations|CHAT|EMAIL|OTHER|PUSH|SMS|
+-------+----------+----+-----+-----+----+---+
|   1007|         5|   1|    0|    2|   2|  0|
|   1035|         3|   2|    0|    0|   0|  1|
|   1028|         2|   0|    0|    0|   1|  1|
|   1002|         2|   1|    0|    0|   0|  1|
|   1050|         2|   1|    1|    0|   0|  0|
|   1048|         2|   1|    0|    1|   0|  0|
|   1036|         2|   0|    1|    1|   0|  0|
|   1001|         2|   0|    1|    1|   0|  0|
|   1049|         2|   2|    0|    0|   0|  0|
|   1040|         2|   0|    0|    1|   1|  0|
|   1033|         2|   0|    1|    1|   0|  0|
|   1013|         2|   1|    0|    1|   0|  0|
|   1037|         2|   0|    0|    2|   0|  0|
|   1005|         1|   0|    0|    0|   0|  1|
|   1031|         1|   0|    1|    0|   0|  0|
|   1008|         1|   0|    0|    0|   1|  0|
|   1047|         1|   0|    0|    0|   1|  0|
|   1021|         1|   0|    0|    0|   0|  1|
|   1010|    

# Challenge 3

**Theoretical question:**

*A new usecase requires the message data to be aggregate in near real time. They want to build a dashboard embedded in the platform website to analyze message data in low latency (few minutes). This application will access directly the data aggregated by streaming process.*

- **Q1: What would be your suggestion to achieve that using Spark Structure Streaming? Or would you choose a different data processing tool?**
- **A1:** By using Spark Struture Streaming, the possible solution would be:
  - **1. Read stream data using Kafka format:** as it enables low-latency integration.
  - **2. Add checkpoint:** as it ensures fault tolerance and state recovery (if, at some point, something fails).
  - **3. Transform and aggregate data given the requirment.**
  - **4. Set a trigger rule:** enables near real-time updates.
  - **5. Write the result using memory** - suitable for low-latency processes.

  - Alternatively, Kafka Streams is another tool that is a viable option for this use case, as it supports real-time data reading, aggregation, and includes built-in dashboard and reporting capabilities.


- **Q2: Which storage would you use and why? (database?, data lake?, kafka?)**

- **A2:** All of them have strengths and weaknesses, and the choice depends on the requirements. I would use for this use case:
  - **Kafka:** given the use case, only the real-time data is needed and does not seems to require historical preservation.

  - **Note:** However, **if there is a need to preserve historical data** and the flexibility to create different layers of data because requirements could change over time, a **data lake is the better choice**. It supports long-term storage and provides scalability for future needs.


