# Description

We have datasets corresponding to a **list of health inspections in establishments** (restaurants, supermarkets, etc.), along with their respective health risk. We have another dataset that shows a **description of said risk**.

**The goal is to load these datasets under specific requirements and manipulate them according to the instructions of each exercise.**

All necessary operations are described in the exercises, although additional tasks carried out by the student on their own initiative will be appreciated. The use of the DataFrame API will also be valued.

# Download Datasets

In [0]:
%sh 
curl -O 'https://raw.githubusercontent.com/masfworld/datahack_docker/master/zeppelin/data/food_inspections_lite.csv'
curl -O 'https://raw.githubusercontent.com/masfworld/datahack_docker/master/zeppelin/data/risk_description.csv'

In [0]:
# Copy the local file food_inspections_lite.csv from the driver node to DBFS (Databricks File System) under /dataset/
dbutils.fs.cp('file:/databricks/driver/food_inspections_lite.csv','dbfs:/dataset/food_inspections_lite.csv')

# Copy the local file risk_description.csv from the driver node to DBFS (Databricks File System) under /dataset/
dbutils.fs.cp('file:/databricks/driver/risk_description.csv','dbfs:/dataset/risk_description.csv')

In [0]:
# List all files and directories under the DBFS directory /dataset/
dbutils.fs.ls('/dataset/')

We do a `head` to see the content of the datasets:

In [0]:
dbutils.fs.head("dbfs:/dataset/food_inspections_lite.csv")

In [0]:
dbutils.fs.head("dbfs:/dataset/risk_description.csv")

# Exercise 1
---

1. **Create two dataframes, one from the file `food_inspections_lite.csv` and another from `risk_description.csv`.**
2. **Convert these two dataframes to delta tables.**


Create the dataframe from `food_inspections_lite.csv`:

In [0]:
# Cargar el archivo food_inspections_lite.csv en un DataFrame
file_path_food = '/dataset/food_inspections_lite.csv'

food_df = spark.read.format("csv") \
  .option("sep", ",") \
  .option("inferSchema", "true") \
  .option("header", "true") \
  .load(file_path_food)

Rename the columns and verify that the schema is adequate:

In [0]:
# Rename the "License #" column to "License number".
food_df = food_df.withColumnRenamed("License #", "License number")

# Rename the columns by replacing spaces with underscores.
new_columns = [col_name.replace(" ", "_") for col_name in food_df.columns]

# Rename the columns in the DataFrame
food_df = food_df.toDF(*new_columns)

# Verify new column names
food_df.printSchema()

In [0]:
# Display the first 3 lines of the DataFrame food_df
food_df.show(3, truncate=False)

We are going to clean up the `Risk` column that will be used later in the exercises. To do this, let's list the values it has and the count of each one:

In [0]:
# Grouping by 'Risk' column and counting the occurrences
risk_counts = food_df.groupBy('Risk').count()

# Showing the results
risk_counts.show()

We see that there are `null` values and `all` values. The correct values are `Risk 1 (High)`, `Risk 2 (Medium)`, and `Risk 3 (Low)`. Therefore, we will make the following modifications:
- We will remove the `null` values since we are unable to trace them to any of the correct values.
- We will change the `All` values to `Risk 1 (High)`, considering that "All" means they have received the highest risk score.

In [0]:
# Drop rows where the 'Risk' column has null values
df_food_drop_risk_nulls = food_df.dropna(subset=['Risk'])

In [0]:
from pyspark.sql.functions import col, when
# Replacing "All" with "Risk 1 (High)"
food_df_clean = df_food_drop_risk_nulls.withColumn('Risk', when(col('Risk') == 'All', 'Risk 1 (High)').otherwise(col('Risk')))

In [0]:
# Group by 'Risk' and count the occurrences
risk_counts = food_df_clean.groupBy('Risk').count()

# Show the results
risk_counts.show()

Create the dataframe from `risk_description.csv`:

In [0]:
# Load the risk_description.csv file in another DataFrame
file_path_risk = '/dataset/risk_description.csv'
risk_df = spark.read.format("csv") \
  .option("sep", ",") \
  .option("inferSchema", "true") \
  .option("header", "true") \
  .load(file_path_risk)

In [0]:
# Show all lines of the DataFrame risk_df
risk_df.display()

Conversion of the DataFrame `food_df_clean` to a Delta Table:

In [0]:
from delta.tables import *
from pyspark.sql.functions import *

# Define the Delta Lake path for the 'food' table and remove the directory at FOOD_DELTA_PATH recursively, if it exists
FOOD_DELTA_PATH = "/mnt/delta/food"
dbutils.fs.rm(FOOD_DELTA_PATH, recurse=True)

# Write the DataFrame to the specified Delta path
food_df_clean.write.format("delta").save(FOOD_DELTA_PATH)

# Drop the table if it already exists
spark.sql("DROP TABLE IF EXISTS food")

# Create a Delta table named 'food' using the data saved at FOOD_DELTA_PATH
spark.sql("CREATE TABLE food USING DELTA LOCATION \'" + FOOD_DELTA_PATH + "\'")

In [0]:
%sql
SELECT *
FROM 
  food
LIMIT 3

In [0]:
%sql
SELECT COUNT(*) AS total_rows FROM food

Conversion of the DataFrame `risk_df` to a Delta Table:

In [0]:
# Define the Delta Lake path for the 'risk' table and remove the directory at RISK_DELTA_PATH recursively, if it exists
RISK_DELTA_PATH = "/mnt/delta/risk"
dbutils.fs.rm(RISK_DELTA_PATH, recurse=True)

# Write the DataFrame to the specified Delta path
risk_df.write.format("delta").save(RISK_DELTA_PATH)

# Drop the table if it already exists
spark.sql("DROP TABLE IF EXISTS risk")

# Create a Delta table named 'risk' using the data saved at RISK_DELTA_PATH
spark.sql("CREATE TABLE risk USING DELTA LOCATION \'" + RISK_DELTA_PATH + "\'")

In [0]:
%sql
SELECT *
FROM 
  risk

# Exercise 2
**Obtain the number of distinct inspections with high `Risk 1 (High)`.**

---



In [0]:
%sql
SELECT COUNT(DISTINCT Inspection_ID) AS num_inspections_high_risk
FROM food
WHERE Risk = 'Risk 1 (High)'


# Exercise 3
**From the dataframes loaded above, obtain a table with the following columns:<br>**
1. `DBA Name`
2. `Facility Type`
3. `Risk`
4. `Risk description`

---
I will use PySpark's `join` function to combine both DataFrames based on the `Risk` column.

In [0]:
# Create a new column in food_df_clean to map the values of Risk to risk_id
food_df_mapped = food_df_clean.withColumn("risk_id", 
                                    when(food_df_clean["Risk"].contains("High"), 1)
                                    .when(food_df_clean["Risk"].contains("Medium"), 2)
                                    .when(food_df_clean["Risk"].contains("Low"), 3)
                                    .otherwise(None))

# Perform the link between food_df_mapped and risk_df
combined_df = food_df_mapped.join(risk_df, food_df_mapped["risk_id"] == risk_df["risk_id"], "inner") \
                            .select(food_df_mapped["DBA_Name"],
                                    food_df_mapped["Facility_Type"],
                                    food_df_mapped["Risk"],
                                    risk_df["description"].alias("Risk_description"))

# Show the result
combined_df.show(10, truncate=False)

# Exercise 4
**Access the Spark UI to view the execution plan of the previous exercise (exercise 3). Describe each of the pieces/boxes that make up the execution plan (a brief one-line description per box will be sufficient).**

---
In Apache Spark, a "**job**" refers to a unit of work that is sent to the cluster to be executed. Each job can consist of one or more "**stages**", where a stage is a processing phase that includes a set of tasks that can be executed in parallel on the cluster nodes.

The execution of exercise 3 indicates that Spark has executed two **jobs**:
- **Job 41 (Stages: 1/1)**: Corresponds to the creation and transformation of the `food_df_mapped` DataFrame using `withColumn` and the `when` conditions.

- **Job 42 (Stages: 1/1, 1 skipped)**: Corresponds to the join operation (`join`) between `food_df_mapped` and `risk_df`, followed by the column selection. The "1 skipped" indicates that Spark has optimized the execution by detecting that some data was already available in cache or in the same state needed for the previous job.


## Job 41 - withColumn

![Details for Job 41](https://github.com/Ubikitina/Spark-Essentials/blob/main/Notebooks/img/04_01.png?raw=true)

![Details for Stage 50](https://github.com/Ubikitina/Spark-Essentials/blob/main/Notebooks/img/04_02.png?raw=true)


- **Scan csv (CSV file scan):** Spark starts by scanning the input CSV file (`food_inspections_lite.csv`) to read the data into an RDD. 
  - `FileScanRDD` represents the RDD created from reading the CSV file.
  - `MapPartitionsRDD`: represents the RDD resulting from partitioning the data read from the CSV file.

- **WholeStageCodegen (Single-stage code generation optimization):** is a physical query optimization in Spark SQL that merges multiple physical operators into a single Java function. Simply put, in this step, the calculations written in DataFrames are computed to generate the Java code to build the underlying RDDs, optimizing the execution of the transformations defined in the code (`withColumn`).

- **Exchange (Data exchange or redistribution):** data partitions are exchanged or redistributed to ensure that the `withColumn` transformation is correctly applied across all data partitions.




## Job 42 - Join

![Details for Job 42](https://github.com/Ubikitina/Spark-Essentials/blob/main/Notebooks/img/04_03.png?raw=true)

![Details for Stage 52](https://github.com/Ubikitina/Spark-Essentials/blob/main/Notebooks/img/04_04.png?raw=true)



**Stage 51 (skipped):** This stage was skipped in the DAG log, possibly because it was already executed previously. It is a DAG very similar to the one explained earlier.

**Stage 52:**
- **Scan csv:** Reading the `risk_df` file.
  - `FileScanRDD`: Represents the RDD of the CSV file read and processed as an RDD.
  - `MapPartitionsRDD`: Represents the RDD resulting from partitioning the data read from the CSV file.
- **ShuffleQueryStage:** Receives the data from Stage 51 and performs a shuffle operation to organize and prepare them for the subsequent Join operation.
- **WholeStageCodegen:** This is the code optimization stage, improving the efficiency of executing the `join`. It includes several `MapPartitionsRDD` stages to organize the data into partitions and a `CartesianRDD` stage corresponding to the Join itself, as the Cartesian transformation generates a Cartesian product of two RDDs.


# Exercise 5
**1. For each establishment (column `DBA Name`) and its result (column `Results`), get the number of inspections it has had.**<br><br>
**2. Get the two establishments (`DBA Name`) that have had the most inspections for each result.**<br><br>
**3. Save the results from point 2 in a new Delta table called `inspections_results`.**

---

**1. For each establishment (column `DBA Name`) and its result (column `Results`), get the number of inspections it has had.**<br><br>
This exercise can be done using either a DataFrame or a Delta Table. Considerations to keep in mind when making the choice:
- If we are working with large volumes of data and need features such as ACID transactions, version management, and Delta Log, then using a Delta Table would be highly recommended. This allows maintaining data integrity and having the capability to perform historical operations and data recovery efficiently.

- On the other hand, if our needs are more oriented towards efficiently manipulating data in memory and we do not require the persistence and advanced management offered by Delta Lake, a DataFrame would be more suitable due to its flexibility and ease of use.

In this case, I will perform the execution using both:

Option 1: Using DataFrame:

In [0]:
# Calculate the number of inspections by `DBA Name` and `Results`.
inspections_count = food_df_clean.groupBy("DBA_Name", "Results") \
                           .agg(count("*").alias("num_inspecciones")) \
                           .orderBy(col("num_inspecciones").desc())

# Show the result
inspections_count.show(truncate=False)

Option 2: Using the Delta Table:

In [0]:
%sql
SELECT `DBA_Name`, `Results`, COUNT(*) AS num_inspecciones
FROM food
GROUP BY `DBA_Name`, `Results`
ORDER BY `num_inspecciones` DESC;

**2. Get the two locations (`DBA Name`) that have had the most inspections for each of the results**<br><br>

In [0]:
from pyspark.sql.window import Window

# Define a window partitioned by 'Results' and sorted by 'num_inspections' in descending order
windowSpec = Window.partitionBy("Results").orderBy(col("num_inspecciones").desc())

# Add a ranking column based on the number of inspections per result
# Filter to get only the two most frequent locations for each result
# Sort by 'Results' and 'rank'.
ranked_inspections = inspections_count.withColumn("rank", rank().over(windowSpec))\
      .filter(col("rank") <= 2)\
      .orderBy("Results", "rank")

# Show the final result 
ranked_inspections.show(truncate=False)

In the result above, we see that some results have ties, and therefore, we get three establishments. For example, `No Entry` has `LANS` in the first position with 5 inspections, but in the second position, there is a tie between `FORK` and `LA PENA RESTAURANTE`, as each has 4 inspections with this result.

**3. Save the results from point 2 in a new Delta table called `inspections_results`.**

In [0]:
# Define the Delta Lake path for the 'inspections_results' table and delete the directory in INSPECTION_RESULTS_DELTA_PATH recursively, if it exists.
INSPECTION_RESULTS_DELTA_PATH = "/mnt/delta/inspections_results"
dbutils.fs.rm(INSPECTION_RESULTS_DELTA_PATH, recurse=True)

# Write the DataFrame to the specified Delta path
ranked_inspections.write.format("delta").save(INSPECTION_RESULTS_DELTA_PATH)

# Create a Delta table named "inspections_results" using data stored in INSPECTION_RESULTS_DELTA_PATH
spark.sql("CREATE TABLE inspections_results USING DELTA LOCATION \'" + INSPECTION_RESULTS_DELTA_PATH + "\'")

In [0]:
%sql
SELECT *
FROM 
  inspections_results

# Exercise 6
1. **Update the delta table of the previous exercise `inspections_results`, specifying `DBA_Name = error`**<br>
2. **Restore the table to its original state**

---



First we will see the details of the metadata history of the `inspection_results` table:

In [0]:
%sql
DESCRIBE HISTORY inspections_results

Now we update the delta table by specifying `DBA_Name = Error`:

In [0]:
%sql
UPDATE inspections_results SET DBA_Name = "Error"

We check that it has done so:

In [0]:
%sql
SELECT *
FROM inspections_results

We verify that this update is reflected in the history:

In [0]:
%sql
DESCRIBE HISTORY inspections_results

We restore the table to its original state:

In [0]:
# Read data from the Delta Lake format in the specified version '0'.
inspections_results_df_v0 = spark.read \
  .format("delta") \
  .option("versionAsOf", "0") \
  .load(INSPECTION_RESULTS_DELTA_PATH)

# Rewrite the DataFrame inspections_results_df_v0 in Delta Lake format, overwriting any existing data
inspections_results_df_v0 \
  .write \
  .format("delta") \
  .mode("overwrite") \
  .save(INSPECTION_RESULTS_DELTA_PATH)

Verify that the values have been restored correctly and check that the transaction has been recorded in the history:

In [0]:
%sql
SELECT *
FROM inspections_results

In [0]:
%sql
DESCRIBE HISTORY inspections_results

# Exercise 7

**Create a Structured Streaming application that reads data from the Kafka topic `inspections`. The Kafka server URL is `35.237.99.179:9094`:**

**The data from this topic is exactly the same as what we have been analyzing throughout this notebook, `Food Inspections`, so the schema is the same.**

In [0]:
# Read streaming data from Kafka
df = spark \
  .readStream \
  .format("kafka") \
  .option("kafka.bootstrap.servers", "35.237.99.179:9094") \
  .option("subscribe", "inspections") \
  .load()

# Print the schema of the DataFrame
df.printSchema()

In [0]:

# Define the schema to parse JSON data
schema = StructType(
  [
    StructField("Inspection ID", StringType(), True),
    StructField("DBA Name", StringType(), True),
    StructField("AKA Name", StringType(), True),
    StructField("License #", StringType(), True),
    StructField("Facility Type", StringType(), True),
    StructField("Risk", StringType(), True),
    StructField("Address", StringType(), True),
    StructField("City", StringType(), True),
    StructField("State", StringType(), True),
    StructField("Zip", StringType(), True),
    StructField("Inspection Date", StringType(), True),
    StructField("Inspection Type", StringType(), True),
    StructField("Results", StringType(), True),
    StructField("Violations", StringType(), True),
    StructField("Latitude", StringType(), True),
    StructField("Longitude", StringType(), True),
    StructField("Location", StringType(), True)
  ]
)

# Select the columns key, value, and timestamp, converting key and value to strings
# Parse the value column from JSON format using the specified schema
# Select the key, timestamp, and all columns from the parsed JSON value.
dataset = df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)", "timestamp") \
    .withColumn("value", from_json("value", schema)) \
    .select(col('key'), col("timestamp"), col('value.*'))

# Print the schema
dataset.printSchema()

Start the streaming with `writeStream`. `writeStream` configures a streaming channel in Spark Structured Streaming and initiates the execution of the streaming. The configuration applied in this case includes: append mode, storing results in memory, assigning a query name, etc.

This enables real-time data processing and analysis, making the results immediately available for querying through Spark SQL or other subsequent applications.

In [0]:
# Specify the output mode as 'append' (only new rows added to the results table)
# Define the output sink format as 'memory' (store the results table in memory)
# Option to truncate long strings in the output table (set to 'false' to display the full content)
# Assign a name to the query (to be referenced in Spark SQL)
# Start the streaming query
dataset.writeStream \
 .outputMode("append") \
 .format("memory") \
 .option("truncate", "false") \
 .queryName("inspections_topic") \
 .start()

We check the availability of the data by running a query in Spark SQL:

In [0]:
%sql
SELECT
  *
FROM
  inspections_topic

# Exercise 8
**Based on the data source from the previous exercise, obtain the number of inspections by `Facility Type` every 5 seconds.**

We will group the data by time windows and then count the records by `Facility Type`. We will use the `.display()` method for visualization.

**Note:** The `.display()` method is only available in certain development environments, such as Databricks, and is primarily designed for rapid data exploration and development. It is not recommended for use in production environments.



In [0]:
dataset.groupBy(window(col("timestamp"), "5 seconds"), col("Facility Type")) \
    .count() \
    .display()

Instead of using `.display()` in production environments, it is preferable to save the results in persistent storage formats using the `writeStream` method.

Below is the code demonstrating how to do this with `writeStream` for production environments:

In [0]:
# Specify the output mode as 'update'
# Define the output sink format as 'memory'
# Set the option to truncate long strings in the output table to false
# Assign a name to the query (to be referenced in Spark SQL)
# Start the streaming query
dataset.groupBy(window(col("timestamp"), "5 seconds"), col("Facility Type")) \
    .count() \
    .writeStream \
    .outputMode("update") \
    .format("memory") \
    .option("truncate", "false") \
    .queryName("inspections_grouped_topic") \
    .start()

As in exercise 7, we check the availability of the data by running a query in Spark SQL:

In [0]:
%sql
SELECT
  *
FROM
  inspections_grouped_topic

# Exercise 9
**Based on the data source from exercise 7, obtain the number of inspections by `Results` every 5 seconds for the last 30 seconds.**

In [0]:
dataset.groupBy(window(col("timestamp"), "30 seconds", "5 seconds"), col("Results")) \
    .count() \
    .writeStream \
    .outputMode("update") \
    .format("memory") \
    .option("truncate", "false") \
    .queryName("inspections_grouped_topic2") \
    .start()

As in exercise 8, we check the availability of the data by running a query in Spark SQL:

In [0]:
%sql
SELECT
  *
FROM
  inspections_grouped_topic2

# Exercise 10
1. **Update the `Results` column of the Delta table for food inspections created in exercise 1 to the value `No result`.**
2. **Update the data in the modified table from point 1 as new items arrive in Kafka.**
---

It is advisable to stop all previous streams, as the one for this exercise tends to be resource-intensive.

**1. Update the `Results` column of the Delta table for food inspections created in exercise 1 to the value `No result`.**

Before starting, we print the details of the Delta `food` table:

In [0]:
%sql
DESCRIBE FORMATTED food

And we also print an example of the values ​​it contains:

In [0]:
%sql
SELECT *
FROM 
  food
LIMIT 3

We will now update the `Results` column to set the value `No result`:

In [0]:
%sql
-- Update the column Results
UPDATE food SET Results = 'No result';

We check that it has been updated by printing a sample of the data again:

In [0]:
%sql
SELECT *
FROM 
  food
LIMIT 10

**2. Update the data in the modified table from point 1 as new items arrive in Kafka**

First we establish the connection to Kafka by reusing the code from exercise 7:

In [0]:
# Read streaming data from Kafka
df = spark \
  .readStream \
  .format("kafka") \
  .option("kafka.bootstrap.servers", "35.237.99.179:9094") \
  .option("subscribe", "inspections") \
  .load()


# Define the schema for parsing JSON data
schema = StructType(
  [
    StructField("Inspection ID", StringType(), True),
    StructField("DBA Name", StringType(), True),
    StructField("AKA Name", StringType(), True),
    StructField("License #", StringType(), True),
    StructField("Facility Type", StringType(), True),
    StructField("Risk", StringType(), True),
    StructField("Address", StringType(), True),
    StructField("City", StringType(), True),
    StructField("State", StringType(), True),
    StructField("Zip", StringType(), True),
    StructField("Inspection Date", StringType(), True),
    StructField("Inspection Type", StringType(), True),
    StructField("Results", StringType(), True),
    StructField("Violations", StringType(), True),
    StructField("Latitude", StringType(), True),
    StructField("Longitude", StringType(), True),
    StructField("Location", StringType(), True)
  ]
)

# Convert Kafka's data to the schema. To do so:
#   - Select key, value, and timestamp columns, converting key and value to strings
#   - Parse the value column from the JSON format using the specified schema
#   - Select the key, the timestamp and all columns of the parsed JSON value
dataset = df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)", "timestamp") \
    .withColumn("value", from_json("value", schema)) \
    .select(col('key'), col("timestamp"), col('value.*'))

In this exercise, it is essential to consider that the Delta table schema to be modified contains columns typed as integers, doubles, among others. Therefore, it is necessary to properly typify the data read from Kafka. For this purpose, the following actions will be performed:

In [0]:
from pyspark.sql.types import IntegerType, DoubleType, DateType

# Make the casting of the data types to avoid problems in the merge
dataset \
  .withColumn("Inspection ID", col("Inspection ID").cast(IntegerType())) \
  .withColumn("License #", col("License #").cast(IntegerType())) \
  .withColumn("Zip", col("Zip").cast(IntegerType())) \
  .withColumn("Inspection Date", to_date(col("Inspection Date"), "MM/dd/yyyy")) \
  .withColumn("Latitude", col("Latitude").cast(DoubleType())) \
  .withColumn("Longitude", col("Longitude").cast(DoubleType()))

Now we will define a function to update the Delta table:

In [0]:
from delta.tables import *

# Adjust the number of partitions for shuffle operations, optimizing performance based on cluster size
spark.conf.set("spark.sql.shuffle.partitions", "200")  


# Function for updating the Delta table
def upsertToDelta(microBatchOutputDF, batchId):

    # Debugging: Print the number of records in the batch
    print(f"Processing batch ID: {batchId} with {microBatchOutputDF.count()} records.")
    
    # Load existing Delta table
    delta_table = DeltaTable.forName(spark, "food")

    # Debugging: Check that the Delta Table has been loaded successfully
    record_count = delta_table.toDF().count()
    print(f"Total number of records in the table 'food': {record_count}")

    # Debugging: Filter the microbatch records that already exist in the 'food' table and count how many there are
    matched_count = microBatchOutputDF.filter("`Inspection ID` IN (SELECT Inspection_ID FROM food)").count()
    print(f"Matching records: {matched_count}")

    # Perform merge operation on Delta table
    delta_table.alias("target").merge(
        microBatchOutputDF.alias("source"),
        "target.Inspection_ID = cast(source.`Inspection ID` as Integer)"
    ).whenMatchedUpdate(set = {
            "target.Results": "source.Results"
    }).whenNotMatchedInsert(values={
            "target.Inspection_ID": "source.`Inspection ID`",
            "target.DBA_Name": "source.`DBA Name`",
            "target.AKA_Name": "source.`AKA Name`",
            "target.License_number": "source.`License #`",
            "target.Facility_Type": "source.`Facility Type`",
            "target.Risk": "source.Risk",
            "target.Address": "source.Address",
            "target.City": "source.City",
            "target.State": "source.State",
            "target.Zip": "source.Zip",
            "target.Inspection_Date": "source.`Inspection date`",
            "target.Results": "source.Results",
            "target.Violations": "source.Violations",
            "target.Latitude": "source.Latitude",
            "target.Longitude": "source.Longitude",
            "target.Location": "source.Location"         
    }).execute()


    # Debugging: Convert the Delta table to a DataFrame and counts the total number of records in the resulting DataFrame.
    delta_table_a_df = delta_table.toDF()
    record_count = delta_table_a_df.count()
    print(f"Merge completed. Total number of records in the table 'food' after the merge: {record_count}")

    # Debugging: Count the values in the 'Results' column and display the result
    result_count = delta_table_a_df.groupBy("Results").count()
    result_count.show(truncate=False)

Apply the `upsertToDelta` function to the stream:

In [0]:
# Delete the checkpoints directory for the 'food' table, including all its contents.
dbutils.fs.rm("dbfs:/mnt/delta/checkpoints/food", recurse=True)

# Configure the stream to process data and apply upsertToDelta function
query = dataset.writeStream \
    .foreachBatch(upsertToDelta) \
    .outputMode("update") \
    .option("checkpointLocation", "dbfs:/mnt/delta/checkpoints/food") \
    .start()

We will check that the updates are being made by counting the values of the "Results" column. Executing it twice, we will see that in the first execution the count of the results different to "No result" is higher, and in the second one lower.

In [0]:
%sql
SELECT Results, COUNT(*) as Count
FROM food
GROUP BY Results

In [0]:
%sql
SELECT Results, COUNT(*) as Count
FROM food
GROUP BY Results

We also check the total number of records, to see if new ones have been added:

In [0]:
%sql
SELECT COUNT(*) AS total_lines FROM food

**Clarification on all debugging instructions** incorporated in the `upsertToDelta` method:

During the development of the merge method, I encountered several challenges:
- It has been difficult to correctly map column names.
- Once the mapping was completed, the execution did not update the data (due to incorrect filtering), and I couldn't identify the apparent reason for this issue.

To detect the cause of the failures, it was necessary to include various debugging instructions, such as:

```python
# Debugging: Print the number of records in the batch
print(f"Processing batch ID: {batchId} with {microBatchOutputDF.count()} records.")

# Debugging: Check that the Delta Table has been loaded successfully
record_count = delta_table.toDF().count()
print(f"Total number of records in the table 'food': {record_count}")

# Debugging: Filter the microbatch records that already exist in the 'food' table and count how many there are
matched_count = microBatchOutputDF.filter("`Inspection ID` IN (SELECT Inspection_ID FROM food)").count()
print(f"Matching records: {matched_count}")

...
```

The results of these instructions can be viewed in the Standard Output (stdout) of Databricks by accessing the menu:

- `Compute` > Select the cluster number > `Driver Logs` tab > `Standard Output` section.

This has allowed me to monitor the execution and effectively debug the process.
