# Unit 5: DELETE from COW tables

In this unit, we will learn delete operations into COW tables.<br>

Apache Hudi supports two types of deletes:<br>
Soft Deletes: This retains the record key and just nulls out the values for all the other fields. The records with nulls in soft deletes are always persisted in storage and never removed.<br>
Hard Deletes: This physically removes any trace of the record from the table. 


This unit takes about 5 minutes to complete.

### Initialize Spark Session

In [1]:
from pyspark.sql.functions import lit
from functools import reduce
from pyspark.sql.types import LongType
import pyspark.sql.functions as F
from datetime import datetime

spark = SparkSession.builder \
  .appName("Hudi-Learning-Unit-05-PySpark") \
  .master("yarn")\
  .enableHiveSupport()\
  .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.hudi.catalog.HoodieCatalog") \
  .config("spark.sql.extensions", "org.apache.spark.sql.hudi.HoodieSparkSessionExtension") \
  .getOrCreate()

spark

23/07/30 02:32:42 WARN SparkSession: Using an existing Spark session; only runtime SQL configurations will take effect.


### Declare & define base variables

In [2]:
PROJECT_ID_OUTPUT=!gcloud config get-value core/project
PROJECT_ID=PROJECT_ID_OUTPUT[0]
PROJECT_NBR_OUTPUT=!gcloud projects describe $PROJECT_ID --format="value(projectNumber)"
PROJECT_NBR=PROJECT_NBR_OUTPUT[0]
print(f"Project ID is {PROJECT_ID}")
print(f"Project Number is {PROJECT_NBR}")

PERSIST_TO_BUCKET = f"gs://gaia_data_bucket-{PROJECT_NBR}"
HUDI_COW_BASE_GCS_URI = f"{PERSIST_TO_BUCKET}/nyc-taxi-trips-hudi-cow"
DATABASE_NAME = "taxi_db"
COW_TABLE_NAME = "nyc_taxi_trips_hudi_cow"
TRIP_DATE='2021-01-31'

Project ID is apache-hudi-lab
Project Number is 623600433888


## 1. "Soft Delete" a record
Soft Delete - retains the record key and just nulls out the values for all the other fields. The records with nulls in soft deletes are always persisted in storage and never removed.<br>


### 1.1. Record count prior to soft delete
This retains the record key and just nulls out the values for all the other fields. The records with nulls in soft deletes are always persisted in storage and never removed.

In [3]:
RECORD_COUNT_PRIOR_TO_DELETE=spark.sql(f"SELECT COUNT(*) as trip_count FROM {DATABASE_NAME}.{COW_TABLE_NAME} WHERE trip_date=\"{TRIP_DATE}\"").collect()[0][0]
print(f"RECORD_COUNT_PRIOR_TO_DELETE={RECORD_COUNT_PRIOR_TO_DELETE}")

ivysettings.xml file not found in HIVE_HOME or HIVE_CONF_DIR,/etc/hive/conf.dist/ivysettings.xml will be used
23/07/30 02:32:47 WARN GhfsStorageStatistics: Detected potential high latency for operation op_open. latencyMs=110; previousMaxLatencyMs=0; operationCount=1; context=gs://gaia_data_bucket-623600433888/nyc-taxi-trips-hudi-cow/.hoodie/hoodie.properties
[Stage 3:>                                                          (0 + 1) / 1]

RECORD_COUNT_PRIOR_TO_DELETE=32604


                                                                                

### 1.2. Files before delete

In [4]:
# GCS parquet file listing prior to insert
!gsutil ls -alh $HUDI_COW_BASE_GCS_URI/trip_date=$TRIP_DATE

     373 B  2023-07-30T01:59:54Z  gs://gaia_data_bucket-623600433888/nyc-taxi-trips-hudi-cow/trip_date=2021-01-31/.hoodie_partition_metadata.parquet#1690682394857386  metageneration=1
   1.3 MiB  2023-07-30T01:59:54Z  gs://gaia_data_bucket-623600433888/nyc-taxi-trips-hudi-cow/trip_date=2021-01-31/8c11c11f-7e53-4b28-a223-eebb4bace8fc-0_103-57-15632_20230729055658662.parquet#1690682394893869  metageneration=1
TOTAL: 2 objects, 1358778 bytes (1.3 MiB)


### 1.3. Identify a record to delete

In [5]:
# Select a trip ID to delete
DELETE_CANDIDATE_TRIP_ID=spark.sql(f"SELECT trip_id  FROM {DATABASE_NAME}.{COW_TABLE_NAME} WHERE trip_date=\"{TRIP_DATE}\" LIMIT 1").collect()[0][0]
print(f"DELETE_CANDIDATE_TRIP_ID={DELETE_CANDIDATE_TRIP_ID}")

# Create a dataframe with the record
deleteTripDFCow=spark.sql(f"SELECT * FROM {DATABASE_NAME}.{COW_TABLE_NAME} WHERE trip_date=\"{TRIP_DATE}\" AND trip_id={DELETE_CANDIDATE_TRIP_ID}")

# Here is the "before" the soft delete
deleteTripDFCow.show()

                                                                                

DELETE_CANDIDATE_TRIP_ID=438086925068


23/07/30 02:33:16 WARN package: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
[Stage 7:>                                                          (0 + 1) / 1]

+-------------------+--------------------+------------------+----------------------+--------------------+---------+---------+----------+--------+---------+-----------+---------+-------------------+-------------------+-----------------+---------+------------------+-------------------+---------------+-------------+-----------+-----------+-----------+-----------+------------+---------------------+------------+-----------------+--------------------+---------+---------+--------------+------------------------+--------------------+------------+----------+
|_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path|   _hoodie_file_name|taxi_type|trip_year|trip_month|trip_day|trip_hour|trip_minute|vendor_id|    pickup_datetime|   dropoff_datetime|store_and_forward|rate_code|pickup_location_id|dropoff_location_id|passenger_count|trip_distance|fare_amount|  surcharge|    mta_tax| tip_amount|tolls_amount|improvement_surcharge|total_amount|payment_type_code|congestion_surcha

                                                                                

### 1.4. Prepare for soft delete

In [6]:
hudi_soft_delete_options = {
            'hoodie.database.name': DATABASE_NAME,
            'hoodie.table.name': COW_TABLE_NAME,
            'hoodie.datasource.write.table.name': COW_TABLE_NAME,
            'hoodie.datasource.write.table.type': 'COPY_ON_WRITE',
            'hoodie.datasource.write.recordkey.field': 'trip_id',
            'hoodie.datasource.write.partitionpath.field': 'trip_date',
            'hoodie.datasource.write.precombine.field': 'pickup_datetime',
            'hoodie.datasource.write.hive_style_partitioning': 'true',
            'hoodie.partition.metafile.use.base.format': 'true', 
            'hoodie.datasource.write.operation': 'upsert',
            'hoodie.datasource.write.drop.partition.columns': 'true',
            'hoodie.upsert.shuffle.parallelism': 2, 
            'hoodie.combine.before.delete': 'false'
}

In [7]:
meta_columns = ["_hoodie_commit_time", "_hoodie_commit_seqno", "_hoodie_record_key", \
  "_hoodie_trip_date", "_hoodie_file_name"]
excluded_columns = meta_columns + ["pickup_datetime","trip_id"]

In [8]:
# Prepare for the soft delete by ensuring the appropriate fields are nullified
nullify_columns = list(filter(lambda field: field[0] not in excluded_columns, \
  list(map(lambda field: (field.name, field.dataType), deleteTripDFCow.schema.fields))))

softDeleteTripDFCow = reduce(lambda df,col: df.withColumn(col[0], lit(None).cast(col[1])), \
  nullify_columns, reduce(lambda df,col: df.drop(col[0]), meta_columns, deleteTripDFCow))

# Lets look at the record we want to soft delete
softDeleteTripDFCow.show(truncate=False)

+-------------------+-----------------------+------------------+----------------------+-----------------------------------------------------------------------------+---------+---------+----------+--------+---------+-----------+---------+-------------------+----------------+-----------------+---------+------------------+-------------------+---------------+-------------+-----------+---------+-------+----------+------------+---------------------+------------+-----------------+--------------------+---------+---------+--------------+------------------------+--------------------+------------+---------+
|_hoodie_commit_time|_hoodie_commit_seqno   |_hoodie_record_key|_hoodie_partition_path|_hoodie_file_name                                                            |taxi_type|trip_year|trip_month|trip_day|trip_hour|trip_minute|vendor_id|pickup_datetime    |dropoff_datetime|store_and_forward|rate_code|pickup_location_id|dropoff_location_id|passenger_count|trip_distance|fare_amount|surcharge|mta

In contrast, the original record-

In [9]:
deleteTripDFCow.show()

+-------------------+--------------------+------------------+----------------------+--------------------+---------+---------+----------+--------+---------+-----------+---------+-------------------+-------------------+-----------------+---------+------------------+-------------------+---------------+-------------+-----------+-----------+-----------+-----------+------------+---------------------+------------+-----------------+--------------------+---------+---------+--------------+------------------------+--------------------+------------+----------+
|_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path|   _hoodie_file_name|taxi_type|trip_year|trip_month|trip_day|trip_hour|trip_minute|vendor_id|    pickup_datetime|   dropoff_datetime|store_and_forward|rate_code|pickup_location_id|dropoff_location_id|passenger_count|trip_distance|fare_amount|  surcharge|    mta_tax| tip_amount|tolls_amount|improvement_surcharge|total_amount|payment_type_code|congestion_surcha

### 1.5. Execute the soft delete

In [10]:
# Simply upsert the table after setting all the fields to null, except the record key, partition key and precombine key fields 
softDeleteTripDFCow.write.format("hudi"). \
  options(**hudi_soft_delete_options). \
  mode("append"). \
  save(HUDI_COW_BASE_GCS_URI)

23/07/30 02:33:24 WARN GhfsStorageStatistics: Detected potential high latency for operation stream_write_close_operations. latencyMs=103; previousMaxLatencyMs=0; operationCount=1; context=gs://gaia_data_bucket-623600433888/nyc-taxi-trips-hudi-cow/.hoodie/20230730023322415.deltacommit.requested
23/07/30 02:33:31 WARN GhfsStorageStatistics: Detected potential high latency for operation stream_write_close_operations. latencyMs=118; previousMaxLatencyMs=103; operationCount=2; context=gs://gaia_data_bucket-623600433888/nyc-taxi-trips-hudi-cow/.hoodie/20230730023322415.deltacommit.inflight
23/07/30 02:33:33 WARN GhfsStorageStatistics: Detected potential high latency for operation op_create. latencyMs=104; previousMaxLatencyMs=82; operationCount=4; context=gs://gaia_data_bucket-623600433888/nyc-taxi-trips-hudi-cow/.hoodie/.temp/20230730023322415/MARKERS.type
23/07/30 02:33:33 WARN GhfsStorageStatistics: Detected potential high latency for operation stream_write_close_operations. latencyMs=145

In [11]:
spark.sql(f"REFRESH TABLE {DATABASE_NAME}.{COW_TABLE_NAME};").show(truncate=False)

++
||
++
++



In [12]:
print(f"RECORD_COUNT_PRIOR_TO_DELETE={RECORD_COUNT_PRIOR_TO_DELETE}")

# Lets do a count after we soft delete - should be same as before
spark.sql(f"SELECT COUNT(*) as trip_count FROM {DATABASE_NAME}.{COW_TABLE_NAME} WHERE trip_date=\"{TRIP_DATE}\" ").show(truncate=False)

RECORD_COUNT_PRIOR_TO_DELETE=32604


[Stage 51:>                                                         (0 + 1) / 1]

+----------+
|trip_count|
+----------+
|32604     |
+----------+



                                                                                

In [13]:
# Lets search for the record we attempted to soft-delete
spark.sql(f"SELECT trip_id,taxi_type,vendor_id,pickup_datetime,dropoff_datetime,pickup_location_id,dropoff_location_id,trip_date " \
          f" FROM {DATABASE_NAME}.{COW_TABLE_NAME} " \
          f" WHERE trip_date=\"{TRIP_DATE}\" AND trip_id={DELETE_CANDIDATE_TRIP_ID}").show(truncate=False)

[Stage 54:>                                                         (0 + 1) / 1]

+------------+---------+---------+-------------------+-------------------+------------------+-------------------+----------+
|trip_id     |taxi_type|vendor_id|pickup_datetime    |dropoff_datetime   |pickup_location_id|dropoff_location_id|trip_date |
+------------+---------+---------+-------------------+-------------------+------------------+-------------------+----------+
|438086925068|green    |2        |2021-01-31 19:27:28|2021-01-31 19:32:46|74                |75                 |2021-01-31|
+------------+---------+---------+-------------------+-------------------+------------------+-------------------+----------+



                                                                                

In [14]:
# Lets check to see if there is a new file
!gsutil ls -alh $HUDI_COW_BASE_GCS_URI/trip_date=$TRIP_DATE

     373 B  2023-07-30T01:59:54Z  gs://gaia_data_bucket-623600433888/nyc-taxi-trips-hudi-cow/trip_date=2021-01-31/.hoodie_partition_metadata.parquet#1690682394857386  metageneration=1
   1.3 MiB  2023-07-30T01:59:54Z  gs://gaia_data_bucket-623600433888/nyc-taxi-trips-hudi-cow/trip_date=2021-01-31/8c11c11f-7e53-4b28-a223-eebb4bace8fc-0_103-57-15632_20230729055658662.parquet#1690682394893869  metageneration=1
TOTAL: 2 objects, 1358778 bytes (1.3 MiB)


#### How does soft delete help?
When you have to scrub data but not lose the trace of the record, you can nullify the columns, and avoid reflection of it in say, aggregation operations on the table. 

In [15]:
# Read from source in GCS
spark.read.format("hudi").load(HUDI_COW_BASE_GCS_URI).createOrReplaceTempView("hudi_trips_snapshot")

# This should return the same total count as before
spark.sql(f"SELECT trip_id,trip_date FROM hudi_trips_snapshot WHERE trip_date=\"{TRIP_DATE}\"").count()

# Search for the record
spark.sql(f"SELECT trip_id,taxi_type,vendor_id,pickup_datetime,dropoff_datetime,pickup_location_id,dropoff_location_id,trip_date FROM hudi_trips_snapshot " \
          f"WHERE trip_date=\"{TRIP_DATE}\" and trip_id={DELETE_CANDIDATE_TRIP_ID}").show()

                                                                                

+------------+---------+---------+-------------------+-------------------+------------------+-------------------+----------+
|     trip_id|taxi_type|vendor_id|    pickup_datetime|   dropoff_datetime|pickup_location_id|dropoff_location_id| trip_date|
+------------+---------+---------+-------------------+-------------------+------------------+-------------------+----------+
|438086925068|    green|        2|2021-01-31 19:27:28|2021-01-31 19:32:46|                74|                 75|2021-01-31|
+------------+---------+---------+-------------------+-------------------+------------------+-------------------+----------+



### 3.2. Hard Delete
This physically removes any trace of the record from the table. 

In [16]:
hudi_hard_delete_options = {
            'hoodie.database.name': DATABASE_NAME,
            'hoodie.table.name': COW_TABLE_NAME,
            'hoodie.datasource.write.table.name': COW_TABLE_NAME,
            'hoodie.datasource.write.table.type': 'COPY_ON_WRITE',
            'hoodie.datasource.write.recordkey.field': 'trip_id',
            'hoodie.datasource.write.partitionpath.field': 'trip_date',
            'hoodie.datasource.write.precombine.field': 'pickup_datetime',
            'hoodie.datasource.write.hive_style_partitioning': 'true',
            'hoodie.partition.metafile.use.base.format': 'true', 
            'hoodie.datasource.write.drop.partition.columns': 'true',
            'hoodie.datasource.write.operation': 'delete',
            'hoodie.combine.before.delete': 'false'
}


In [17]:
# Simply append to the table - the delete setting in the options will remove physical trace of the record
softDeleteTripDFCow.write.format("hudi"). \
  options(**hudi_hard_delete_options). \
  mode("append"). \
  save(HUDI_COW_BASE_GCS_URI)


                                                                                

In [20]:
# Refresh Hive Metsatore Metadata
spark.sql(f"REFRESH TABLE {DATABASE_NAME}.{COW_TABLE_NAME};").show(truncate=False)

++
||
++
++



In [21]:
# Lets search for the record we attempted to hard delete
spark.sql(f"SELECT * FROM {DATABASE_NAME}.{COW_TABLE_NAME} WHERE trip_date=\"{TRIP_DATE}\" AND trip_id={DELETE_CANDIDATE_TRIP_ID}").show(truncate=False)

[Stage 118:>                                                        (0 + 1) / 1]

+-------------------+-----------------------+------------------+----------------------+-----------------------------------------------------------------------------+---------+---------+----------+--------+---------+-----------+---------+-------------------+-------------------+-----------------+---------+------------------+-------------------+---------------+-------------+-----------+-----------+-----------+-----------+------------+---------------------+------------+-----------------+--------------------+---------+---------+--------------+------------------------+--------------------+------------+----------+
|_hoodie_commit_time|_hoodie_commit_seqno   |_hoodie_record_key|_hoodie_partition_path|_hoodie_file_name                                                            |taxi_type|trip_year|trip_month|trip_day|trip_hour|trip_minute|vendor_id|pickup_datetime    |dropoff_datetime   |store_and_forward|rate_code|pickup_location_id|dropoff_location_id|passenger_count|trip_distance|fare_amount

                                                                                

In [22]:
# Lets check to see if there is a new file
!gsutil ls -alh $HUDI_COW_BASE_GCS_URI/trip_date=$TRIP_DATE

     373 B  2023-07-30T01:59:54Z  gs://gaia_data_bucket-623600433888/nyc-taxi-trips-hudi-cow/trip_date=2021-01-31/.hoodie_partition_metadata.parquet#1690682394857386  metageneration=1
   1.3 MiB  2023-07-30T01:59:54Z  gs://gaia_data_bucket-623600433888/nyc-taxi-trips-hudi-cow/trip_date=2021-01-31/8c11c11f-7e53-4b28-a223-eebb4bace8fc-0_103-57-15632_20230729055658662.parquet#1690682394893869  metageneration=1
TOTAL: 2 objects, 1358778 bytes (1.3 MiB)


ERROR:root:Exception while sending command.
Traceback (most recent call last):
  File "/usr/lib/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/clientserver.py", line 516, in send_command
    raise Py4JNetworkError("Answer from Java side is empty")
py4j.protocol.Py4JNetworkError: Answer from Java side is empty

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py", line 1038, in send_command
    response = connection.send_command(command)
  File "/usr/lib/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/clientserver.py", line 539, in send_command
    raise Py4JNetworkError(
py4j.protocol.Py4JNetworkError: Error while sending or receiving
