# Unit 5: DELETE from COW tables

In this unit, we will learn delete operations into COW tables.<br>

Apache Hudi supports two types of deletes:<br>
Soft Deletes: This retains the record key and just nulls out the values for all the other fields. The records with nulls in soft deletes are always persisted in storage and never removed.<br>
Hard Deletes: This physically removes any trace of the record from the table. 


This unit takes about 5 minutes to complete.

### Initialize Spark Session

In [1]:
from pyspark.sql.functions import lit
from functools import reduce
from pyspark.sql.types import LongType
import pyspark.sql.functions as F
from datetime import datetime

spark = SparkSession.builder \
  .appName("Hudi-Learning-Unit-05-PySpark") \
  .master("yarn") \
  .enableHiveSupport() \
  .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \
  .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.hudi.catalog.HoodieCatalog") \
  .config("spark.sql.extensions", "org.apache.spark.sql.hudi.HoodieSparkSessionExtension") \
  .getOrCreate()

spark

23/08/01 03:31:36 WARN SparkSession: Using an existing Spark session; only runtime SQL configurations will take effect.


### Declare & define base variables

In [2]:
PROJECT_ID_OUTPUT=!gcloud config get-value core/project
PROJECT_ID=PROJECT_ID_OUTPUT[0]
PROJECT_NBR_OUTPUT=!gcloud projects describe $PROJECT_ID --format="value(projectNumber)"
PROJECT_NBR=PROJECT_NBR_OUTPUT[0]

TRIP_DATE='2021-01-31'
LOCATION="us-central1"
HUDI_COW_BASE_GCS_URI = f"gs://gaia_data_bucket-{PROJECT_NBR}/nyc-taxi-trips-hudi-cow"
DATAPROC_METASTORE_THRIFT_URI_LIST=!gcloud metastore services list --location $LOCATION | grep thrift | cut -d' ' -f11
DATAPROC_METASTORE_THRIFT_URI=DATAPROC_METASTORE_THRIFT_URI_LIST[0]

print(f"Project ID is {PROJECT_ID}")
print(f"Project number is {PROJECT_NBR}")
print(f"Project location is {LOCATION}")
print(f"Hudi base Cow table GCS URI is {HUDI_COW_BASE_GCS_URI}")
print(f"Dataproc Metastore Service thrift URI is {DATAPROC_METASTORE_THRIFT_URI}")
print(f"Trip date to be used for deletes is {TRIP_DATE}")

Project ID is apache-hudi-lab
Project number is 623600433888
Project location is us-central1
Hudi base Cow table GCS URI is gs://gaia_data_bucket-623600433888/nyc-taxi-trips-hudi-cow
Dataproc Metastore Service thrift URI is thrift://10.60.192.28:9080
Trip date to be used for deletes is 2021-01-31


## 1. "Soft Delete" a record
Soft Delete is achieved by retaining the record key, partition path, precombine field, and just nulling out the values for all the other fields and persisting in upsert mode. The records with nulls in soft deletes are durable in storage and never removed.<br>


### 1.1. Record count prior to soft delete
This retains the record key and just nulls out the values for all the other fields. The records with nulls in soft deletes are always persisted in storage and never removed.

In [3]:
RECORD_COUNT_PRIOR_TO_DELETE=spark.sql(f"SELECT COUNT(*) FROM taxi_db.nyc_taxi_trips_hudi_cow WHERE trip_date='2021-01-31'").collect()[0][0]
print(f"RECORD_COUNT_PRIOR_TO_DELETE={RECORD_COUNT_PRIOR_TO_DELETE}")

ivysettings.xml file not found in HIVE_HOME or HIVE_CONF_DIR,/etc/hive/conf.dist/ivysettings.xml will be used
23/08/01 03:31:42 WARN GhfsStorageStatistics: Detected potential high latency for operation op_open. latencyMs=119; previousMaxLatencyMs=0; operationCount=1; context=gs://gaia_data_bucket-623600433888/nyc-taxi-trips-hudi-cow/.hoodie/hoodie.properties
[Stage 3:>                                                          (0 + 1) / 1]

RECORD_COUNT_PRIOR_TO_DELETE=32604


                                                                                

### 1.2. Files before delete

In [4]:
# GCS parquet file listing prior to insert
!gsutil ls -alh $HUDI_COW_BASE_GCS_URI/trip_date=2021-01-31

     373 B  2023-08-01T03:06:41Z  gs://gaia_data_bucket-623600433888/nyc-taxi-trips-hudi-cow/trip_date=2021-01-31/.hoodie_partition_metadata.parquet#1690859201526064  metageneration=1
  1.29 MiB  2023-08-01T03:06:41Z  gs://gaia_data_bucket-623600433888/nyc-taxi-trips-hudi-cow/trip_date=2021-01-31/59efa789-ea60-4693-8268-59c3357fffc7-0_103-57-15447_20230731210155584.parquet#1690859201531538  metageneration=1
TOTAL: 2 objects, 1358270 bytes (1.3 MiB)


### 1.3. Identify a record to delete

In [5]:
# Select a trip ID to delete
DELETE_CANDIDATE_TRIP_ID=spark.sql(f"SELECT trip_id  FROM taxi_db.nyc_taxi_trips_hudi_cow WHERE trip_date='2021-01-31' LIMIT 1").collect()[0][0]
print(f"DELETE_CANDIDATE_TRIP_ID={DELETE_CANDIDATE_TRIP_ID}")

# Create a dataframe with the record
deleteTripDFCow=spark.sql(f"SELECT * FROM taxi_db.nyc_taxi_trips_hudi_cow WHERE trip_date='2021-01-31' AND trip_id={DELETE_CANDIDATE_TRIP_ID}")

# Here is the "before" the soft delete
deleteTripDFCow.show()

                                                                                

DELETE_CANDIDATE_TRIP_ID=249108210053


23/08/01 03:32:11 WARN package: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
[Stage 7:>                                                          (0 + 1) / 1]

+-------------------+--------------------+------------------+----------------------+--------------------+---------+---------+----------+--------+---------+-----------+---------+-------------------+-------------------+-----------------+---------+------------------+-------------------+---------------+-------------+------------+-----------+-----------+-----------+------------+---------------------+------------+-----------------+--------------------+---------+---------+--------------+------------------------+--------------------+------------+----------+
|_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path|   _hoodie_file_name|taxi_type|trip_year|trip_month|trip_day|trip_hour|trip_minute|vendor_id|    pickup_datetime|   dropoff_datetime|store_and_forward|rate_code|pickup_location_id|dropoff_location_id|passenger_count|trip_distance| fare_amount|  surcharge|    mta_tax| tip_amount|tolls_amount|improvement_surcharge|total_amount|payment_type_code|congestion_surc

                                                                                

### 1.4. Prepare for soft delete

This is achieved by nulling out all columns except:<br>
1. The hoodie metadata columns
2. The record key, precombine key and partition path fields

In [6]:
hudi_soft_delete_options = {
            'hoodie.database.name': 'taxi_db',
            'hoodie.table.name': 'nyc_taxi_trips_hudi_cow',
            'hoodie.datasource.write.table.name': 'nyc_taxi_trips_hudi_cow',
            'hoodie.datasource.write.table.type': 'COPY_ON_WRITE',
            'hoodie.datasource.write.recordkey.field': 'trip_id',
            'hoodie.datasource.write.partitionpath.field': 'trip_date',
            'hoodie.datasource.write.precombine.field': 'pickup_datetime',
            'hoodie.datasource.write.hive_style_partitioning': 'true',
            'hoodie.partition.metafile.use.base.format': 'true', 
            'hoodie.datasource.write.operation': 'upsert',
            'hoodie.datasource.write.drop.partition.columns': 'true',
            'hoodie.upsert.shuffle.parallelism': 2, 
            'hoodie.combine.before.delete': 'false',
            'hoodie.datasource.hive_sync.enable': 'true',
            'hoodie.meta.sync.client.tool.class': 'org.apache.hudi.hive.HiveSyncTool',
            'hoodie.datasource.hive_sync.mode':'hms',
            'hoodie.datasource.hive_sync.metastore.uris':DATAPROC_METASTORE_THRIFT_URI,
            'hoodie.datasource.hive_sync.auto_create_database':'true',
            'hoodie.datasource.hive_sync.database': 'taxi_db',
            'hoodie.datasource.hive_sync.table': 'nyc_taxi_trips_hudi_cow',
            'hoodie.datasource.hive_sync.partition_fields': 'trip_date', 
            'hoodie.datasource.hive_sync.partition_extractor_class':'org.apache.hudi.hive.MultiPartKeysValueExtractor',
            'hoodie.datasource.hive_sync.use_jdbc': 'false',
            'hoodie.datasource.hive_sync.support_timestamp': 'true'
            
}

In [7]:
meta_columns = ["_hoodie_commit_time", "_hoodie_commit_seqno", "_hoodie_record_key", \
  "_hoodie_partition_path", "_hoodie_file_name"]
excluded_columns = meta_columns + ["trip_date","pickup_datetime","trip_id"]

In [8]:
# Prepare for the soft delete by ensuring the appropriate fields are nullified
nullify_columns = list(filter(lambda field: field[0] not in excluded_columns, \
  list(map(lambda field: (field.name, field.dataType), deleteTripDFCow.schema.fields))))

softDeleteTripDFCow = reduce(lambda df,col: df.withColumn(col[0], lit(None).cast(col[1])), \
  nullify_columns, reduce(lambda df,col: df.drop(col[0]), meta_columns, deleteTripDFCow))

# Lets look at the record we want to soft delete
softDeleteTripDFCow.show(truncate=False)

[Stage 8:>                                                          (0 + 1) / 1]

+-------------------+-----------------------+------------------+----------------------+-----------------------------------------------------------------------------+---------+---------+----------+--------+---------+-----------+---------+-------------------+----------------+-----------------+---------+------------------+-------------------+---------------+-------------+-----------+---------+-------+----------+------------+---------------------+------------+-----------------+--------------------+---------+---------+--------------+------------------------+--------------------+------------+----------+
|_hoodie_commit_time|_hoodie_commit_seqno   |_hoodie_record_key|_hoodie_partition_path|_hoodie_file_name                                                            |taxi_type|trip_year|trip_month|trip_day|trip_hour|trip_minute|vendor_id|pickup_datetime    |dropoff_datetime|store_and_forward|rate_code|pickup_location_id|dropoff_location_id|passenger_count|trip_distance|fare_amount|surcharge|mt

                                                                                

In contrast, the original record-

In [9]:
deleteTripDFCow.show()

+-------------------+--------------------+------------------+----------------------+--------------------+---------+---------+----------+--------+---------+-----------+---------+-------------------+-------------------+-----------------+---------+------------------+-------------------+---------------+-------------+------------+-----------+-----------+-----------+------------+---------------------+------------+-----------------+--------------------+---------+---------+--------------+------------------------+--------------------+------------+----------+
|_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path|   _hoodie_file_name|taxi_type|trip_year|trip_month|trip_day|trip_hour|trip_minute|vendor_id|    pickup_datetime|   dropoff_datetime|store_and_forward|rate_code|pickup_location_id|dropoff_location_id|passenger_count|trip_distance| fare_amount|  surcharge|    mta_tax| tip_amount|tolls_amount|improvement_surcharge|total_amount|payment_type_code|congestion_surc

### 1.5. Execute the soft delete

#### How does soft delete help?
When you have to scrub data (fro various reasons, could be GDPR) but not lose the trace of the record, you can nullify the columns, and avoid reflection of it in say, aggregation operations on the table. 

In [None]:
# Simply upsert the table after setting all the fields to null, except the record key, partition key and precombine key fields 
softDeleteTripDFCow.write.format("hudi"). \
  options(**hudi_soft_delete_options). \
  mode("append"). \
  save(HUDI_COW_BASE_GCS_URI)

In [11]:
# Lets do a count after we soft delete - should be same as before
RECORD_COUNT_AFTER_SOFT_DELETE=spark.sql(f"SELECT COUNT(*)FROM taxi_db.nyc_taxi_trips_hudi_cow WHERE trip_date='2021-01-31'").collect()[0][0]
print(f"Record count prior to delete was {RECORD_COUNT_PRIOR_TO_DELETE} and still is {RECORD_COUNT_AFTER_SOFT_DELETE}")

                                                                                

Record count prior to delete was 32604 and still is 32604


In [12]:
# Lets search for the record we attempted to soft-delete
spark.sql(f"SELECT trip_id,taxi_type,vendor_id,pickup_datetime,dropoff_datetime,pickup_location_id,dropoff_location_id,trip_date " \
          f" FROM taxi_db.nyc_taxi_trips_hudi_cow " \
          f" WHERE trip_date='2021-01-31' AND trip_id={DELETE_CANDIDATE_TRIP_ID}").show(truncate=False)

[Stage 54:>                                                         (0 + 1) / 1]

+------------+---------+---------+-------------------+----------------+------------------+-------------------+----------+
|trip_id     |taxi_type|vendor_id|pickup_datetime    |dropoff_datetime|pickup_location_id|dropoff_location_id|trip_date |
+------------+---------+---------+-------------------+----------------+------------------+-------------------+----------+
|249108210053|null     |null     |2021-01-31 10:17:47|null            |null              |null               |2021-01-31|
+------------+---------+---------+-------------------+----------------+------------------+-------------------+----------+



                                                                                

In [13]:
# Full record
spark.sql(f"SELECT * " \
          f" FROM taxi_db.nyc_taxi_trips_hudi_cow " \
          f" WHERE trip_date='2021-01-31' AND trip_id={DELETE_CANDIDATE_TRIP_ID}").show(truncate=False)

+-------------------+---------------------+------------------+----------------------+--------------------------------------------------------------------------+---------+---------+----------+--------+---------+-----------+---------+-------------------+----------------+-----------------+---------+------------------+-------------------+---------------+-------------+-----------+---------+-------+----------+------------+---------------------+------------+-----------------+--------------------+---------+---------+--------------+------------------------+--------------------+------------+----------+
|_hoodie_commit_time|_hoodie_commit_seqno |_hoodie_record_key|_hoodie_partition_path|_hoodie_file_name                                                         |taxi_type|trip_year|trip_month|trip_day|trip_hour|trip_minute|vendor_id|pickup_datetime    |dropoff_datetime|store_and_forward|rate_code|pickup_location_id|dropoff_location_id|passenger_count|trip_distance|fare_amount|surcharge|mta_tax|tip_

In [14]:
# Lets check to see if there is a new file, and there is, with the soft deleted record ++
!gsutil ls -alh $HUDI_COW_BASE_GCS_URI/trip_date=2021-01-31

     373 B  2023-08-01T03:06:41Z  gs://gaia_data_bucket-623600433888/nyc-taxi-trips-hudi-cow/trip_date=2021-01-31/.hoodie_partition_metadata.parquet#1690859201526064  metageneration=1
  1.29 MiB  2023-08-01T03:32:31Z  gs://gaia_data_bucket-623600433888/nyc-taxi-trips-hudi-cow/trip_date=2021-01-31/59efa789-ea60-4693-8268-59c3357fffc7-0_0-24-2890_20230801033217250.parquet#1690860751993321  metageneration=1
  1.29 MiB  2023-08-01T03:06:41Z  gs://gaia_data_bucket-623600433888/nyc-taxi-trips-hudi-cow/trip_date=2021-01-31/59efa789-ea60-4693-8268-59c3357fffc7-0_103-57-15447_20230731210155584.parquet#1690859201531538  metageneration=1
TOTAL: 3 objects, 2714307 bytes (2.59 MiB)


### 1.6. Study the commit log

In [15]:
LOG_FILE_LIST=!gsutil ls $HUDI_COW_BASE_GCS_URI/.hoodie/*.commit | tail -n 1 
LOG_FILE=LOG_FILE_LIST[0]
print(f"Log file FQP is {LOG_FILE}")

Log file FQP is gs://gaia_data_bucket-623600433888/nyc-taxi-trips-hudi-cow/.hoodie/20230801033217250.commit


In [16]:
!gsutil cat $LOG_FILE

{
  "partitionToWriteStats" : {
    "trip_date=2021-01-31" : [ {
      "fileId" : "59efa789-ea60-4693-8268-59c3357fffc7-0",
      "path" : "trip_date=2021-01-31/59efa789-ea60-4693-8268-59c3357fffc7-0_0-24-2890_20230801033217250.parquet",
      "prevCommit" : "20230731210155584",
      "numWrites" : 32604,
      "numDeletes" : 0,
      "numUpdateWrites" : 1,
      "numInserts" : 0,
      "totalWriteBytes" : 1356037,
      "totalWriteErrors" : 0,
      "tempPath" : null,
      "partitionPath" : "trip_date=2021-01-31",
      "totalLogRecords" : 0,
      "totalLogFilesCompacted" : 0,
      "totalLogSizeCompacted" : 0,
      "totalUpdatedRecordsCompacted" : 0,
      "totalLogBlocks" : 0,
      "totalCorruptLogBlock" : 0,
      "totalRollbackBlocks" : 0,
      "fileSizeInBytes" : 1356037,
      "minEventTime" : null,
      "maxEventTime" : null
    } ]
  },
  "compacted" : false,
  "extraMetadata" : {
    "schema" : "{\"type\":\"record\",\"name\":\"nyc_taxi_trips_hudi_cow_record\",\"namespac

## 2. Hard Delete
This physically removes any trace of the record from the table. 

### 2.1. Hudi Hard Delete Options

In [17]:
hudi_hard_delete_options = {
            'hoodie.database.name': 'taxi_db',
            'hoodie.table.name': 'nyc_taxi_trips_hudi_cow',
            'hoodie.datasource.write.table.name': 'nyc_taxi_trips_hudi_cow',
            'hoodie.datasource.write.table.type': 'COPY_ON_WRITE',
            'hoodie.datasource.write.recordkey.field': 'trip_id',
            'hoodie.datasource.write.partitionpath.field': 'trip_date',
            'hoodie.datasource.write.precombine.field': 'pickup_datetime',
            'hoodie.datasource.write.hive_style_partitioning': 'true',
            'hoodie.partition.metafile.use.base.format': 'true', 
            'hoodie.datasource.write.drop.partition.columns': 'true',
            'hoodie.datasource.write.operation': 'delete',
            'hoodie.combine.before.delete': 'false',
            'hoodie.datasource.hive_sync.enable': 'true',
            'hoodie.meta.sync.client.tool.class': 'org.apache.hudi.hive.HiveSyncTool',
            'hoodie.datasource.hive_sync.mode':'hms',
            'hoodie.datasource.hive_sync.metastore.uris':DATAPROC_METASTORE_THRIFT_URI,
            'hoodie.datasource.hive_sync.auto_create_database':'true',
            'hoodie.datasource.hive_sync.database': 'taxi_db',
            'hoodie.datasource.hive_sync.table': 'nyc_taxi_trips_hudi_cow',
            'hoodie.datasource.hive_sync.partition_fields': 'trip_date', 
            'hoodie.datasource.hive_sync.partition_extractor_class':'org.apache.hudi.hive.MultiPartKeysValueExtractor',
            'hoodie.datasource.hive_sync.use_jdbc': 'false',
            'hoodie.datasource.hive_sync.support_timestamp': 'true'
}


### 2.2. Execute the deletion

In [18]:
# Simply append to the table - the delete setting in the options will remove physical trace of the record
deleteTripDFCow=spark.sql(f"SELECT * FROM taxi_db.nyc_taxi_trips_hudi_cow WHERE trip_date='2021-01-31' AND trip_id={DELETE_CANDIDATE_TRIP_ID}")


deleteTripDFCow.write.format("hudi"). \
  options(**hudi_hard_delete_options). \
  mode("append"). \
  save(HUDI_COW_BASE_GCS_URI)


                                                                                

### 2.3. Validate deletion

In [19]:
# Lets search for the record we attempted to hard delete, it should not exist
spark.sql(f"SELECT * FROM taxi_db.nyc_taxi_trips_hudi_cow WHERE trip_date='2021-01-31' AND trip_id={DELETE_CANDIDATE_TRIP_ID}").show(truncate=False)

                                                                                

+-------------------+--------------------+------------------+----------------------+-----------------+---------+---------+----------+--------+---------+-----------+---------+---------------+----------------+-----------------+---------+------------------+-------------------+---------------+-------------+-----------+---------+-------+----------+------------+---------------------+------------+-----------------+--------------------+---------+---------+--------------+------------------------+--------------------+-------+---------+
|_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path|_hoodie_file_name|taxi_type|trip_year|trip_month|trip_day|trip_hour|trip_minute|vendor_id|pickup_datetime|dropoff_datetime|store_and_forward|rate_code|pickup_location_id|dropoff_location_id|passenger_count|trip_distance|fare_amount|surcharge|mta_tax|tip_amount|tolls_amount|improvement_surcharge|total_amount|payment_type_code|congestion_surcharge|trip_type|ehail_fee|partition_date|d

### 2.4. Lets review the DFS

In [20]:
# Lets check to see if there is a new file, and there is a third file, this wont have the deleted record but will have all else (in our case) for the partition path 
!gsutil ls -alh $HUDI_COW_BASE_GCS_URI/trip_date=2021-01-31

     373 B  2023-08-01T03:06:41Z  gs://gaia_data_bucket-623600433888/nyc-taxi-trips-hudi-cow/trip_date=2021-01-31/.hoodie_partition_metadata.parquet#1690859201526064  metageneration=1
  1.29 MiB  2023-08-01T03:32:31Z  gs://gaia_data_bucket-623600433888/nyc-taxi-trips-hudi-cow/trip_date=2021-01-31/59efa789-ea60-4693-8268-59c3357fffc7-0_0-24-2890_20230801033217250.parquet#1690860751993321  metageneration=1
  1.29 MiB  2023-08-01T03:33:20Z  gs://gaia_data_bucket-623600433888/nyc-taxi-trips-hudi-cow/trip_date=2021-01-31/59efa789-ea60-4693-8268-59c3357fffc7-0_0-87-9977_20230801033306138.parquet#1690860800075996  metageneration=1
  1.29 MiB  2023-08-01T03:06:41Z  gs://gaia_data_bucket-623600433888/nyc-taxi-trips-hudi-cow/trip_date=2021-01-31/59efa789-ea60-4693-8268-59c3357fffc7-0_103-57-15447_20230731210155584.parquet#1690859201531538  metageneration=1
TOTAL: 4 objects, 4070297 bytes (3.88 MiB)


### 2.5. Study the log file

In [21]:
LOG_FILE_LIST=!gsutil ls $HUDI_COW_BASE_GCS_URI/.hoodie/*.commit | tail -n 1 
LOG_FILE=LOG_FILE_LIST[0]
print(f"Log file FQP is {LOG_FILE}")

Log file FQP is gs://gaia_data_bucket-623600433888/nyc-taxi-trips-hudi-cow/.hoodie/20230801033306138.commit


In [22]:
!gsutil cat $LOG_FILE

{
  "partitionToWriteStats" : {
    "trip_date=2021-01-31" : [ {
      "fileId" : "59efa789-ea60-4693-8268-59c3357fffc7-0",
      "path" : "trip_date=2021-01-31/59efa789-ea60-4693-8268-59c3357fffc7-0_0-87-9977_20230801033306138.parquet",
      "prevCommit" : "20230801033217250",
      "numWrites" : 32603,
      "numDeletes" : 1,
      "numUpdateWrites" : 0,
      "numInserts" : 0,
      "totalWriteBytes" : 1355990,
      "totalWriteErrors" : 0,
      "tempPath" : null,
      "partitionPath" : "trip_date=2021-01-31",
      "totalLogRecords" : 0,
      "totalLogFilesCompacted" : 0,
      "totalLogSizeCompacted" : 0,
      "totalUpdatedRecordsCompacted" : 0,
      "totalLogBlocks" : 0,
      "totalCorruptLogBlock" : 0,
      "totalRollbackBlocks" : 0,
      "fileSizeInBytes" : 1355990,
      "minEventTime" : null,
      "maxEventTime" : null
    } ]
  },
  "compacted" : false,
  "extraMetadata" : {
    "schema" : "{\"type\":\"record\",\"name\":\"nyc_taxi_trips_hudi_cow_record\",\"namespac

This concludes the unit, please proceed to the next notebook.