# Unit 5: CRUD with COW tables

In the previous units, we created CoW and MoR table with the exact same data.<br>
In this unit, we will learn CRUD operations against COW tables.<br>


This unit takes about 15 minutes to complete.

In [59]:
from pyspark.sql.functions import lit
from functools import reduce

### Initialize Spark Session

In [1]:
spark = SparkSession.builder \
  .appName("Hudi-Learning-Unit-05-PySpark") \
  .master("yarn")\
  .enableHiveSupport()\
  .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.hudi.catalog.HoodieCatalog") \
  .config("spark.sql.extensions", "org.apache.spark.sql.hudi.HoodieSparkSessionExtension") \
  .getOrCreate()

23/07/26 20:00:02 WARN SparkSession: Using an existing Spark session; only runtime SQL configurations will take effect.


In [2]:
spark

### Declare & define variables

In [3]:
PROJECT_ID_OUTPUT=!gcloud config get-value core/project
PROJECT_ID=PROJECT_ID_OUTPUT[0]

In [4]:
PROJECT_NBR_OUTPUT=!gcloud projects describe $PROJECT_ID --format="value(projectNumber)"
PROJECT_NBR=PROJECT_NBR_OUTPUT[0]

In [5]:
print(f"Project ID is {PROJECT_ID}")
print(f"Project Number is {PROJECT_NBR}")

Project ID is apache-hudi-lab
Project Number is 623600433888


In [6]:
PERSIST_TO_BUCKET = f"gs://gaia_data_bucket-{PROJECT_NBR}"
HUDI_COW_BASE_GCS_URI = f"{PERSIST_TO_BUCKET}/nyc-taxi-trips-hudi-cow"
DATABASE_NAME = "taxi_db"
COW_TABLE_NAME = "nyc_taxi_trips_hudi_cow"

## 1. Insert into CoW table

### 1.1. Review the Hudi metadata & note the version

In [7]:
! gsutil cat $HUDI_COW_BASE_GCS_URI/.hoodie/hoodie.properties

#Properties saved on 2023-07-26T19:58:22.262893Z
#Wed Jul 26 19:58:22 UTC 2023
hoodie.table.type=COPY_ON_WRITE
hoodie.table.metadata.partitions=files
hoodie.table.precombine.field=pickup_datetime
hoodie.table.partition.fields=trip_year,trip_month,trip_day
hoodie.archivelog.folder=archived
hoodie.table.create.schema={"type"\:"record","name"\:"topLevelRecord","fields"\:[{"name"\:"_hoodie_commit_time","type"\:["string","null"]},{"name"\:"_hoodie_commit_seqno","type"\:["string","null"]},{"name"\:"_hoodie_record_key","type"\:["string","null"]},{"name"\:"_hoodie_partition_path","type"\:["string","null"]},{"name"\:"_hoodie_file_name","type"\:["string","null"]},{"name"\:"taxi_type","type"\:["string","null"]},{"name"\:"trip_hour","type"\:["int","null"]},{"name"\:"trip_minute","type"\:["int","null"]},{"name"\:"vendor_id","type"\:["string","null"]},{"name"\:"pickup_datetime","type"\:[{"type"\:"long","logicalType"\:"timestamp-micros"},"null"]},{"name"\:"dropoff_datetime","type"\:[{"type"\:"long","

### 1.2. Study the stats of the partition - trip_year = 2022, trip_month=1, trip_day=31 

a) Layout and size

In [8]:
!gsutil ls -alh $HUDI_COW_BASE_GCS_URI/trip_year=2022/trip_month=1/trip_day=31

       0 B  2023-07-26T19:52:46Z  gs://gaia_data_bucket-623600433888/nyc-taxi-trips-hudi-cow/trip_year=2022/trip_month=1/trip_day=31/#1690401166202282  metageneration=1
     373 B  2023-07-26T19:52:46Z  gs://gaia_data_bucket-623600433888/nyc-taxi-trips-hudi-cow/trip_year=2022/trip_month=1/trip_day=31/.hoodie_partition_metadata.parquet#1690401166492261  metageneration=1
  3.98 MiB  2023-07-26T19:52:52Z  gs://gaia_data_bucket-623600433888/nyc-taxi-trips-hudi-cow/trip_year=2022/trip_month=1/trip_day=31/5ee09729-ad22-4b4a-860e-3dd7d475473d-0_304-78-16489_20230726194847630.parquet#1690401172531505  metageneration=1
TOTAL: 3 objects, 4174159 bytes (3.98 MiB)


b) Record count

In [9]:
spark.sql(f"SELECT COUNT(*) as trip_count FROM {DATABASE_NAME}.{COW_TABLE_NAME} WHERE trip_year=2022 AND trip_month=1 AND trip_day=31").show(truncate=False)

ivysettings.xml file not found in HIVE_HOME or HIVE_CONF_DIR,/etc/hive/conf.dist/ivysettings.xml will be used
[Stage 7:>                                                          (0 + 1) / 1]

+----------+
|trip_count|
+----------+
|88188     |
+----------+



                                                                                

### 1.3. Create a record / trip that we will use for our insert trial
We'll grab a record and change the hour of pickup and dropoff to be 5 hours later

In [10]:
# This query returns exactly one record
newTripDFCow=spark.sql(f"SELECT * FROM {DATABASE_NAME}.{COW_TABLE_NAME} WHERE trip_year=2022 AND trip_month=1 AND trip_day=31 and vendor_id=1 and pickup_datetime='2022-01-31 07:59:48'")

In [11]:
newTripDFCow.show(truncate=False)

23/07/26 20:00:41 WARN package: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
[Stage 10:>                                                         (0 + 1) / 1]

+-------------------+---------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------+-----------------------------------------------------------------------------+---------+---------+-----------+---------+-------------------+-------------------+-----------------+---------+------------------+-------------------+---------------+-------------+------------+-----------+-----------+----------+------------+---------------------+------------+-----------------+--------------------+---------+---------+--------------+------------------------+--------------------+---------+----------+--------+
|_hoodie_commit_time|_hoodie_commit_seqno       |_hoodie_record_key                                                                                                                                                              

                                                                                

In [12]:
newTripDFCow.printSchema

<bound method DataFrame.printSchema of DataFrame[_hoodie_commit_time: string, _hoodie_commit_seqno: string, _hoodie_record_key: string, _hoodie_partition_path: string, _hoodie_file_name: string, taxi_type: string, trip_hour: int, trip_minute: int, vendor_id: string, pickup_datetime: timestamp, dropoff_datetime: timestamp, store_and_forward: string, rate_code: string, pickup_location_id: string, dropoff_location_id: string, passenger_count: bigint, trip_distance: decimal(38,9), fare_amount: decimal(38,9), surcharge: decimal(38,9), mta_tax: decimal(38,9), tip_amount: decimal(38,9), tolls_amount: decimal(38,9), improvement_surcharge: decimal(10,0), total_amount: decimal(38,9), payment_type_code: string, congestion_surcharge: decimal(10,0), trip_type: string, ehail_fee: decimal(10,0), partition_date: date, distance_between_service: decimal(38,9), time_between_service: bigint, trip_year: string, trip_month: string, trip_day: string]>

In [13]:
import pyspark.sql.functions as F
from datetime import datetime

#Add 5 hours to an existing row - we will insert this as a new trip; While at it, lets also drop any _hoodie fields as well
newTripDFCow1 = newTripDFCow.withColumn('pickup_datetime_5', newTripDFCow.pickup_datetime + F.expr('INTERVAL 5 HOURS')).drop(F.col("pickup_datetime")).withColumnRenamed("pickup_datetime_5","pickup_datetime") \
.withColumn('dropoff_datetime_5', newTripDFCow.dropoff_datetime + F.expr('INTERVAL 5 HOURS')).drop(F.col("dropoff_datetime")).withColumnRenamed("dropoff_datetime_5","dropoff_datetime") \
.withColumn('trip_hour_5', newTripDFCow.trip_hour + 5).drop(F.col("trip_hour")).withColumnRenamed("trip_hour_5","trip_hour") \
.drop("_hoodie_commit_time").drop("_hoodie_commit_seqno").drop("_hoodie_record_key").drop("_hoodie_partition_path").drop("_hoodie_file_name")

In [14]:
newTripDFCow1.printSchema

<bound method DataFrame.printSchema of DataFrame[taxi_type: string, trip_minute: int, vendor_id: string, store_and_forward: string, rate_code: string, pickup_location_id: string, dropoff_location_id: string, passenger_count: bigint, trip_distance: decimal(38,9), fare_amount: decimal(38,9), surcharge: decimal(38,9), mta_tax: decimal(38,9), tip_amount: decimal(38,9), tolls_amount: decimal(38,9), improvement_surcharge: decimal(10,0), total_amount: decimal(38,9), payment_type_code: string, congestion_surcharge: decimal(10,0), trip_type: string, ehail_fee: decimal(10,0), partition_date: date, distance_between_service: decimal(38,9), time_between_service: bigint, trip_year: string, trip_month: string, trip_day: string, pickup_datetime: timestamp, dropoff_datetime: timestamp, trip_hour: int]>

In [15]:
# Reorder the columns to be what they were
finalTripDFCow=newTripDFCow1.select("taxi_type", "trip_hour", "trip_minute","vendor_id","pickup_datetime","dropoff_datetime","store_and_forward","rate_code","pickup_location_id","dropoff_location_id","passenger_count","trip_distance","fare_amount","surcharge","mta_tax","tip_amount","tolls_amount","improvement_surcharge","total_amount",
                                  "payment_type_code","congestion_surcharge","trip_type","ehail_fee","partition_date","distance_between_service",
                                  "time_between_service","trip_year","trip_month","trip_day")



In [16]:
# Original record
spark.sql(f"SELECT taxi_type,trip_year,trip_month,trip_day,vendor_id,pickup_datetime,dropoff_datetime,pickup_location_id,dropoff_location_id FROM {DATABASE_NAME}.{COW_TABLE_NAME} WHERE trip_year=2022 AND trip_month=1 AND trip_day=31 and vendor_id=1 and pickup_datetime='2022-01-31 07:59:48'").show(truncate=False)

[Stage 11:>                                                         (0 + 1) / 1]

+---------+---------+----------+--------+---------+-------------------+-------------------+------------------+-------------------+
|taxi_type|trip_year|trip_month|trip_day|vendor_id|pickup_datetime    |dropoff_datetime   |pickup_location_id|dropoff_location_id|
+---------+---------+----------+--------+---------+-------------------+-------------------+------------------+-------------------+
|yellow   |2022     |1         |31      |1        |2022-01-31 07:59:48|2022-01-31 08:43:10|139               |16                 |
+---------+---------+----------+--------+---------+-------------------+-------------------+------------------+-------------------+



                                                                                

In [17]:
# The record we want to insert - note its pickup_datetime and dropoff_datetime are different
finalTripDFCow.select("taxi_type","trip_year","trip_month","trip_day","vendor_id","pickup_datetime","dropoff_datetime","pickup_location_id","dropoff_location_id").show(truncate=False)

[Stage 12:>                                                         (0 + 1) / 1]

+---------+---------+----------+--------+---------+-------------------+-------------------+------------------+-------------------+
|taxi_type|trip_year|trip_month|trip_day|vendor_id|pickup_datetime    |dropoff_datetime   |pickup_location_id|dropoff_location_id|
+---------+---------+----------+--------+---------+-------------------+-------------------+------------------+-------------------+
|yellow   |2022     |1         |31      |1        |2022-01-31 12:59:48|2022-01-31 13:43:10|139               |16                 |
+---------+---------+----------+--------+---------+-------------------+-------------------+------------------+-------------------+



                                                                                

In [18]:
# The full record we will insert
finalTripDFCow.show(truncate=False)

+---------+---------+-----------+---------+-------------------+-------------------+-----------------+---------+------------------+-------------------+---------------+-------------+------------+-----------+-----------+----------+------------+---------------------+------------+-----------------+--------------------+---------+---------+--------------+------------------------+--------------------+---------+----------+--------+
|taxi_type|trip_hour|trip_minute|vendor_id|pickup_datetime    |dropoff_datetime   |store_and_forward|rate_code|pickup_location_id|dropoff_location_id|passenger_count|trip_distance|fare_amount |surcharge  |mta_tax    |tip_amount|tolls_amount|improvement_surcharge|total_amount|payment_type_code|congestion_surcharge|trip_type|ehail_fee|partition_date|distance_between_service|time_between_service|trip_year|trip_month|trip_day|
+---------+---------+-----------+---------+-------------------+-------------------+-----------------+---------+------------------+----------------

### 1.4. Insert the record

In [36]:
hudi_options = {
            'hoodie.database.name': DATABASE_NAME,
            'hoodie.table.name': COW_TABLE_NAME,
            'hoodie.datasource.write.table.name': COW_TABLE_NAME,
            'hoodie.datasource.write.table.type': 'COPY_ON_WRITE',
            'hoodie.datasource.write.recordkey.field': 'taxi_type,trip_year,trip_month,trip_day,vendor_id,pickup_datetime,dropoff_datetime,pickup_location_id,dropoff_location_id',
            'hoodie.datasource.write.partitionpath.field': 'trip_year,trip_month,trip_day',
            'hoodie.datasource.write.precombine.field': 'pickup_datetime',
            'hoodie.datasource.write.hive_style_partitioning': 'true',
            'hoodie.partition.metafile.use.base.format': 'true', 
            'hoodie.datasource.write.drop.partition.columns': 'true',
            'hoodie.datasource.write.operation': 'insert'
        }

In [37]:
# Append to dataset in GCS
finalTripDFCow.write.format("hudi"). \
                options(**hudi_options). \
                mode("append"). \
                save(HUDI_COW_BASE_GCS_URI)

                                                                                

In [38]:
!gsutil cat $HUDI_COW_BASE_GCS_URI/.hoodie/hoodie.properties

#Properties saved on 2023-07-26T19:58:22.262893Z
#Wed Jul 26 19:58:22 UTC 2023
hoodie.table.type=COPY_ON_WRITE
hoodie.table.metadata.partitions=files
hoodie.table.precombine.field=pickup_datetime
hoodie.table.partition.fields=trip_year,trip_month,trip_day
hoodie.archivelog.folder=archived
hoodie.table.create.schema={"type"\:"record","name"\:"topLevelRecord","fields"\:[{"name"\:"_hoodie_commit_time","type"\:["string","null"]},{"name"\:"_hoodie_commit_seqno","type"\:["string","null"]},{"name"\:"_hoodie_record_key","type"\:["string","null"]},{"name"\:"_hoodie_partition_path","type"\:["string","null"]},{"name"\:"_hoodie_file_name","type"\:["string","null"]},{"name"\:"taxi_type","type"\:["string","null"]},{"name"\:"trip_hour","type"\:["int","null"]},{"name"\:"trip_minute","type"\:["int","null"]},{"name"\:"vendor_id","type"\:["string","null"]},{"name"\:"pickup_datetime","type"\:[{"type"\:"long","logicalType"\:"timestamp-micros"},"null"]},{"name"\:"dropoff_datetime","type"\:[{"type"\:"long","

In [39]:
!gsutil ls -alh $HUDI_COW_BASE_GCS_URI/trip_year=2022/trip_month=1/trip_day=31

       0 B  2023-07-26T19:52:46Z  gs://gaia_data_bucket-623600433888/nyc-taxi-trips-hudi-cow/trip_year=2022/trip_month=1/trip_day=31/#1690401166202282  metageneration=1
     373 B  2023-07-26T19:52:46Z  gs://gaia_data_bucket-623600433888/nyc-taxi-trips-hudi-cow/trip_year=2022/trip_month=1/trip_day=31/.hoodie_partition_metadata.parquet#1690401166492261  metageneration=1
  3.98 MiB  2023-07-26T20:45:53Z  gs://gaia_data_bucket-623600433888/nyc-taxi-trips-hudi-cow/trip_year=2022/trip_month=1/trip_day=31/5ee09729-ad22-4b4a-860e-3dd7d475473d-0_0-111-4256_20230726204543351.parquet#1690404353589695  metageneration=1
  3.98 MiB  2023-07-26T20:01:04Z  gs://gaia_data_bucket-623600433888/nyc-taxi-trips-hudi-cow/trip_year=2022/trip_month=1/trip_day=31/5ee09729-ad22-4b4a-860e-3dd7d475473d-0_0-18-2982_20230726200050243.parquet#1690401664258994  metageneration=1
  3.98 MiB  2023-07-26T20:01:32Z  gs://gaia_data_bucket-623600433888/nyc-taxi-trips-hudi-cow/trip_year=2022/trip_month=1/trip_day=31/5ee09729

In [40]:
# Expecting 88188+1 = 88189
spark.sql(f"SELECT COUNT(*) as trip_count FROM {DATABASE_NAME}.{COW_TABLE_NAME} WHERE trip_year=2022 AND trip_month=1 AND trip_day=31").show(truncate=False)


+----------+
|trip_count|
+----------+
|88188     |
+----------+



In [42]:
# Original record
spark.sql(f"SELECT taxi_type,trip_year,trip_month,trip_day,vendor_id,pickup_datetime,dropoff_datetime,pickup_location_id,dropoff_location_id FROM {DATABASE_NAME}.{COW_TABLE_NAME} WHERE trip_year=2022 AND trip_month=1 AND trip_day=31 and vendor_id=1 and pickup_datetime='2022-01-31 07:59:48'").show(truncate=False)


+---------+---------+----------+--------+---------+-------------------+-------------------+------------------+-------------------+
|taxi_type|trip_year|trip_month|trip_day|vendor_id|pickup_datetime    |dropoff_datetime   |pickup_location_id|dropoff_location_id|
+---------+---------+----------+--------+---------+-------------------+-------------------+------------------+-------------------+
|yellow   |2022     |1         |31      |1        |2022-01-31 07:59:48|2022-01-31 08:43:10|139               |16                 |
+---------+---------+----------+--------+---------+-------------------+-------------------+------------------+-------------------+



In [43]:
# New record inserted
spark.sql(f"SELECT taxi_type,trip_year,trip_month,trip_day,vendor_id,pickup_datetime,dropoff_datetime,pickup_location_id,dropoff_location_id,_hoodie_commit_time,_hoodie_commit_seqno FROM {DATABASE_NAME}.{COW_TABLE_NAME} WHERE trip_year=2022 AND trip_month=1 AND trip_day=31 and vendor_id=1 and to_timestamp(pickup_datetime)=to_timestamp('2022-01-31 12:59:48')").show(truncate=False)


+---------+---------+----------+--------+---------+---------------+----------------+------------------+-------------------+-------------------+--------------------+
|taxi_type|trip_year|trip_month|trip_day|vendor_id|pickup_datetime|dropoff_datetime|pickup_location_id|dropoff_location_id|_hoodie_commit_time|_hoodie_commit_seqno|
+---------+---------+----------+--------+---------+---------------+----------------+------------------+-------------------+-------------------+--------------------+
+---------+---------+----------+--------+---------+---------------+----------------+------------------+-------------------+-------------------+--------------------+



### 1.5. Insert the record again - Hudi should dedupe and there should be no record count change

In [26]:
# Append to dataset in GCS
finalTripDFCow.write.format("hudi"). \
                options(**hudi_options). \
                mode("append"). \
                save(HUDI_COW_BASE_GCS_URI)

                                                                                

In [27]:
# Expecting 88188 - Hudi should have deduped
spark.sql(f"SELECT COUNT(*) as trip_count FROM {DATABASE_NAME}.{COW_TABLE_NAME} WHERE trip_year=2022 AND trip_month=1 AND trip_day=31").show(truncate=False)


+----------+
|trip_count|
+----------+
|88188     |
+----------+



                                                                                

In [28]:
# There is a new file
!gsutil ls -alh $HUDI_COW_BASE_GCS_URI/trip_year=2022/trip_month=1/trip_day=31

       0 B  2023-07-26T19:52:46Z  gs://gaia_data_bucket-623600433888/nyc-taxi-trips-hudi-cow/trip_year=2022/trip_month=1/trip_day=31/#1690401166202282  metageneration=1
     373 B  2023-07-26T19:52:46Z  gs://gaia_data_bucket-623600433888/nyc-taxi-trips-hudi-cow/trip_year=2022/trip_month=1/trip_day=31/.hoodie_partition_metadata.parquet#1690401166492261  metageneration=1
  3.98 MiB  2023-07-26T20:01:04Z  gs://gaia_data_bucket-623600433888/nyc-taxi-trips-hudi-cow/trip_year=2022/trip_month=1/trip_day=31/5ee09729-ad22-4b4a-860e-3dd7d475473d-0_0-18-2982_20230726200050243.parquet#1690401664258994  metageneration=1
  3.98 MiB  2023-07-26T20:01:32Z  gs://gaia_data_bucket-623600433888/nyc-taxi-trips-hudi-cow/trip_year=2022/trip_month=1/trip_day=31/5ee09729-ad22-4b4a-860e-3dd7d475473d-0_0-45-3007_20230726200120188.parquet#1690401692339028  metageneration=1
  3.98 MiB  2023-07-26T19:52:52Z  gs://gaia_data_bucket-623600433888/nyc-taxi-trips-hudi-cow/trip_year=2022/trip_month=1/trip_day=31/5ee09729-

In [29]:
# Lets query the record we wanted to insert
spark.sql(f"SELECT _hoodie_commit_time,_hoodie_commit_seqno,taxi_type,trip_year,trip_month,trip_day,vendor_id,pickup_datetime,dropoff_datetime,pickup_location_id,dropoff_location_id FROM {DATABASE_NAME}.{COW_TABLE_NAME} WHERE trip_year=2022 AND trip_month=1 AND trip_day=31 and vendor_id=1 and to_timestamp(pickup_datetime)=to_timestamp('2022-01-31 12:59:48')").show(truncate=False)


[Stage 66:>                                                         (0 + 1) / 1]

+-------------------+--------------------+---------+---------+----------+--------+---------+---------------+----------------+------------------+-------------------+
|_hoodie_commit_time|_hoodie_commit_seqno|taxi_type|trip_year|trip_month|trip_day|vendor_id|pickup_datetime|dropoff_datetime|pickup_location_id|dropoff_location_id|
+-------------------+--------------------+---------+---------+----------+--------+---------+---------------+----------------+------------------+-------------------+
+-------------------+--------------------+---------+---------+----------+--------+---------+---------------+----------------+------------------+-------------------+



                                                                                

#### Potential bug in Hudi OR my code: Insert did not work

The insert did not take. It has been reported to the Hudi community.
https://github.com/apache/hudi/issues/9294https://github.com/apache/hudi/issues/9294

## 2. Upsert into CoW table

Lets attempt an insert of a record via the upsert operation:

In [30]:
hudi_options = {
            'hoodie.database.name': DATABASE_NAME,
            'hoodie.table.name': COW_TABLE_NAME,
            'hoodie.datasource.write.table.name': COW_TABLE_NAME,
            'hoodie.datasource.write.table.type': 'COPY_ON_WRITE',
            'hoodie.datasource.write.keygenerator.class':'org.apache.hudi.keygen.ComplexKeyGenerator',
            'hoodie.datasource.write.recordkey.field': 'taxi_type,trip_year,trip_month,trip_day,vendor_id,pickup_datetime,dropoff_datetime,pickup_location_id,dropoff_location_id',
            'hoodie.datasource.write.partitionpath.field': 'trip_year,trip_month,trip_day',
            'hoodie.datasource.write.precombine.field': 'pickup_datetime',
            'hoodie.datasource.write.hive_style_partitioning': 'true',
            'hoodie.partition.metafile.use.base.format': 'true', 
            'hoodie.datasource.write.drop.partition.columns': 'true',
            'hoodie.datasource.write.operation': 'upsert'
        }

In [31]:
# Original record
spark.sql(f"SELECT taxi_type,trip_year,trip_month,trip_day,vendor_id,pickup_datetime,dropoff_datetime,pickup_location_id,dropoff_location_id FROM {DATABASE_NAME}.{COW_TABLE_NAME} WHERE trip_year=2022 AND trip_month=1 AND trip_day=31 and vendor_id=1 and pickup_datetime='2022-01-31 07:59:48'").show(truncate=False)


+---------+---------+----------+--------+---------+-------------------+-------------------+------------------+-------------------+
|taxi_type|trip_year|trip_month|trip_day|vendor_id|pickup_datetime    |dropoff_datetime   |pickup_location_id|dropoff_location_id|
+---------+---------+----------+--------+---------+-------------------+-------------------+------------------+-------------------+
|yellow   |2022     |1         |31      |1        |2022-01-31 07:59:48|2022-01-31 08:43:10|139               |16                 |
+---------+---------+----------+--------+---------+-------------------+-------------------+------------------+-------------------+



In [32]:
# Append to dataset in GCS
finalTripDFCow.write.format("hudi"). \
                options(**hudi_options). \
                mode("append"). \
                save(HUDI_COW_BASE_GCS_URI)

                                                                                

In [33]:
# Expecting 88189 
spark.sql(f"SELECT COUNT(*) as trip_count FROM {DATABASE_NAME}.{COW_TABLE_NAME} WHERE trip_year=2022 AND trip_month=1 AND trip_day=31").show(truncate=False)


[Stage 103:>                                                        (0 + 1) / 1]

+----------+
|trip_count|
+----------+
|88188     |
+----------+



                                                                                

In [34]:
# There is a new file
!gsutil ls -alh $HUDI_COW_BASE_GCS_URI/trip_year=2022/trip_month=1/trip_day=31

       0 B  2023-07-26T19:52:46Z  gs://gaia_data_bucket-623600433888/nyc-taxi-trips-hudi-cow/trip_year=2022/trip_month=1/trip_day=31/#1690401166202282  metageneration=1
     373 B  2023-07-26T19:52:46Z  gs://gaia_data_bucket-623600433888/nyc-taxi-trips-hudi-cow/trip_year=2022/trip_month=1/trip_day=31/.hoodie_partition_metadata.parquet#1690401166492261  metageneration=1
  3.98 MiB  2023-07-26T20:01:04Z  gs://gaia_data_bucket-623600433888/nyc-taxi-trips-hudi-cow/trip_year=2022/trip_month=1/trip_day=31/5ee09729-ad22-4b4a-860e-3dd7d475473d-0_0-18-2982_20230726200050243.parquet#1690401664258994  metageneration=1
  3.98 MiB  2023-07-26T20:01:32Z  gs://gaia_data_bucket-623600433888/nyc-taxi-trips-hudi-cow/trip_year=2022/trip_month=1/trip_day=31/5ee09729-ad22-4b4a-860e-3dd7d475473d-0_0-45-3007_20230726200120188.parquet#1690401692339028  metageneration=1
  3.98 MiB  2023-07-26T20:33:12Z  gs://gaia_data_bucket-623600433888/nyc-taxi-trips-hudi-cow/trip_year=2022/trip_month=1/trip_day=31/5ee09729-

In [35]:
# Lets query the record we wanted to insert
spark.sql(f"SELECT _hoodie_commit_time,_hoodie_commit_seqno,taxi_type,trip_year,trip_month,trip_day,vendor_id,pickup_datetime,dropoff_datetime,pickup_location_id,dropoff_location_id FROM {DATABASE_NAME}.{COW_TABLE_NAME} WHERE trip_year=2022 AND trip_month=1 AND trip_day=31 and vendor_id=1 and to_timestamp(pickup_datetime)=to_timestamp('2022-01-31 12:59:48')").show(truncate=False)


[Stage 106:>                                                        (0 + 1) / 1]

+-------------------+--------------------+---------+---------+----------+--------+---------+---------------+----------------+------------------+-------------------+
|_hoodie_commit_time|_hoodie_commit_seqno|taxi_type|trip_year|trip_month|trip_day|vendor_id|pickup_datetime|dropoff_datetime|pickup_location_id|dropoff_location_id|
+-------------------+--------------------+---------+---------+----------+--------+---------+---------------+----------------+------------------+-------------------+
+-------------------+--------------------+---------+---------+----------+--------+---------+---------------+----------------+------------------+-------------------+



                                                                                

#### Potential bug in Hudi OR my code: Upsert (insert aspect) did not work

## 3. Delete a record

Apache Hudi supports two types of deletes:<br>
Soft Deletes: This retains the record key and just nulls out the values for all the other fields. The records with nulls in soft deletes are always persisted in storage and never removed.<br>
Hard Deletes: This physically removes any trace of the record from the table. Check out the deletion section for more details.<br>

### 3.1. Soft Delete
This retains the record key and just nulls out the values for all the other fields. The records with nulls in soft deletes are always persisted in storage and never removed.

In [52]:
# Lets do a count before we soft delete
spark.sql(f"SELECT COUNT(*) as trip_count FROM {DATABASE_NAME}.{COW_TABLE_NAME} WHERE trip_year=2021 AND trip_month=1 AND trip_day=31").show(truncate=False)


+----------+
|trip_count|
+----------+
|32604     |
+----------+



In [51]:
# Lets select a record to delete - the query below returns exactly one record
deleteTripDFCow=spark.sql(f"SELECT * FROM {DATABASE_NAME}.{COW_TABLE_NAME} WHERE trip_year=2021 AND trip_month=1 AND trip_day=31 and vendor_id=1 AND pickup_datetime='2021-01-31 18:16:08'")

# Here is the "before" the soft delete
deleteTripDFCow.show()

+-------------------+--------------------+--------------------+----------------------+--------------------+---------+---------+-----------+---------+-------------------+-------------------+-----------------+---------+------------------+-------------------+---------------+-------------+-----------+-----------+-----------+-----------+------------+---------------------+------------+-----------------+--------------------+---------+---------+--------------+------------------------+--------------------+---------+----------+--------+
|_hoodie_commit_time|_hoodie_commit_seqno|  _hoodie_record_key|_hoodie_partition_path|   _hoodie_file_name|taxi_type|trip_hour|trip_minute|vendor_id|    pickup_datetime|   dropoff_datetime|store_and_forward|rate_code|pickup_location_id|dropoff_location_id|passenger_count|trip_distance|fare_amount|  surcharge|    mta_tax| tip_amount|tolls_amount|improvement_surcharge|total_amount|payment_type_code|congestion_surcharge|trip_type|ehail_fee|partition_date|distance_be

In [65]:
hudi_soft_delete_options = {
            'hoodie.database.name': DATABASE_NAME,
            'hoodie.table.name': COW_TABLE_NAME,
            'hoodie.datasource.write.table.name': COW_TABLE_NAME,
            'hoodie.datasource.write.table.type': 'COPY_ON_WRITE',
            'hoodie.datasource.write.recordkey.field': 'taxi_type,trip_year,trip_month,trip_day,vendor_id,pickup_datetime,dropoff_datetime,pickup_location_id,dropoff_location_id',
            'hoodie.datasource.write.partitionpath.field': 'trip_year,trip_month,trip_day',
            'hoodie.datasource.write.precombine.field': 'pickup_datetime',
            'hoodie.datasource.write.hive_style_partitioning': 'true',
            'hoodie.partition.metafile.use.base.format': 'true', 
            'hoodie.datasource.write.operation': 'upsert',
            'hoodie.datasource.write.drop.partition.columns': 'true',
            'hoodie.upsert.shuffle.parallelism': 2, 
            'hoodie.insert.shuffle.parallelism': 2,
            'hoodie.combine.before.delete': 'false'
}

In [66]:
meta_columns = ["_hoodie_commit_time", "_hoodie_commit_seqno", "_hoodie_record_key", \
  "_hoodie_partition_path", "_hoodie_file_name"]
excluded_columns = meta_columns + ["pickup_datetime"]

In [67]:
# prepare the soft deletes by ensuring the appropriate fields are nullified
nullify_columns = list(filter(lambda field: field[0] not in excluded_columns, \
  list(map(lambda field: (field.name, field.dataType), deleteTripDFCow.schema.fields))))

In [68]:
softDeleteTripDFCow = reduce(lambda df,col: df.withColumn(col[0], lit(None).cast(col[1])), \
  nullify_columns, reduce(lambda df,col: df.drop(col[0]), meta_columns, deleteTripDFCow))

In [69]:
# Lets look at the record we want to soft delete
softDeleteTripDFCow.show(truncate=False)

+-------------------+------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------+-----------------------------------------------------------------------------+---------+---------+-----------+---------+-------------------+----------------+-----------------+---------+------------------+-------------------+---------------+-------------+-----------+---------+-------+----------+------------+---------------------+------------+-----------------+--------------------+---------+---------+--------------+------------------------+--------------------+---------+----------+--------+
|_hoodie_commit_time|_hoodie_commit_seqno    |_hoodie_record_key                                                                                                                                                                  |_hoodie_part

                                                                                

In [70]:
# Simply upsert the table after setting all the fields to null, except the record key, partition key and precombine key fields 
softDeleteTripDFCow.write.format("hudi"). \
  options(**hudi_soft_delete_options). \
  mode("append"). \
  save(HUDI_COW_BASE_GCS_URI)

                                                                                

In [72]:
# Lets do a count after we soft delete
spark.sql(f"SELECT COUNT(*) as trip_count FROM {DATABASE_NAME}.{COW_TABLE_NAME} WHERE trip_year=2021 AND trip_month=1 AND trip_day=31").show(truncate=False)


+----------+
|trip_count|
+----------+
|32604     |
+----------+



In [76]:
spark.sql(f"REFRESH TABLE {DATABASE_NAME}.{COW_TABLE_NAME};").show(truncate=False)

                                                                                

++
||
++
++



In [77]:
# Lets search for the record we attempted to soft-delete
spark.sql(f"SELECT * FROM {DATABASE_NAME}.{COW_TABLE_NAME} WHERE trip_year=2021 AND trip_month=1 AND trip_day=31 and vendor_id=1 AND pickup_datetime='2021-01-31 18:16:08'").show(truncate=False)

                                                                                

+-------------------+------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------+-----------------------------------------------------------------------------+---------+---------+-----------+---------+-------------------+-------------------+-----------------+---------+------------------+-------------------+---------------+-------------+-----------+-----------+-----------+-----------+------------+---------------------+------------+-----------------+--------------------+---------+---------+--------------+------------------------+--------------------+---------+----------+--------+
|_hoodie_commit_time|_hoodie_commit_seqno    |_hoodie_record_key                                                                                                                                                                  |_h

In [81]:
# Lets check to see if there is a new file
!gsutil ls -alh $HUDI_COW_BASE_GCS_URI/trip_year=2022/trip_month=1/trip_day=31

       0 B  2023-07-26T19:52:46Z  gs://gaia_data_bucket-623600433888/nyc-taxi-trips-hudi-cow/trip_year=2022/trip_month=1/trip_day=31/#1690401166202282  metageneration=1
     373 B  2023-07-26T19:52:46Z  gs://gaia_data_bucket-623600433888/nyc-taxi-trips-hudi-cow/trip_year=2022/trip_month=1/trip_day=31/.hoodie_partition_metadata.parquet#1690401166492261  metageneration=1
  3.98 MiB  2023-07-26T20:45:53Z  gs://gaia_data_bucket-623600433888/nyc-taxi-trips-hudi-cow/trip_year=2022/trip_month=1/trip_day=31/5ee09729-ad22-4b4a-860e-3dd7d475473d-0_0-111-4256_20230726204543351.parquet#1690404353589695  metageneration=1
  3.98 MiB  2023-07-26T20:01:04Z  gs://gaia_data_bucket-623600433888/nyc-taxi-trips-hudi-cow/trip_year=2022/trip_month=1/trip_day=31/5ee09729-ad22-4b4a-860e-3dd7d475473d-0_0-18-2982_20230726200050243.parquet#1690401664258994  metageneration=1
  3.98 MiB  2023-07-26T20:01:32Z  gs://gaia_data_bucket-623600433888/nyc-taxi-trips-hudi-cow/trip_year=2022/trip_month=1/trip_day=31/5ee09729

#### How does this help?
When you have to scrub data but not lose the trace of it, you can nullify the columns, and avoid reflection of it in say, aggregation operations on the table. 

#### Potential bug in Hudi or my code: Soft delete too did not work

### 3.2. Hard Delete
This physically removes any trace of the record from the table. Check out the deletion section for more details.

In [78]:
hudi_hard_delete_options = {
            'hoodie.database.name': DATABASE_NAME,
            'hoodie.table.name': COW_TABLE_NAME,
            'hoodie.datasource.write.table.name': COW_TABLE_NAME,
            'hoodie.datasource.write.table.type': 'COPY_ON_WRITE',
            'hoodie.datasource.write.recordkey.field': 'taxi_type,trip_year,trip_month,trip_day,vendor_id,pickup_datetime,dropoff_datetime,pickup_location_id,dropoff_location_id',
            'hoodie.datasource.write.partitionpath.field': 'trip_year,trip_month,trip_day',
            'hoodie.datasource.write.precombine.field': 'pickup_datetime',
            'hoodie.datasource.write.hive_style_partitioning': 'true',
            'hoodie.partition.metafile.use.base.format': 'true', 
            'hoodie.datasource.write.operation': 'delete',
            'hoodie.datasource.write.drop.partition.columns': 'true',
            'hoodie.upsert.shuffle.parallelism': 2, 
            'hoodie.insert.shuffle.parallelism': 2,
            'hoodie.combine.before.delete': 'false'
}

In [79]:
# Simply append to the table - the delete setting in the options will remove physical trace of the record
softDeleteTripDFCow.write.format("hudi"). \
  options(**hudi_hard_delete_options). \
  mode("append"). \
  save(HUDI_COW_BASE_GCS_URI)

                                                                                

In [80]:
# Lets search for the record we attempted to hard-delete
spark.sql(f"SELECT * FROM {DATABASE_NAME}.{COW_TABLE_NAME} WHERE trip_year=2021 AND trip_month=1 AND trip_day=31 and vendor_id=1 AND pickup_datetime='2021-01-31 18:16:08'").show(truncate=False)

[Stage 281:>                                                        (0 + 1) / 1]

+-------------------+------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------+-----------------------------------------------------------------------------+---------+---------+-----------+---------+-------------------+-------------------+-----------------+---------+------------------+-------------------+---------------+-------------+-----------+-----------+-----------+-----------+------------+---------------------+------------+-----------------+--------------------+---------+---------+--------------+------------------------+--------------------+---------+----------+--------+
|_hoodie_commit_time|_hoodie_commit_seqno    |_hoodie_record_key                                                                                                                                                                  |_h

                                                                                

In [82]:
# Lets check to see if there is a new file
!gsutil ls -alh $HUDI_COW_BASE_GCS_URI/trip_year=2022/trip_month=1/trip_day=31

       0 B  2023-07-26T19:52:46Z  gs://gaia_data_bucket-623600433888/nyc-taxi-trips-hudi-cow/trip_year=2022/trip_month=1/trip_day=31/#1690401166202282  metageneration=1
     373 B  2023-07-26T19:52:46Z  gs://gaia_data_bucket-623600433888/nyc-taxi-trips-hudi-cow/trip_year=2022/trip_month=1/trip_day=31/.hoodie_partition_metadata.parquet#1690401166492261  metageneration=1
  3.98 MiB  2023-07-26T20:45:53Z  gs://gaia_data_bucket-623600433888/nyc-taxi-trips-hudi-cow/trip_year=2022/trip_month=1/trip_day=31/5ee09729-ad22-4b4a-860e-3dd7d475473d-0_0-111-4256_20230726204543351.parquet#1690404353589695  metageneration=1
  3.98 MiB  2023-07-26T20:01:04Z  gs://gaia_data_bucket-623600433888/nyc-taxi-trips-hudi-cow/trip_year=2022/trip_month=1/trip_day=31/5ee09729-ad22-4b4a-860e-3dd7d475473d-0_0-18-2982_20230726200050243.parquet#1690401664258994  metageneration=1
  3.98 MiB  2023-07-26T20:01:32Z  gs://gaia_data_bucket-623600433888/nyc-taxi-trips-hudi-cow/trip_year=2022/trip_month=1/trip_day=31/5ee09729

#### Potential bug in Hudi or my code: Hard delete too did not work

## Will a fresh read work?

In [86]:
PERSIST_TO_BUCKET = f"gs://gaia_data_bucket-{PROJECT_NBR}"
HUDI_BASE_GCS_URI = f"{PERSIST_TO_BUCKET}/nyc-taxi-trips-hudi-cow/"

brandNewDF=spark.read.format("hudi").load(HUDI_BASE_GCS_URI)

                                                                                

In [87]:
brandNewDF.createOrReplaceTempView("temp_taxi_trips")

In [88]:
# Lets search for the record we attempted to hard-delete
spark.sql(f"SELECT * FROM temp_taxi_trips WHERE trip_year=2021 AND trip_month=1 AND trip_day=31 and vendor_id=1 AND pickup_datetime='2021-01-31 18:16:08'").show(truncate=False)

[Stage 289:>                                                        (0 + 1) / 1]

+-------------------+------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------+-----------------------------------------------------------------------------+---------+---------+-----------+---------+-------------------+-------------------+-----------------+---------+------------------+-------------------+---------------+-------------+-----------+-----------+-----------+-----------+------------+---------------------+------------+-----------------+--------------------+---------+---------+--------------+------------------------+--------------------+---------+----------+--------+
|_hoodie_commit_time|_hoodie_commit_seqno    |_hoodie_record_key                                                                                                                                                                  |_h

                                                                                