# Unit 5: CRUD with COW tables

In the previous units, we created CoW and MoR table with the exact same data.<br>
In this unit, we will learn CRUD operations against COW tables.<br>


This module takes about 15 minutes to complete.

### Initialize Spark Session

In [1]:
spark = SparkSession.builder \
  .appName("Hudi-Learning-Unit-05-PySpark") \
  .master("yarn")\
  .enableHiveSupport()\
  .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.hudi.catalog.HoodieCatalog") \
  .config("spark.sql.extensions", "org.apache.spark.sql.hudi.HoodieSparkSessionExtension") \
  .getOrCreate()

23/07/26 03:06:41 WARN SparkSession: Using an existing Spark session; only runtime SQL configurations will take effect.


In [2]:
spark

### Declare & define variables

In [3]:
PROJECT_ID_OUTPUT=!gcloud config get-value core/project
PROJECT_ID=PROJECT_ID_OUTPUT[0]

In [4]:
PROJECT_NBR_OUTPUT=!gcloud projects describe $PROJECT_ID --format="value(projectNumber)"
PROJECT_NBR=PROJECT_NBR_OUTPUT[0]

In [5]:
print(f"Project ID is {PROJECT_ID}")
print(f"Project Number is {PROJECT_NBR}")

Project ID is apache-hudi-lab
Project Number is 623600433888


In [6]:
PERSIST_TO_BUCKET = f"gs://gaia_data_bucket-{PROJECT_NBR}"
HUDI_COW_BASE_GCS_URI = f"{PERSIST_TO_BUCKET}/nyc-taxi-trips-hudi-cow"
DATABASE_NAME = "taxi_db"
COW_TABLE_NAME = "nyc_taxi_trips_hudi_cow"

## 1. Insert into CoW table

### 1.1. Review the Hudi metadata & note the version

In [7]:
! gsutil cat $HUDI_COW_BASE_GCS_URI/.hoodie/hoodie.properties

#Properties saved on 2023-07-26T02:20:55.112345Z
#Wed Jul 26 02:20:55 UTC 2023
hoodie.table.type=COPY_ON_WRITE
hoodie.table.metadata.partitions=files
hoodie.table.precombine.field=pickup_datetime
hoodie.table.partition.fields=trip_year,trip_month,trip_day
hoodie.archivelog.folder=archived
hoodie.table.create.schema={"type"\:"record","name"\:"topLevelRecord","fields"\:[{"name"\:"_hoodie_commit_time","type"\:["string","null"]},{"name"\:"_hoodie_commit_seqno","type"\:["string","null"]},{"name"\:"_hoodie_record_key","type"\:["string","null"]},{"name"\:"_hoodie_partition_path","type"\:["string","null"]},{"name"\:"_hoodie_file_name","type"\:["string","null"]},{"name"\:"taxi_type","type"\:["string","null"]},{"name"\:"trip_hour","type"\:["int","null"]},{"name"\:"trip_minute","type"\:["int","null"]},{"name"\:"vendor_id","type"\:["string","null"]},{"name"\:"pickup_datetime","type"\:[{"type"\:"long","logicalType"\:"timestamp-micros"},"null"]},{"name"\:"dropoff_datetime","type"\:[{"type"\:"long","

### 1.2. Study the stats of the partition - trip_year = 2022, trip_month=1, trip_day=31 

a) Layout and size

In [8]:
!gsutil ls -alh $HUDI_COW_BASE_GCS_URI/trip_year=2022/trip_month=1/trip_day=31

       0 B  2023-07-26T00:33:39Z  gs://gaia_data_bucket-623600433888/nyc-taxi-trips-hudi-cow/trip_year=2022/trip_month=1/trip_day=31/#1690331619722217  metageneration=1
     373 B  2023-07-26T00:33:40Z  gs://gaia_data_bucket-623600433888/nyc-taxi-trips-hudi-cow/trip_year=2022/trip_month=1/trip_day=31/.hoodie_partition_metadata.parquet#1690331620360312  metageneration=1
  4.16 MiB  2023-07-26T00:34:01Z  gs://gaia_data_bucket-623600433888/nyc-taxi-trips-hudi-cow/trip_year=2022/trip_month=1/trip_day=31/d1744139-a721-4363-9e1e-0a7a3e01cc7b-0_304-127-15249_20230726002529414.parquet#1690331641975958  metageneration=1
TOTAL: 3 objects, 4362057 bytes (4.16 MiB)


b) Record count

In [9]:
spark.sql(f"SELECT COUNT(*) as trip_count FROM {DATABASE_NAME}.{COW_TABLE_NAME} WHERE trip_year=2022 AND trip_month=1 AND trip_day=31").show(truncate=False)

ivysettings.xml file not found in HIVE_HOME or HIVE_CONF_DIR,/etc/hive/conf.dist/ivysettings.xml will be used
                                                                                

+----------+
|trip_count|
+----------+
|87798     |
+----------+



### 1.3. Quick visual of the table

In [10]:
spark.sql(f"SELECT * FROM {DATABASE_NAME}.{COW_TABLE_NAME} WHERE trip_year=2022 AND trip_month=1 AND trip_day=31 LIMIT 1").show(truncate=False)

23/07/26 03:07:52 WARN package: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
[Stage 10:>                                                         (0 + 2) / 2]

+-------------------+-----------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------+------------------------------------------------------------------------------+---------+---------+-----------+---------+-------------------+-------------------+-----------------+---------+------------------+-------------------+---------------+-------------+------------+-----------+-----------+----------+------------+---------------------+------------+-----------------+--------------------+---------+---------+--------------+------------------------+--------------------+---------+----------+--------+
|_hoodie_commit_time|_hoodie_commit_seqno   |_hoodie_record_key                                                                                                                                                                   |_

                                                                                

### 1.3. Create a record / trip that we will use for our insert trial
We'll grab a record and change the hour of pickup and dropoff to be 5 hours later

In [48]:
# This query returns exactly one record
newTripDFCow=spark.sql(f"SELECT * FROM {DATABASE_NAME}.{COW_TABLE_NAME} WHERE trip_year=2022 AND trip_month=1 AND trip_day=31 and vendor_id=1 and pickup_datetime='2022-01-31 07:59:48'")

In [49]:
newTripDFCow.show(truncate=False)

+-------------------+-----------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------+------------------------------------------------------------------------------+---------+---------+-----------+---------+-------------------+-------------------+-----------------+---------+------------------+-------------------+---------------+-------------+------------+-----------+-----------+----------+------------+---------------------+------------+-----------------+--------------------+---------+---------+--------------+------------------------+--------------------+---------+----------+--------+
|_hoodie_commit_time|_hoodie_commit_seqno   |_hoodie_record_key                                                                                                                                                                   |_

In [50]:
newTripDFCow.printSchema

<bound method DataFrame.printSchema of DataFrame[_hoodie_commit_time: string, _hoodie_commit_seqno: string, _hoodie_record_key: string, _hoodie_partition_path: string, _hoodie_file_name: string, taxi_type: string, trip_hour: int, trip_minute: int, vendor_id: string, pickup_datetime: timestamp, dropoff_datetime: timestamp, store_and_forward: string, rate_code: string, pickup_location_id: string, dropoff_location_id: string, passenger_count: bigint, trip_distance: decimal(38,9), fare_amount: decimal(38,9), surcharge: decimal(38,9), mta_tax: decimal(38,9), tip_amount: decimal(38,9), tolls_amount: decimal(38,9), improvement_surcharge: decimal(10,0), total_amount: decimal(38,9), payment_type_code: string, congestion_surcharge: decimal(10,0), trip_type: string, ehail_fee: decimal(10,0), partition_date: date, distance_between_service: decimal(38,9), time_between_service: bigint, trip_year: string, trip_month: string, trip_day: string]>

In [59]:
import pyspark.sql.functions as F
from datetime import datetime

#Add 5 hours to an existing row - we will insert this as a new trip; While at it, lets also drop any _hoodie fields as well
newTripDFCow1 = newTripDFCow.withColumn('pickup_datetime_5', newTripDFCow.pickup_datetime + F.expr('INTERVAL 5 HOURS')).drop(F.col("pickup_datetime")).withColumnRenamed("pickup_datetime_5","pickup_datetime") \
.withColumn('dropoff_datetime_5', newTripDFCow.dropoff_datetime + F.expr('INTERVAL 5 HOURS')).drop(F.col("dropoff_datetime")).withColumnRenamed("dropoff_datetime_5","dropoff_datetime") \
.withColumn('trip_hour_5', newTripDFCow.trip_hour + 5).drop(F.col("trip_hour")).withColumnRenamed("trip_hour_5","trip_hour") \
.drop("_hoodie_commit_time").drop("_hoodie_commit_seqno").drop("_hoodie_record_key").drop("_hoodie_partition_path").drop("_hoodie_file_name")

In [60]:
newTripDFCow1.printSchema

<bound method DataFrame.printSchema of DataFrame[taxi_type: string, trip_minute: int, vendor_id: string, store_and_forward: string, rate_code: string, pickup_location_id: string, dropoff_location_id: string, passenger_count: bigint, trip_distance: decimal(38,9), fare_amount: decimal(38,9), surcharge: decimal(38,9), mta_tax: decimal(38,9), tip_amount: decimal(38,9), tolls_amount: decimal(38,9), improvement_surcharge: decimal(10,0), total_amount: decimal(38,9), payment_type_code: string, congestion_surcharge: decimal(10,0), trip_type: string, ehail_fee: decimal(10,0), partition_date: date, distance_between_service: decimal(38,9), time_between_service: bigint, trip_year: string, trip_month: string, trip_day: string, pickup_datetime: timestamp, dropoff_datetime: timestamp, trip_hour: int]>

In [61]:
# Reorder the columns to be what they were
finalTripDFCow=newTripDFCow1.select("taxi_type", "trip_hour", "trip_minute","vendor_id","pickup_datetime","dropoff_datetime","store_and_forward","rate_code","pickup_location_id","dropoff_location_id","passenger_count","trip_distance","fare_amount","surcharge","mta_tax","tip_amount","tolls_amount","improvement_surcharge","total_amount",
                                  "payment_type_code","congestion_surcharge","trip_type","ehail_fee","partition_date","distance_between_service",
                                  "time_between_service","trip_year","trip_month","trip_day")



In [62]:
# Original record
spark.sql(f"SELECT taxi_type,trip_year,trip_month,trip_day,vendor_id,pickup_datetime,dropoff_datetime,pickup_location_id,dropoff_location_id FROM {DATABASE_NAME}.{COW_TABLE_NAME} WHERE trip_year=2022 AND trip_month=1 AND trip_day=31 and vendor_id=1 and pickup_datetime='2022-01-31 07:59:48'").show(truncate=False)

+---------+---------+----------+--------+---------+-------------------+-------------------+------------------+-------------------+
|taxi_type|trip_year|trip_month|trip_day|vendor_id|pickup_datetime    |dropoff_datetime   |pickup_location_id|dropoff_location_id|
+---------+---------+----------+--------+---------+-------------------+-------------------+------------------+-------------------+
|yellow   |2022     |1         |31      |1        |2022-01-31 07:59:48|2022-01-31 08:43:10|139               |16                 |
+---------+---------+----------+--------+---------+-------------------+-------------------+------------------+-------------------+



In [63]:
# The record we want to insert - note its pickup_datetime and dropoff_datetime are different
finalTripDFCow.select("taxi_type","trip_year","trip_month","trip_day","vendor_id","pickup_datetime","dropoff_datetime","pickup_location_id","dropoff_location_id").show(truncate=False)

+---------+---------+----------+--------+---------+-------------------+-------------------+------------------+-------------------+
|taxi_type|trip_year|trip_month|trip_day|vendor_id|pickup_datetime    |dropoff_datetime   |pickup_location_id|dropoff_location_id|
+---------+---------+----------+--------+---------+-------------------+-------------------+------------------+-------------------+
|yellow   |2022     |1         |31      |1        |2022-01-31 12:59:48|2022-01-31 13:43:10|139               |16                 |
+---------+---------+----------+--------+---------+-------------------+-------------------+------------------+-------------------+



In [64]:
# The full record we will insert
finalTripDFCow.show(truncate=False)

+---------+---------+-----------+---------+-------------------+-------------------+-----------------+---------+------------------+-------------------+---------------+-------------+------------+-----------+-----------+----------+------------+---------------------+------------+-----------------+--------------------+---------+---------+--------------+------------------------+--------------------+---------+----------+--------+
|taxi_type|trip_hour|trip_minute|vendor_id|pickup_datetime    |dropoff_datetime   |store_and_forward|rate_code|pickup_location_id|dropoff_location_id|passenger_count|trip_distance|fare_amount |surcharge  |mta_tax    |tip_amount|tolls_amount|improvement_surcharge|total_amount|payment_type_code|congestion_surcharge|trip_type|ehail_fee|partition_date|distance_between_service|time_between_service|trip_year|trip_month|trip_day|
+---------+---------+-----------+---------+-------------------+-------------------+-----------------+---------+------------------+----------------

In [65]:
hudi_options = {
            'hoodie.database.name': DATABASE_NAME,
            'hoodie.table.name': COW_TABLE_NAME,
            'hoodie.datasource.write.table.name': COW_TABLE_NAME,
            'hoodie.datasource.write.table.type': 'COPY_ON_WRITE',
            'hoodie.datasource.write.keygenerator.class':'org.apache.hudi.keygen.CustomKeyGenerator',
            'hoodie.datasource.write.recordkey.field': 'taxi_type,trip_year,trip_month,trip_day,vendor_id,pickup_datetime,dropoff_datetime,pickup_location_id,dropoff_location_id',
            'hoodie.datasource.write.partitionpath.field': 'trip_year:SIMPLE,trip_month:SIMPLE,trip_day:SIMPLE',
            'hoodie.datasource.write.precombine.field': 'pickup_datetime',
            'hoodie.datasource.write.hive_style_partitioning': 'true',
            'hoodie.partition.metafile.use.base.format': 'true', 
            'hoodie.datasource.write.drop.partition.columns': 'true',
            'hoodie.datasource.write.operation': 'insert'
    
        }

In [66]:
# Append to GCS
finalTripDFCow.write.format("hudi"). \
                options(**hudi_options). \
                mode("append"). \
                save(HUDI_COW_BASE_GCS_URI)

                                                                                

In [67]:
!gsutil cat $HUDI_COW_BASE_GCS_URI/.hoodie/hoodie.properties

#Properties saved on 2023-07-26T02:20:55.112345Z
#Wed Jul 26 02:20:55 UTC 2023
hoodie.table.type=COPY_ON_WRITE
hoodie.table.metadata.partitions=files
hoodie.table.precombine.field=pickup_datetime
hoodie.table.partition.fields=trip_year,trip_month,trip_day
hoodie.archivelog.folder=archived
hoodie.table.create.schema={"type"\:"record","name"\:"topLevelRecord","fields"\:[{"name"\:"_hoodie_commit_time","type"\:["string","null"]},{"name"\:"_hoodie_commit_seqno","type"\:["string","null"]},{"name"\:"_hoodie_record_key","type"\:["string","null"]},{"name"\:"_hoodie_partition_path","type"\:["string","null"]},{"name"\:"_hoodie_file_name","type"\:["string","null"]},{"name"\:"taxi_type","type"\:["string","null"]},{"name"\:"trip_hour","type"\:["int","null"]},{"name"\:"trip_minute","type"\:["int","null"]},{"name"\:"vendor_id","type"\:["string","null"]},{"name"\:"pickup_datetime","type"\:[{"type"\:"long","logicalType"\:"timestamp-micros"},"null"]},{"name"\:"dropoff_datetime","type"\:[{"type"\:"long","

In [68]:
!gsutil ls -alh $HUDI_COW_BASE_GCS_URI/trip_year=2022/trip_month=1/trip_day=31

       0 B  2023-07-26T00:33:39Z  gs://gaia_data_bucket-623600433888/nyc-taxi-trips-hudi-cow/trip_year=2022/trip_month=1/trip_day=31/#1690331619722217  metageneration=1
     373 B  2023-07-26T00:33:40Z  gs://gaia_data_bucket-623600433888/nyc-taxi-trips-hudi-cow/trip_year=2022/trip_month=1/trip_day=31/.hoodie_partition_metadata.parquet#1690331620360312  metageneration=1
  4.16 MiB  2023-07-26T03:26:25Z  gs://gaia_data_bucket-623600433888/nyc-taxi-trips-hudi-cow/trip_year=2022/trip_month=1/trip_day=31/d1744139-a721-4363-9e1e-0a7a3e01cc7b-0_0-84-3052_20230726032612735.parquet#1690341985503707  metageneration=1
  4.16 MiB  2023-07-26T00:34:01Z  gs://gaia_data_bucket-623600433888/nyc-taxi-trips-hudi-cow/trip_year=2022/trip_month=1/trip_day=31/d1744139-a721-4363-9e1e-0a7a3e01cc7b-0_304-127-15249_20230726002529414.parquet#1690331641975958  metageneration=1
TOTAL: 4 objects, 8722354 bytes (8.32 MiB)


In [69]:
# Expecting 87798+1 = 87799
spark.sql(f"SELECT COUNT(*) as trip_count FROM {DATABASE_NAME}.{COW_TABLE_NAME} WHERE trip_year=2022 AND trip_month=1 AND trip_day=31").show(truncate=False)

+----------+
|trip_count|
+----------+
|87798     |
+----------+



In [73]:
# Original record
spark.sql(f"SELECT taxi_type,trip_year,trip_month,trip_day,vendor_id,pickup_datetime,dropoff_datetime,pickup_location_id,dropoff_location_id FROM {DATABASE_NAME}.{COW_TABLE_NAME} WHERE trip_year=2022 AND trip_month=1 AND trip_day=31 and vendor_id=1 and pickup_datetime='2022-01-31 07:59:48'").show(truncate=False)


+---------+---------+----------+--------+---------+-------------------+-------------------+------------------+-------------------+
|taxi_type|trip_year|trip_month|trip_day|vendor_id|pickup_datetime    |dropoff_datetime   |pickup_location_id|dropoff_location_id|
+---------+---------+----------+--------+---------+-------------------+-------------------+------------------+-------------------+
|yellow   |2022     |1         |31      |1        |2022-01-31 07:59:48|2022-01-31 08:43:10|139               |16                 |
+---------+---------+----------+--------+---------+-------------------+-------------------+------------------+-------------------+



In [77]:
# New record did not make it -> no errors surfaced even, what happened here?
spark.sql(f"SELECT taxi_type,trip_year,trip_month,trip_day,vendor_id,pickup_datetime,dropoff_datetime,pickup_location_id,dropoff_location_id FROM {DATABASE_NAME}.{COW_TABLE_NAME} WHERE trip_year=2022 AND trip_month=1 AND trip_day=31 and vendor_id=1 and pickup_datetime='2022-01-31 12:59:48'").show(truncate=False)


+---------+---------+----------+--------+---------+---------------+----------------+------------------+-------------------+
|taxi_type|trip_year|trip_month|trip_day|vendor_id|pickup_datetime|dropoff_datetime|pickup_location_id|dropoff_location_id|
+---------+---------+----------+--------+---------+---------------+----------------+------------------+-------------------+
+---------+---------+----------+--------+---------+---------------+----------------+------------------+-------------------+

