# Unit 3: Creating/Inserting into Hudi tables

In unit 2, we learned about the Hudi table types. 


In this module, we will-
1. Alter a Hudi table name and observe metadata changes
2. Insert fresh data into our existing Hudi Copy on Write table from prior units
3. Create a Hudi Merge on Read (MoR) table using CTAS with Spark SQL
4. Insert fresh data into the Hudi Merge on Read table we created

### Initialize Spark Session

In [1]:
spark = SparkSession.builder \
  .appName("Hudi-Learning-Unit-03-PySpark") \
  .master("yarn")\
  .enableHiveSupport()\
  .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.hudi.catalog.HoodieCatalog") \
  .config("spark.sql.extensions", "org.apache.spark.sql.hudi.HoodieSparkSessionExtension") \
  .getOrCreate()

23/07/07 20:37:18 WARN SparkSession: Using an existing Spark session; only runtime SQL configurations will take effect.


In [2]:
spark

### Declare & define variables

In [16]:
PROJECT_ID_OUTPUT=!gcloud config get-value core/project
PROJECT_ID=PROJECT_ID_OUTPUT[0]

In [17]:
PROJECT_NBR_OUTPUT=!gcloud projects describe $PROJECT_ID --format="value(projectNumber)"
PROJECT_NBR=PROJECT_NBR_OUTPUT[0]

In [18]:
print(f"Project ID is {PROJECT_ID}")
print(f"Project Number is {PROJECT_NBR}")

Project ID is apache-hudi-lab
Project Number is 623600433888


In [26]:
PERSIST_TO_BUCKET = f"gs://gaia_data_bucket-{PROJECT_NBR}"
HUDI_COW_BASE_GCS_URI = f"{PERSIST_TO_BUCKET}/nyc-taxi-trips-hudi/"
DATABASE_NAME = "taxi_db"
COW_TABLE_NAME = "nyc_taxi_trips_hudi_cow"
MOR_TABLE_NAME = "nyc_taxi_trips_hudi_mor"

## 1. Insert into Copy on Write (CoW) table

In module 2, we created a CoW table. In this section, we will rename the table to include cow suffix and then insert fresh data into it.

### 1.1. Review the dataase and table entry in the managed Hive Metastore

In [6]:
spark.sql("SHOW databases;").show()

+---------+
|namespace|
+---------+
|  default|
|  taxi_db|
+---------+



In [28]:
spark.sql("SHOW tables IN taxi_db;").show()

+---------+-------------------+-----------+
|namespace|          tableName|isTemporary|
+---------+-------------------+-----------+
|  taxi_db|nyc_taxi_trips_hudi|      false|
+---------+-------------------+-----------+



Now that we know the name of the original table in the Hive Metastore, lets take a quick peek at the Hoodie metadata properties file in Cloud Storage (where Hudi colocates the metadata with the data)

In [29]:
! gsutil cat "gs://gaia_data_bucket-$PROJECT_NBR/nyc-taxi-trips-hudi/.hoodie/hoodie.properties"

#Properties saved on 2023-07-07T21:19:00.611214Z
#Fri Jul 07 21:19:00 UTC 2023
hoodie.table.type=COPY_ON_WRITE
hoodie.table.metadata.partitions=files
hoodie.table.precombine.field=pickup_datetime
hoodie.table.partition.fields=trip_year,trip_month,trip_day
hoodie.archivelog.folder=archived
hoodie.table.create.schema={"type"\:"record","name"\:"topLevelRecord","fields"\:[{"name"\:"_hoodie_commit_time","type"\:["string","null"]},{"name"\:"_hoodie_commit_seqno","type"\:["string","null"]},{"name"\:"_hoodie_record_key","type"\:["string","null"]},{"name"\:"_hoodie_partition_path","type"\:["string","null"]},{"name"\:"_hoodie_file_name","type"\:["string","null"]},{"name"\:"taxi_type","type"\:["string","null"]},{"name"\:"trip_hour","type"\:["int","null"]},{"name"\:"trip_minute","type"\:["int","null"]},{"name"\:"vendor_id","type"\:["string","null"]},{"name"\:"pickup_datetime","type"\:[{"type"\:"long","logicalType"\:"timestamp-micros"},"null"]},{"name"\:"dropoff_datetime","type"\:[{"type"\:"

### 1.2. Rename the table taxi_db.nyc_taxi_trips_hudi to taxi_db.nyc_taxi_trips_hudi_cow

Lets run the table name update in the Hive Metastore; This should trigger metadata edits in Cloud Storage as well, in the Hudi metadata colocated with the Hudi dataset

In [30]:
spark.sql(f"ALTER table taxi_db.nyc_taxi_trips_hudi rename to taxi_db.{COW_TABLE_NAME};").show()

                                                                                

++
||
++
++



In [33]:
spark.sql("SHOW tables IN taxi_db;").show(truncate=False)

+---------+-----------------------+-----------+
|namespace|tableName              |isTemporary|
+---------+-----------------------+-----------+
|taxi_db  |nyc_taxi_trips_hudi_cow|false      |
+---------+-----------------------+-----------+



Notice that the operations takes a couple minutes to complete despite just a rename operation - in the Hive Metastore, and in Cloud Storage. Lets review the metadata change in Cloud Storage - pay specific attention to the property: hoodie.table.name.

Lets make sure we can query the table.

In [37]:
spark.sql(f"SELECT * FROM taxi_db.{COW_TABLE_NAME} WHERE trip_year=2022 AND trip_month=1 AND trip_day=31 LIMIT 1;").show(truncate=False)

+-------------------+------------------------+------------------------------------------------------------------------------------------------------------------+---------------------------------------+------------------------------------------------------------------------------+---------+---------+-----------+---------+-------------------+-------------------+-----------------+---------+------------------+-------------------+---------------+-------------+------------+-----------+-----------+----------+------------+---------------------+------------+-----------------+--------------------+---------+---------+--------------+------------------------+--------------------+---------+----------+--------+
|_hoodie_commit_time|_hoodie_commit_seqno    |_hoodie_record_key                                                                                                |_hoodie_partition_path                 |_hoodie_file_name                                                             |taxi_type|trip

In [24]:
! gsutil cat "gs://gaia_data_bucket-$PROJECT_NBR/nyc-taxi-trips-hudi/.hoodie/hoodie.properties"

#Properties saved on 2023-07-07T21:03:21.191736Z
#Fri Jul 07 21:03:21 UTC 2023
hoodie.table.type=COPY_ON_WRITE
hoodie.table.metadata.partitions=files
hoodie.table.precombine.field=pickup_datetime
hoodie.table.partition.fields=trip_year,trip_month,trip_day
hoodie.archivelog.folder=archived
hoodie.table.create.schema={"type"\:"record","name"\:"topLevelRecord","fields"\:[{"name"\:"_hoodie_commit_time","type"\:["string","null"]},{"name"\:"_hoodie_commit_seqno","type"\:["string","null"]},{"name"\:"_hoodie_record_key","type"\:["string","null"]},{"name"\:"_hoodie_partition_path","type"\:["string","null"]},{"name"\:"_hoodie_file_name","type"\:["string","null"]},{"name"\:"taxi_type","type"\:["string","null"]},{"name"\:"trip_hour","type"\:["int","null"]},{"name"\:"trip_minute","type"\:["int","null"]},{"name"\:"vendor_id","type"\:["string","null"]},{"name"\:"pickup_datetime","type"\:[{"type"\:"long","logicalType"\:"timestamp-micros"},"null"]},{"name"\:"dropoff_datetime","type"\:[{"type"\:"

This concludes the unit 1. Proceed to the next notebook.

In [None]:
%%javascript
Jupyter.notebook.session.delete();

<IPython.core.display.Javascript object>