# Unit 3: Altering Hudi tables in the Hive Metastore

In unit 2, we learned about the Hudi table types. In this module, we will- alter a Hudi table name and observe metadata changes.

### Initialize Spark Session

In [1]:
spark = SparkSession.builder \
  .appName("Hudi-Learning-Unit-03-PySpark") \
  .master("yarn")\
  .enableHiveSupport()\
  .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.hudi.catalog.HoodieCatalog") \
  .config("spark.sql.extensions", "org.apache.spark.sql.hudi.HoodieSparkSessionExtension") \
  .getOrCreate()

23/07/25 16:22:33 WARN SparkSession: Using an existing Spark session; only runtime SQL configurations will take effect.


In [2]:
spark

### Declare & define variables

In [3]:
PROJECT_ID_OUTPUT=!gcloud config get-value core/project
PROJECT_ID=PROJECT_ID_OUTPUT[0]

In [4]:
PROJECT_NBR_OUTPUT=!gcloud projects describe $PROJECT_ID --format="value(projectNumber)"
PROJECT_NBR=PROJECT_NBR_OUTPUT[0]

In [5]:
print(f"Project ID is {PROJECT_ID}")
print(f"Project Number is {PROJECT_NBR}")

Project ID is apache-hudi-lab
Project Number is 623600433888


In [6]:
PERSIST_TO_BUCKET = f"gs://gaia_data_bucket-{PROJECT_NBR}"
HUDI_COW_BASE_GCS_URI = f"{PERSIST_TO_BUCKET}/nyc-taxi-trips-hudi-cow"
DATABASE_NAME = "taxi_db"
COW_TABLE_NAME = "nyc_taxi_trips_hudi_cow"

## 1. Alter the Hudi table name to include cow keyword

In module 2, we created a CoW table. In this section, we will rename the table to include cow suffix.

### 1.1. Review the database and table entry in the managed Hive Metastore

In [7]:
spark.sql("SHOW databases;").show()

ivysettings.xml file not found in HIVE_HOME or HIVE_CONF_DIR,/etc/hive/conf.dist/ivysettings.xml will be used


+---------+
|namespace|
+---------+
|  default|
|  taxi_db|
+---------+



In [8]:
spark.sql("SHOW tables IN taxi_db;").show(truncate=False)

[Stage 0:>                                                          (0 + 1) / 1]

+---------+-----------------------+-----------+
|namespace|tableName              |isTemporary|
+---------+-----------------------+-----------+
|taxi_db  |nyc_taxi_trips_hudi_mor|false      |
+---------+-----------------------+-----------+



                                                                                

Now that we know the name of the original table in the Hive Metastore, lets take a quick peek at the Hoodie metadata properties file in Cloud Storage (where Hudi colocates the metadata with the data)

### 1.2. Review the hoodie.properties file colocated with the Hudi dataset

Study the properties fiel construct carefully and note the database and table name

In [9]:
print(HUDI_COW_BASE_GCS_URI)

gs://gaia_data_bucket-623600433888/nyc-taxi-trips-hudi-cow


In [10]:
! gsutil cat $HUDI_COW_BASE_GCS_URI/.hoodie/hoodie.properties

#Properties saved on 2023-07-25T03:46:19.627732Z
#Tue Jul 25 03:46:19 UTC 2023
hoodie.table.type=COPY_ON_WRITE
hoodie.table.metadata.partitions=files
hoodie.table.precombine.field=pickup_datetime
hoodie.table.partition.fields=trip_year,trip_month,trip_day
hoodie.archivelog.folder=archived
hoodie.table.create.schema={"type"\:"record","name"\:"topLevelRecord","fields"\:[{"name"\:"_hoodie_commit_time","type"\:["string","null"]},{"name"\:"_hoodie_commit_seqno","type"\:["string","null"]},{"name"\:"_hoodie_record_key","type"\:["string","null"]},{"name"\:"_hoodie_partition_path","type"\:["string","null"]},{"name"\:"_hoodie_file_name","type"\:["string","null"]},{"name"\:"taxi_type","type"\:["string","null"]},{"name"\:"trip_hour","type"\:["int","null"]},{"name"\:"trip_minute","type"\:["int","null"]},{"name"\:"vendor_id","type"\:["string","null"]},{"name"\:"pickup_datetime","type"\:[{"type"\:"long","logicalType"\:"timestamp-micros"},"null"]},{"name"\:"dropoff_datetime","type"\:[{"type"\:"long","

### 1.3. Rename the Hudi table taxi_db.nyc_taxi_trips_hudi to taxi_db.nyc_taxi_trips_hudi_cow

Lets run the table name update in the Hive Metastore; This should trigger metadata edits in Cloud Storage as well, in the Hudi metadata colocated with the Hudi dataset

In [12]:
spark.sql(f"ALTER table taxi_db.nyc_taxi_trips_hudi rename to taxi_db.{COW_TABLE_NAME};").show()

23/07/25 16:24:59 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, since hive.security.authorization.manager is set to instance of HiveAuthorizerFactory.


++
||
++
++



In [13]:
spark.sql("SHOW tables IN taxi_db;").show(truncate=False)

+---------+-----------------------+-----------+
|namespace|tableName              |isTemporary|
+---------+-----------------------+-----------+
|taxi_db  |nyc_taxi_trips_hudi_cow|false      |
|taxi_db  |nyc_taxi_trips_hudi_mor|false      |
+---------+-----------------------+-----------+



Notice that the operations takes a couple minutes to complete despite just a rename operation - in the Hive Metastore, and in Cloud Storage. Lets review the metadata change in Cloud Storage - pay specific attention to the property: hoodie.table.name.

### 1.4. Query the table & study the first five non-data columns
Check if you can still query the table.
Then study the Hudi specific metadata columns-
_hoodie_commit_time - Timestamp column reflecting commit time
_hoodie_commit_seqno - Commit sequence number
_hoodie_record_key - Record key
_hoodie_partition_path - Hive partition materialized scheme                
_hoodie_file_name - Hudi parquet file name

Learn more about these columns from the Hudi documetation.

In [14]:
spark.sql(f"SELECT * FROM taxi_db.{COW_TABLE_NAME} WHERE trip_year=2022 AND trip_month=1 AND trip_day=31 LIMIT 1;").show(truncate=False)

23/07/25 16:25:07 WARN package: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.

+-------------------+-----------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------+------------------------------------------------------------------------------+---------+---------+-----------+---------+-------------------+-------------------+-----------------+---------+------------------+-------------------+---------------+-------------+-----------+-----------+-----------+----------+------------+---------------------+------------+-----------------+--------------------+---------+---------+--------------+------------------------+--------------------+---------+----------+--------+
|_hoodie_commit_time|_hoodie_commit_seqno   |_hoodie_record_key                                                                                                                                                                  |_hoo

                                                                                

### 1.5. Review the hoodie.properties file to ensure it reflects the table name change 

In [15]:
! gsutil cat "$HUDI_COW_BASE_GCS_URI/.hoodie/hoodie.properties"

#Properties saved on 2023-07-25T16:24:37.815856Z
#Tue Jul 25 16:24:37 UTC 2023
hoodie.table.type=COPY_ON_WRITE
hoodie.table.metadata.partitions=files
hoodie.table.precombine.field=pickup_datetime
hoodie.table.partition.fields=trip_year,trip_month,trip_day
hoodie.archivelog.folder=archived
hoodie.table.create.schema={"type"\:"record","name"\:"topLevelRecord","fields"\:[{"name"\:"_hoodie_commit_time","type"\:["string","null"]},{"name"\:"_hoodie_commit_seqno","type"\:["string","null"]},{"name"\:"_hoodie_record_key","type"\:["string","null"]},{"name"\:"_hoodie_partition_path","type"\:["string","null"]},{"name"\:"_hoodie_file_name","type"\:["string","null"]},{"name"\:"taxi_type","type"\:["string","null"]},{"name"\:"trip_hour","type"\:["int","null"]},{"name"\:"trip_minute","type"\:["int","null"]},{"name"\:"vendor_id","type"\:["string","null"]},{"name"\:"pickup_datetime","type"\:[{"type"\:"long","logicalType"\:"timestamp-micros"},"null"]},{"name"\:"dropoff_datetime","type"\:[{"type"\:"long","

This concludes the unit 4. Proceed to the next notebook.

In [None]:
%%javascript
Jupyter.notebook.session.delete();