# Unit 6: Reading Hudi "Merge On Read" datasets with PySpark
In Module 2, we created a Hudi "Merge On Reaad" (MoR) dataset. We also registered the dataset into a Hive Metastore/Dataproc Metastore Service as an external table.

In this unit:

We will review reading Hudi MoR datasets from your data lake using Spark Dataframe API
Also review reading via Spark SQL, directly, the previously registered external table in the Apache Hive Metastore/Dataproc Metastore Service
At the end of this module, you should know how to read Hudi MoR datasets from Spark.

There are multiple read types possible with MoR table types that we will cover subsequently.

In [2]:
spark = SparkSession.builder \
  .appName("Hudi-Learning-Unit-06-pyspark") \
  .master("yarn")\
  .enableHiveSupport()\
  .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \
  .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.hudi.catalog.HoodieCatalog") \
  .config("spark.sql.extensions", "org.apache.spark.sql.hudi.HoodieSparkSessionExtension") \
  .getOrCreate()

In [3]:
spark

In [5]:
PROJECT_ID_OUTPUT=!gcloud config get-value core/project
PROJECT_ID=PROJECT_ID_OUTPUT[0]
PROJECT_NBR_OUTPUT=!gcloud projects describe $PROJECT_ID --format="value(projectNumber)"
PROJECT_NBR=PROJECT_NBR_OUTPUT[0]
HUDI_BASE_GCS_URI = f"gs://gaia_data_bucket-{PROJECT_NBR}/nyc-taxi-trips-hudi-mor/"

print(f"Project ID is {PROJECT_ID}")
print(f"Project Number is {PROJECT_NBR}")
print(f"Base path of Hudi dataset is {HUDI_BASE_GCS_URI}")

Project ID is apache-hudi-lab
Project Number is 623600433888
Base path of Hudi dataset is gs://gaia_data_bucket-623600433888/nyc-taxi-trips-hudi-mor/


## 1. Read Hudi dataset from source files in Cloud Storage, with Spark Dataframe API, and analyze with Spark SQL against a temporary table

In [6]:
tripsDF = spark.read.format("hudi").load(HUDI_BASE_GCS_URI)

23/08/01 14:44:19 WARN GhfsStorageStatistics: Detected potential high latency for operation op_open. latencyMs=102; previousMaxLatencyMs=0; operationCount=1; context=gs://gaia_data_bucket-623600433888/nyc-taxi-trips-hudi-mor/.hoodie/hoodie.properties
23/08/01 14:44:20 WARN GhfsStorageStatistics: Detected potential high latency for operation stream_read_operations. latencyMs=220; previousMaxLatencyMs=0; operationCount=1; context=gs://gaia_data_bucket-623600433888/nyc-taxi-trips-hudi-mor/.hoodie/hoodie.properties


In [7]:
tripsDF.count()

                                                                                

186263929

In [8]:
tripsDF.createOrReplaceTempView("hudi_taxi_trips_snapshot")

23/08/01 14:45:18 WARN package: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.


In [9]:
# Without partition key
spark.sql("select trip_year,count(*) as trip_count from hudi_taxi_trips_snapshot group by trip_year").show()



+---------+----------+
|trip_year|trip_count|
+---------+----------+
|     2021|  31972637|
|     2019|  90897542|
|     2022|  37023925|
|     2020|  26369825|
+---------+----------+



                                                                                

## 2. Read previously registered external table on the same Hudi dataset in Hive Metsatore/Dataproc Metastore and analyze with Spark SQL

In [10]:
# With partition key
spark.sql(f"select trip_date,count(*) as trip_count from taxi_db.nyc_taxi_trips_hudi_mor group by trip_date order by trip_date desc").show()

ivysettings.xml file not found in HIVE_HOME or HIVE_CONF_DIR,/etc/hive/conf.dist/ivysettings.xml will be used

+----------+----------+
| trip_date|trip_count|
+----------+----------+
|2022-12-07|         3|
|2022-12-06|         3|
|2022-12-01|        66|
|2022-11-30|    120020|
|2022-11-29|    121600|
|2022-11-28|    108743|
|2022-11-27|     92095|
|2022-11-26|    101256|
|2022-11-25|     88323|
|2022-11-24|     71200|
|2022-11-23|    107921|
|2022-11-22|    116825|
|2022-11-21|    110717|
|2022-11-20|     82719|
|2022-11-19|     96767|
|2022-11-18|     97693|
|2022-11-17|     97458|
|2022-11-16|     94731|
|2022-11-15|     92818|
|2022-11-14|     84078|
+----------+----------+
only showing top 20 rows



                                                                                

This concludes this unit. Proceed to the next notebook.