# Unit 1: Reading Hudi datasets with PySpark
In Module 2, we created a Hudi dataset. We also registered the dataset into a Hive Metastore/Dataproc Metastore Service as an external table.

In this unit:

We will review reading Hudi datasets from your data lake using Spark Dataframe API
Also review reading via Spark SQL, directly, the previously registered external table in the Apache Hive Metastore/Dataproc Metastore Service
At the end of this module, you should know how to read Hudi datasets from Spark.

In [1]:
spark = SparkSession.builder \
  .appName("Hudi-Learning-Unit-01-pyspark") \
  .master("yarn")\
  .enableHiveSupport()\
  .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.hudi.catalog.HoodieCatalog") \
  .config("spark.sql.extensions", "org.apache.spark.sql.hudi.HoodieSparkSessionExtension") \
  .getOrCreate()

23/07/30 02:09:54 WARN SparkSession: Using an existing Spark session; only runtime SQL configurations will take effect.


In [2]:
spark

In [3]:
PROJECT_ID_OUTPUT=!gcloud config get-value core/project
PROJECT_ID=PROJECT_ID_OUTPUT[0]

In [4]:
PROJECT_NBR_OUTPUT=!gcloud projects describe $PROJECT_ID --format="value(projectNumber)"
PROJECT_NBR=PROJECT_NBR_OUTPUT[0]

In [5]:
print(f"Project ID is {PROJECT_ID}")
print(f"Project Number is {PROJECT_NBR}")

Project ID is apache-hudi-lab
Project Number is 623600433888


In [6]:
PERSIST_TO_BUCKET = f"gs://gaia_data_bucket-{PROJECT_NBR}"
PARQUET_BASE_GCS_URI = f"{PERSIST_TO_BUCKET}/nyc-taxi-trips-parquet/"
HUDI_BASE_GCS_URI = f"{PERSIST_TO_BUCKET}/nyc-taxi-trips-hudi-cow/"
DATABASE_NAME = "taxi_db"
TABLE_NAME = "nyc_taxi_trips_hudi_cow"

## 1. Read Hudi dataset from source files in Cloud Storage, with Spark Dataframe API, and analyze with Spark SQL against a temporary table

In [7]:
tripsDF = spark.read.format("hudi").load(HUDI_BASE_GCS_URI)

23/07/30 02:09:57 WARN GhfsStorageStatistics: Detected potential high latency for operation op_open. latencyMs=106; previousMaxLatencyMs=0; operationCount=1; context=gs://gaia_data_bucket-623600433888/nyc-taxi-trips-hudi-cow/.hoodie/hoodie.properties


In [8]:
tripsDF.count()

                                                                                

186263929

In [9]:
tripsDF.createOrReplaceTempView("hudi_taxi_trips_snapshot")

23/07/30 02:10:48 WARN package: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.


In [10]:
# Without partition key
spark.sql("select trip_year,count(*) as trip_count from hudi_taxi_trips_snapshot group by trip_year").show()



+---------+----------+
|trip_year|trip_count|
+---------+----------+
|     2020|  26369825|
|     2019|  90897542|
|     2022|  37023925|
|     2021|  31972637|
+---------+----------+



                                                                                

## 2. Read previously registered external table on the same Hudi dataset in Hive Metsatore/Dataproc Metastore and analyze with Spark SQL

In [11]:
# With partition key
spark.sql(f"select trip_date,count(*) as trip_count from {DATABASE_NAME}.{TABLE_NAME} group by trip_date order by trip_date desc").show()

ivysettings.xml file not found in HIVE_HOME or HIVE_CONF_DIR,/etc/hive/conf.dist/ivysettings.xml will be used

+----------+----------+
| trip_date|trip_count|
+----------+----------+
|2022-12-07|         3|
|2022-12-06|         3|
|2022-12-01|        66|
|2022-11-30|    120020|
|2022-11-29|    121600|
|2022-11-28|    108743|
|2022-11-27|     92095|
|2022-11-26|    101256|
|2022-11-25|     88323|
|2022-11-24|     71200|
|2022-11-23|    107921|
|2022-11-22|    116825|
|2022-11-21|    110717|
|2022-11-20|     82719|
|2022-11-19|     96767|
|2022-11-18|     97693|
|2022-11-17|     97458|
|2022-11-16|     94731|
|2022-11-15|     92818|
|2022-11-14|     84078|
+----------+----------+
only showing top 20 rows



                                                                                

This concludes the unit 1. Proceed to the next notebook.

In [12]:
%%javascript
Jupyter.notebook.session.delete();

<IPython.core.display.Javascript object>