# Unit 1: Reading Hudi datasets with PySpark
In Module 2, we created a Hudi dataset. We also registered the dataset into a Hive Metastore/Dataproc Metastore Service as an external table.

In this module:

We will review reading Hudi datasets from your data lake using Spark Dataframe API
Also review reading via Spark SQL, directly, the previously registered external table in the Apache Hive Metastore/Dataproc Metastore Service
At the end of this module, you should know how to read Hudi datasets from Spark.

In [1]:
spark = SparkSession.builder \
  .appName("Hudi-Learning-Unit-01-pyspark") \
  .master("yarn")\
  .enableHiveSupport()\
  .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.hudi.catalog.HoodieCatalog") \
  .config("spark.sql.extensions", "org.apache.spark.sql.hudi.HoodieSparkSessionExtension") \
  .getOrCreate()

23/07/07 14:29:55 WARN SparkSession: Using an existing Spark session; only runtime SQL configurations will take effect.


In [2]:
spark

In [11]:
PROJECT_ID_OUTPUT=!gcloud config get-value core/project
PROJECT_ID=PROJECT_ID_OUTPUT[0]

In [17]:
PROJECT_NBR_OUTPUT=!gcloud projects describe $PROJECT_ID --format="value(projectNumber)"
PROJECT_NBR=PROJECT_NBR_OUTPUT[0]

In [18]:
print(f"Project ID is {PROJECT_ID}")
print(f"Project Number is {PROJECT_NBR}")

Project ID is apache-hudi-lab
Project Number is 623600433888


In [22]:
PERSIST_TO_BUCKET = f"gs://gaia_data_bucket-{PROJECT_NBR}"
PARQUET_BASE_GCS_URI = f"{PERSIST_TO_BUCKET}/nyc-taxi-trips-parquet/"
HUDI_BASE_GCS_URI = f"{PERSIST_TO_BUCKET}/nyc-taxi-trips-hudi/"
DATABASE_NAME = "taxi_db"
TABLE_NAME = "nyc_taxi_trips_hudi"

## 1. Read Hudi dataset from source files in Cloud Storage, with Spark Dataframe API, and analyze with Spark SQL

In [27]:
tripsDF = spark.read.format("hudi").load(HUDI_BASE_GCS_URI)

                                                                                

In [28]:
tripsDF.count()

                                                                                

20939415

In [29]:
tripsDF.createOrReplaceTempView("hudi_taxi_trips_snapshot")

23/07/07 14:47:31 WARN package: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.


In [30]:
spark.sql("select trip_year,count(*) as trip_count from hudi_taxi_trips_snapshot group by trip_year").show()



+---------+----------+
|trip_year|trip_count|
+---------+----------+
|     2019|   8023712|
|     2020|   4179576|
|     2022|   4022129|
|     2021|   4713998|
+---------+----------+



                                                                                

## 2. Read previously registered external table on the same Hudi dataset in Hive Metsatore/Dataproc Metastore and analyze with Spark SQL

In [31]:
spark.sql("select trip_year,count(*) as trip_count from taxi_db.nyc_taxi_trips_hudi group by trip_year").show()

ivysettings.xml file not found in HIVE_HOME or HIVE_CONF_DIR,/etc/hive/conf.dist/ivysettings.xml will be used

+---------+----------+
|trip_year|trip_count|
+---------+----------+
|     2019|   8023712|
|     2020|   4179576|
|     2021|   4713998|
|     2022|   4022129|
+---------+----------+



                                                                                

This concludes the unit 1. Proceed to the next notebook.

In [None]:
%%javascript
Jupyter.notebook.session.delete();

<IPython.core.display.Javascript object>