# Unit 1: Reading Hudi datasets with Spark-Scala

In Module 2, we created a Hudi dataset. We also registered the dataset into a Hive Metastore/Dataproc Metastore Service as an external table. 

In this module:
1. We will review reading Hudi datasets from your data lake using Spark Dataframe API
2. Also review reading via Spark SQL, directly, the previously registered external table in the Apache Hive Metastore/Dataproc Metastore Service


At the end of this module, you should know how to read Hudi datasets from Spark.

In [1]:
import sys.process._
import org.apache.hudi.QuickstartUtils._
import scala.collection.JavaConversions._
import org.apache.spark.sql.SaveMode._
import org.apache.hudi.DataSourceReadOptions._
import org.apache.hudi.DataSourceWriteOptions._
import org.apache.hudi.config.HoodieWriteConfig._
import org.apache.hudi.common.model.HoodieRecord

In [2]:
val spark = SparkSession.builder \
  .appName("Hudi-Learning-Unit-01-scala") \
  .master("yarn")\
  .enableHiveSupport()\
  .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.hudi.catalog.HoodieCatalog") \
  .config("spark.sql.extensions", "org.apache.spark.sql.hudi.HoodieSparkSessionExtension") \
  .getOrCreate()

Unknown Error: <console>:2: error: ';' expected but '.' found.
         .appName("Hudi-Learning-Module-01") \
         ^


In [3]:
spark

Waiting for a Spark session to start...

org.apache.spark.sql.SparkSession@7b2fa00a

In [None]:
val PROJECT_ID="gcloud config get-value core/project" !!

In [None]:
val PROJECT_NBR=s"gcloud projects describe $PROJECT_ID --format=value(projectNumber)" !!

In [6]:
println(s"Project ID is $PROJECT_ID")
println(s"Project ID is $PROJECT_NBR")

Project ID is apache-hudi-lab

Project ID is 623600433888



In [None]:
val PERSIST_TO_BUCKET = s"gs://gaia_data_bucket-$PROJECT_NBR"
val PARQUET_BASE_GCS_URI = s"$PERSIST_TO_BUCKET/nyc-taxi-trips-parquet/"
val HUDI_BASE_GCS_URI = s"$PERSIST_TO_BUCKET/nyc-taxi-trips-hudi/"
val DATABASE_NAME = "taxi_db"
val TABLE_NAME = "nyc_taxi_trips_hudi"

## 1. Read Hudi dataset from source files in Cloud Storage, with Spark Dataframe API, and analyze with Spark SQL

In [None]:
val tripsDF = spark.
  read.
  format("hudi").
  load(HUDI_BASE_GCS_URI)

In [9]:
tripsDF.count()

20939415

In [10]:
tripsDF.createOrReplaceTempView("hudi_taxi_trips_snapshot")

In [11]:
spark.sql("select trip_year,count(*) as trip_count from hudi_taxi_trips_snapshot group by trip_year").show()

+---------+----------+
|trip_year|trip_count|
+---------+----------+
|     2019|   8023712|
|     2020|   4179576|
|     2022|   4022129|
|     2021|   4713998|
+---------+----------+



## 2. Read previously registered external table on the same Hudi dataset in Hive Metsatore/Dataproc Metastore and analyze with Spark SQL

In [12]:
spark.sql("select trip_year,count(*) as trip_count from taxi_db.nyc_taxi_trips_hudi group by trip_year").show()

ivysettings.xml file not found in HIVE_HOME or HIVE_CONF_DIR,/etc/hive/conf.dist/ivysettings.xml will be used
+---------+----------+
|trip_year|trip_count|
+---------+----------+
|     2019|   8023712|
|     2020|   4179576|
|     2022|   4022129|
|     2021|   4713998|
+---------+----------+



This concludes the unit 1. Proceed to the next notebook.

In [13]:
%%javascript
Jupyter.notebook.session.delete();