# Explore the NYC taxi trip Hudi dataset

This notebook can be used to explore NYC taxi trips (yellow and green) in Hudi format in Cloud Storage. There is a table already registered in the Apache Hive Metastore/Dataproc Metastore Service called taxi_db.nyc_taxi_trips_hudi that can be used for SparkSQL based analytics.

The duration of this module if executed as scripted will take 10 minute sor less. However, this is intended ot be a playground to explore and familiarize yourself with the dataset as you see fit.

### 1. Get or create Spark Session with requisite Hudi configs

In [1]:
from pyspark.sql import SparkSession

spark = SparkSession.builder \
  .appName("NYC Taxi Hudi Data Explorer-PySpark") \
  .master("yarn")\
  .enableHiveSupport()\
  .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.hudi.catalog.HoodieCatalog") \
  .config("spark.sql.extensions", "org.apache.spark.sql.hudi.HoodieSparkSessionExtension") \
  .getOrCreate()

23/07/10 18:52:02 WARN SparkSession: Using an existing Spark session; only runtime SQL configurations will take effect.


In [None]:
spark

### 2. Variables

In [3]:
import os

PROJECT_ID = ""
PROJECT_NBR = ""

# Get your Google Cloud project ID from gcloud
if not os.getenv("IS_TESTING"):
    project_id_output = !gcloud config list --format 'value(core.project)' 2>/dev/null
    PROJECT_ID = project_id_output[0]
    print("PROJECT_ID: ", PROJECT_ID)
    
    
    project_nbr_output = !gcloud projects describe $PROJECT_ID --format='value(projectNumber)'
    PROJECT_NBR = project_nbr_output[0]
    print("PROJECT_NBR: ", PROJECT_NBR)
    


DATA_BUCKET=f"gs://gaia_data_bucket-{PROJECT_NBR}"
print("DATA_BUCKET: ",PERSIST_TO_BUCKET)

PARQUET_BASE_GCS_URI=f"{DATA_BUCKET}/nyc-taxi-trips-parquet/"
HUDI_BASE_GCS_URI=f"{DATA_BUCKET}/nyc-taxi-trips-hudi-cow"


DATABASE_NAME="taxi_db"
TABLE_NAME="nyc_taxi_trips_hudi"

PROJECT_ID:  apache-hudi-lab
PROJECT_NBR:  623600433888
DATA_BUCKET:  gs://gaia_data_bucket-623600433888


### 3. Explore the Hudi taxi trips with Spark SQL
The Dataproc cluster was created with an existing Dataproc Metatsore Service referenced as Apache Hive Metastore

In [9]:
spark.sql(f"SELECT count(*) as trip_count FROM {DATABASE_NAME}.{TABLE_NAME}").show()



+----------+
|trip_count|
+----------+
| 185550246|
+----------+



                                                                                

In [5]:
spark.sql(f"SELECT * FROM {DATABASE_NAME}.{TABLE_NAME} LIMIT 2").show()

23/07/10 18:53:27 WARN package: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.

+-------------------+--------------------+--------------------+----------------------+--------------------+---------+---------+-----------+---------+-------------------+-------------------+-----------------+---------+------------------+-------------------+---------------+-------------+------------+-----------+-----------+-----------+------------+---------------------+------------+-----------------+--------------------+---------+---------+--------------+------------------------+--------------------+---------+----------+--------+
|_hoodie_commit_time|_hoodie_commit_seqno|  _hoodie_record_key|_hoodie_partition_path|   _hoodie_file_name|taxi_type|trip_hour|trip_minute|vendor_id|    pickup_datetime|   dropoff_datetime|store_and_forward|rate_code|pickup_location_id|dropoff_location_id|passenger_count|trip_distance| fare_amount|  surcharge|    mta_tax| tip_amount|tolls_amount|improvement_surcharge|total_amount|payment_type_code|congestion_surcharge|trip_type|ehail_fee|partition_date|distance_

                                                                                

In [6]:
spark.sql(F"SHOW PARTITIONS {DATABASE_NAME}.{TABLE_NAME}").show(truncate=False)

+---------------------------------------+
|partition                              |
+---------------------------------------+
|trip_year=2019/trip_month=1/trip_day=1 |
|trip_year=2019/trip_month=1/trip_day=10|
|trip_year=2019/trip_month=1/trip_day=11|
|trip_year=2019/trip_month=1/trip_day=12|
|trip_year=2019/trip_month=1/trip_day=13|
|trip_year=2019/trip_month=1/trip_day=14|
|trip_year=2019/trip_month=1/trip_day=15|
|trip_year=2019/trip_month=1/trip_day=16|
|trip_year=2019/trip_month=1/trip_day=17|
|trip_year=2019/trip_month=1/trip_day=18|
|trip_year=2019/trip_month=1/trip_day=19|
|trip_year=2019/trip_month=1/trip_day=2 |
|trip_year=2019/trip_month=1/trip_day=20|
|trip_year=2019/trip_month=1/trip_day=21|
|trip_year=2019/trip_month=1/trip_day=22|
|trip_year=2019/trip_month=1/trip_day=23|
|trip_year=2019/trip_month=1/trip_day=24|
|trip_year=2019/trip_month=1/trip_day=25|
|trip_year=2019/trip_month=1/trip_day=26|
|trip_year=2019/trip_month=1/trip_day=27|
+---------------------------------

In [8]:
spark.sql(f"SELECT  trip_year, count(*) trip_count FROM {DATABASE_NAME}.{TABLE_NAME} GROUP BY trip_year ORDER BY trip_year asc ").show()



+---------+----------+
|trip_year|trip_count|
+---------+----------+
|     2019|  90690529|
|     2020|  26192443|
|     2021|  31845761|
|     2022|  36821513|
+---------+----------+



                                                                                

### 4. Explore the Parquet taxi trip count with Spark SQL


In [10]:
tripsDF=spark.read.format("parquet").load(PARQUET_BASE_GCS_URI)
tripsDF.createOrReplaceTempView("parquet_trips")

23/07/10 19:45:54 WARN SharedInMemoryCache: Evicting cached table partition metadata from memory due to size constraints (spark.sql.hive.filesourcePartitionFileCacheSize = 262144000 bytes). This may impact query planning performance.


In [11]:
spark.sql(f"SELECT  trip_year, count(*) parquet_trip_count FROM parquet_trips GROUP BY trip_year ORDER BY trip_year asc ").show()



+---------+------------------+
|trip_year|parquet_trip_count|
+---------+------------------+
|     2019|          90897542|
|     2020|          26369825|
|     2021|          31972637|
|     2022|          37023925|
+---------+------------------+



                                                                                

The counts are slightly different due to the author's choice of composite record key (column combination) and the precombine field.

In [None]:
%%javascript
Jupyter.notebook.session.delete();