# Generate NYC taxi trip data in Hudi format

This notebook reads NYC taxi trips (yellow and green) off of a Parquet dataset in Cloud Storage and persists as Hudi to Cloud Storage. It takes about 30 minutes to complete.

### 1. Get or create Spark Session with requisite Hudi configs

In [1]:
from pyspark.sql import SparkSession

spark = SparkSession.builder \
  .appName("NYC Taxi Hudi Data Generator") \
  .master("yarn")\
  .enableHiveSupport()\
  .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.hudi.catalog.HoodieCatalog") \
  .config("spark.sql.extensions", "org.apache.spark.sql.hudi.HoodieSparkSessionExtension") \
  .getOrCreate()

23/06/28 23:22:25 WARN SparkSession: Using an existing Spark session; only runtime SQL configurations will take effect.


### 2. Variables

In [2]:
import os

PROJECT_ID = ""
PROJECT_NBR = ""

# Get your Google Cloud project ID from gcloud
if not os.getenv("IS_TESTING"):
    project_id_output = !gcloud config list --format 'value(core.project)' 2>/dev/null
    PROJECT_ID = project_id_output[0]
    print("Project ID: ", PROJECT_ID)
    
    
    project_nbr_output = !gcloud projects describe $PROJECT_ID --format='value(projectNumber)'
    PROJECT_NBR = project_nbr_output[0]
    print("Project Number: ", PROJECT_NBR)
    


PERSIST_TO_BUCKET=f"gs://gaia_data_bucket-{PROJECT_NBR}"
print("PERSIST_TO_BUCKET: ",PERSIST_TO_BUCKET)

PARQUET_BASE_GCS_URI=f"{PERSIST_TO_BUCKET}/nyc-taxi-trips-parquet/"
HUDI_BASE_GCS_URI=f"{PERSIST_TO_BUCKET}/nyc-taxi-trips-hudi/"


DATABASE_NAME="taxi_db"
TABLE_NAME="nyc_taxi_trips_hudi"



Project ID:  apache-hudi-lab
Project Number:  623600433888
PERSIST_TO_BUCKET:  gs://gaia_data_bucket-623600433888


### 3. Create database in Apache Hive Metastore
The Dataproc cluster was created with an existing Dataproc Metatsore Service referenced as Apache Hive Metastore

In [3]:
# Create database
spark.sql(f"create database if not exists {DATABASE_NAME};")

ivysettings.xml file not found in HIVE_HOME or HIVE_CONF_DIR,/etc/hive/conf.dist/ivysettings.xml will be used


DataFrame[]

In [4]:
# Drop any existing tables 
spark.sql(f"drop table if exists {DATABASE_NAME}.{TABLE_NAME}")

DataFrame[]

### 4. Read Taxi trips in Parquet format in Cloud Storage and persist as Hudi

In [5]:
import datetime
startTime = datetime.datetime.now()
print(f"Started at {startTime}")

Started at 2023-06-28 23:22:31.176740


#### 4.1. Read Parquet from Cloud Storage

In [6]:
tripsDF=spark.read.format("parquet").load(PARQUET_BASE_GCS_URI)

23/06/28 23:23:19 WARN SharedInMemoryCache: Evicting cached table partition metadata from memory due to size constraints (spark.sql.hive.filesourcePartitionFileCacheSize = 262144000 bytes). This may impact query planning performance.
                                                                                

In [7]:
tripsDF.show(2)

23/06/28 23:23:25 WARN package: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
[Stage 1:>                                                          (0 + 1) / 1]

+---------+---------+-----------+---------+-------------------+-------------------+-----------------+---------+------------------+-------------------+---------------+-------------+------------+-----------+-----------+----------+------------+---------------------+------------+-----------------+--------------------+---------+---------+--------------+------------------------+--------------------+---------+----------+--------+
|taxi_type|trip_hour|trip_minute|vendor_id|    pickup_datetime|   dropoff_datetime|store_and_forward|rate_code|pickup_location_id|dropoff_location_id|passenger_count|trip_distance| fare_amount|  surcharge|    mta_tax|tip_amount|tolls_amount|improvement_surcharge|total_amount|payment_type_code|congestion_surcharge|trip_type|ehail_fee|partition_date|distance_between_service|time_between_service|trip_year|trip_month|trip_day|
+---------+---------+-----------+---------+-------------------+-------------------+-----------------+---------+------------------+----------------

                                                                                

#### 4.2. Persist as Hudi to Cloud Storage

In [8]:
hudi_options = {
    'hoodie.database.name': DATABASE_NAME,
    'hoodie.table.name': TABLE_NAME,
    'hoodie.datasource.write.table.name': TABLE_NAME,
    'hoodie.datasource.write.keygenerator.class':'org.apache.hudi.keygen.CustomKeyGenerator',
    'hoodie.datasource.write.recordkey.field': 'taxi_type,trip_year,trip_month,trip_day,vendor_id,pickup_location_id,dropoff_location_id',
    'hoodie.datasource.write.partitionpath.field': 'trip_year:SIMPLE,trip_month:SIMPLE,trip_day:SIMPLE',
    'hoodie.datasource.write.precombine.field': 'pickup_datetime',
    'hoodie.datasource.write.hive_style_partitioning': 'true',
    'hoodie.partition.metafile.use.base.format': 'true', 
    'hoodie.datasource.write.drop.partition.columns': 'true'
}

tripsDF.write.format("hudi"). \
    options(**hudi_options). \
    mode("overwrite"). \
    save(HUDI_BASE_GCS_URI)

23/06/28 23:23:28 WARN HoodieSparkSqlWriter$: hoodie table at gs://gaia_data_bucket-623600433888/nyc-taxi-trips/hudi-base already exists. Deleting existing data & overwriting with new data.
23/06/28 23:23:31 WARN HoodieBackedTableMetadata: Metadata table was not found at path gs://gaia_data_bucket-623600433888/nyc-taxi-trips/hudi-base//.hoodie/metadata
23/06/29 00:00:53 WARN DAGScheduler: Broadcasting large task binary with size 1099.7 KiB
23/06/29 00:02:52 WARN DAGScheduler: Broadcasting large task binary with size 1100.4 KiB
                                                                                

In [9]:
completionTime = datetime.datetime.now()
print(f"Completed at {completionTime}")

Completed at 2023-06-29 00:03:05.584754


#### 4.3. A quick review of the schema

In [10]:
tripsDF.printSchema()

root
 |-- taxi_type: string (nullable = true)
 |-- trip_hour: integer (nullable = true)
 |-- trip_minute: integer (nullable = true)
 |-- vendor_id: string (nullable = true)
 |-- pickup_datetime: timestamp (nullable = true)
 |-- dropoff_datetime: timestamp (nullable = true)
 |-- store_and_forward: string (nullable = true)
 |-- rate_code: string (nullable = true)
 |-- pickup_location_id: string (nullable = true)
 |-- dropoff_location_id: string (nullable = true)
 |-- passenger_count: long (nullable = true)
 |-- trip_distance: decimal(38,9) (nullable = true)
 |-- fare_amount: decimal(38,9) (nullable = true)
 |-- surcharge: decimal(38,9) (nullable = true)
 |-- mta_tax: decimal(38,9) (nullable = true)
 |-- tip_amount: decimal(38,9) (nullable = true)
 |-- tolls_amount: decimal(38,9) (nullable = true)
 |-- improvement_surcharge: decimal(10,0) (nullable = true)
 |-- total_amount: decimal(38,9) (nullable = true)
 |-- payment_type_code: string (nullable = true)
 |-- congestion_surcharge: decimal

### 5. Register table in Dataproc Metastore Service
As part of Terraform for provisioning automation, a managed Hive Metastore was created for you - Dataproc Metastore Service with thrift endpoint.

In [11]:
spark.sql("SHOW DATABASES;").show(truncate=False)

+---------+
|namespace|
+---------+
|default  |
|taxi_db  |
+---------+



In [12]:
# Create an external table on the Hudi files in the data lake in Cloud Storage
spark.sql(f"CREATE TABLE IF NOT EXISTS {DATABASE_NAME}.{TABLE_NAME} USING hudi LOCATION \"{HUDI_BASE_GCS_URI}\";").show()

23/06/29 00:03:08 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, since hive.security.authorization.manager is set to instance of HiveAuthorizerFactory.


++
||
++
++



### 6. Explore the table and data with Spark SQL
This requires the table to be registered in the Apache Hive Metastore (Dataproc Metastore Service).

In [13]:
spark.sql(f"SELECT count(*) FROM {DATABASE_NAME}.{TABLE_NAME}").show()



+--------+
|count(1)|
+--------+
|20939415|
+--------+



                                                                                

In [14]:
spark.sql(f"SELECT * FROM {DATABASE_NAME}.{TABLE_NAME} LIMIT 2").show()



+-------------------+--------------------+--------------------+----------------------+--------------------+---------+---------+-----------+---------+-------------------+-------------------+-----------------+---------+------------------+-------------------+---------------+-------------+------------+-----------+-----------+----------+------------+---------------------+------------+-----------------+--------------------+---------+---------+--------------+------------------------+--------------------+---------+----------+--------+
|_hoodie_commit_time|_hoodie_commit_seqno|  _hoodie_record_key|_hoodie_partition_path|   _hoodie_file_name|taxi_type|trip_hour|trip_minute|vendor_id|    pickup_datetime|   dropoff_datetime|store_and_forward|rate_code|pickup_location_id|dropoff_location_id|passenger_count|trip_distance| fare_amount|  surcharge|    mta_tax|tip_amount|tolls_amount|improvement_surcharge|total_amount|payment_type_code|congestion_surcharge|trip_type|ehail_fee|partition_date|distance_be

                                                                                

In [15]:
spark.sql(F"SHOW PARTITIONS {DATABASE_NAME}.{TABLE_NAME}").show(truncate=False)

+---------------------------------------+
|partition                              |
+---------------------------------------+
|trip_year=2019/trip_month=1/trip_day=1 |
|trip_year=2019/trip_month=1/trip_day=10|
|trip_year=2019/trip_month=1/trip_day=11|
|trip_year=2019/trip_month=1/trip_day=12|
|trip_year=2019/trip_month=1/trip_day=13|
|trip_year=2019/trip_month=1/trip_day=14|
|trip_year=2019/trip_month=1/trip_day=15|
|trip_year=2019/trip_month=1/trip_day=16|
|trip_year=2019/trip_month=1/trip_day=17|
|trip_year=2019/trip_month=1/trip_day=18|
|trip_year=2019/trip_month=1/trip_day=19|
|trip_year=2019/trip_month=1/trip_day=2 |
|trip_year=2019/trip_month=1/trip_day=20|
|trip_year=2019/trip_month=1/trip_day=21|
|trip_year=2019/trip_month=1/trip_day=22|
|trip_year=2019/trip_month=1/trip_day=23|
|trip_year=2019/trip_month=1/trip_day=24|
|trip_year=2019/trip_month=1/trip_day=25|
|trip_year=2019/trip_month=1/trip_day=26|
|trip_year=2019/trip_month=1/trip_day=27|
+---------------------------------

In [16]:
spark.sql(f"SELECT  trip_year, count(*) trip_count FROM {DATABASE_NAME}.{TABLE_NAME} GROUP BY trip_year").show()



+---------+----------+
|trip_year|trip_count|
+---------+----------+
|     2019|   8023712|
|     2020|   4179576|
|     2022|   4022129|
|     2021|   4713998|
+---------+----------+



                                                                                

In [17]:
spark.sql(f"SELECT  taxi_type, count(*) trip_count FROM {DATABASE_NAME}.{TABLE_NAME} GROUP BY taxi_type").show()



+---------+----------+
|taxi_type|trip_count|
+---------+----------+
|    green|   3967204|
|   yellow|  16972211|
+---------+----------+



                                                                                

### 7. Delete the table metadata in the Dataproc Metastore Service (Hive metastore)
This is because, we will run the OSS utility that can read Hudi and populate metadata of the Hudi table into a Hive metastore.

In [18]:
# Drop any existing tables 
spark.sql(f"drop table if exists {DATABASE_NAME}.{TABLE_NAME}")

DataFrame[]

This concludes the data generation, proceed to the next module in Github.