Make sure you change the start date to today so we can ensure there is data for today.

In [8]:
# Import NYC yellow cab data from Azure Open Datasets
from azureml.opendatasets import NycTlcYellow

from datetime import datetime
from dateutil import parser

from pyspark.sql.functions import *

end_date = parser.parse('2018-07-08 23:59:59')
start_date = parser.parse('2018-01-01 00:00:00')

nyc_tlc = NycTlcYellow(start_date=start_date, end_date=end_date)
nyc_tlc_df = nyc_tlc.to_spark_dataframe()

We're using the NYC Yellow Taxi dataset for our baseline dataset.  This dataset only covers up to 2018 in the public samples, so we need to do some manipulations on the file for our purposes:
1. We add 4 years to the start and end date to make the data appear current.
2. We fabricate a column that is just Year/Month/Day for partitioning purposes later on.

In [9]:
##  Shift the dates from 2018 -> 2022, and create a field with pickup date
nyc_tlc_df = nyc_tlc_df.withColumn('tpepPickupDateTime',add_months(col('tpepPickupDateTime'),48))
nyc_tlc_df = nyc_tlc_df.withColumn('tpepDropoffDateTime',add_months(col('tpepDropoffDateTime'),48))
nyc_tlc_df = nyc_tlc_df.withColumn('puYear',col('puYear')+4)
nyc_tlc_df = nyc_tlc_df.withColumn("puDate",to_date(col('tpepPickupDateTime')))

#display(nyc_tlc_df)

Now that we have loaded and manipulated the data set, we need to persist it in our lake.  Here we persist the data partitioned by year, month, and full date and save it as both the parquet and delta formats.

We'll use both of these formats with the Serverless SQL Endpoint in Synapse to illustrate how it affects the performance of the query engine.

In [10]:
nyc_tlc_df.write.mode("overwrite").partitionBy("puYear","puMonth","puDate").parquet("/output/nycyellow")
nyc_tlc_df.write.mode("overwrite").partitionBy("puYear","puMonth","puDate").format("delta").save("/output/nycyellowdelta")
#nyc_tlc_df.write.mode("overwrite").format("delta").save("/output/nycyellowdelta")