## Data discovery: Load and query Yellow Taxi data
> Download the dataset from [the official TLC Trip Record Data website](https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page)

---

### This cell only shows how to document code
```python
# Load file
local_file = 'datasets/your-downloaded-from-TLC-taxis-file-here.parquet'

# Show data
spark.read.parquet(local_file).show()
```

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import to_date, col

### What is master(local N)?
The --master option specifies the master URL for a distributed cluster, or local to run locally with one thread, or local[N] to run locally with N threads.

<b>Source</b>: See Spark [docs here](spark.apache.org/docs/latest). See all [options here](https://spark.apache.org/docs/latest/submitting-applications.html#master-urls)

In [2]:
import pyspark
print(pyspark.__version__)


3.5.0


In [3]:
# Create SparkSession
spark = SparkSession.builder\
             .appName("spark-app-version-x")\
             .getOrCreate()

In [4]:
# Read taxi data
local_files = 'datasets/parquet/'
df = spark.read.parquet(local_files)

In [5]:
# DF is like a relation table in memory. Let's see the columns
df.printSchema()

root
 |-- VendorID: integer (nullable = true)
 |-- tpep_pickup_datetime: timestamp_ntz (nullable = true)
 |-- tpep_dropoff_datetime: timestamp_ntz (nullable = true)
 |-- passenger_count: long (nullable = true)
 |-- trip_distance: double (nullable = true)
 |-- RatecodeID: long (nullable = true)
 |-- store_and_fwd_flag: string (nullable = true)
 |-- PULocationID: integer (nullable = true)
 |-- DOLocationID: integer (nullable = true)
 |-- payment_type: long (nullable = true)
 |-- fare_amount: double (nullable = true)
 |-- extra: double (nullable = true)
 |-- mta_tax: double (nullable = true)
 |-- tip_amount: double (nullable = true)
 |-- tolls_amount: double (nullable = true)
 |-- improvement_surcharge: double (nullable = true)
 |-- total_amount: double (nullable = true)
 |-- congestion_surcharge: double (nullable = true)
 |-- Airport_fee: double (nullable = true)



In [6]:
# Query sample:
df.select('VendorID','total_amount', 'PULocationID').show(n=5)

+--------+------------+------------+
|VendorID|total_amount|PULocationID|
+--------+------------+------------+
|       1|         9.4|         142|
|       2|        -5.5|          71|
|       2|         5.5|          71|
|       1|       74.65|         132|
|       2|        25.3|         161|
+--------+------------+------------+
only showing top 5 rows



In [7]:
# Query sample, using Spark SQL
df.createOrReplaceTempView('tbl_raw_yellow_taxis')

In [8]:
# SQL Statement
# PULocationID = 188, 379 rows our of 3,066,766
spark.sql('''
          select min(tpep_pickup_datetime), max(tpep_dropoff_datetime)
          from tbl_raw_yellow_taxis
          ''').show(n=5)

+-------------------------+--------------------------+
|min(tpep_pickup_datetime)|max(tpep_dropoff_datetime)|
+-------------------------+--------------------------+
|      2001-01-01 00:06:49|       2023-05-03 23:19:31|
+-------------------------+--------------------------+



In [9]:
# SQL Statement
spark.sql('''
          select extract(year from tpep_pickup_datetime), count(1)
          from tbl_raw_yellow_taxis
          group by extract(year from tpep_pickup_datetime)
          having count(1) > 100
          ''').show(n=100)

+---------------------------------------+--------+
|extract(year FROM tpep_pickup_datetime)|count(1)|
+---------------------------------------+--------+
|                                   2023| 9605947|
+---------------------------------------+--------+



In [10]:
# SQL Statement example, using a subquery to clean the data
# Use case example: imagine our business users asked to us delete all data if dataset's year has < 100 rows.
df_clean_s1 = spark.sql('''
          select *
          from tbl_raw_yellow_taxis
          where extract(year from tpep_pickup_datetime) in
                        (select extract(year from tpep_pickup_datetime)
                        from tbl_raw_yellow_taxis
                        group by extract(year from tpep_pickup_datetime)
                        having count(1) > 100
                        )
          ''')

In [11]:
# Register new Temp View, using the cleansed new DataFrame 
df_clean_s1.createOrReplaceTempView('tbl_raw_yellow_taxis_clean_s1')

In [12]:
# SQL Statement
spark.sql('''
          select min(tpep_pickup_datetime), max(tpep_dropoff_datetime)
          from tbl_raw_yellow_taxis_clean_s1
          ''').show(n=5)

+-------------------------+--------------------------+
|min(tpep_pickup_datetime)|max(tpep_dropoff_datetime)|
+-------------------------+--------------------------+
|      2023-01-31 23:49:00|       2023-05-03 23:19:31|
+-------------------------+--------------------------+



---
### If we want to write the output, for example partitioned by date

In [13]:
# Create new partition key
df_sink = df_clean_s1.withColumn("p_date",to_date(col('tpep_pickup_datetime')))

In [14]:
# Write to local storage, if not done already:
df_sink.write.partitionBy("p_date").mode("overwrite").parquet("datasets/yellow_taxis_daily/")

In [15]:
# Stop the session
spark.stop()