# Exploratory Data Analysis with Pyspark and Spark SQL

The following notebook utilizes New York City taxi data from [TLC Trip Record Data](https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page)

## Instructions

- Load and explore nyc taxi data from january 0f 2019. The exercises can be executed using pyspark or spark sql ( a subset of the questions will be re-answered using the language not chosen for the  main work).
- Load the zone lookup table to answer the questions about the nyc boroughs.  
- Load nyc taxi data from January of 2025 and compare data.  
- With any remaining time, work on the where to go from here section.  
- Lab due date is TBD ( due dates will be updated in the readme for the class repo )

In [0]:
# Define the name of the new catalog
catalog = 'taxi_eda_db'

# define variables for the trips data
schema = 'yellow_taxi_trips'
volume = 'data'
file_name = 'yellow_tripdata_2019-01.parquet'
table_name = 'tbl_yellow_taxi_trips'
path_volume = '/Volumes/' + catalog + "/" + schema + '/' + volume
path_table =  catalog + "." + schema
download_url = 'https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2019-01.parquet'

In [0]:
# create the catalog/schema/volume
spark.sql('create catalog if not exists ' + catalog)
spark.sql('create schema if not exists ' + catalog + '.' + schema)
spark.sql('create volume if not exists ' + catalog + '.' + schema + '.' + volume)

DataFrame[]

In [0]:
# Get the data
dbutils.fs.cp(f"{download_url}", f"{path_volume}" + "/" + f"{file_name}")

True

In [0]:
# create the dataframe
df_trips = spark.read.parquet(f"{path_volume}/{file_name}",
  header=True,
  inferSchema=True,
  sep=",")

In [0]:
# Show the dataframe
df_trips.show()

+--------+--------------------+---------------------+---------------+-------------+----------+------------------+------------+------------+------------+-----------+-----+-------+----------+------------+---------------------+------------+--------------------+-----------+
|VendorID|tpep_pickup_datetime|tpep_dropoff_datetime|passenger_count|trip_distance|RatecodeID|store_and_fwd_flag|PULocationID|DOLocationID|payment_type|fare_amount|extra|mta_tax|tip_amount|tolls_amount|improvement_surcharge|total_amount|congestion_surcharge|airport_fee|
+--------+--------------------+---------------------+---------------+-------------+----------+------------------+------------+------------+------------+-----------+-----+-------+----------+------------+---------------------+------------+--------------------+-----------+
|       1| 2019-01-01 00:46:40|  2019-01-01 00:53:20|            1.0|          1.5|       1.0|                 N|         151|         239|           1|        7.0|  0.5|    0.5|      1.6

## Lab

### Part 1
This section can be completed either using pyspark commands or sql commands ( There will be a section after in which a self-chosen subset of the questions are re-answered using the language not used for the main section. i.e. if pyspark is chosen for the main lab, sql should be used to repeat some of the questions. )

- Add a column that creates a unique key to identify each record in order to answer questions about individual trips
- Which trip has the highest passanger count
- What is the Average passanger count
- Shortest/longest trip by distance? by time?.
- busiest day/slowest single day
- busiest/slowest time of day ( you may want to bucket these by hour or create timess such as morning, afternoon, evening, late night )
- On average which day of the week is slowest/busiest
- Does trip distance or num passangers affect tip amount
- What was the highest "extra" charge and which trip
- Are there any datapoints that seem to be strange/outliers (make sure to explain your reasoning in a markdown cell)?

Step 1 : Add a column that creates a unique key to identify each record in order to answer questions about individual trips 

In [0]:
# let's import a function that generates a column with monotonically increasing 64-bit integers
from pyspark.sql.functions import monotonically_increasing_id

# here we create the column and apply the function
df_trips = df_trips.withColumn("trip_id", monotonically_increasing_id())

display(df_trips)

VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge,airport_fee,trip_id
1,2019-01-01T00:46:40.000,2019-01-01T00:53:20.000,1.0,1.5,1.0,N,151,239,1,7.0,0.5,0.5,1.65,0.0,0.3,9.95,,,0
1,2019-01-01T00:59:47.000,2019-01-01T01:18:59.000,1.0,2.6,1.0,N,239,246,1,14.0,0.5,0.5,1.0,0.0,0.3,16.3,,,1
2,2018-12-21T13:48:30.000,2018-12-21T13:52:40.000,3.0,0.0,1.0,N,236,236,1,4.5,0.5,0.5,0.0,0.0,0.3,5.8,,,2
2,2018-11-28T15:52:25.000,2018-11-28T15:55:45.000,5.0,0.0,1.0,N,193,193,2,3.5,0.5,0.5,0.0,0.0,0.3,7.55,,,3
2,2018-11-28T15:56:57.000,2018-11-28T15:58:33.000,5.0,0.0,2.0,N,193,193,2,52.0,0.0,0.5,0.0,0.0,0.3,55.55,,,4
2,2018-11-28T16:25:49.000,2018-11-28T16:28:26.000,5.0,0.0,1.0,N,193,193,2,3.5,0.5,0.5,0.0,5.76,0.3,13.31,,,5
2,2018-11-28T16:29:37.000,2018-11-28T16:33:43.000,5.0,0.0,2.0,N,193,193,2,52.0,0.0,0.5,0.0,0.0,0.3,55.55,,,6
1,2019-01-01T00:21:28.000,2019-01-01T00:28:37.000,1.0,1.3,1.0,N,163,229,1,6.5,0.5,0.5,1.25,0.0,0.3,9.05,,,7
1,2019-01-01T00:32:01.000,2019-01-01T00:45:39.000,1.0,3.7,1.0,N,229,7,1,13.5,0.5,0.5,3.7,0.0,0.3,18.5,,,8
1,2019-01-01T00:57:32.000,2019-01-01T01:09:32.000,2.0,2.1,1.0,N,141,234,1,10.0,0.5,0.5,1.7,0.0,0.3,13.0,,,9


- Which trip has the highest passanger count ?


In [0]:
# the desc function returns a sort expression for the target column in descending order
from pyspark.sql.functions import desc

# let's use this function help to orderBy and display the first row wich correspond to the highest passanger count 
trip_max_passenger = df_trips.orderBy(desc("passenger_count")).limit(1).select("passenger_count", "trip_id")
display(trip_max_passenger)

passenger_count,trip_id
9.0,949956


- What is the Average passanger count


In [0]:
# Calculate the average passenger count
from pyspark.sql.functions import avg

# Compute the average and add a comment to the result column
avg_passenger_count = df_trips.agg(avg("passenger_count").alias("average_passenger_count"))

display(avg_passenger_count)

average_passenger_count
1.5670317144945614


- Shortest/longest trip by distance? by time?.


In [0]:
from pyspark.sql.functions import col, unix_timestamp, min as spark_min, max as spark_max

# Calculate trip duration in seconds
df_trips = df_trips.withColumn(
    "trip_duration_sec",
    unix_timestamp("tpep_dropoff_datetime") - unix_timestamp("tpep_pickup_datetime")
)

# Shortest and longest trip by distance
shortest_trip_distance = df_trips.orderBy(
    col("trip_distance").asc()
).limit(1).select("trip_distance", "trip_id")

longest_trip_distance = df_trips.orderBy(
    col("trip_distance").desc()
).limit(1).select("trip_distance", "trip_id")

# Shortest and longest trip by duration
shortest_trip_time = df_trips.orderBy(
    col("trip_duration_sec").asc()
).limit(1).select("trip_duration_sec", "trip_id")

longest_trip_time = df_trips.orderBy(
    col("trip_duration_sec").desc()
).limit(1).select("trip_duration_sec", "trip_id")

#  display the results with .show()
print("the shortest trip by distance is :")
shortest_trip_distance.show()

print("the longest trip by distance is :")
longest_trip_distance.show()

print("the shortest trip by time is :")
shortest_trip_time.show()

print("the longest trip by time is:")
longest_trip_time.show()

the shortest trip by distance is :
+-------------+-------+
|trip_distance|trip_id|
+-------------+-------+
|          0.0|      2|
+-------------+-------+

the longest trip by distance is :
+-------------+-------+
|trip_distance|trip_id|
+-------------+-------+
|        831.8|6074091|
+-------------+-------+

the shortest trip by time is :
+-----------------+-------+
|trip_duration_sec|trip_id|
+-----------------+-------+
|         -5056830|1203184|
+-----------------+-------+

the longest trip by time is:
+-----------------+-------+
|trip_duration_sec|trip_id|
+-----------------+-------+
|          2618881|  68267|
+-----------------+-------+



we can see here that there are outliers values in the dataset since there is negative trip duration value


- Busiest / Slowest single day


In [0]:
from pyspark.sql.functions import to_date, col, count

# Extract pickup date
df_trips = df_trips.withColumn("pickup_date", to_date(col("tpep_pickup_datetime")))

# Count trips per day
trips_per_day = df_trips.groupBy("pickup_date").agg(count("*").alias("num_trips"))

# Busiest and slowest days
busiest_day = trips_per_day.orderBy(col("num_trips").desc()).limit(1)
slowest_day = trips_per_day.orderBy(col("num_trips").asc()).limit(1)

busiest_day.show()
slowest_day.show()

+-----------+---------+
|pickup_date|num_trips|
+-----------+---------+
| 2019-01-25|   292499|
+-----------+---------+

+-----------+---------+
|pickup_date|num_trips|
+-----------+---------+
| 2018-12-21|        1|
+-----------+---------+



- Busiest / Slowest time of day

In [0]:
from pyspark.sql.functions import hour

# Extract the hour from the pickup datetime
df_trips = df_trips.withColumn("pickup_hour", hour(col("tpep_pickup_datetime")))

# Count how many trips started in each hour
trips_per_hour = df_trips.groupBy("pickup_hour").agg(count("*").alias("num_trips"))

# Find the busiest and slowest hours
busiest_hour = trips_per_hour.orderBy(col("num_trips").desc()).limit(1)
slowest_hour = trips_per_hour.orderBy(col("num_trips").asc()).limit(1)

# Show the results
busiest_hour.show()
slowest_hour.show()


+-----------+---------+
|pickup_hour|num_trips|
+-----------+---------+
|         18|   515390|
+-----------+---------+

+-----------+---------+
|pickup_hour|num_trips|
+-----------+---------+
|          4|    61424|
+-----------+---------+



- Find the busiest and slowest day of the week (on average)

In [0]:
from pyspark.sql.functions import dayofweek

# Extract the day of the week from the pickup date
df_trips = df_trips.withColumn("day_of_week", dayofweek(col("pickup_date")))

# Count total trips for each day of the week
avg_trips_by_day = df_trips.groupBy("day_of_week").agg(count("*").alias("num_trips"))

# Show which day of the week is the busiest and slowest
avg_trips_by_day.orderBy(col("num_trips").desc()).show()


+-----------+---------+
|day_of_week|num_trips|
+-----------+---------+
|          5|  1357043|
|          4|  1265264|
|          3|  1209084|
|          6|  1087215|
|          7|  1009985|
|          2|   908121|
|          1|   859905|
+-----------+---------+



- Check if distance or passengers affect the tip amount

In [0]:
# Calculate correlation between trip distance and tip amount
print("Correlation (distance vs tip):", df_trips.stat.corr("trip_distance", "tip_amount"))

# Calculate correlation between passenger count and tip amount
print("Correlation (passengers vs tip):", df_trips.stat.corr("passenger_count", "tip_amount"))

# Show average tip amount by passenger count
from pyspark.sql.functions import avg

df_trips.groupBy("passenger_count").agg(avg("tip_amount").alias("avg_tip")).orderBy("passenger_count").show()

Correlation (distance vs tip): 0.5269200663652669
Correlation (passengers vs tip): 0.004431051585116288
+---------------+--------------------+
|passenger_count|             avg_tip|
+---------------+--------------------+
|           NULL|0.061789899553571406|
|            0.0|  1.7869007761051638|
|            1.0|  1.8283524429075058|
|            2.0|  1.8339324029045228|
|            3.0|  1.7955889568213272|
|            4.0|  1.7027097823846875|
|            5.0|  1.8698681146978595|
|            6.0|  1.8568302035247934|
|            7.0|   6.542631578947368|
|            8.0|   6.480689655172414|
|            9.0|  3.1166666666666667|
+---------------+--------------------+



- What was the highest "extra" charge and which trip


In [0]:
from pyspark.sql.functions import max

# Find the maximum extra charge
max_extra = df_trips.agg(max("extra").alias("max_extra")).collect()[0]["max_extra"]

# Show the trip(s) that had this highest extra charge
df_trips.filter(col("extra") == max_extra).select("trip_id", "extra", "tpep_pickup_datetime", "trip_distance").show()

+-------+------+--------------------+-------------+
|trip_id| extra|tpep_pickup_datetime|trip_distance|
+-------+------+--------------------+-------------+
|5323483|535.38| 2019-01-23 08:58:09|          0.0|
+-------+------+--------------------+-------------+



- Are there any datapoints that seem to be strange/outliers (make sure to explain your reasoning in a markdown cell)?

In [0]:
df_trips.filter(
    (col("trip_distance") <= 0) |      # distance cannot be 0 or negative
    (col("fare_amount") < 0) |         # fare cannot be negative
    (col("passenger_count") > 6)       # more than 6 passengers is unusual
).show()

+--------+--------------------+---------------------+---------------+-------------+----------+------------------+------------+------------+------------+-----------+-----+-------+----------+------------+---------------------+------------+--------------------+-----------+-------+-----------------+-----------+-----------+-----------+
|VendorID|tpep_pickup_datetime|tpep_dropoff_datetime|passenger_count|trip_distance|RatecodeID|store_and_fwd_flag|PULocationID|DOLocationID|payment_type|fare_amount|extra|mta_tax|tip_amount|tolls_amount|improvement_surcharge|total_amount|congestion_surcharge|airport_fee|trip_id|trip_duration_sec|pickup_date|pickup_hour|day_of_week|
+--------+--------------------+---------------------+---------------+-------------+----------+------------------+------------+------------+------------+-----------+-----+-------+----------+------------+---------------------+------------+--------------------+-----------+-------+-----------------+-----------+-----------+-----------+
|

Some trips show unrealistic values such as a distance of 0 or negative, negative fares, or more than six passengers.
These data points are likely caused by data entry or recording errors and should be treated as outliers.


### Part 2

- Using the code for loading the first dataset as an example, load in the taxi zone lookup and answer the following questions
- which borough had most pickups? dropoffs?
- what are the busy/slow times by borough 
- what are the busiest days of the week by borough?
- what is the average trip distance by borough?
- what is the average trip fare by borough?
- highest/lowest faire amounts for a trip, what burough is associated with the each
- load the dataset from the most recently available january, is there a change to any of the average metrics.

Load the taxi zone lookup dataset


In [0]:


# Define variables for the taxi zone lookup file
zone_schema = 'taxi_zone_lookup'
zone_volume = 'data'
zone_file_name = 'taxi+_zone_lookup.csv'
zone_table_name = 'tbl_taxi_zone_lookup'
zone_path_volume = '/Volumes/' + catalog + "/" + zone_schema + '/' + zone_volume
zone_path_table = catalog + "." + zone_schema
zone_download_url = 'https://d37ci6vzurychx.cloudfront.net/misc/taxi+_zone_lookup.csv'

# Create the schema and volume if they don't exist
spark.sql('create schema if not exists ' + catalog + '.' + zone_schema)
spark.sql('create volume if not exists ' + catalog + '.' + zone_schema + '.' + zone_volume)

# Download the lookup CSV file into your volume
dbutils.fs.cp(f"{zone_download_url}", f"{zone_path_volume}/{zone_file_name}")

# Load the lookup table into a Spark DataFrame
df_zones = spark.read.csv(f"{zone_path_volume}/{zone_file_name}", header=True, inferSchema=True)

df_zones.show(5)


+----------+-------------+--------------------+------------+
|LocationID|      Borough|                Zone|service_zone|
+----------+-------------+--------------------+------------+
|         1|          EWR|      Newark Airport|         EWR|
|         2|       Queens|         Jamaica Bay|   Boro Zone|
|         3|        Bronx|Allerton/Pelham G...|   Boro Zone|
|         4|    Manhattan|       Alphabet City| Yellow Zone|
|         5|Staten Island|       Arden Heights|   Boro Zone|
+----------+-------------+--------------------+------------+
only showing top 5 rows


Join trips with zone lookup


In [0]:

# Join for pickup borough
df_trips = df_trips.join(
    df_zones.withColumnRenamed("LocationID", "PULocationID_lookup"),
    df_trips["PULocationID"] == col("PULocationID_lookup"),
    "left"
).withColumnRenamed("Borough", "pickup_borough")

# Join for dropoff borough
df_trips = df_trips.join(
    df_zones.withColumnRenamed("LocationID", "DOLocationID_lookup"),
    df_trips["DOLocationID"] == col("DOLocationID_lookup"),
    "left"
).withColumnRenamed("Borough", "dropoff_borough")


- which borough had most pickups? dropoffs?


In [0]:

from pyspark.sql.functions import count

# Count pickups per borough
pickup_counts = df_trips.groupBy("pickup_borough").agg(count("*").alias("num_pickups")).orderBy(col("num_pickups").desc())
pickup_counts.show()

# Count dropoffs per borough
dropoff_counts = df_trips.groupBy("dropoff_borough").agg(count("*").alias("num_dropoffs")).orderBy(col("num_dropoffs").desc())
dropoff_counts.show()


+--------------+-----------+
|pickup_borough|num_pickups|
+--------------+-----------+
|     Manhattan|    6950965|
|        Queens|     471173|
|       Unknown|     159815|
|      Brooklyn|      91905|
|         Bronx|      18062|
|           N/A|       3890|
|           EWR|        446|
| Staten Island|        361|
+--------------+-----------+

+---------------+------------+
|dropoff_borough|num_dropoffs|
+---------------+------------+
|      Manhattan|     6817355|
|         Queens|      340972|
|       Brooklyn|      301105|
|        Unknown|      149097|
|          Bronx|       58085|
|            N/A|       16904|
|            EWR|       10914|
|  Staten Island|        2185|
+---------------+------------+



- what are the busy/slow times by borough 


In [0]:

# Group by pickup_borough and hour
trips_by_borough_hour = df_trips.groupBy("pickup_borough", "pickup_hour").agg(count("*").alias("num_trips"))

# Show the busiest hours for each borough
trips_by_borough_hour.orderBy(col("pickup_borough"), col("num_trips").desc()).show()


+--------------+-----------+---------+
|pickup_borough|pickup_hour|num_trips|
+--------------+-----------+---------+
|         Bronx|          7|     1803|
|         Bronx|          8|     1445|
|         Bronx|          6|     1301|
|         Bronx|          9|     1158|
|         Bronx|         10|     1079|
|         Bronx|         14|     1014|
|         Bronx|         12|      956|
|         Bronx|         13|      918|
|         Bronx|         15|      897|
|         Bronx|         11|      885|
|         Bronx|         17|      812|
|         Bronx|         16|      756|
|         Bronx|         18|      741|
|         Bronx|          5|      736|
|         Bronx|         19|      519|
|         Bronx|         20|      440|
|         Bronx|         23|      410|
|         Bronx|         21|      408|
|         Bronx|          4|      403|
|         Bronx|         22|      353|
+--------------+-----------+---------+
only showing top 20 rows


- what are the busiest days of the week by borough?


In [0]:

trips_by_borough_day = df_trips.groupBy("pickup_borough", "day_of_week").agg(count("*").alias("num_trips"))

# Show results
trips_by_borough_day.orderBy(col("pickup_borough"), col("num_trips").desc()).show()


+--------------+-----------+---------+
|pickup_borough|day_of_week|num_trips|
+--------------+-----------+---------+
|         Bronx|          5|     3121|
|         Bronx|          3|     3059|
|         Bronx|          4|     2999|
|         Bronx|          6|     2666|
|         Bronx|          2|     2177|
|         Bronx|          1|     2112|
|         Bronx|          7|     1928|
|      Brooklyn|          3|    15779|
|      Brooklyn|          5|    15714|
|      Brooklyn|          4|    15101|
|      Brooklyn|          6|    13092|
|      Brooklyn|          7|    11604|
|      Brooklyn|          1|    11099|
|      Brooklyn|          2|     9516|
|           EWR|          4|       83|
|           EWR|          3|       77|
|           EWR|          6|       74|
|           EWR|          1|       68|
|           EWR|          5|       58|
|           EWR|          7|       55|
+--------------+-----------+---------+
only showing top 20 rows


- highest/lowest fare amounts for a trip, what borough is associated with the each


In [0]:


from pyspark.sql.functions import max, min

# Highest fare by borough
max_fares = df_trips.groupBy("pickup_borough").agg(max("fare_amount").alias("max_fare"))
max_fares.show()

# Lowest fare by borough
min_fares = df_trips.groupBy("pickup_borough").agg(min("fare_amount").alias("min_fare"))
min_fares.show()


+--------------+---------+
|pickup_borough| max_fare|
+--------------+---------+
|      Brooklyn|    412.0|
|         Bronx|    679.5|
|     Manhattan|623259.86|
|        Queens|   655.35|
|           N/A|  1079.15|
|       Unknown|  36090.3|
| Staten Island|   355.55|
|           EWR|    342.0|
+--------------+---------+

+--------------+--------+
|pickup_borough|min_fare|
+--------------+--------+
|      Brooklyn|  -165.0|
|         Bronx|  -300.0|
|     Manhattan|  -252.0|
|        Queens|  -362.0|
|           N/A|  -320.0|
|       Unknown|  -224.0|
| Staten Island|   -12.0|
|           EWR| -142.06|
+--------------+--------+



- what is the average trip distance by borough?


In [0]:
from pyspark.sql.functions import avg

# Average distance per pickup borough
avg_distance = df_trips.groupBy("pickup_borough").agg(avg("trip_distance").alias("avg_distance"))
avg_distance.show()

+--------------+------------------+
|pickup_borough|      avg_distance|
+--------------+------------------+
|      Brooklyn| 4.787677275447492|
|         Bronx| 7.233194552098303|
|     Manhattan|2.2286693358402596|
|        Queens|11.283218499361993|
|           N/A| 3.193850899742941|
|       Unknown| 2.415464130400774|
| Staten Island|12.503601108033246|
|           EWR| 2.641098654708519|
+--------------+------------------+



- what is the average trip fare by borough?


In [0]:

from pyspark.sql.functions import avg

# Average fare per pickup borough
avg_fare = df_trips.groupBy("pickup_borough").agg(avg("fare_amount").alias("avg_fare"))
avg_fare.show()


+--------------+------------------+
|pickup_borough|          avg_fare|
+--------------+------------------+
|      Brooklyn|18.649132800172286|
|         Bronx| 26.26890543682963|
|     Manhattan|10.792468572351568|
|        Queens| 35.14462651722029|
|           N/A|  59.5731593830335|
|       Unknown|14.944423051653523|
| Staten Island|45.289861495844896|
|           EWR| 76.24024663677126|
+--------------+------------------+



- load the dataset from the most recently available january, is there a change to any of the average metrics.


In [0]:

latest_file = 'yellow_tripdata_2025-01.parquet'
latest_download_url = f'https://d37ci6vzurychx.cloudfront.net/trip-data/{latest_file}'

# Copy and load the new data
dbutils.fs.cp(latest_download_url, f"{path_volume}/{latest_file}")

# Read the latest dataset
df_trips_latest = spark.read.parquet(f"{path_volume}/{latest_file}", header=True, inferSchema=True)



# Join with the taxi zone lookup to get borough names
# Join for pickup borough
df_trips_latest = df_trips_latest.join(
    df_zones.withColumnRenamed("LocationID", "PULocationID_lookup"),
    df_trips_latest["PULocationID"] == col("PULocationID_lookup"),
    "left"
).withColumnRenamed("Borough", "pickup_borough")
  
#Average trip distance per pickup borough
avg_distance_latest = (
    df_trips_latest.groupBy("pickup_borough")
    .agg(avg("trip_distance").alias("avg_trip_distance"))
    .orderBy(col("avg_trip_distance").desc())
)

print("Average Trip Distance by Borough (January 2025):")
avg_distance_latest.show()

# Average fare amount per pickup borough
avg_fare_latest = (
    df_trips_latest.groupBy("pickup_borough")
    .agg(avg("fare_amount").alias("avg_fare_amount"))
    .orderBy(col("avg_fare_amount").desc())
)

print("Average Fare Amount by Borough (January 2025):")
avg_fare_latest.show()


Average Trip Distance by Borough (January 2025):
+--------------+------------------+
|pickup_borough| avg_trip_distance|
+--------------+------------------+
|         Bronx| 65.91340478936304|
|           N/A|28.235833333333336|
|      Brooklyn|24.807041925231157|
|        Queens|13.360874380478473|
| Staten Island| 8.305703124999999|
|     Manhattan|4.4439188903536015|
|       Unknown| 3.202056258444915|
|           EWR|0.8962864721485414|
+--------------+------------------+

Average Fare Amount by Borough (January 2025):
+--------------+------------------+
|pickup_borough|   avg_fare_amount|
+--------------+------------------+
|           EWR| 81.47989389920424|
|           N/A| 78.32668840579711|
|        Queens| 48.35070962689943|
|         Bronx|27.772212197272925|
| Staten Island|23.920820312500005|
|      Brooklyn|  23.4369228091425|
|       Unknown|18.669125414568267|
|     Manhattan|13.869130184261868|
+--------------+------------------+



- **Trip distances** increased significantly in 2025, especially in the **Bronx** and **Brooklyn**, which may suggest longer routes or data recording changes.  
- **Fares** generally rose, especially for **Queens** and **EWR (Newark Airport)**, possibly reflecting higher fuel costs or fare adjustments.  
- **Manhattan** still has the **shortest and cheapest trips**, consistent with its dense urban layout.  
- **Staten Island** shows a notable **drop in fare and distance**, possibly due to fewer long-distance trips or outlier filtering differences.  
- The **N/A and Unknown** categories likely represent incomplete location data and should be interpreted cautiously.


### Part 3

- choose 3 questions from above and re-answer them using the language you did not use for the main notebook . (i.e - if you completed the exercise in python, redo 3 questions in pure sql) . at least one of the questions to be redone must involve a join

In [0]:
# Register the DataFrame as a temporary view for SQL queries
df_trips_latest.createOrReplaceTempView("yellow_trips_latest")
df_zones.createOrReplaceTempView("taxi_zones")


Busy/slow times by borough (hour)

In [0]:
%sql
SELECT
    pickup_borough,
    HOUR(tpep_pickup_datetime) AS pickup_hour,
    COUNT(*) AS trip_count
FROM yellow_trips_latest
WHERE pickup_borough IS NOT NULL
GROUP BY pickup_borough, HOUR(tpep_pickup_datetime)
ORDER BY pickup_borough, trip_count DESC;


pickup_borough,pickup_hour,trip_count
Bronx,7,1725
Bronx,8,1497
Bronx,6,1353
Bronx,9,935
Bronx,14,708
Bronx,5,690
Bronx,10,668
Bronx,11,659
Bronx,12,655
Bronx,13,632


Busiest days of the week by borough (with JOIN)

In [0]:
%sql
SELECT
    z.Borough AS pickup_borough,
    date_format(t.tpep_pickup_datetime, 'E') AS day_of_week,
    COUNT(*) AS trip_count
FROM yellow_trips_latest AS t
LEFT JOIN taxi_zones AS z
    ON t.PULocationID = z.LocationID
WHERE z.Borough IS NOT NULL
GROUP BY z.Borough, date_format(t.tpep_pickup_datetime, 'E')
ORDER BY z.Borough, trip_count DESC;


pickup_borough,day_of_week,trip_count
Bronx,Fri,2741
Bronx,Wed,2652
Bronx,Thu,2416
Bronx,Sun,1816
Bronx,Tue,1776
Bronx,Mon,1760
Bronx,Sat,1580
Brooklyn,Wed,11427
Brooklyn,Fri,11333
Brooklyn,Thu,9906


Average trip distance by borough

In [0]:
%sql
SELECT
    pickup_borough,
    ROUND(AVG(trip_distance), 2) AS avg_trip_distance
FROM yellow_trips_latest
WHERE pickup_borough IS NOT NULL
GROUP BY pickup_borough
ORDER BY avg_trip_distance DESC;


pickup_borough,avg_trip_distance
Bronx,65.91
,28.24
Brooklyn,24.81
Queens,13.36
Staten Island,8.31
Manhattan,4.44
Unknown,3.2
EWR,0.9



### Part 4

As of spark v4 dataframes have native visualization support. Choose at least 3 questions from above and provide visualizations.


In [0]:
from pyspark.sql.functions import avg, col

# Calculate average trip distance per pickup borough
avg_distance_by_borough = (
    df_trips_latest.groupBy("pickup_borough")
    .agg(avg("trip_distance").alias("avg_trip_distance"))
    .orderBy(col("avg_trip_distance").desc())
)


display(avg_distance_by_borough) 


pickup_borough,avg_trip_distance
Bronx,65.91340478936304
,28.235833333333336
Brooklyn,24.80704192523116
Queens,13.360874380478473
Staten Island,8.305703124999999
Manhattan,4.4439188903536015
Unknown,3.202056258444915
EWR,0.8962864721485414


Databricks visualization. Run in Databricks to view.

In [0]:
# Calculate average fare per pickup borough
avg_fare_by_borough = (
    df_trips_latest.groupBy("pickup_borough")
    .agg(avg("fare_amount").alias("avg_fare"))
    .orderBy(col("avg_fare").desc())
)


display(avg_fare_by_borough)  


pickup_borough,avg_fare
EWR,81.47989389920424
,78.32668840579711
Queens,48.35070962689943
Bronx,27.772212197272925
Staten Island,23.920820312500005
Brooklyn,23.4369228091425
Unknown,18.669125414568267
Manhattan,13.869130184261868


Databricks visualization. Run in Databricks to view.

In [0]:
# Calculate average fare per pickup borough
avg_fare_by_borough = (
    df_trips_latest.groupBy("pickup_borough")
    .agg(avg("fare_amount").alias("avg_fare"))
    .orderBy(col("avg_fare").desc())
)

display(avg_fare_by_borough) 


pickup_borough,avg_fare
EWR,81.47989389920424
,78.32668840579711
Queens,48.35070962689943
Bronx,27.772212197272925
Staten Island,23.920820312500005
Brooklyn,23.4369228091425
Unknown,18.669125414568267
Manhattan,13.869130184261868


Databricks visualization. Run in Databricks to view.

# Where to go from here

- Continue building the dataset by loading in more data, start by completing the data for 2019 and calculating the busiest season (fall, winter, spring, summer)
- Explore a dataset/datasets of your choosing