## PROJECT 1 - NYC Taxi Data Analysis

This notebook demonstrates how to analyze a sample of NYC Taxi trip data using PySpark and Shapely.  
We will:

1. Load the CSV data
2. Filter out outliers (invalid or overly long trips)  
3. Enrich the data with borough names using GeoJSON and Shapely  
4. Compute several queries:
   - **Query 1**: Taxi utilization  
   - **Query 2**: Average time to find the next fare (per destination borough)  
   - **Query 3**: Number of trips starting and ending in the same borough  
   - **Query 4**: Number of trips that start in one borough and end in another  


In [1]:
%pip install kafka-python

Note: you may need to restart the kernel to use updated packages.


## 1. Imports and Spark Session

In [51]:
import json
# from shapely.geometry import shape, Point

# PySpark imports
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
from pyspark.sql.window import Window
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DoubleType, TimestampType

import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")

In [52]:
schema = StructType([
    StructField("medallion", StringType(), True),
    StructField("hack_license", StringType(), True),
    StructField("pickup_datetime", StringType(), True),
    StructField("dropoff_datetime", StringType(), True),
    StructField("trip_time_in_secs", IntegerType(), True),
    StructField("trip_distance", DoubleType(), True),
    StructField("pickup_longitude", DoubleType(), True),
    StructField("pickup_latitude", DoubleType(), True),
    StructField("dropoff_longitude", DoubleType(), True),
    StructField("dropoff_latitude", DoubleType(), True),
    StructField("payment_type", StringType(), True),
    StructField("fare_amount", DoubleType(), True),
    StructField("surcharge", DoubleType(), True),
    StructField("mta_tax", StringType(), True),      # parse as string to avoid "0.50.1" errors
    StructField("tip_amount", DoubleType(), True),
    StructField("tolls_amount", StringType(), True), # same reason
    StructField("total", DoubleType(), True)
])

In [53]:
spark = SparkSession.builder \
    .appName("debs_grand_challenge") \
    .config(
        "spark.jars.packages",
        "org.apache.spark:spark-sql-kafka-0-10_2.12:3.5.1,"
        "org.apache.spark:spark-token-provider-kafka-0-10_2.12:3.5.1"
    ) \
    .getOrCreate()

25/03/26 13:47:08 WARN SparkSession: Using an existing Spark session; only runtime SQL configurations will take effect.


In [54]:

df_spark = spark.read \
    .format("csv") \
    .option("header", "false") \
    .schema(schema) \
    .load("../data/sample.csv")       

print("✅ Finished sending data to Kafka (nyc-taxi-clean)!")


✅ Finished sending data to Kafka (nyc-taxi-clean)!


In [55]:
kafka_df = spark.read \
    .format("kafka") \
    .option("kafka.bootstrap.servers", "kafka:9092") \
    .option("subscribe", "nyc-taxi-clean") \
    .load()

In [56]:

kafka_df = spark.read \
    .format("kafka") \
    .option("kafka.bootstrap.servers", "kafka:9092") \
    .option("subscribe", "nyc-taxi-clean") \
    .load()

kafka_df.printSchema()

root
 |-- key: binary (nullable = true)
 |-- value: binary (nullable = true)
 |-- topic: string (nullable = true)
 |-- partition: integer (nullable = true)
 |-- offset: long (nullable = true)
 |-- timestamp: timestamp (nullable = true)
 |-- timestampType: integer (nullable = true)



In [57]:
# Convert the Kafka binary "value" to string
raw_str_df = kafka_df.selectExpr("CAST(value AS STRING) AS raw_string")

In [58]:
from pyspark.sql.functions import col, to_json, struct, from_json, to_timestamp, year, month, dayofmonth, regexp_extract

# Parse JSON with the same schema
parsed_df = raw_str_df.select(from_json(col("raw_string"), schema).alias("data")).select("data.*")

print("Parsed DataFrame from Kafka (preview):")
parsed_df.show(5, truncate=False)

Parsed DataFrame from Kafka (preview):


25/03/26 13:47:09 WARN AdminClientConfig: These configurations '[key.deserializer, value.deserializer, enable.auto.commit, max.poll.records, auto.offset.reset]' were supplied but are not used yet.


+--------------------------------+--------------------------------+-------------------+-------------------+-----------------+-------------+----------------+---------------+-----------------+----------------+------------+-----------+---------+-------+----------+------------+-----+
|medallion                       |hack_license                    |pickup_datetime    |dropoff_datetime   |trip_time_in_secs|trip_distance|pickup_longitude|pickup_latitude|dropoff_longitude|dropoff_latitude|payment_type|fare_amount|surcharge|mta_tax|tip_amount|tolls_amount|total|
+--------------------------------+--------------------------------+-------------------+-------------------+-----------------+-------------+----------------+---------------+-----------------+----------------+------------+-----------+---------+-------+----------+------------+-----+
|5EE2C4D3BF57BDB455E74B03B89E43A7|E96EF8F6E6122591F9465376043B946D|2013-01-01 00:00:09|2013-01-01 00:00:36|26               |0.1          |-73.99221       |4

In [59]:
df_clean = parsed_df.dropna()

df_clean = df_clean.filter(
    (col("trip_time_in_secs") > 0) &
    (col("trip_distance") > 0) &
    (col("pickup_longitude") != 0.0) &
    (col("pickup_latitude") != 0.0) &
    (col("dropoff_longitude") != 0.0) &
    (col("dropoff_latitude") != 0.0)
)

In [60]:
df_clean.show(5)

+--------------------+--------------------+-------------------+-------------------+-----------------+-------------+----------------+---------------+-----------------+----------------+------------+-----------+---------+-------+----------+------------+-----+
|           medallion|        hack_license|    pickup_datetime|   dropoff_datetime|trip_time_in_secs|trip_distance|pickup_longitude|pickup_latitude|dropoff_longitude|dropoff_latitude|payment_type|fare_amount|surcharge|mta_tax|tip_amount|tolls_amount|total|
+--------------------+--------------------+-------------------+-------------------+-----------------+-------------+----------------+---------------+-----------------+----------------+------------+-----------+---------+-------+----------+------------+-----+
|5EE2C4D3BF57BDB45...|E96EF8F6E6122591F...|2013-01-01 00:00:09|2013-01-01 00:00:36|               26|          0.1|       -73.99221|      40.725124|       -73.991646|       40.726658|         CSH|        2.5|      0.5| 0.50.1|   

25/03/26 13:47:09 WARN AdminClientConfig: These configurations '[key.deserializer, value.deserializer, enable.auto.commit, max.poll.records, auto.offset.reset]' were supplied but are not used yet.


In [61]:
df_clean.count()

25/03/26 13:47:09 WARN AdminClientConfig: These configurations '[key.deserializer, value.deserializer, enable.auto.commit, max.poll.records, auto.offset.reset]' were supplied but are not used yet.
                                                                                

29175

In [62]:


# 6. Add time columns
df_time = df_clean.withColumn(
    "pickup_ts", to_timestamp("pickup_datetime", "yyyy-MM-dd HH:mm:ss")
).withColumn("year", year("pickup_ts")) \
 .withColumn("month", month("pickup_ts")) \
 .withColumn("day", dayofmonth("pickup_ts"))


In [63]:

# 7. Partitioned Parquet Output
df_time.write \
    .partitionBy("year", "month") \
    .mode("overwrite") \
    .parquet("data/kafka_cleaned_partitioned")

spark.stop()
print("✅ Done reading from Kafka, cleaning, and writing partitioned data!")

25/03/26 13:47:12 WARN AdminClientConfig: These configurations '[key.deserializer, value.deserializer, enable.auto.commit, max.poll.records, auto.offset.reset]' were supplied but are not used yet.
                                                                                

✅ Done reading from Kafka, cleaning, and writing partitioned data!


In [64]:
# Now create a new session
new_spark = SparkSession.builder.appName("CheckParquet").getOrCreate()

df_check = new_spark.read.parquet("data/kafka_cleaned_partitioned/year=2013/month=1/part-00000-b700166a-b17a-4a71-8686-e17a7e547570.c000.snappy.parquet")
df_check.show(10, truncate=False)

new_spark.stop()

25/03/26 13:47:39 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.


+--------------------------------+--------------------------------+-------------------+-------------------+-----------------+-------------+----------------+---------------+-----------------+----------------+------------+-----------+---------+-------+----------+------------+-----+-------------------+---+
|medallion                       |hack_license                    |pickup_datetime    |dropoff_datetime   |trip_time_in_secs|trip_distance|pickup_longitude|pickup_latitude|dropoff_longitude|dropoff_latitude|payment_type|fare_amount|surcharge|mta_tax|tip_amount|tolls_amount|total|pickup_ts          |day|
+--------------------------------+--------------------------------+-------------------+-------------------+-----------------+-------------+----------------+---------------+-----------------+----------------+------------+-----------+---------+-------+----------+------------+-----+-------------------+---+
|5EE2C4D3BF57BDB455E74B03B89E43A7|E96EF8F6E6122591F9465376043B946D|2013-01-01 00:00:0