<a href="https://colab.research.google.com/github/aavarela/SPBD_Labs/blob/main/docs/labs/projs/SPBD2526_Proj1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# SPBD 2526 Project 1

 version 0.1 (5 Nov 2025)

# Context
The project scenario involves a dataset of taxi rides, collected circa 2013, in the New York city area.

Each taxi ride corresponds to a line in the dataset, comprising of the passenger pick-up and drop-off points, and respective timestamps, as well as information related to the payment, the taxi and its driver.

This project scenario is inspired by the [ACM DEBS 2015 Grand Challenge](http://www.debs2015.org/call-grand-challenge.html).

### Taxi Rides

Each completed taxi ride comprises a number of attributes, separated by commas, as follows:

| Attribute   | Description |
| :---        |        :--- |
|medallion| an md5sum of the identifier of the taxi - vehicle bound|
|hack_license| an md5sum of the identifier for the taxi license|
|pickup_datetime| time when the passenger(s) were picked up|
|dropoff_datetime| time when the passenger(s) were dropped off|
|trip_time_in_secs| duration of the trip|
|trip_distance| trip distance in miles|
|pickup_longitude| longitude coordinate of the pickup location|
|pickup_latitude| latitude coordinate of the pickup location|
|dropoff_longitude| longitude coordinate of the drop-off location|
|dropoff_latitude| latitude coordinate of the drop-off location|
|payment_type| the payment method - credit card or cash|
|fare_amount| fare amount in dollars|
|surcharge| surcharge in dollars|
|mta_tax| tax in dollars|
|tip_amount| tip in dollars|
|tolls_amount| bridge and tunnel tolls in dollars|
|total_amount| total paid amount in dollars|

---

## Dataset

The dataset is available in several forms:

* Sample of 1% of the available data, for the whole period, (roughly 1.7 million rides) (~ 120 MB) [download](https://www.dropbox.com/scl/fi/v8ei5laqcalrx30z3lsty/taxi_rides_1pc.csv.gz?rlkey=q1lq7l56c4j97h9kymsdroau5&st=iurdwnwj&dl=0);
* The whole year of 2013 (~ 173 million events) (~ 12 GB) (~33 GB expanded) [download](https://drive.google.com/file/d/0B4zFfvIVhcMzcWV5SEQtSUdtMWc/view?usp=sharing);

* (Soon) Sample of the 10 days, leading to Xmas;

* (Soon) Sample correspondig to rides taken on a given day of the week;

* More samples will be provided if necessary.

---

* Events are reported at the end of the trip, i.e., upon arrival in the order of the drop-off timestamps.

* Events with the same *dropoff_datetime* are in random order.

* Quality of the data is **not perfect**.

 + Some events might miss information such as *drop off* and *pickup*;

 + Moreover, some information, such as, e.g., the *fare price*, might have been entered incorrectly by the taxi drivers thus introducing additional skew.

# Geographic Coordinates

For simplicity, the dataset coordinates can be mapped to a grid of 300x300 cells, corresponding to square of 500x500m.

All trips starting or ending outside this area are treated as outliers (not be considered).

You can assume that a distance of 500 meter south corresponds to a change of 0.004491556 degrees in the coordinate system. For moving 500 meter east you can assume a change of 0.005986 degrees in the coordinate system. See the Helper Code at the bottom.

# Goals

The goal is to perform analytics on the dataset, using Spark SQL (Dataframes and/or SQL).

The focus will not be on the actual result numbers, but on the process. Namely, the depth and consistency of the analytics presented in the report will be more important. As such, reports will not be graded on the code alone, but on how the tools were used to support insights and observations about the dataset.

---

# Analytics Suggestions

Below is sample of possible questions we might want to ask about the dataset. Their order is not relevant.

You can develop and expand these questions or present your own choices (novelty will be appreciated).

### Do taxi rides exhibit geographic and/or temporal patterns?

+ A route is represented by a starting grid cell and an ending grid cell;

### Which areas are more profitable?

+ The profitability of an area is determined by dividing the area profit by the number of empty taxis in that area within the last 15 minutes.
    
+ The profit that originates from an area is computed by calculating the average fare + tip for trips that started in the area and ended within the last 15 minutes.

+ The number of empty taxis in an area is the sum of taxis that had a drop-off location in that area less than 30 minutes ago and had no following pickup yet.

### Which times of day are more profitable?

+ Does the overall profit change in the course of the day?

### Do certain routes or areas more prone than other to delays?

### Are all rides fair?"

+ Are there rides that cost more than expected?

### Are there drivers that seem to deviate from the pack in some way?

+. Do any drivers exhibit anomalous behavior?


---

# Requeriments

Do not try to solve all these questions. One is enough if developed with enough depth.

For code, you need to present solutions based on Spark Dataframes, Spart SQL, mixing both if necessary. Do not use Spark Core.

---
### Execution

 Groups of up to 3 elements: 2 Humans, 1 AI agent.

### Delivery Format

The solution should be delivered as a pair of Google Colab jupyter notebooks.

One notebook should be developed as a presentation report focusing the analytics. To that end, it can include text, code and graphs.

The other notebook should focus on the tecnhical aspects of the work. In particular, you should expose how AI has been pursued to aid in the development of the solution. Detailed prompts used with AI agents should be included in this document.

A google forms will be provided closer to the deadline for delivery purposes.
---

# Grading

Grading will take into consideration the overall presentation quality of the report and its technical merit.

Use of AI tools is allowed and even recommended, for example, to enrich the presentation with visual elements. However, all prompts used need to be reported.

Showing effective use of AI agents will be graded positively.

---

### Deadline

December 7, 2025. Penalty of 0.1/20 for each day late. Accumulates til grade reaches 9.5/20.

Deliveries past December 31, 2025 will not be considered.

## Suggestions

* Get familiar with the sample data;

* Use AI to help you plot the dataset in a some graphical representation, by mapping coordinates to cell grids.

* Sanitize the data: i.e, exclude incomplete, non used data or out of area rides;



# Addendum

In [None]:
#@title Java Setup (needed for pyspark)
!apt-get install -y openjdk-17-jre 2>/dev/null > /dev/null

In [None]:
#@title Download 1% sample
!wget -q -O taxi_rides_1pc.csv.gz https://www.dropbox.com/scl/fi/v8ei5laqcalrx30z3lsty/taxi_rides_1pc.csv.gz?rlkey=q1lq7l56c4j97h9kymsdroau5&st=iurdwnwj&dl=0

In [None]:
#@title Dataset Schema
from pyspark.sql import *
from pyspark.sql.types import *
from pyspark.sql.functions import *

spark = SparkSession.builder.master('local[*]') \
						.appName('taxis').getOrCreate()

try :
    data = spark.read.csv('taxi_rides_1pc.csv.gz', sep =',', header=True, inferSchema=True)

    data.printSchema()

except Exception as err:
    print(err)

root
 |-- medallion: string (nullable = true)
 |-- hack_license: string (nullable = true)
 |-- pickup_datetime: timestamp (nullable = true)
 |-- dropoff_datetime: timestamp (nullable = true)
 |-- trip_time_in_secs: integer (nullable = true)
 |-- trip_distance: double (nullable = true)
 |-- pickup_longitude: double (nullable = true)
 |-- pickup_latitude: double (nullable = true)
 |-- dropoff_longitude: double (nullable = true)
 |-- dropoff_latitude: double (nullable = true)
 |-- payment_type: string (nullable = true)
 |-- fare_amount: double (nullable = true)
 |-- surcharge: double (nullable = true)
 |-- mta_tax: double (nullable = true)
 |-- tip_amount: double (nullable = true)
 |-- tolls_amount: double (nullable = true)
 |-- total_amount: double (nullable = true)



### Some Helper Code

The following helper functions can be used in the assignment,
as is or changed as needed.


#### Convert GPS coordinates to grid cell coordinates

In [None]:
# Longitude and latitude from the upper left corner of the grid
MIN_LON = -74.916578
MAX_LAT = 41.47718278

# Longitude and latitude that correspond to a shift in 500 meters
LON_DELTA = 0.005986
LAT_DELTA = 0.004491556

def latlon_to_grid(lat, lon):
    return ((int)((MAX_LAT - lat)/LAT_DELTA), (int)((lon - MIN_LON)/LON_DELTA))

#### In Bounds check

You can use cell coordinates to exclude invalid rides

In [None]:
def inBounds( cell ):
    return cell[0] > 0 and cell[0] < 300 and cell[1] > 0 and cell[1] < 300