<a href="https://colab.research.google.com/github/d-vinha/SPBD/blob/main/project_taxi_rides/SPBD2324_Proj.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# SPBD-2324 Project Assignment

#### version 0.1 - 18/10

The project scenario involves a dataset of taxi rides, collected December 2022, in the New York city area.

Each completed taxi ride corresponds to an event in the dataset. A ride comprises several items of information, including the pick-up and drop-off zones/regions within NY City, their respective timestamps, as well as information related to the payment and number of passengers reported by the driver. The full explanation of the available data is provided [here](https://www.nyc.gov/assets/tlc/downloads/pdf/data_dictionary_trip_records_yellow.pdf).

A table to convert zone identifiers into proper names is found [here](https://d37ci6vzurychx.cloudfront.net/misc/taxi+_zone_lookup.csv).

The project assignment will comprise a set of queries. All must be solved using Spark SQL Dataframes API. One query **of your choice** needs to be solved twice more, using Spark Core (mandatory) and, either using the SQL flavor of SparkSQL or using MrJOB.

# Queries

## Q1 - Basic Statistics

Compute for each day of the week, the total number of rides, the average ride duration, cost and distance travelled.

## Q2 Top-5 New York **boroughs**

Compute the top-5 New York **boroughs** most popular zones for pick-ups and dropoffs, for the whole month and for each day of the week, separately.

## Q3 - Compute a list of anomalous rides.

Anomalous rides are those that deviate, significantly, either in terms of cost or distance travelled, from rides that started and ended in the same zone.

## Q4 - Find the which zones tend to generate shorter rides and which generate longer rides.

 Consider a ride short or long, respectively, if it less or more than 30% than the average distance for rides that originate in that zone.

## Q5 - To be defined...

# Deadline
 + 8th December - 23h59
 + For each day late, a penalty of 0.5/20 grade points applies.

# Helper Code

The cells below show how to download the dataset and
start processing it using Spark Core and SparkSQL.

In [None]:
#@title Download Dataset
!wget -q -O taxirides.csv.gz https://shorturl.at/mzHKY

In [None]:
#@title Install PySpark
!pip install --quiet pyspark

In [None]:
#@title Spark Core Example
import pyspark

from pyspark.sql import *
from pyspark.sql.types import *
from pyspark.sql.functions import *

spark = SparkSession.builder.master('local[*]').appName('NYCtaxis').getOrCreate()
sc = spark.sparkContext

try :
  rides = sc.textFile('taxirides.csv.gz')

  for ride in rides.take(10):
    print(ride)

  sc.stop()
except Exception as e:
  print(e)
  sc.stop()

In [None]:
#@title SparkSQL Example
from pyspark.sql import *
from pyspark.sql.types import *
from pyspark.sql.functions import *

spark = SparkSession.builder.master('local[*]').appName('NYCtaxis').getOrCreate()

try :
  trips = spark.read.csv(path = "taxirides.csv.gz", header= True, inferSchema= True )
  trips.printSchema()
  trips.show(5)

except Exception as err:
  print(err)

### QUESTION 1 - BASIC STATISTICS

In [None]:
from pyspark.sql import *
from pyspark.sql.types import *
from pyspark.sql.functions import *

spark = SparkSession.builder.master('local[*]').appName('NYCtaxis').getOrCreate()

try :
  trips = spark.read.csv(path = "taxirides.csv.gz", header= True, inferSchema= True )
  trips.printSchema()
  trips.createOrReplaceTempView("TAXI_TRIPS_NYC")


except Exception as err:
  print(err)




In [None]:
# Register the DataFrame as a temporary view
trips.createOrReplaceTempView("TAXI_TRIPS_NYC")

# Execute SQL query to order the rows by passenger_count
trips_weekdays = spark.sql("SELECT *, DATE_FORMAT(tpep_pickup_datetime, 'EEEE') AS weekday FROM TAXI_TRIPS_NYC")

# Save to use for SQL statements
trips_weekdays.createOrReplaceTempView("TRIPS_WEEKDAYS")

# SQL query to drop unnecessary columns
columns_to_keep = "tpep_pickup_datetime, tpep_dropoff_datetime, trip_distance, \
                  total_amount, congestion_surcharge, airport_fee, weekday"
query = f"SELECT {columns_to_keep} FROM TRIPS_WEEKDAYS"

# Create a new DataFrame selecting specific columns
trips_weekdays_redux = spark.sql(query)

trips_weekdays_redux.createOrReplaceTempView("TRIPS_WEEKDAYS_REDUX")

# SQL query to calculate ride_cost and add it as a new column
query = """
    SELECT *,
           (total_amount + congestion_surcharge + airport_fee) AS ride_cost
    FROM TRIPS_WEEKDAYS_REDUX
"""

# Create a new DataFrame with the ride_cost column
trips_weekdays_ride_cost = spark.sql(query)

# Show the DataFrame with the new ride_cost column
trips_weekdays_ride_cost.show()

