# Homework - Spark Programming on Taxicab Report Dataset 

The purpose of this exercise is to write some `pyspark` code that does some computation over a large dataset. Specifically, your Spark program will analyze a dataset consisting of New York City Taxi trip reports in the Year 2013. The dataset was released under the FOIL (The Freedom of Information Law) and made public by Chris Whong (https://chriswhong.com/open-data/foiling-nycs-boro-taxi-trip-data/).

The dataset is a simple `csv` file. Each taxi trip report is a different line in the file. Among
other things, each trip report includes the starting point, the drop-off point, corresponding timestamps, and
information related to the payment. The data are reported by the time that the trip ended, i.e., upon arrive in
the order of the drop-off timestamps.
The attributes present on each line of the file are, in order:

| attribute    | description                                                       |
| -------------|-------------------------------------------------------------------|
| medallion    | an md5sum of the identifier of the taxi - vehicle bound (Taxi ID) |
| hack_license | an md5sum of the identifier for the taxi license (driver ID)      |
| vendor_id    |identifies the vendor  |
| pickup_datetime	|time when the passenger(s) were picked up  |
| payment_type	 |the payment method -credit card or cash  |
| fare_amount	 |fare amount in dollars  |
| surcharge	 |surcharge in dollars  |
| mta_tax	 |tax in dollars  |
| tip_amount	 |tip in dollars  |
| tolls_amount	 |bridge and tunnel tolls in dollars  |
| total_amount	 |total paid amount in dollars  |

Data files:
* `taxi_small_subset.csv` - This is a subset of the entire big file. You can examine this file to see what the data look like. Also, you can use this file for running your code in a single-node platform (e.g., in Vocareum) and debug it, before running your code on the big file in the cluster.   
* `2013_weekdays.csv` - This is a file with the dates of 365 days of the year 2013 with their corresponding week day. This file is used in task 4 to do join.
* S3 URI `s3://comp643bucket/homework/spark_taxicab/trip*` - This is the address of the entire dataset available in S3, which is a big file (18.4 GB). Once you debugged your code on the small subset, your final task is to run your code on this big file over an EMR cluster in AWS.

**For this homework, you need to complete 5 tasks described below.** 

**For tasks 1 through 4, write your Spark code in this Jupyter Notebook and run your code on the small subset of data, i.e., `taxi_small_subset.csv`, in Vocareum. This helps you debug your Spark program easier since you're running it in an interactive single-node platform and on a small dataset.**     

**Once you've debugged your code on a small dataset, for task 5, you need to execute your Spark code for tasks 1 through 4, in an AWS EMR cluster on the big dataset that is stored in S3 (`s3://comp643bucket/homework/spark_taxicab/trip*`).** 

In [None]:
import pyspark

In [None]:
# pyspark works best with java8 
# set JAVA_HOME enviroment variable to java8 path 
%env JAVA_HOME = /usr/lib/jvm/java-8-openjdk-amd64

In [None]:
sc = pyspark.SparkContext()

**Read the data file into an RDD**

In [None]:
taxi = sc.textFile('data/taxi_small_subset.csv')

In [None]:
taxi.count()

In [None]:
taxi.take(3)

## Task 1 - clean the dataset (20 pts)

Write a Spark program that reads the dataset into an RDD, splits each line by `,` to extract field values, and cleans the RDD through the following steps:
* Remove lines with any missing value indicated by `NULL` 
* Validate the type of the following fields and remove lines with any invalid field value:
    * `pickup_datetime` must match this pattern 'YYYY-MM-DD HH:MM:SS'
    * All fileds in dollars (`fare_amount`, `surcharge`, `mta_tax`, `tip_amount`, `tolls_amount`, `total_amount`) must be non-negative numbers (with or without a decimal point)
    
After each step of cleaning, run `count()` on your RDD, to see how many lines have been left. 

Below, we give you a set of cells you can use to walk through the analysis procress. You are also welcome to simply write all of your code in one cell, following your own logic.

### Split each line by `,` to extract field values

### Clean the RDD

**Remove lines with any `NULL` value**

**Run `count()` on your RDD to see how many lines have been left**

**Remove lines with `pickup_datetime` that does not match this pattern 'YYYY-MM-DD HH:MM:SS'**

For this task, you can use Python `re` module along with your Spark code.

In [None]:
import re

**Run `count()` on your RDD to see how many lines have been left**

**All the fields indicating an amount in dollar (`fare_amount`, `surcharge`, `mta_tax`, `tip_amount`, `tolls_amount`, `total_amount`) must be non-negative numeric (with or without decimal point) value. Remove lines with any value that does not match this pattern.** 

For this task, you can use Python `re` module along with your Spark code.

**Run `count()` on your RDD to see how many lines have been left**

## Task 2 - compute total revenue by dates (20 pts)

Write a Spark program on your derived cleaned RDD (from task 1) that computes the total amount of revenue (`total_amount` field) for each date (`pickup_datetime` field without time portions - only dates). Then, sort your RDD by the total revenue in ascending order and print out the 5 lines with the smallest total revenue. That shows the 5 dates with least total revenue.   

## Task 3 - compute total revenue by taxi drivers  (20 pts)

Write a Spark program on your derived cleaned RDD (from task 1) that computes the total amount of revenue (`total_amount` field) for each taxi driver (`hack_license`). Then, sort your RDD by the total revenue in descending order and print out the top 5 lines with the largest total revenue. That shows the 5 taxi drivers with most total revenue. 

## Task 4 - compute total revenue by weekday through join operation (20 pts)

Write a Spark program on your derived cleaned RDD (from task 1) that computes the total amount of revenue (`total_amount` field) for each 7 days of the week (Sunday through Saturday).

To extract the week days and experimenting more with Spark, we suggest that you use `join` RDD operation to join the taxi dataset with the provided `2013_weekdays.csv` file that contains the dates for 365 days of the year 2013 and their corresponding week days.    

First, read `2013_weekdays.csv` into an RDD, and split each line by `,` to extract the field values.

Then, manipulate this RDD and your derived cleaned RDD of taxi dataset (from task 1), to be able to join the two and compute the total revenue by weekday.  

Finally, sum the total amount per weekday, and return the result in descending order of the total revenue.

## Task 5 - run on a big file in EMR cluster (20 pts)

For the last part of this homework, you need to run your Spark code for tasks 1 through 4, on a big file in S3, in an AWS EMR cluster. 

Follow the instructions on `Lab - Spark Intro (AWS)` to create and connect to an EMR cluster in AWS and run Spark programs in there. 

**For better efficiency, in the hardware configuration of your cluster, choose `m5.xlarge` as instance type, and type 4 as the number of instances.**  

The big file exists in this S3 URI: `s3://comp643bucket/homework/spark_taxicab/trip*.csv`

To read the big file from S3 into an RDD, use the code below:

`taxi = sc.textFile ("s3://comp643bucket/homework/spark_taxicab/trip*.csv")`

Repeat tasks 1 through 4 on this `taxi` RDD created from the big file, and print your results in the markdown cells below (keep the results from the small subset above). 

**Repeat task 1 on the big file in your EMR cluster - print the number of lines (`count()`) of your cleaned RDD from the big file, here:** 

Copy your result in this markdown cell ...

**Repeat task 2 on the big file in your EMR cluster - copy your result, which is the 5 dates with least total revenue, from the big file, here:** 

Copy your result in this markdown cell ...

**Repeat task 3 on the big file in your EMR cluster - copy your result, which is the top 5 drivers with the most revenue, from the big file, here:**  

Copy your result in this markdown cell ...

**Repeat task 4 on the big file in your EMR cluster. `2013_weekdays.csv` is also available in S3 through this URI `s3://comp643bucket/homework/spark_taxicab/2013_weekdays.csv`. Copy your result, which is the sum of revenue per weekday in descending order of total revenue, from the big file, here:**  

Copy your result in this markdown cell ...