# Homework - Spark Programming on Taxicab Report Dataset 

The purpose of this exercise is to write some `pyspark` code that does some computation over a large dataset. Specifically, your Spark program will analyze a dataset consisting of New York City Taxi trip reports in the Year 2013. The dataset was released under the FOIL (The Freedom of Information Law) and made public by Chris Whong (https://chriswhong.com/open-data/foiling-nycs-boro-taxi-trip-data/).

The dataset is a simple `csv` file. Each taxi trip report is a different line in the file. Among
other things, each trip report includes the starting point, the drop-off point, corresponding timestamps, and
information related to the payment. The data are reported by the time that the trip ended, i.e., upon arrive in
the order of the drop-off timestamps.
The attributes present on each line of the file are, in order:

| attribute    | description                                                       |
| -------------|-------------------------------------------------------------------|
| medallion    | an md5sum of the identifier of the taxi - vehicle bound (Taxi ID) |
| hack_license | an md5sum of the identifier for the taxi license (driver ID)      |
| vendor_id    |identifies the vendor  |
| pickup_datetime	|time when the passenger(s) were picked up  |
| payment_type	 |the payment method -credit card or cash  |
| fare_amount	 |fare amount in dollars  |
| surcharge	 |surcharge in dollars  |
| mta_tax	 |tax in dollars  |
| tip_amount	 |tip in dollars  |
| tolls_amount	 |bridge and tunnel tolls in dollars  |
| total_amount	 |total paid amount in dollars  |

Data files:
* `taxi_small_subset.csv` - This is a subset of the entire big file. You can examine this file to see what the data look like. Also, you can use this file for running your code in a single-node platform (e.g., in Vocareum) and debug it, before running your code on the big file in the cluster.   
* `2013_weekdays.csv` - This is a file with the dates of 365 days of the year 2013 with their corresponding week day. This file is used in task 4 to do join.
* S3 URI `s3://comp643bucket/homework/spark_taxicab/trip*` - This is the address of the entire dataset available in S3, which is a big file (18.4 GB). Once you debugged your code on the small subset, your final task is to run your code on this big file over an EMR cluster in AWS.

**For this homework, you need to complete 5 tasks described below.** 

**For tasks 1 through 4, write your Spark code in this Jupyter Notebook and run your code on the small subset of data, i.e., `taxi_small_subset.csv`, in Vocareum. This helps you debug your Spark program easier since you're running it in an interactive single-node platform and on a small dataset.**     

**Once you've debugged your code on a small dataset, for task 5, you need to execute your Spark code for tasks 1 through 4, in an AWS EMR cluster on the big dataset that is stored in S3 (`s3://comp643bucket/homework/spark_taxicab/trip*`).** 

In [1]:
import pyspark

In [2]:
# pyspark works best with java8 
# set JAVA_HOME enviroment variable to java8 path 
%env JAVA_HOME = /usr/lib/jvm/java-8-openjdk-amd64

env: JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64


In [3]:
sc = pyspark.SparkContext()

**Read the data file into an RDD**

In [4]:
taxi = sc.textFile('data/taxi_small_subset.csv')

In [5]:
taxi.count()

71153

In [6]:
taxi.take(4)

['medallion, hack_license, vendor_id, pickup_datetime, payment_type, fare_amount, surcharge, mta_tax, tip_amount, tolls_amount, total_amount',
 '7DD1D6A5E432ACBD68A734587B589B9B,EF3FD28F7D39F614BF68B51F0256B050,CMT,2013-08-28 06:53:33,CSH,12,0,0.5,0,0,12.5',
 'CEBDF34FE2DA2E9233B87C2E703004FF,D9EA31E70BE082F423D42860FD4BD240,CMT,NULL,CSH,7,1,0.5,0,0,8.5',
 'A6E8AD830F49F7B358D52419084D42A0,B1F1E21144EC5D9EC144AF9E4FBF320E,CMT,2013-08-29 12:59:08,CSH,6,0,0.5,0,0,6.5']

## Task 1 - clean the dataset (20 pts)

Write a Spark program that reads the dataset into an RDD, splits each line by `,` to extract field values, and cleans the RDD through the following steps:
* Remove lines with any missing value indicated by `NULL` 
* Validate the type of the following fields and remove lines with any invalid field value:
    * `pickup_datetime` must match this pattern 'YYYY-MM-DD HH-MM-SS'
    * All fileds in dollars (`fare_amount`, `surcharge`, `mta_tax`, `tip_amount`, `tolls_amount`, `total_amount`) must be non-negative numbers (with or without a decimal point)
    
After each step of cleaning, run `count()` on your RDD, to see how many lines have been left. 

Below, we give you a set of cells you can use to walk through the analysis procress. You are also welcome to simply write all of your code in one cell, following your own logic.

**Split each line by `,` to extract field values**

In [7]:
my_taxi = sc.textFile('data/taxi_small_subset.csv').map(lambda line: line.split(","))

**Clean the RDD**

**Remove lines with any `NULL` value**

In [8]:
my_taxi = my_taxi.filter(lambda x: 'NULL' not in x)

In [9]:
my_taxi.count()

71141

**Remove lines with `pickup_datetime` that does not match this pattern 'YYYY-MM-DD HH-MM-SS'**

For this task, you can use Python `re` module along with your Spark code.

In [10]:
import re

In [11]:
my_taxi = my_taxi.filter(lambda x: re.match(r'\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}', x[3]))

In [12]:
my_taxi.count()

71140

**All the fields indicating an amount in dollar (`fare_amount`, `surcharge`, `mta_tax`, `tip_amount`, `tolls_amount`, `total_amount`) must be positive numeric (with or without decimal point) value. Remove lines with any value that does not match this pattern.** 

For this task, you can use Python `re` module along with your Spark code.

In [13]:
my_taxi = my_taxi.filter(lambda x: all([re.match(r'^[0-9]\d*(\.\d+)?$', x[5]), 
                                        re.match(r'^[0-9]\d*(\.\d+)?$', x[6]), 
                                        re.match(r'^[0-9]\d*(\.\d+)?$', x[7]),
                                        re.match(r'^[0-9]\d*(\.\d+)?$', x[8]),
                                        re.match(r'^[0-9]\d*(\.\d+)?$', x[9])]))

In [14]:
my_taxi.count()

71138

## Task 2 - compute total revenue by date (20 pts)

Write a Spark program on your derived cleaned RDD (from task 1) that computes the total amount of revenue (`total_amount` field) for each date (`pickup_datetime` field without time portions - only dates). Print out your result RDD, sorted by date in ascending order. 

In [15]:
my_taxi2 = my_taxi.map(lambda line: line[:3]+[line[3][:10]]+[line[4]]+[float(x) for x in line[5:]])

In [16]:
dates_dollars = my_taxi2.map(lambda line: (line[3], line[-1]))

In [17]:
dates_dollars.reduceByKey(lambda x, y: x+y).sortByKey().collect()

[('2013-01-01', 5830.230000000001),
 ('2013-01-02', 5461.040000000004),
 ('2013-01-03', 5542.82),
 ('2013-01-04', 6224.220000000002),
 ('2013-01-05', 5188.200000000002),
 ('2013-01-06', 4207.72),
 ('2013-01-07', 1333.8799999999997),
 ('2013-01-08', 2119.2599999999998),
 ('2013-01-09', 1863.539999999999),
 ('2013-01-10', 2529.2199999999975),
 ('2013-01-11', 2256.6599999999994),
 ('2013-01-12', 1981.5599999999993),
 ('2013-01-13', 4581.5099999999975),
 ('2013-01-14', 5289.370000000003),
 ('2013-01-15', 5317.680000000002),
 ('2013-01-16', 6234.550000000004),
 ('2013-01-17', 6544.0999999999985),
 ('2013-01-18', 6329.929999999998),
 ('2013-01-19', 5918.4),
 ('2013-01-20', 5118.680000000002),
 ('2013-01-21', 1394.5399999999995),
 ('2013-01-22', 1063.7800000000002),
 ('2013-01-23', 8.5),
 ('2013-01-25', 34.56),
 ('2013-02-01', 1828.8100000000002),
 ('2013-02-02', 3733.060000000001),
 ('2013-02-03', 3333.709999999999),
 ('2013-02-04', 3074.9599999999996),
 ('2013-02-05', 122.3),
 ('2013-02-06'

## Task 3 - find the 5 taxi drivers with most total revenue (20 pts)

Write Spark code on your derived cleaned RDD (from task 1) that finds the 5 taxi drivers (`hack_license`) who had the most revenue (`total_amount`) in the dataset.

Finally, you need to sort your result RDD on the descending order of total amount and print out the top 5 taxi drivers. 

In [18]:
driver_revenue = my_taxi2.map(lambda line: (line[1], line[-1]))
driver_revenue.take(3)

[('EF3FD28F7D39F614BF68B51F0256B050', 12.5),
 ('B1F1E21144EC5D9EC144AF9E4FBF320E', 6.5),
 ('B0A352468D7D2FEDC1BFD85A74BEE0A9', 11.5)]

In [19]:
driver_by_rev = driver_revenue.reduceByKey(lambda x, y: x+y)
t3 = driver_by_rev.top(5, key=lambda p: p[1])

[('CFCD208495D565EF66E7DFF9F98764DA', 508.5),
 ('178C58D2C909125EE599C388CC1A311C', 356.9),
 ('83DDCD2CC7035BEBED7AC4255688308A', 355.0),
 ('B9E81BA07F0DDA5B2FBCA9B33CCC7C9A', 335.3),
 ('98949EA21D9A4DA151ADEE27E4DEDE7C', 333.32)]

## Task 4 - compute total revenue by week day through join operation (20 pts)

Write a Spark program on your derived cleaned RDD (from task 1) that computes the total amount of revenue (`total_amount` field) for each 7 days of the week (Sunday through Saturday).

To extract the week days and experimenting more with Spark, we suggest that you use `join` RDD operation to join the taxi dataset with the provided `2013_weekdays.csv` file that contains the dates for 365 days of the year 2013 and their corresponding week days.    

First, read `2013_weekdays.csv` into an RDD, and split each line by `,` to extract the field values.

Then, manipulate this RDD and your derived cleaned RDD of taxi dataset (from task 1), to be able to join the two and compute the total revenue by week day.  

Finally, sum the total amount per week day, and return the result. 

In [20]:
weekdays = sc.textFile('data/2013_weekdays.csv').map(lambda line: line.split(","))
weekdays.take(3)

[['Date', 'WeekDay'], ['2013-01-01', 'Tuesday'], ['2013-01-02', 'Wednesday']]

In [21]:
wkdy_rev = my_taxi2.map(lambda line: (line[3], line[-1])).leftOuterJoin(weekdays).map(lambda x: (x[1][1], x[1][0]))

In [22]:
rev_by_weekday = wkdy_rev.reduceByKey(lambda x, y: x+y)
t4 = rev_by_weekday.collect()

[('Saturday', 160588.6800000002),
 ('Thursday', 161180.15000000014),
 ('Sunday', 137148.43000000014),
 ('Tuesday', 155412.58000000007),
 ('Monday', 138941.37000000017),
 ('Friday', 160147.96000000028),
 ('Wednesday', 154230.5700000003)]

## Task 5 - run on a big file in EMR cluster (20 pts)

For the last part of this homework, you need to run your Spark code for tasks 1 through 4, on a big file in S3, in an AWS EMR cluster. 

Follow the instructions on `Lab - Spark Intro (AWS)` to create and connect to an EMR cluster in AWS and run Spark programs in there. 

**For better efficiency, in the hardware configuration of your cluster, choose `m5.xlarge` as instance type, and type 4 as the number of instances.**  

The big file exists in this S3 URI: `s3://comp643bucket/homework/spark_taxicab/trip*.csv`

To read the big file from S3 into an RDD, use the code below:

`taxi = sc.textFile ("s3://comp643bucket/homework/spark_taxicab/trip*.csv")`

Repeat tasks 1 through 4 on this `taxi` RDD created from the big file, and print your results in the markdown cells below (keep the results from the small subset above). 

**Repeat task 1 on the big file in your EMR cluster - print the number of lines (`count()`) of your cleaned RDD from the big file, here:** 

173176128

**Repeat task 2 on the big file in your EMR cluster - copy your result RDD sorted by date in ascending order, from the big file, here:** 

[('2013-01-01', 6155808.069999878), ('2013-01-02', 5597586.190000281), ('2013-01-03', 6063858.900000158), ('2013-01-04', 6436856.210000263), ('2013-01-05', 6284187.200000161), ('2013-01-06', 5713649.1699998425), ('2013-01-07', 5709195.480000229), ('2013-01-08', 6086401.
280000457), ('2013-01-09', 6340631.1300003035), ('2013-01-10', 6923892.180000302), ('2013-01-11', 7309754.250000272), ('2013-01-12', 6794477.000000234), ('2013-01-13', 6337345.52000048), ('2013-01-14', 6247518.030000352), ('2013-01-15', 6904804.110000175), ('2013-01-16'
, 7099953.430000353), ('2013-01-17', 7369913.010000459), ('2013-01-18', 7753118.539999796), ('2013-01-19', 6861240.610000081), ('2013-01-20', 6308348.879999912), ('2013-01-21', 5409171.5100002745), ('2013-01-22', 6859188.610000231), ('2013-01-23', 7141795.610000406), ('
2013-01-24', 7394285.480000175), ('2013-01-25', 7591697.220000246), ('2013-01-26', 7380616.140000381), ('2013-01-27', 6691499.989999998), ('2013-01-28', 6318050.780000444), ('2013-01-29', 6661626.400000415), ('2013-01-30', 6819942.910000342), ('2013-01-31', 7611473.5700
00863), ('2013-02-01', 8077711.2600005865), ('2013-02-02', 7480307.670000356), ('2013-02-03', 6532121.09000059), ('2013-02-04', 6465668.770000457), ('2013-02-05', 6728588.430000471), ('2013-02-06', 7124564.740000489), ('2013-02-07', 7810428.980000304), ('2013-02-08', 56
42667.740000293), ('2013-02-09', 5013328.540000186), ('2013-02-10', 6491805.029999988), ('2013-02-11', 6600228.5000001285), ('2013-02-12', 7256108.230000392), ('2013-02-13', 7541222.270000545), ('2013-02-14', 7739731.940000535), ('2013-02-15', 7872468.330000801), ('2013
-02-16', 7286060.730000604), ('2013-02-17', 6845261.530000539), ('2013-02-18', 5719261.550000126), ('2013-02-19', 6530409.830000578), ('2013-02-20', 7137929.220000626), ('2013-02-21', 7594376.810000505), ('2013-02-22', 7967313.82000071), ('2013-02-23', 7741647.640000548
), ('2013-02-24', 6734030.250000311), ('2013-02-25', 6477388.620000317), ('2013-02-26', 6990771.510000228), ('2013-02-27', 7278291.40000001), ('2013-02-28', 7666926.940000321), ('2013-03-01', 8139945.37000051), ('2013-03-02', 7891883.259999683), ('2013-03-03', 7164542.6
80000489), ('2013-03-04', 6701235.600000434), ('2013-03-05', 7033434.130000641), ('2013-03-06', 7299922.870000678), ('2013-03-07', 7653469.560000306), ('2013-03-08', 7736197.170000744), ('2013-03-09', 7843434.580000532), ('2013-03-10', 6897237.450000262), ('2013-03-11',
 6603637.270000128), ('2013-03-12', 7151880.270000354), ('2013-03-13', 7676765.990000417), ('2013-03-14', 8332600.810000342), ('2013-03-15', 8362542.520000496), ('2013-03-16', 8052750.150000363), ('2013-03-17', 7213525.390000293), ('2013-03-18', 6668808.510000563), ('20
13-03-19', 6856158.650000535), ('2013-03-20', 7273280.520000198), ('2013-03-21', 7368950.29000031), ('2013-03-22', 8026893.840000602), ('2013-03-23', 7645868.220000483), ('2013-03-24', 6851763.680000435), ('2013-03-25', 6242181.4300001), ('2013-03-26', 6632680.740000226
), ('2013-03-27', 7014765.390000305), ('2013-03-28', 7653224.730000071), ('2013-03-29', 7433572.330000302), ('2013-03-30', 6864070.360000197), ('2013-03-31', 6430424.840000064), ('2013-04-01', 6263647.910000262), ('2013-04-02', 7035845.220000231), ('2013-04-03', 7471364
.470000258), ('2013-04-04', 7892240.7500001285), ('2013-04-05', 8132437.820000378), ('2013-04-06', 7761401.1300002225), ('2013-04-07', 7094588.600000518), ('2013-04-08', 6456388.9500001585), ('2013-04-09', 6927704.380000168), ('2013-04-10', 7324328.760000352), ('2013-04
-11', 7942809.060000086), ('2013-04-12', 8256000.260000243), ('2013-04-13', 7625592.250000177), ('2013-04-14', 7221397.510000267), ('2013-04-15', 6829723.23000014), ('2013-04-16', 7040197.220000152), ('2013-04-17', 7426764.830000158), ('2013-04-18', 7921211.370000119), 
('2013-04-19', 8108821.010000275), ('2013-04-20', 7889826.78000008), ('2013-04-21', 7110201.060000311), ('2013-04-22', 6715859.190000011), ('2013-04-23', 7268278.210000118), ('2013-04-24', 7381586.470000213), ('2013-04-25', 7976303.670000426), ('2013-04-26', 8162598.720
000325), ('2013-04-27', 7701326.850000167), ('2013-04-28', 7073965.570000053), ('2013-04-29', 6769091.820000637), ('2013-04-30', 7036539.730000621), ('2013-05-01', 7515615.840000653), ('2013-05-02', 7915287.520000829), ('2013-05-03', 8211510.980000824), ('2013-05-04', 7
753913.520000289), ('2013-05-05', 7308989.850000454), ('2013-05-06', 7035808.31000062), ('2013-05-07', 7272448.350000593), ('2013-05-08', 7715748.71000058), ('2013-05-09', 8170779.440000861), ('2013-05-10', 8200131.330000792), ('2013-05-11', 7465048.090000614), ('2013-0
5-12', 6934527.7700003), ('2013-05-13', 6921766.400000596), ('2013-05-14', 7265246.020000659), ('2013-05-15', 7616556.480000711), ('2013-05-16', 8141080.420000397), ('2013-05-17', 8332180.800000798), ('2013-05-18', 7942858.820000636), ('2013-05-19', 7409152.300000489), 
('2013-05-20', 7175166.400000304), ('2013-05-21', 7657783.020000313), ('2013-05-22', 7754988.380000447), ('2013-05-23', 7831608.789999997), ('2013-05-24', 7725439.820000488), ('2013-05-25', 6327689.160000245), ('2013-05-26', 5560882.860000107), ('2013-05-27', 5203819.58
0000215), ('2013-05-28', 6764673.440000377), ('2013-05-29', 7179023.220000526), ('2013-05-30', 7773610.690000678), ('2013-05-31', 7954878.62000055), ('2013-06-01', 7481803.270000773), ('2013-06-02', 6883789.750000542), ('2013-06-03', 6793589.04000054), ('2013-06-04', 72
37697.450000897), ('2013-06-05', 7649636.100000773), ('2013-06-06', 7990652.720000529), ('2013-06-07', 8028643.020000682), ('2013-06-08', 6933761.410000347), ('2013-06-09', 6888553.510000557), ('2013-06-10', 7023471.3800002895), ('2013-06-11', 7115482.590000292), ('2013
-06-12', 7494382.580000324), ('2013-06-13', 7822149.280000367), ('2013-06-14', 7748373.750000307), ('2013-06-15', 7076438.010000244), ('2013-06-16', 6449233.110000103), ('2013-06-17', 6601498.39000023), ('2013-06-18', 7004796.280000517), ('2013-06-19', 7476068.830000438
), ('2013-06-20', 7727357.320000678), ('2013-06-21', 7688887.060000587), ('2013-06-22', 6790382.830000447), ('2013-06-23', 6570007.290000318), ('2013-06-24', 6540394.860000331), ('2013-06-25', 7055273.16000026), ('2013-06-26', 7454240.260000319), ('2013-06-27', 7748210.
200000362), ('2013-06-28', 7823081.670000392), ('2013-06-29', 7143670.710000018), ('2013-06-30', 6409428.470000108), ('2013-07-01', 6251501.600000273), ('2013-07-02', 6362945.36000043), ('2013-07-03', 6418768.160000468), ('2013-07-04', 4575369.319999915), ('2013-07-05',
 4803163.4600001), ('2013-07-06', 4871393.060000102), ('2013-07-07', 5353565.250000056), ('2013-07-08', 6300703.240000136), ('2013-07-09', 6757271.849999977), ('2013-07-10', 7053278.450000233), ('2013-07-11', 7508138.830000423), ('2013-07-12', 7513843.170000294), ('2013
-07-13', 6587855.010000009), ('2013-07-14', 6487142.680000196), ('2013-07-15', 6608855.280000178), ('2013-07-16', 6914721.6200002255), ('2013-07-17', 7342898.770000178), ('2013-07-18', 7727945.430000244), ('2013-07-19', 7717193.2900004145), ('2013-07-20', 6743033.680000
139), ('2013-07-21', 6250172.660000268), ('2013-07-22', 6489995.520000474), ('2013-07-23', 6953065.740000048), ('2013-07-24', 7237750.570000603), ('2013-07-25', 7461431.37000058), ('2013-07-26', 7313389.780000589), ('2013-07-27', 6562832.940000334), ('2013-07-28', 61878
79.040000292), ('2013-07-29', 6252667.090000435), ('2013-07-30', 6500185.240000397), ('2013-07-31', 6922351.720000497), ('2013-08-01', 4567602.710000372), ('2013-08-02', 3530800.6300003207), ('2013-08-03', 3182263.760000205), ('2013-08-04', 3073608.610000252), ('2013-08
-05', 6657935.770000243), ('2013-08-06', 6577890.380000218), ('2013-08-07', 6926720.310000213), ('2013-08-08', 6513403.630000377), ('2013-08-09', 7254503.710000245), ('2013-08-10', 6393553.63000018), ('2013-08-11', 3071194.3099999935), ('2013-08-12', 5995842.190000141),
 ('2013-08-13', 6663907.490000272), ('2013-08-14', 7553822.43000033), ('2013-08-15', 7311746.650000263), ('2013-08-16', 7139247.39000035), ('2013-08-17', 6526074.540000162), ('2013-08-18', 6063420.590000081), ('2013-08-19', 6165393.170000159), ('2013-08-20', 6470846.060
00012), ('2013-08-21', 6850371.850000195), ('2013-08-22', 7186063.9800003655), ('2013-08-23', 7094220.640000271), ('2013-08-24', 6514903.470000389), ('2013-08-25', 6194814.22000038), ('2013-08-26', 5950471.929999909), ('2013-08-27', 6235638.869999969), ('2013-08-28', 66
68984.810000051), ('2013-08-29', 7071593.720000107), ('2013-08-30', 7004371.110000109), ('2013-08-31', 6241374.149999882), ('2013-09-01', 5619155.650000589), ('2013-09-02', 5098490.359999916), ('2013-09-03', 6491695.110000019), ('2013-09-04', 6761444.71000005), ('2013-0
9-05', 6849652.47000005), ('2013-09-06', 7543070.9400001345), ('2013-09-07', 7410219.290000037), ('2013-09-08', 7025360.529999996), ('2013-09-09', 6881140.260000368), ('2013-09-10', 7333647.010000184), ('2013-09-11', 7512565.320000286), ('2013-09-12', 7916396.880000429)
, ('2013-09-13', 8187016.470000336), ('2013-09-14', 7305007.470000164), ('2013-09-15', 7059224.600000268), ('2013-09-16', 6851916.240000076), ('2013-09-17', 7184226.680000102), ('2013-09-18', 7477339.480000223), ('2013-09-19', 7814325.180000222), ('2013-09-20', 8145315.
140000067), ('2013-09-21', 7786891.859999995), ('2013-09-22', 7227919.220000006), ('2013-09-23', 6616708.470000274), ('2013-09-24', 6763584.100000315), ('2013-09-25', 7240695.170000259), ('2013-09-26', 7627537.790000347), ('2013-09-27', 7851086.620000188), ('2013-09-28'
, 7703734.320000196), ('2013-09-29', 7122464.470000364), ('2013-09-30', 6582569.77000078), ('2013-10-01', 6836625.290000099), ('2013-10-02', 7193015.640000195), ('2013-10-03', 7691451.380000072), ('2013-10-04', 7939072.640000345), ('2013-10-05', 7557050.500000456), ('20
13-10-06', 7112979.200000342), ('2013-10-07', 6508816.200000122), ('2013-10-08', 6966873.490000178), ('2013-10-09', 7426570.660000391), ('2013-10-10', 7928990.740000216), ('2013-10-11', 8081373.9700003285), ('2013-10-12', 7753415.920000272), ('2013-10-13', 7190288.14000
0472), ('2013-10-14', 6096785.550000258), ('2013-10-15', 6582799.590000266), ('2013-10-16', 7309766.840000517), ('2013-10-17', 7926713.43000032), ('2013-10-18', 8151395.510000412), ('2013-10-19', 7869540.080000158), ('2013-10-20', 7308765.270000379), ('2013-10-21', 6678
712.330000319), ('2013-10-22', 7000042.790000219), ('2013-10-23', 7525775.5900003435), ('2013-10-24', 7993033.550000335), ('2013-10-25', 8123852.1100004595), ('2013-10-26', 7677650.650000172), ('2013-10-27', 7194001.18000027), ('2013-10-28', 6610361.540000609), ('2013-1
0-29', 6912935.060000349), ('2013-10-30', 7278987.04000053), ('2013-10-31', 7462751.2800004035), ('2013-11-01', 8580527.570000838), ('2013-11-02', 7668629.240000647), ('2013-11-03', 7192686.050000586), ('2013-11-04', 6920247.070000112), ('2013-11-05', 6896607.980000235)
, ('2013-11-06', 7394702.590000292), ('2013-11-07', 8044201.770000227), ('2013-11-08', 8284354.440000417), ('2013-11-09', 7774557.650000313), ('2013-11-10', 7213840.790000227), ('2013-11-11', 6455445.770000205), ('2013-11-12', 7127845.610000043), ('2013-11-13', 7411807.
630000045), ('2013-11-14', 7789295.340000255), ('2013-11-15', 7970699.0000004), ('2013-11-16', 7721349.4400003515), ('2013-11-17', 7033678.51000013), ('2013-11-18', 6505814.620000014), ('2013-11-19', 7072677.400000162), ('2013-11-20', 7469779.110000047), ('2013-11-21', 
7738555.950000253), ('2013-11-22', 8034678.870000381), ('2013-11-23', 7790661.060000331), ('2013-11-24', 7030430.580000415), ('2013-11-25', 6486676.800000241), ('2013-11-26', 6882872.590000114), ('2013-11-27', 6483025.480000323), ('2013-11-28', 4748377.520000041), ('201
3-11-29', 4963309.899999993), ('2013-11-30', 6057614.490000117), ('2013-12-01', 6253203.700000689), ('2013-12-02', 6788031.6400004625), ('2013-12-03', 7134861.170000515), ('2013-12-04', 7452656.220000421), ('2013-12-05', 7814472.960000547), ('2013-12-06', 8341385.070000
742), ('2013-12-07', 7961737.730000368), ('2013-12-08', 7332464.940000491), ('2013-12-09', 6924651.610000456), ('2013-12-10', 7101578.310000417), ('2013-12-11', 7817996.160000437), ('2013-12-12', 8256267.2600003285), ('2013-12-13', 8580331.660000643), ('2013-12-14', 728
1088.110000422), ('2013-12-15', 6771079.61000035), ('2013-12-16', 7031713.510000614), ('2013-12-17', 7176713.170000481), ('2013-12-18', 7730028.7100007115), ('2013-12-19', 7922613.040000445), ('2013-12-20', 8183023.320000751), ('2013-12-21', 7176444.250000424), ('2013-1
2-22', 6072180.56000036), ('2013-12-23', 5657237.399999956), ('2013-12-24', 4961798.929999984), ('2013-12-25', 3465875.309999937), ('2013-12-26', 4601206.359999929), ('2013-12-27', 5608426.559999961), ('2013-12-28', 5782093.000000121), ('2013-12-29', 5918205.399999994),
 ('2013-12-30', 5956747.680000547), ('2013-12-31', 6272547.9800005)]   

**Repeat task 3 on the big file in your EMR cluster - copy your result RDD, which is the top 5 drivers based on their sum of revenue, from the big file, here:**  

[('664927CDE376A32789BA48BF55DFB7E3', 728594.3300000002), ('CFCD208495D565EF66E7DFF9F98764DA', 615220.1499999997), ('E4F99C9ABE9861F18BCD38BC63D007A9', 563445.9400000001), ('D85749E8852FCC66A990E40605607B2F', 246374.58999999997), ('1EDF99EE9DAC182027330EF48828B54A', 242
656.10000000015)]

**Repeat task 4 on the big file in your EMR cluster. `2013_weekdays.csv` is also available in S3 through this URI `s3://comp643bucket/homework/spark_taxicab/2013_weekdays.csv`. Copy your result RDD, which is the sum of revenue per week day, from the big file, here:**  

[('Sunday', 341485998.91002834), ('Monday', 334818004.12003356), ('Friday', 394663376.4300214), ('Wednesday', 3727
33935.33000827), ('Thursday', 386085572.9700065), ('Saturday', 368934554.2300191), ('Tuesday', 362666230.20001215)
]   