Dataset: https://github.com/fivethirtyeight/uber-tlc-foil-response

This dataset contains data for more than 4.5 million Uber trips in New York City, from April to September 2014, and 14.3 million from January to June 2015.<br />
Travel data from 10 others rental vehicles companies are also included (FHV), as well as aggregated data for 329 FHV companies.<br />
All files were received on August 3, September 15 and September 22, 2015.

1 - How many and what are the Uber car bases?<br />
2 - What is the total number of vehicles that passed through the B02617 base?<br />
3 - What is the total number of races per base? Present in descending form.

In [1]:
from pandas import read_csv 

In [2]:
dfUber = read_csv('aux/datasets/uber.csv')
type(dfUber)

pandas.core.frame.DataFrame

In [3]:
dfUber.head(5)

Unnamed: 0,dispatching_base_number,date,active_vehicles,trips
0,B02512,1/1/2015,190,1132
1,B02765,1/1/2015,225,1765
2,B02764,1/1/2015,3427,29421
3,B02682,1/1/2015,945,7679
4,B02617,1/1/2015,1228,9537


**Converting to a Spark DataFrame**

In [4]:
dfUber = sqlContext.createDataFrame(dfUber)
type(dfUber)

pyspark.sql.dataframe.DataFrame

**Creating a RDD**

In [5]:
rddUber = sc.textFile('aux/datasets/uber.csv')
type(rddUber)

pyspark.rdd.RDD

In [6]:
rddUber.count()

355

In [7]:
rddUber.first()

'dispatching_base_number,date,active_vehicles,trips'

In [8]:
rddUber.take(5)

['dispatching_base_number,date,active_vehicles,trips',
 'B02512,1/1/2015,190,1132',
 'B02765,1/1/2015,225,1765',
 'B02764,1/1/2015,3427,29421',
 'B02682,1/1/2015,945,7679']

In [9]:
rddUberSplit = rddUber.map(lambda line: line.split(','))
rddUberSplit.take(5)

[['dispatching_base_number', 'date', 'active_vehicles', 'trips'],
 ['B02512', '1/1/2015', '190', '1132'],
 ['B02765', '1/1/2015', '225', '1765'],
 ['B02764', '1/1/2015', '3427', '29421'],
 ['B02682', '1/1/2015', '945', '7679']]

**Answer to Question 1**

In [10]:
rddUberSplit.map(lambda line: line[0]).distinct().count() - 1

6

In [11]:
rddUberSplit.map(lambda line: line[0]).distinct().collect()

['dispatching_base_number',
 'B02765',
 'B02682',
 'B02598',
 'B02512',
 'B02764',
 'B02617']

**Answer to Question 2**

In [12]:
rddUberSplit.filter(lambda line: 'B02617' in line[0]).count()

59

**Answer to Question 3**

In [13]:
header = rddUberSplit.first()
rddUberNoHeader = rddUberSplit.filter(lambda line: line != header)

In [14]:
rddUberNoHeader \
    .map(lambda kv: (kv[0], int(kv[3]))) \
    .reduceByKey(lambda n1, n2: n1 + n2) \
    .takeOrdered(6, key = lambda n: -n[1])

[('B02764', 1914449),
 ('B02617', 725025),
 ('B02682', 662509),
 ('B02598', 540791),
 ('B02765', 193670),
 ('B02512', 93786)]