NYCTaxi 2013
===========

In 2014 Chris Whong successfully submitted a FOIA request to obtain the records of all taxi rides in New York City for the year of 2013.  

http://chriswhong.com/open-data/foil_nyc_taxi/

We use [dask.dataframe](dask.pydata.org/en/latest/dataframe.html) and [castra](github.com/blaze/castra) to interact with this data.

### Download and untar

This should take less than two minutes if running on Google Compute Engine (via binder.)  If running on your own hardware then please be mindful of repeated downloads.  (Moving this dataset outside of the cloud costs us about $1.)

In [None]:
!wget  https://storage.googleapis.com/blaze-data/nyc-taxi/castra/nyc-taxi-2013.castra.tar

In [None]:
# This takes about a minute
!tar -xf nyc-taxi-2013.castra.tar

### Wrap with dask.dataframe

In [None]:
import dask.dataframe as dd
df = dd.from_castra('tripdata.castra/')
df.head()

### Set up progress bar

This lets us know how long our `dask.dataframe` computations take.

In [None]:
from dask.diagnostics import ProgressBar
progress_bar = ProgressBar()
progress_bar.register()

## Play



### How many passengers per ride?

In [None]:
df.passenger_count.value_counts().compute()

### How many medallions on the road per day?

In [None]:
%matplotlib inline
df.medallion.resample('1d', how='nunique').compute().plot()

### Lets consider this per hour

We'll have to switch from matplotlib to bokeh so that we can zoom around

In [None]:
hourly = df.medallion.resample('1h', how='nunique').compute()

In [None]:
from bokeh.charts import TimeSeries, output_notebook, show
output_notebook()

fig = TimeSeries(hourly.values, hourly.index, title='Cabs on the road',
                 xlabel='Time', ylabel='Number of cabs')
show(fig)

Next we will visualize the pickup locations. 

In [None]:
df2 = df.sample(frac=0.0001)
# Remove some outliers
df3 = df2[(df2.pickup_latitude > 40) &
          (df2.pickup_latitude < 42) & 
          (df2.pickup_longitude > -75) & 
          (df2.pickup_longitude < -72)]
pickup = df3[['pickup_latitude', 'pickup_longitude']]


result = pickup.compute()

In [None]:
from bokeh.plotting import figure, show

p = figure(title="Pickup Locations")
p.scatter(result.pickup_longitude, result.pickup_latitude, size=3, alpha=0.2, notebook=True)
show(p)

Next we plot a histogram of the distances travelled.

In [None]:
km_per_degree_latitude = 110.0  # at 40 degrees 
km_per_degree_longitude = 85.0  # from http://www.csgnetwork.com/degreelenllavcalc.html

with ProgressBar():
    distance = ((km_per_degree_latitude * (df3.pickup_latitude - df3.dropoff_latitude))**2
              + (km_per_degree_longitude * (df3.pickup_longitude - df3.dropoff_longitude))**2)**0.5
    hist = (distance // 1).value_counts().compute().sort_index()  # truncate then count
hist.head()

In [None]:
p = figure(title="Binned distance frequencies (degrees)",
           y_axis_type="log")
p.line(hist.loc[:100].index.values, hist.loc[:100].values)
show(p)

Notice the interesting spikes around 20 kms, these are probably trips to the airport.