NYC Uber/Lyft Rides
===================

The NYC Taxi dataset is a timeless classic.  

Interestingly there is a new variant.  The NYC Taxi and Livery Commission requires data from all ride-share services in the city of New York.  This includes private limosine services, van services, and a new category "High Volume For Hire Vehicle" services, those that dispatch 10,000 rides per day or more.  This is a special category defined for Uber and Lyft.  

This data is available here:

In [None]:
import coiled
cluster = coiled.Cluster(
    n_workers=10,
    package_sync=True,
    backend_options={"region": "us-east-1"},
    # account="...",
)

from dask.distributed import Client
client = Client(cluster)
client

## Load Uber/Lyft dataset into distributed memory

In [None]:
import dask.dataframe as dd

df = dd.read_parquet(
    "s3://coiled-datasets/mrocklin/nyc-taxi-fhv",
    storage_options={"anon": True},
)
df.head()

In [None]:
# Persist in memory in efficient format

dtypes = {}
for column, dtype in df.dtypes.items():
    if dtype == "string":
        dtypes[column] = "string[pyarrow]"
        
df = df.astype(dtypes)

df = df.persist()

Play time
---------

We actually don't know what to expect from this dataset.  No one in our team has spent much time inspecting it.  We'd like to solicit help from you, new Dask user, to uncover some interesting insights.  

Care to explore and report your findings?

In [None]:
df.columns

In [None]:
df.base_passenger_fare.sum().compute() / 1e9

In [None]:
df.driver_pay.sum().compute() / 1e9

In [None]:
(df.tips != 0).mean().compute()