
NYC Uber/Lyft Rides
===================

<img src="https://docs.dask.org/en/stable/_images/dask-dataframe.svg"
     align="right"
     width="40%"/>

The NYC Taxi dataset is a timeless classic.  

Interestingly there is a new variant.  The NYC Taxi and Livery Commission requires data from all ride-share services in the city of New York.  This includes private limosine services, van services, and a new category "High Volume For Hire Vehicle" services, those that dispatch 10,000 rides per day or more.  This is a special category defined for Uber and Lyft.  

This data is available here:

In [None]:
import dask.distributed
import coiled

cluster = coiled.Cluster(
    n_workers=30,
    region="us-east-2",  # start workers close to data to minimize costs
)

client = cluster.get_client()

In [None]:
import dask
import dask.dataframe as dd

dask.config.set({"dataframe.convert-string": True})  # use PyArrow strings by default

df = dd.read_parquet(
    "s3://coiled-datasets/uber-lyft-tlc/",
)
df.head()

Play time
---------

We actually don't know what to expect from this dataset.  No one in our team has spent much time inspecting it.  We'd like to solicit help from you, new Dask user, to uncover some interesting insights.  

Care to explore and report your findings?

In [None]:
df = df.persist()

df.columns

## Tipping Practices

In [None]:
# How often do New Yorkers tip?

(df.tips != 0).mean().compute()

## Broken down by carrier

In [None]:
# Uber / Lyft / Via / ... different carriers
df.hvfhs_license_num.value_counts().compute()

In [None]:
df["tipped"] = df.tips != 0

df.groupby("hvfhs_license_num").tipped.mean().compute()

## Dask TV

We use this in conference events just to make the dashboard go and bring in a crowd.  Colloquially we call this "Dask TV".  Enjoy!

In [None]:
import dask
import dask.dataframe as dd
dask.config.set({"dataframe.convert-string": True})  # use PyArrow strings by default

while True:
    client.restart()

    df = dd.read_parquet(
        "s3://coiled-datasets/uber-lyft-tlc/",
        storage_options={"anon": True},
    ).persist()

    for _ in range(10):
        df["tipped"] = df.tips != 0

        df.groupby("hvfhs_license_num").tipped.mean().compute()