NYC Uber/Lyft Rides
===================

The NYC Taxi dataset is a timeless classic.  

Interestingly there is a new variant.  The NYC Taxi and Livery Commission requires data from all ride-share services in the city of New York.  This includes private limosine services, van services, and a new category "High Volume For Hire Vehicle" services, those that dispatch 10,000 rides per day or more.  This is a special category defined for Uber and Lyft.  

This data is available here:

In [None]:
import coiled
cluster = coiled.Cluster(
    n_workers=20,
    backend_options={"region_name": "us-east-2"},
)

client = cluster.get_client()

In [None]:
import dask
import dask.dataframe as dd

dask.config.set({"dataframe.convert-string": True})  # use PyArrow strings by default

df = dd.read_parquet(
    "s3://coiled-datasets/uber-lyft-tlc/",
    storage_options={"anon": True},
)
df.head()

Play time
---------

We actually don't know what to expect from this dataset.  No one in our team has spent much time inspecting it.  We'd like to solicit help from you, new Dask user, to uncover some interesting insights.  

Care to explore and report your findings?

In [None]:
df = df.persist()

df.columns

## Basic statistics

In [None]:
total = df[["base_passenger_fare", "driver_pay", "tips", "trip_miles"]].sum()
average = df[["base_passenger_fare", "driver_pay", "tips", "trip_miles"]].mean()

total, average = dask.compute(total, average)

In [None]:
total

In [None]:
average

## Tipping Practices

In [None]:
(df.tips != 0).mean().compute()

## Broken down by carrier

In [None]:
df.hvfhs_license_num.value_counts().compute()

In [None]:
%%time
df["tipped"] = df.tips != 0

df.groupby("hvfhs_license_num").tipped.mean().compute()