NYC Uber/Lyft Rides
===================

The NYC Taxi dataset is a timeless classic.  

Interestingly there is a new variant.  The NYC Taxi and Livery Commission requires data from all ride-share services in the city of New York.  This includes private limosine services, van services, and a new category "High Volume For Hire Vehicle" services, those that dispatch 10,000 rides per day or more.  This is a special category defined for Uber and Lyft.  

This data is available here:

In [None]:
import coiled

name = "your_name" #avoid reusing teamate clusters

cluster = coiled.Cluster(
    n_workers=30,
    account="events",
    name=f"uber-lyft_pydata-seattle_{name}",
    shutdown_on_close=False,
)

client = cluster.get_client()

In [None]:
client

In [None]:
import dask
import pandas
import dask.dataframe as dd
import pandas as pd

dask.config.set({"dataframe.convert-string": True})  # use PyArrow strings by default

df = dd.read_parquet(
    "s3://coiled-datasets/uber-lyft-tlc/",
    storage_options={"anon": True},
)
df.head()

Play time
---------

We actually don't know what to expect from this dataset.  No one in our team has spent much time inspecting it.  We'd like to solicit help from you, new Dask user, to uncover some interesting insights.  

Care to explore and report your findings?

In [None]:
df = df.persist()

df.columns

In [None]:
%time df.base_passenger_fare.sum().compute()

In [None]:
df.driver_pay.sum().compute()

## Basic statistics

In [None]:
total = df[["base_passenger_fare", "driver_pay", "tips", "trip_miles"]].sum()
average = df[["base_passenger_fare", "driver_pay", "tips", "trip_miles"]].mean()



In [None]:
total

In [None]:
average

In [None]:
total, average = dask.compute(total, average)

## Tipping Practices

In [None]:
df[df.tips != 0].tips.mean().compute()

In [None]:
(df.tips != 0).mean().compute()

In [None]:
df.base_passenger_fare.sum().compute()

In [None]:
df.driver_pay.sum().compute()

In [None]:
(df.driver_pay > df.base_passenger_fare).mean().compute()

## Broken down by carrier

In [None]:
df.hvfhs_license_num.value_counts().compute()

In [None]:
df["tipped"] = df.tips != 0

tip_by_provider = df.groupby("hvfhs_license_num").tipped.mean().compute()
tip_by_provider

In [None]:
tip_by_provider

In [None]:
provider = {"HV0002": "Juno", "HV0005": "Lyft", "HV0003": "Uber", "HV0004": "Via"}
tip_by_provider = tip_by_provider.to_frame().set_index(tip_by_provider.index.map(provider))

tip_by_provider

In [None]:
while True:
    df["tipped"] = df.tips != 0

    df.groupby("hvfhs_license_num").tipped.mean().compute()

In [None]:
client.restart()

In [None]:
df["tipped"] = df.tips != 0

x = df.groupby("hvfhs_license_num").tipped.mean().persist()

In [None]:
client.restart()

In [None]:
cluster.shutdown()

## Dask TV

In [None]:
while True:
    client.restart()
    dask.config.set({"dataframe.convert-string": True})  # use PyArrow strings by default

    df = dd.read_parquet(
        "s3://coiled-datasets/uber-lyft-tlc/",
        storage_options={"anon": True},
    ).persist()

    for _ in range(10):
        df["tipped"] = df.tips != 0

        df.groupby("hvfhs_license_num").tipped.mean().compute()