# Create descriptive statistics for NYC Yellow Cab data set

In this notebook, 

## Launch a cluster

The first step is to spin up a Dask Cluster. In Coiled, this is done by creating a `coiled.Cluster` instance, there are [several keyword arguments](https://docs.coiled.io/user_guide/api.html#coiled.Cluster) you can use to specify the details of your cluster further. Please read the [cluster creation documentation](https://docs.coiled.io/user_guide/cluster_creation.html) to know more.

Note that we will give a name to this cluster, if you don't specify this keyword argument, clusters will be given a unique randomly generated name.

In [21]:
import coiled

cluster = coiled.Cluster(name="taxi-analysis", n_workers=10)

Output()

Found software environment build




Once a cluster has been created (you can see the status on your [Coiled dashboard](https://cloud.coiled.io/)), you can connect Dask to the cluster by creating a `distributed.Client` instance.

In [22]:
from dask.distributed import Client

client = Client(cluster)
client


+---------+---------------+---------------+---------------+
| Package | client        | scheduler     | workers       |
+---------+---------------+---------------+---------------+
| blosc   | None          | 1.10.2        | 1.10.2        |
| lz4     | None          | 3.1.3         | 3.1.3         |
| python  | 3.9.0.final.0 | 3.9.4.final.0 | 3.9.4.final.0 |
+---------+---------------+---------------+---------------+


0,1
Client  Scheduler: tls://ec2-34-227-242-212.compute-1.amazonaws.com:8786  Dashboard: http://ec2-34-227-242-212.compute-1.amazonaws.com:8787,Cluster  Workers: 10  Cores: 20  Memory: 80.00 GiB


## Analyze data in the cloud

Now that we have our cluster running and Dask connected to it, let's run a computation. This example will run the computation on about 84 million rows.

In [23]:
import dask.dataframe as dd

taxi_full = dd.read_csv(
    "s3://nyc-tlc/trip data/yellow_tripdata_2019-*.csv",
    dtype={
        "payment_type": "UInt8",
        "VendorID": "UInt8",
        "passenger_count": "UInt8",
        "RatecodeID": "UInt8",
    },
    storage_options={"anon": True},
    blocksize="16 MiB",
).persist()

In [26]:
# print(taxi_full.tip_amount.mean().compute())
# print(taxi_full.trip_distance.mean().compute())
# print(taxi_full.fare_amount.mean().compute())
# print(len(taxi_full))

taxi_full[['tip_amount', 'trip_distance', 'fare_amount']].describe().compute()

Unnamed: 0,tip_amount,trip_distance,fare_amount
count,84399020.0,84399020.0,84399020.0
mean,2.195064,3.000928,13.34399
std,15.65706,8.091114,174.3749
min,-221.0,-37264.53,-1856.0
25%,0.5,1.27,8.0
50%,2.06,2.32,12.5
75%,3.36,11.77,42.01
max,141492.0,45977.22,943274.8


In [27]:
taxi_full_wtips = taxi_full[taxi_full['tip_amount']>0]

In [28]:
len(taxi_full_wtips)/len(taxi_full)

0.6899768941627153

In [29]:
# print(taxi_full_wtips.tip_amount.mean().compute(), taxi_full_wtips.tip_amount.std().compute())
# print(taxi_full_wtips.trip_distance.mean().compute())
# print(taxi_full_wtips.fare_amount.mean().compute())
# print(len(taxi_full_wtips))

taxi_full_wtips[['tip_amount', 'trip_distance', 'fare_amount']].describe().compute()

Unnamed: 0,tip_amount,trip_distance,fare_amount
count,58233370.0,58233370.0,58233370.0
mean,3.181497,2.998858,13.29481
std,18.76556,3.844208,124.1712
min,0.01,-13.79,-400.0
25%,1.96,1.32,7.5
50%,2.66,2.3,11.0
75%,13.54,15.0825,65.0
max,141492.0,831.8,943274.8


In [30]:
coiled.delete_cluster(name="taxi-analysis")
client.close()

## Wrapping up

