# Create NYC TLC Parquet files

There is New York City Taxi and Limousine Commission (TLC) Trip Record Data stored in S3, [see here](https://registry.opendata.aws/nyc-tlc-trip-records-pds/).

This notebook reads in the CSV data and writes out Parquet files that are easier and more performant to work with.

In [1]:
import coiled
import dask
import dask.dataframe as dd
import pandas as pd

In [2]:
from coiled.v2 import Cluster

cluster = Cluster(
    name="nyc-tlc-cleaning", 
    n_workers=25,
    software='coiled-examples/dask-dataframes',
)

In [3]:
client = dask.distributed.Client(cluster)


+---------+----------------+---------------+---------------+
| Package | client         | scheduler     | workers       |
+---------+----------------+---------------+---------------+
| msgpack | 1.0.3          | 1.0.2         | 1.0.2         |
| python  | 3.9.12.final.0 | 3.9.7.final.0 | 3.9.7.final.0 |
+---------+----------------+---------------+---------------+
Notes: 
-  msgpack: Variation is ok, as long as everything is above 0.6


In [17]:
client.restart()

0,1
Connection method: Cluster object,Cluster type: coiled.ClusterBeta
Dashboard: http://44.201.64.146:8787,

0,1
Dashboard: http://44.201.64.146:8787,Workers: 16
Total threads: 32,Total memory: 60.41 GiB

0,1
Comm: tls://172.18.0.2:8786,Workers: 16
Dashboard: http://172.18.0.2:8787/status,Total threads: 32
Started: 8 minutes ago,Total memory: 60.41 GiB

0,1
Comm: tls://10.4.11.30:42273,Total threads: 2
Dashboard: http://10.4.11.30:44809/status,Memory: 3.78 GiB
Nanny: tls://10.4.11.30:40283,
Local directory: /scratch/dask-worker-space/worker-at5mmo_6,Local directory: /scratch/dask-worker-space/worker-at5mmo_6

0,1
Comm: tls://10.4.6.3:40221,Total threads: 2
Dashboard: http://10.4.6.3:36069/status,Memory: 3.78 GiB
Nanny: tls://10.4.6.3:33077,
Local directory: /scratch/dask-worker-space/worker-qbqzjwfl,Local directory: /scratch/dask-worker-space/worker-qbqzjwfl

0,1
Comm: tls://10.4.11.211:36947,Total threads: 2
Dashboard: http://10.4.11.211:39639/status,Memory: 3.78 GiB
Nanny: tls://10.4.11.211:37359,
Local directory: /scratch/dask-worker-space/worker-topn9gaj,Local directory: /scratch/dask-worker-space/worker-topn9gaj

0,1
Comm: tls://10.4.7.42:36537,Total threads: 2
Dashboard: http://10.4.7.42:36813/status,Memory: 3.78 GiB
Nanny: tls://10.4.7.42:45987,
Local directory: /scratch/dask-worker-space/worker-flqvhf_p,Local directory: /scratch/dask-worker-space/worker-flqvhf_p

0,1
Comm: tls://10.4.9.133:34783,Total threads: 2
Dashboard: http://10.4.9.133:32853/status,Memory: 3.78 GiB
Nanny: tls://10.4.9.133:38339,
Local directory: /scratch/dask-worker-space/worker-3x47q6vl,Local directory: /scratch/dask-worker-space/worker-3x47q6vl

0,1
Comm: tls://10.4.1.167:34123,Total threads: 2
Dashboard: http://10.4.1.167:36877/status,Memory: 3.78 GiB
Nanny: tls://10.4.1.167:41471,
Local directory: /scratch/dask-worker-space/worker-t8ww57v_,Local directory: /scratch/dask-worker-space/worker-t8ww57v_

0,1
Comm: tls://10.4.8.1:37683,Total threads: 2
Dashboard: http://10.4.8.1:45515/status,Memory: 3.78 GiB
Nanny: tls://10.4.8.1:33001,
Local directory: /scratch/dask-worker-space/worker-xpom3hcb,Local directory: /scratch/dask-worker-space/worker-xpom3hcb

0,1
Comm: tls://10.4.12.124:36675,Total threads: 2
Dashboard: http://10.4.12.124:33199/status,Memory: 3.78 GiB
Nanny: tls://10.4.12.124:39701,
Local directory: /scratch/dask-worker-space/worker-85ohi1r9,Local directory: /scratch/dask-worker-space/worker-85ohi1r9

0,1
Comm: tls://10.4.1.234:35417,Total threads: 2
Dashboard: http://10.4.1.234:46541/status,Memory: 3.78 GiB
Nanny: tls://10.4.1.234:33431,
Local directory: /scratch/dask-worker-space/worker-bauwoa_x,Local directory: /scratch/dask-worker-space/worker-bauwoa_x

0,1
Comm: tls://10.4.12.55:45685,Total threads: 2
Dashboard: http://10.4.12.55:40569/status,Memory: 3.78 GiB
Nanny: tls://10.4.12.55:41417,
Local directory: /scratch/dask-worker-space/worker-xl8w2jrr,Local directory: /scratch/dask-worker-space/worker-xl8w2jrr

0,1
Comm: tls://10.4.5.172:46235,Total threads: 2
Dashboard: http://10.4.5.172:46811/status,Memory: 3.78 GiB
Nanny: tls://10.4.5.172:43471,
Local directory: /scratch/dask-worker-space/worker-78vcbrf4,Local directory: /scratch/dask-worker-space/worker-78vcbrf4

0,1
Comm: tls://10.4.8.105:37235,Total threads: 2
Dashboard: http://10.4.8.105:34793/status,Memory: 3.78 GiB
Nanny: tls://10.4.8.105:44077,
Local directory: /scratch/dask-worker-space/worker-hri5z44k,Local directory: /scratch/dask-worker-space/worker-hri5z44k

0,1
Comm: tls://10.4.6.90:33341,Total threads: 2
Dashboard: http://10.4.6.90:36317/status,Memory: 3.78 GiB
Nanny: tls://10.4.6.90:35321,
Local directory: /scratch/dask-worker-space/worker-xnm_0hjj,Local directory: /scratch/dask-worker-space/worker-xnm_0hjj

0,1
Comm: tls://10.4.0.246:37487,Total threads: 2
Dashboard: http://10.4.0.246:42595/status,Memory: 3.78 GiB
Nanny: tls://10.4.0.246:42141,
Local directory: /scratch/dask-worker-space/worker-fwr_vamh,Local directory: /scratch/dask-worker-space/worker-fwr_vamh

0,1
Comm: tls://10.4.5.110:40213,Total threads: 2
Dashboard: http://10.4.5.110:40015/status,Memory: 3.78 GiB
Nanny: tls://10.4.5.110:36315,
Local directory: /scratch/dask-worker-space/worker-38nssb1v,Local directory: /scratch/dask-worker-space/worker-38nssb1v

0,1
Comm: tls://10.4.15.154:40349,Total threads: 2
Dashboard: http://10.4.15.154:45249/status,Memory: 3.78 GiB
Nanny: tls://10.4.15.154:46847,
Local directory: /scratch/dask-worker-space/worker-2tfcqty6,Local directory: /scratch/dask-worker-space/worker-2tfcqty6


## 2009 data create

In [18]:
ddf = dd.read_csv(
    "s3://nyc-tlc/trip data/yellow_tripdata_2009-*.csv",
    parse_dates=["Trip_Pickup_DateTime", "Trip_Dropoff_DateTime"],
    dtype={
        "Tolls_Amt": "float64",
        "store_and_forward": "object",
    },
)

In [19]:
ddf.head()

Unnamed: 0,vendor_name,Trip_Pickup_DateTime,Trip_Dropoff_DateTime,Passenger_Count,Trip_Distance,Start_Lon,Start_Lat,Rate_Code,store_and_forward,End_Lon,End_Lat,Payment_Type,Fare_Amt,surcharge,mta_tax,Tip_Amt,Tolls_Amt,Total_Amt
0,VTS,2009-01-04 02:52:00,2009-01-04 03:02:00,1,2.63,-73.991957,40.721567,,,-73.993803,40.695922,CASH,8.9,0.5,,0.0,0.0,9.4
1,VTS,2009-01-04 03:31:00,2009-01-04 03:38:00,3,4.55,-73.982102,40.73629,,,-73.95585,40.76803,Credit,12.1,0.5,,2.0,0.0,14.6
2,VTS,2009-01-03 15:43:00,2009-01-03 15:57:00,5,10.35,-74.002587,40.739748,,,-73.869983,40.770225,Credit,23.7,0.0,,4.74,0.0,28.44
3,DDS,2009-01-01 20:52:58,2009-01-01 21:14:00,1,5.0,-73.974267,40.790955,,,-73.996558,40.731849,CREDIT,14.9,0.5,,3.05,0.0,18.45
4,DDS,2009-01-24 16:18:23,2009-01-24 16:24:56,1,0.4,-74.00158,40.719382,,,-74.008378,40.72035,CASH,3.7,0.0,,0.0,0.0,3.7


#### Clean Rate_Code NaNs

In [20]:
# # clean rate_code column NaNs
# import math

# def rate_code_to_one(something):
#     if math.isnan(something):
#         return 1
#     else:
#         return something
    
# ddf = ddf.assign(
#     Rate_Code=ddf.Rate_Code.apply(rate_code_to_one, meta=("float64", "int64"))
# )

In [21]:
# vendor_id                     object
# pickup_datetime       datetime64[ns]
# dropoff_datetime      datetime64[ns]
# passenger_count                int64
# trip_distance                float64
# pickup_longitude             float64
# pickup_latitude              float64
# rate_code                      int64
# store_and_fwd_flag            object
# dropoff_longitude            float64
# dropoff_latitude             float64
# payment_type                  object
# fare_amount                  float64
# surcharge                    float64
# mta_tax                      float64
# tip_amount                   float64
# tolls_amount                 float64
# total_amount                 float64
# dtype: object
ddf.dtypes

vendor_name                      object
Trip_Pickup_DateTime     datetime64[ns]
Trip_Dropoff_DateTime    datetime64[ns]
Passenger_Count                   int64
Trip_Distance                   float64
Start_Lon                       float64
Start_Lat                       float64
Rate_Code                       float64
store_and_forward                object
End_Lon                         float64
End_Lat                         float64
Payment_Type                     object
Fare_Amt                        float64
surcharge                       float64
mta_tax                         float64
Tip_Amt                         float64
Tolls_Amt                       float64
Total_Amt                       float64
dtype: object

In [22]:
# rename columns to standardize schema
ddf = ddf.rename(
    columns={
        "vendor_name": "vendor_id",
        "Trip_Pickup_DateTime": "pickup_datetime",
        "Trip_Dropoff_DateTime": "dropoff_datetime",
        "Passenger_Count": "passenger_count",
        "Trip_Distance": "trip_distance",
        "Start_Lon": "pickup_longitude",
        "Start_Lat": "pickup_latitude",
        "Rate_Code": "rate_code",
        "store_and_forward": "store_and_fwd_flag",
        "End_Lon": "dropoff_longitude",
        "End_Lat": "dropoff_latitude",
        "Payment_Type": "payment_type",
        "Fare_Amt": "fare_amount",
        "surcharge": "surcharge",
        "mta_tax": "mta_tax",
        "Tip_Amt": "tip_amount",
        "Tolls_Amt": "tolls_amount",
        "Total_Amt": "total_amount",
    }
)

In [23]:
ddf.dtypes

vendor_id                     object
pickup_datetime       datetime64[ns]
dropoff_datetime      datetime64[ns]
passenger_count                int64
trip_distance                float64
pickup_longitude             float64
pickup_latitude              float64
rate_code                    float64
store_and_fwd_flag            object
dropoff_longitude            float64
dropoff_latitude             float64
payment_type                  object
fare_amount                  float64
surcharge                    float64
mta_tax                      float64
tip_amount                   float64
tolls_amount                 float64
total_amount                 float64
dtype: object

In [24]:
len(ddf.dtypes)

18

In [25]:
ddf.head()

Unnamed: 0,vendor_id,pickup_datetime,dropoff_datetime,passenger_count,trip_distance,pickup_longitude,pickup_latitude,rate_code,store_and_fwd_flag,dropoff_longitude,dropoff_latitude,payment_type,fare_amount,surcharge,mta_tax,tip_amount,tolls_amount,total_amount
0,VTS,2009-01-04 02:52:00,2009-01-04 03:02:00,1,2.63,-73.991957,40.721567,,,-73.993803,40.695922,CASH,8.9,0.5,,0.0,0.0,9.4
1,VTS,2009-01-04 03:31:00,2009-01-04 03:38:00,3,4.55,-73.982102,40.73629,,,-73.95585,40.76803,Credit,12.1,0.5,,2.0,0.0,14.6
2,VTS,2009-01-03 15:43:00,2009-01-03 15:57:00,5,10.35,-74.002587,40.739748,,,-73.869983,40.770225,Credit,23.7,0.0,,4.74,0.0,28.44
3,DDS,2009-01-01 20:52:58,2009-01-01 21:14:00,1,5.0,-73.974267,40.790955,,,-73.996558,40.731849,CREDIT,14.9,0.5,,3.05,0.0,18.45
4,DDS,2009-01-24 16:18:23,2009-01-24 16:24:56,1,0.4,-74.00158,40.719382,,,-74.008378,40.72035,CASH,3.7,0.0,,0.0,0.0,3.7


In [26]:
ddf.npartitions

478

In [27]:
ddf.known_divisions

False

In [28]:
ddf.repartition(partition_size="100MB").to_parquet(
    "s3://coiled-datasets/dask-book/nyc-tlc/2009",
    engine="pyarrow",
    compression="snappy",
    write_metadata_file=False,
)

KeyboardInterrupt: 

## 2009 data query

In [12]:
ddf = dd.read_parquet(
    "s3://coiled-datasets/dask-book/nyc-tlc/2009",
    engine="pyarrow",
)

In [13]:
ddf.head()

Unnamed: 0,vendor_id,pickup_datetime,dropoff_datetime,passenger_count,trip_distance,pickup_longitude,pickup_latitude,rate_code,store_and_fwd_flag,dropoff_longitude,dropoff_latitude,payment_type,fare_amount,surcharge,mta_tax,tip_amount,tolls_amount,total_amount
0,VTS,2009-01-04 02:52:00,2009-01-04 03:02:00,1,2.63,-73.991957,40.721567,1,,-73.993803,40.695922,CASH,8.9,0.5,,0.0,0.0,9.4
1,VTS,2009-01-04 03:31:00,2009-01-04 03:38:00,3,4.55,-73.982102,40.73629,1,,-73.95585,40.76803,Credit,12.1,0.5,,2.0,0.0,14.6
2,VTS,2009-01-03 15:43:00,2009-01-03 15:57:00,5,10.35,-74.002587,40.739748,1,,-73.869983,40.770225,Credit,23.7,0.0,,4.74,0.0,28.44
3,DDS,2009-01-01 20:52:58,2009-01-01 21:14:00,1,5.0,-73.974267,40.790955,1,,-73.996558,40.731849,CREDIT,14.9,0.5,,3.05,0.0,18.45
4,DDS,2009-01-24 16:18:23,2009-01-24 16:24:56,1,0.4,-74.00158,40.719382,1,,-74.008378,40.72035,CASH,3.7,0.0,,0.0,0.0,3.7


In [14]:
dtypes_2009 = ddf.dtypes

In [15]:
dtypes_2009

vendor_id                     object
pickup_datetime       datetime64[ns]
dropoff_datetime      datetime64[ns]
passenger_count                int64
trip_distance                float64
pickup_longitude             float64
pickup_latitude              float64
rate_code                      int64
store_and_fwd_flag            object
dropoff_longitude            float64
dropoff_latitude             float64
payment_type                  object
fare_amount                  float64
surcharge                    float64
mta_tax                      float64
tip_amount                   float64
tolls_amount                 float64
total_amount                 float64
dtype: object

In [38]:
len(ddf)

170896055

In [42]:
ddf = dd.read_parquet(
    "s3://coiled-datasets/nyc-tlc/2009", 
    engine="pyarrow", 
    columns=["Fare_Amt"]
)

In [44]:
%%time
ddf.Fare_Amt.mean().compute()

CPU times: user 124 ms, sys: 7.65 ms, total: 131 ms
Wall time: 4.79 s


9.905162372589585

## 2010 data create

In [40]:
# read data in without dtypes
ddf = dd.read_csv(
    "s3://nyc-tlc/trip data/yellow_tripdata_2010-*.csv",
    parse_dates=["pickup_datetime", "dropoff_datetime"],
    on_bad_lines="skip",
    dtype={
        'tolls_amount': 'float64',
        "store_and_fwd_flag": "object",
    },
)

In [41]:
ddf.head()

Unnamed: 0,vendor_id,pickup_datetime,dropoff_datetime,passenger_count,trip_distance,pickup_longitude,pickup_latitude,rate_code,store_and_fwd_flag,dropoff_longitude,dropoff_latitude,payment_type,fare_amount,surcharge,mta_tax,tip_amount,tolls_amount,total_amount
0,VTS,2010-01-26 07:41:00,2010-01-26 07:45:00,1,0.75,-73.956778,40.76775,1,,-73.965957,40.765232,CAS,4.5,0.0,0.5,0.0,0.0,5.0
1,DDS,2010-01-30 23:31:00,2010-01-30 23:46:12,1,5.9,-73.996118,40.763932,1,,-73.981512,40.741193,CAS,15.3,0.5,0.5,0.0,0.0,16.3
2,DDS,2010-01-18 20:22:20,2010-01-18 20:38:12,1,4.0,-73.979673,40.78379,1,,-73.917852,40.87856,CAS,11.7,0.5,0.5,0.0,0.0,12.7
3,VTS,2010-01-09 01:18:00,2010-01-09 01:35:00,2,4.7,-73.977922,40.763997,1,,-73.923908,40.759725,CAS,13.3,0.5,0.5,0.0,0.0,14.3
4,CMT,2010-01-18 19:10:14,2010-01-18 19:17:07,1,0.6,-73.990924,40.734682,1,0.0,-73.995511,40.739088,Cre,5.3,0.0,0.5,0.87,0.0,6.67


In [29]:
%%time
# calculate .describe() to check for dtype issues
ddf.describe().compute()

CPU times: user 1.76 s, sys: 173 ms, total: 1.94 s
Wall time: 1min 55s


Unnamed: 0,passenger_count,trip_distance,pickup_longitude,pickup_latitude,rate_code,dropoff_longitude,dropoff_latitude,fare_amount,surcharge,mta_tax,tip_amount,tolls_amount,total_amount
count,168994400.0,168994400.0,168994400.0,168994400.0,168994400.0,168994200.0,168994200.0,168994400.0,168994400.0,168994400.0,168994400.0,168994400.0,168994400.0
mean,1.674221,5.864681,-72.39051,39.88347,1.032463,-72.41661,39.89821,9.844589,0.3221727,0.495594,0.7605548,-0.07496016,11.35107
std,1.300666,5409.394,11.02328,7.053625,0.4236116,10.89876,7.027043,1664.848,0.3693486,0.1345498,173.5413,2336.194,2873.962
min,0.0,-21474830.0,-3509.015,-3579.139,0.0,-3579.139,-3538.432,-21474810.0,-1.0,-1.0,-1677720.0,-21474840.0,-21474830.0
25%,1.0,1.09,-73.99139,40.73673,1.0,-73.99071,40.73553,5.7,0.0,0.5,0.0,0.0,7.0
50%,1.0,1.8,-73.98101,40.75392,1.0,-73.97925,40.75437,8.1,0.5,0.5,0.0,0.0,9.54
75%,3.0,3.25,-73.96582,40.76844,1.0,-73.96257,40.76894,12.1,0.5,0.5,1.6,0.0,13.8
max,255.0,16201630.0,3569.931,3377.993,221.0,3443.651,3510.381,93960.07,615.78,1311.22,938.02,5510.07,93960.57


In [45]:
# verify dtypes and column names are identical for 2009 and 2010
dtypes_2010 = ddf.dtypes
dtypes_2009 == dtypes_2010

vendor_id             True
pickup_datetime       True
dropoff_datetime      True
passenger_count       True
trip_distance         True
pickup_longitude      True
pickup_latitude       True
rate_code             True
store_and_fwd_flag    True
dropoff_longitude     True
dropoff_latitude      True
payment_type          True
fare_amount           True
surcharge             True
mta_tax               True
tip_amount            True
tolls_amount          True
total_amount          True
dtype: bool

In [47]:
# repartition and write 2010 data to Parquet 
ddf.repartition(partition_size="100MB").to_parquet(
    "s3://coiled-datasets/dask-book/nyc-tlc/2010",
    engine="pyarrow",
    compression="snappy",
    write_metadata_file=False,
)

KeyboardInterrupt: 

In [None]:
ddf = dd.read_parquet(
    "s3://coiled-datasets/dask-book/nyc-tlc/2010", 
    engine="pyarrow"
)

In [None]:
# check datatypes are stored correctly in Parquet file
dtypes_2010 == ddf.dtypes

## 2011 data create

In [37]:
ddf = dd.read_csv(
    "s3://nyc-tlc/trip data/yellow_tripdata_2011-*.csv",
    parse_dates=["pickup_datetime", "dropoff_datetime"],
    dtype={
        "tip_amount": "float64",
        "tolls_amount": "float64",
        # "vendor_name": "string[pyarrow]",
        # "Payment_Type": "string[pyarrow]",
    },
)

In [38]:
ddf.dtypes

vendor_id                     object
pickup_datetime       datetime64[ns]
dropoff_datetime      datetime64[ns]
passenger_count                int64
trip_distance                float64
pickup_longitude             float64
pickup_latitude              float64
rate_code                      int64
store_and_fwd_flag            object
dropoff_longitude            float64
dropoff_latitude             float64
payment_type                  object
fare_amount                  float64
surcharge                    float64
mta_tax                      float64
tip_amount                   float64
tolls_amount                 float64
total_amount                 float64
dtype: object

In [39]:
# verify data types are identical to those in 2010 dataset
dtypes_2010 == ddf.dtypes

vendor_id             True
pickup_datetime       True
dropoff_datetime      True
passenger_count       True
trip_distance         True
pickup_longitude      True
pickup_latitude       True
rate_code             True
store_and_fwd_flag    True
dropoff_longitude     True
dropoff_latitude      True
payment_type          True
fare_amount           True
surcharge             True
mta_tax               True
tip_amount            True
tolls_amount          True
total_amount          True
dtype: bool

In [17]:
ddf.repartition(partition_size="100MB").to_parquet(
    "s3://coiled-datasets/dask-book/nyc-tlc/2011",
    engine="pyarrow",
    compression="snappy",
    write_metadata_file=False,
)

[None]

In [12]:
ddf = dd.read_parquet("s3://coiled-datasets/nyc-tlc/2011", engine="pyarrow")

In [13]:
actual_dtypes_2011 = ddf.dtypes

In [15]:
type(actual_dtypes_2011)

pandas.core.series.Series

In [21]:
actual_dtypes_2011

vendor_id                     object
pickup_datetime       datetime64[ns]
dropoff_datetime      datetime64[ns]
passenger_count                int64
trip_distance                float64
pickup_longitude             float64
pickup_latitude              float64
rate_code                      int64
store_and_fwd_flag            object
dropoff_longitude            float64
dropoff_latitude             float64
payment_type                  object
fare_amount                  float64
surcharge                    float64
mta_tax                      float64
tip_amount                   float64
tolls_amount                 float64
total_amount                 float64
dtype: object

In [20]:
actual_dtypes_2011 == actual_dtypes_2010

vendor_id             True
pickup_datetime       True
dropoff_datetime      True
passenger_count       True
trip_distance         True
pickup_longitude      True
pickup_latitude       True
rate_code             True
store_and_fwd_flag    True
dropoff_longitude     True
dropoff_latitude      True
payment_type          True
fare_amount           True
surcharge             True
mta_tax               True
tip_amount            True
tolls_amount          True
total_amount          True
dtype: bool

## 2012 Data Create

In [22]:
ddf = dd.read_csv(
    "s3://nyc-tlc/trip data/yellow_tripdata_2012-*.csv",
    parse_dates=["pickup_datetime", "dropoff_datetime"],
    dtype={
        "tip_amount": "float64",
        "tolls_amount": "float64",
    },
)

In [23]:
ddf.dtypes

vendor_id                     object
pickup_datetime       datetime64[ns]
dropoff_datetime      datetime64[ns]
passenger_count                int64
trip_distance                float64
pickup_longitude             float64
pickup_latitude              float64
rate_code                      int64
store_and_fwd_flag            object
dropoff_longitude            float64
dropoff_latitude             float64
payment_type                  object
fare_amount                  float64
surcharge                    float64
mta_tax                      float64
tip_amount                   float64
tolls_amount                 float64
total_amount                 float64
dtype: object

In [24]:
ddf.dtypes == actual_dtypes_2010

vendor_id             True
pickup_datetime       True
dropoff_datetime      True
passenger_count       True
trip_distance         True
pickup_longitude      True
pickup_latitude       True
rate_code             True
store_and_fwd_flag    True
dropoff_longitude     True
dropoff_latitude      True
payment_type          True
fare_amount           True
surcharge             True
mta_tax               True
tip_amount            True
tolls_amount          True
total_amount          True
dtype: bool

In [25]:
ddf.repartition(partition_size="100MB").to_parquet(
    "s3://coiled-datasets/nyc-tlc/2012",
    engine="pyarrow",
    compression="snappy",
    write_metadata_file=False,
)

[None]

## 2013 Data Create

In [26]:
ddf = dd.read_csv(
    "s3://nyc-tlc/trip data/yellow_tripdata_2013-*.csv",
    parse_dates=["pickup_datetime", "dropoff_datetime"],
    dtype={
        "tip_amount": "float64",
        "tolls_amount": "float64",
    },
)

In [27]:
ddf.dtypes == actual_dtypes_2010

vendor_id             True
pickup_datetime       True
dropoff_datetime      True
passenger_count       True
trip_distance         True
pickup_longitude      True
pickup_latitude       True
rate_code             True
store_and_fwd_flag    True
dropoff_longitude     True
dropoff_latitude      True
payment_type          True
fare_amount           True
surcharge             True
mta_tax               True
tip_amount            True
tolls_amount          True
total_amount          True
dtype: bool

In [28]:
len(ddf.dtypes)

18

In [29]:
ddf.dtypes

vendor_id                     object
pickup_datetime       datetime64[ns]
dropoff_datetime      datetime64[ns]
passenger_count                int64
trip_distance                float64
pickup_longitude             float64
pickup_latitude              float64
rate_code                      int64
store_and_fwd_flag            object
dropoff_longitude            float64
dropoff_latitude             float64
payment_type                  object
fare_amount                  float64
surcharge                    float64
mta_tax                      float64
tip_amount                   float64
tolls_amount                 float64
total_amount                 float64
dtype: object

In [30]:
ddf.repartition(partition_size="100MB").to_parquet(
    "s3://coiled-datasets/nyc-tlc/2013",
    engine="pyarrow",
    compression="snappy",
    write_metadata_file=False,
)

[None]

## 2019 Data Create

In [4]:
ddf = dd.read_csv(
    "s3://nyc-tlc/trip data/yellow_tripdata_2019-*.csv",
    parse_dates=["tpep_pickup_datetime", "tpep_dropoff_datetime"],
    dtype={
        "RatecodeID": "float64",
        "VendorID": "float64",
        "passenger_count": "float64",
        "payment_type": "object",
    },
)

In [6]:
ddf.dtypes

VendorID                        float64
tpep_pickup_datetime     datetime64[ns]
tpep_dropoff_datetime    datetime64[ns]
passenger_count                 float64
trip_distance                   float64
RatecodeID                      float64
store_and_fwd_flag               object
PULocationID                      int64
DOLocationID                      int64
payment_type                     object
fare_amount                     float64
extra                           float64
mta_tax                         float64
tip_amount                      float64
tolls_amount                    float64
improvement_surcharge           float64
total_amount                    float64
congestion_surcharge            float64
dtype: object

In [16]:
ddf.repartition(partition_size="100MB").to_parquet(
    "s3://coiled-datasets/nyc-tlc/2019",
    engine="pyarrow",
    compression="snappy",
    write_metadata_file=False,
)

[None]

In [7]:
ddf.repartition(partition_size="100MB").to_parquet(
    "s3://coiled-datasets/nyc-tlc-with-metadata/2019",
    engine="pyarrow",
    compression="snappy",
    write_metadata_file=True,
)

[None]

## 2020 Data

In [17]:
ddf = dd.read_csv(
    "s3://nyc-tlc/trip data/yellow_tripdata_2020-*.csv",
    parse_dates=["tpep_pickup_datetime", "tpep_dropoff_datetime"],
    dtype={
        "RatecodeID": "float64",
        "VendorID": "float64",
        "passenger_count": "float64",
        "payment_type": "object",
    },
)

In [18]:
ddf.dtypes

VendorID                        float64
tpep_pickup_datetime     datetime64[ns]
tpep_dropoff_datetime    datetime64[ns]
passenger_count                 float64
trip_distance                   float64
RatecodeID                      float64
store_and_fwd_flag               object
PULocationID                      int64
DOLocationID                      int64
payment_type                     object
fare_amount                     float64
extra                           float64
mta_tax                         float64
tip_amount                      float64
tolls_amount                      int64
improvement_surcharge           float64
total_amount                    float64
congestion_surcharge            float64
dtype: object

## Data queries

In [86]:
ddf2009 = dd.read_parquet(
    "s3://coiled-datasets/nyc-tlc/2009",
    engine="pyarrow",
)

In [87]:
len(ddf2009)

OSError: [Errno 22] Bad Request

In [80]:
ddf2010 = dd.read_parquet(
    "s3://coiled-datasets/nyc-tlc/2010",
    engine="pyarrow",
)

In [81]:
ddf2011 = dd.read_parquet(
    "s3://coiled-datasets/nyc-tlc/2011",
    engine="pyarrow",
)

In [82]:
ddf2012 = dd.read_parquet(
    "s3://coiled-datasets/nyc-tlc/2012",
    engine="pyarrow",
)

In [83]:
ddf2013 = dd.read_parquet(
    "s3://coiled-datasets/nyc-tlc/2013",
    engine="pyarrow",
)

In [84]:
ddf = dd.concat([ddf2009, ddf2010, ddf2011, ddf2012, ddf2013])

In [85]:
len(ddf)

PermissionError: The provided token has expired.