### Example 4 - Map of chlorophyll content in the North West Atlantic - Dask approach

This example will show you how to read and manipulate Argo profiles data stored in parquet format in your machine, exploiting [dask](https://docs.dask.org/en/stable/) capabilities of dealing with larger-than-memory data.

The data are stored across multiple files: we will load into memory only what we need by applying some filters, and we will create a map showing the chlorophyll content in the North West Atlantic.

##### Note on performance

I have developed this example on WHOI's HPC cluster. If you have access to an HPC machine, I recommend using it. If not, filter more aggressively or skip the parts where a lot of data is loaded (the first example loads ~45 GB).

As I am accessing data in a cluster, reading times are of course influenced by my internet connection, besides the file format.

##### Note on parquet files

The original netCDF Argo files have been converted to parquet format, which provides faster read operations.

There are a couple of ways to read parquet files in Python. In this example we will use Dask. Dask uses lazy evaluation to optimize operations, i.e. it first builds a graph of the computation it needs to carry out, and then eventually optimizes them and computes only the necessary ones. This means that when we call some function on a dataframe (e.g. `mean()`), a delayed object is created. To execute the computation, we need to call the `compute()` of the delayed object we created.

To access data with pyarrow and pandas see Example 1 notebook. Note that Dask, too, uses pyarrow behind the scenes, and that dask dataframes are almost identical to pandas'.

Generally speaking, you'll want to use Dask only if you need to operate on a large amount of data so that you can benefit from its parallelization capabilities. You should avoid Dask whenever the data fits in your RAM.

#### Getting started

We start by importing the necessary modules and setting the path and filenames of the parquet files.  For a list of modules that you need to install, you can look at the [README.md file in the repository](https://github.com/boom-lab/nc2parquet).

We also provide the schema (e.g. column names and data types) that pyarrow will need to read the parquet database. This speeds up read operations, as pyarrow does not need to guess the schema from the files.

In [1]:
from datetime import datetime
import xarray as xr
import pyarrow as pa
import pyarrow.parquet as pq
from pprint import pprint
import numpy as np

import dask
import dask.dataframe as dd

# Paths on Poseidon cluster
parquet_dir = '/vortexfs1/home/enrico.milanese/projects/ARGO/nc2parquet/data_test/parquet/bgc'

# Setting up parquet schema
schema_path = '../schemas/ArgoBGC_schema.metadata'
BGC_schema = pq.read_schema(schema_path)

We now want to set up our filter to read only the data from the NWA (i.e. latitude between 34$^\circ$ and 60$^\circ$, longitude between -45$^\circ$ and -78$^\circ$).

In [2]:
filter_coords = [("LATITUDE",">",34), ("LATITUDE","<",80),
                 ("LONGITUDE",">",-78), ("LONGITUDE","<",-50)]

In [3]:
%%time
ddf = dd.read_parquet(
    parquet_dir,
    engine="pyarrow",
    storage_options={"anon": True, "use_ssl": True} ,
    schema=BGC_schema,
    filters = filter_coords
    )

CPU times: user 100 ms, sys: 49.2 ms, total: 150 ms
Wall time: 177 ms


You probably noticed how fast this `read_parquet()` has been. This is because `ddf` is actually a delayed object, i.e. it stored the graph of operations it needs to execute, but it did not execute them. If we call `ddf`, we get the basic information (columns and data types), but the dataframe looks otherwise empty.

In [4]:
ddf

Unnamed: 0_level_0,PRES_ADJUSTED_QC,TEMP_ADJUSTED_QC,DOXY,PLATFORM_NUMBER,PSAL_dPRES,DOXY_ADJUSTED_ERROR,TEMP,LATITUDE,TEMP_dPRES,PSAL_ADJUSTED_ERROR,DOXY_ADJUSTED,CYCLE_NUMBER,PRES_QC,PRES_ADJUSTED_ERROR,TEMP_ADJUSTED_ERROR,PSAL_ADJUSTED,PSAL_ADJUSTED_QC,DOXY_ADJUSTED_QC,DOXY_QC,PSAL_QC,JULD,TEMP_QC,LONGITUDE,PSAL,TEMP_ADJUSTED,PRES,PRES_ADJUSTED,DOXY_dPRES,N_PROF,N_LEVELS,CHLA_QC,CHLA_ADJUSTED_QC,BBP700_ADJUSTED,NITRATE_ADJUSTED,NITRATE_dPRES,CHLA_dPRES,PH_IN_SITU_TOTAL_ADJUSTED_QC,NITRATE_QC,NITRATE_ADJUSTED_ERROR,PH_IN_SITU_TOTAL_dPRES,NITRATE_ADJUSTED_QC,NITRATE,BBP700_ADJUSTED_ERROR,PH_IN_SITU_TOTAL_QC,PH_IN_SITU_TOTAL,PH_IN_SITU_TOTAL_ADJUSTED,CHLA,BBP700_QC,CHLA_ADJUSTED,CHLA_ADJUSTED_ERROR,PH_IN_SITU_TOTAL_ADJUSTED_ERROR,BBP700,BBP700_ADJUSTED_QC,BBP700_dPRES,CDOM_dPRES,DOWN_IRRADIANCE380_ADJUSTED_ERROR,DOWNWELLING_PAR_QC,CDOM_ADJUSTED_QC,DOWN_IRRADIANCE412_ADJUSTED_ERROR,CDOM,DOWNWELLING_PAR_ADJUSTED_ERROR,DOWNWELLING_PAR,DOWN_IRRADIANCE412_ADJUSTED_QC,DOWN_IRRADIANCE412_ADJUSTED,DOWN_IRRADIANCE380_dPRES,DOWNWELLING_PAR_dPRES,CDOM_QC,DOWN_IRRADIANCE490_QC,DOWN_IRRADIANCE490_dPRES,DOWN_IRRADIANCE490,DOWN_IRRADIANCE490_ADJUSTED_ERROR,CDOM_ADJUSTED,DOWN_IRRADIANCE380_ADJUSTED_QC,CDOM_ADJUSTED_ERROR,DOWN_IRRADIANCE490_ADJUSTED,DOWNWELLING_PAR_ADJUSTED_QC,DOWN_IRRADIANCE490_ADJUSTED_QC,DOWN_IRRADIANCE380_ADJUSTED,DOWN_IRRADIANCE380,DOWN_IRRADIANCE412,DOWNWELLING_PAR_ADJUSTED,DOWN_IRRADIANCE412_QC,DOWN_IRRADIANCE412_dPRES,DOWN_IRRADIANCE380_QC,BBP532_ADJUSTED,BBP532_QC,BBP532,BBP532_dPRES,BBP532_ADJUSTED_QC,BBP532_ADJUSTED_ERROR,DOWN_IRRADIANCE443,DOWN_IRRADIANCE443_ADJUSTED_QC,DOWN_IRRADIANCE443_ADJUSTED_ERROR,DOWN_IRRADIANCE443_dPRES,DOWN_IRRADIANCE443_ADJUSTED,DOWN_IRRADIANCE443_QC,DOWN_IRRADIANCE555_QC,DOWN_IRRADIANCE555,DOWN_IRRADIANCE555_ADJUSTED,DOWN_IRRADIANCE555_dPRES,DOWN_IRRADIANCE555_ADJUSTED_QC,BBP470,CP660_ADJUSTED_QC,CP660_QC,BBP470_QC,BBP470_dPRES,BBP470_ADJUSTED_QC,CP660,BBP470_ADJUSTED,CP660_ADJUSTED,BBP470_ADJUSTED_ERROR,CP660_ADJUSTED_ERROR,DOWN_IRRADIANCE555_ADJUSTED_ERROR,CP660_dPRES,BISULFIDE_ADJUSTED,BISULFIDE_dPRES,BISULFIDE_ADJUSTED_QC,BISULFIDE_ADJUSTED_ERROR,BISULFIDE_QC,BISULFIDE,TURBIDITY_dPRES,TURBIDITY_ADJUSTED,TURBIDITY,TURBIDITY_ADJUSTED_QC,TURBIDITY_ADJUSTED_ERROR,TURBIDITY_QC
npartitions=48,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1,Unnamed: 74_level_1,Unnamed: 75_level_1,Unnamed: 76_level_1,Unnamed: 77_level_1,Unnamed: 78_level_1,Unnamed: 79_level_1,Unnamed: 80_level_1,Unnamed: 81_level_1,Unnamed: 82_level_1,Unnamed: 83_level_1,Unnamed: 84_level_1,Unnamed: 85_level_1,Unnamed: 86_level_1,Unnamed: 87_level_1,Unnamed: 88_level_1,Unnamed: 89_level_1,Unnamed: 90_level_1,Unnamed: 91_level_1,Unnamed: 92_level_1,Unnamed: 93_level_1,Unnamed: 94_level_1,Unnamed: 95_level_1,Unnamed: 96_level_1,Unnamed: 97_level_1,Unnamed: 98_level_1,Unnamed: 99_level_1,Unnamed: 100_level_1,Unnamed: 101_level_1,Unnamed: 102_level_1,Unnamed: 103_level_1,Unnamed: 104_level_1,Unnamed: 105_level_1,Unnamed: 106_level_1,Unnamed: 107_level_1,Unnamed: 108_level_1,Unnamed: 109_level_1,Unnamed: 110_level_1,Unnamed: 111_level_1,Unnamed: 112_level_1,Unnamed: 113_level_1,Unnamed: 114_level_1,Unnamed: 115_level_1,Unnamed: 116_level_1,Unnamed: 117_level_1,Unnamed: 118_level_1,Unnamed: 119_level_1,Unnamed: 120_level_1,Unnamed: 121_level_1,Unnamed: 122_level_1,Unnamed: 123_level_1,Unnamed: 124_level_1,Unnamed: 125_level_1,Unnamed: 126_level_1
,Int64,Int64,float32,Int64,float32,float32,float32,float64,float32,float32,float32,Int64,Int64,float32,float32,float32,Int64,float64,float64,Int64,datetime64[ns],Int64,float64,float32,float32,float32,float32,float32,Int64,Int64,float64,float64,float32,float32,float32,float32,float64,float64,float32,float32,float64,float32,float32,float64,float32,float32,float32,float64,float32,float32,float32,float32,float64,float32,float32,float32,float64,float64,float32,float32,float32,float32,float64,float32,float32,float32,float64,float64,float32,float32,float32,float32,float64,float32,float32,float64,float64,float32,float32,float32,float32,float64,float32,float64,float32,float64,float32,float32,float64,float32,float32,float64,float32,float32,float32,float64,float64,float32,float32,float32,float64,float32,float64,float64,float64,float32,float64,float32,float32,float32,float32,float32,float32,float32,float32,float32,float64,float32,float64,float32,float32,float32,float32,float64,float32,float64
,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...


We can check get a sneak peek of the data with the method `head()`, which loads the first 5 rows of the dataframe:

In [5]:
%%time
ddf.head()

CPU times: user 267 ms, sys: 154 ms, total: 421 ms
Wall time: 1.57 s


Unnamed: 0,PRES_ADJUSTED_QC,TEMP_ADJUSTED_QC,DOXY,PLATFORM_NUMBER,PSAL_dPRES,DOXY_ADJUSTED_ERROR,TEMP,LATITUDE,TEMP_dPRES,PSAL_ADJUSTED_ERROR,...,BISULFIDE_ADJUSTED_QC,BISULFIDE_ADJUSTED_ERROR,BISULFIDE_QC,BISULFIDE,TURBIDITY_dPRES,TURBIDITY_ADJUSTED,TURBIDITY,TURBIDITY_ADJUSTED_QC,TURBIDITY_ADJUSTED_ERROR,TURBIDITY_QC
0,1,1,,1901378,0.0,,26.016001,34.025,0.0,0.01,...,,,,,,,,,,
1,1,1,,1901378,0.0,,26.017,34.025,0.0,0.01,...,,,,,,,,,,
2,1,8,188.885849,1901378,0.2,4.907148,26.017,34.025,0.2,0.01,...,,,,,,,,,,
3,1,1,,1901378,0.0,,26.017,34.025,0.0,0.01,...,,,,,,,,,,
4,1,1,,1901378,0.0,,26.017,34.025,0.0,0.01,...,,,,,,,,,,


This was still really fast, less than half a second! Let's now load all the dataframe into memory with the `compute()` method.

In [6]:
%%time
ddf = ddf.compute()
ddf

CPU times: user 8.2 s, sys: 24.9 s, total: 33.1 s
Wall time: 2.87 s


Unnamed: 0,PRES_ADJUSTED_QC,TEMP_ADJUSTED_QC,DOXY,PLATFORM_NUMBER,PSAL_dPRES,DOXY_ADJUSTED_ERROR,TEMP,LATITUDE,TEMP_dPRES,PSAL_ADJUSTED_ERROR,...,BISULFIDE_ADJUSTED_QC,BISULFIDE_ADJUSTED_ERROR,BISULFIDE_QC,BISULFIDE,TURBIDITY_dPRES,TURBIDITY_ADJUSTED,TURBIDITY,TURBIDITY_ADJUSTED_QC,TURBIDITY_ADJUSTED_ERROR,TURBIDITY_QC
0,1,1,,1901378,0.0,,26.016001,34.02500,0.0,0.01,...,,,,,,,,,,
1,1,1,,1901378,0.0,,26.017000,34.02500,0.0,0.01,...,,,,,,,,,,
2,1,8,188.885849,1901378,0.2,4.907148,26.017000,34.02500,0.2,0.01,...,,,,,,,,,,
3,1,1,,1901378,0.0,,26.017000,34.02500,0.0,0.01,...,,,,,,,,,,
4,1,1,,1901378,0.0,,26.017000,34.02500,0.0,0.01,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
33093,0,0,,3901669,,,,52.46687,,,...,,,,,,,,,,
33094,0,0,,3901669,,,,52.46687,,,...,,,,,,,,,,
33095,0,0,,3901669,,,,52.46687,,,...,,,,,,,,,,
33096,0,0,,3901669,,,,52.46687,,,...,,,,,,,,,,


It took a few seconds to load the whole dataframe, much faster than using pyarrow as we did in Example 1!

In [11]:
%%time
from datetime import datetime, timedelta
t0 = datetime(2023, 7, 1)
t1 = datetime(2023, 10, 31)

ref_var = 'CHLA_ADJUSTED'
cols = [ref_var,"LATITUDE","LONGITUDE","PRES_ADJUSTED","JULD"]
filter_coords_time_pres = [("LATITUDE",">",34), ("LATITUDE","<",80),
                           ("LONGITUDE",">",-78), ("LONGITUDE","<",-50),
                           ("JULD",">=",t0),("JULD","<=",t1),
                           ("PRES_ADJUSTED",">=",0),("PRES_ADJUSTED","<=",50),
                           (ref_var,">=",-1e30),(ref_var,"<=",+1e30),
                           (ref_var+"_QC",">=",1.0),(ref_var+"_QC","<=",2.0)]

ddf = dd.read_parquet(
    parquet_dir,
    engine="pyarrow",
    storage_options={"anon": True, "use_ssl": True} ,
    schema=BGC_schema,
    filters=filter_coords_time_pres,
    columns= cols
    )

CPU times: user 118 ms, sys: 3.61 ms, total: 122 ms
Wall time: 123 ms


In [12]:
import cartopy.crs as ccrs
import matplotlib.pyplot as plt
from matplotlib import colormaps

# Convert 'JULD' column to datetime type
# df['JULD'] = pd.to_datetime(df['JULD'])

# Group by 'LATITUDE' and 'LONGITUDE', and aggregate by averaging over 'PRES_ADJUSTED', 'JULD', and 'CHLA_ADJUSTED'
grouped = ddf.groupby(['LATITUDE', 'LONGITUDE']).agg({
    'PRES_ADJUSTED': 'mean',  # Take the mean depth
    'JULD': lambda x: x.tolist(),  # Collect all time values into a list
    ref_var: 'mean'  # Take the mean intensity
}).reset_index().compute()

# Plotting using Cartopy
plt.figure(figsize=(10, 6))
ax = plt.axes(projection=ccrs.PlateCarree())
ax.coastlines()

# Scatter plot
cbar_min = ddf[ref_var].quantile(q=0.1).compute()
cbar_max = ddf[ref_var].quantile(q=0.9).compute()
plt.scatter(grouped['LONGITUDE'], grouped['LATITUDE'], c=grouped[ref_var], vmin=cbar_min, vmax=cbar_max, cmap='cividis', transform=ccrs.PlateCarree())
plt.colorbar(label='Average ' + ref_var)
plt.xlabel('Longitude')
plt.ylabel('Latitude')
plt.title('North-West Atlantic average ' + ref_var)
plt.grid(True)
plt.xlim([-78, -45])
plt.ylim([30, 50])
plt.show()

ValueError: unknown aggregate lambda

Now we time how long it takes to load the filtered data with each different partitioning scheme.

### pyarrow only

We start using only pyarrow. While dask will likely improve the performance, we first want to see how pyarrow performs. Note the that pyarrow is the same engine used by dask, and that it supports multi-threaded column reads natively and by default.

#### 100 MB in-memory partitions

In [3]:
%%time
argo_ds = pq.ParquetDataset(
    pqt_100, 
    schema=PHY_schema,
    filters=filter_to_apply
)
argo_df = argo_ds.read(columns=cols).to_pandas()

CPU times: user 12.8 s, sys: 2.67 s, total: 15.5 s
Wall time: 3.73 s


#### 300 MB in-memory partitions

In [4]:
%%time
argo_ds = pq.ParquetDataset(
    pqt_300, 
    schema=PHY_schema,
    filters=filter_to_apply
)
argo_df = argo_ds.read(columns=cols).to_pandas()

CPU times: user 17.1 s, sys: 2.67 s, total: 19.8 s
Wall time: 1.93 s


#### YYYY-MM on-disk partitions (filtering on JULD)

In [5]:
%%time
argo_ds = pq.ParquetDataset(
    pqt_juld, 
    schema=PHY_schema,
    filters=filter_to_apply
)
argo_df = argo_ds.read(columns=cols).to_pandas()

CPU times: user 8min 2s, sys: 1min 18s, total: 9min 20s
Wall time: 6min 45s


#### YYYY-MM on-disk partitions (filtering on partitioned parameter JULD_D)

In [6]:
%%time
filter_to_apply_D = [("JULD_D",">=",time0),("JULD_D","<",time1),
                      ("LATITUDE",">=",lat0), ("LATITUDE","<=",lat1),
                      ("LONGITUDE",">=",lon0), ("LONGITUDE","<=",lon1),
                      ("PRES_ADJUSTED",">=",0),("PRES_ADJUSTED","<=",50),
                      (ref_var,">=",-1e30),(ref_var,"<=",+1e30)]
argo_ds = pq.ParquetDataset(
    pqt_juld, 
    schema=PHY_schema,
    filters=filter_to_apply_D
)
argo_df = argo_ds.read(columns=cols).to_pandas()

CPU times: user 48.3 s, sys: 25.2 s, total: 1min 13s
Wall time: 2min 54s


### pyarrow+dask

We start using only pyarrow. While dask will likely improve the performance, we first want to see how pyarrow performs. Note the that pyarrow is the same engine used by dask, and that it supports multi-threaded column reads natively and by default.

In [7]:
import dask
import dask.dataframe as dd

#### 100 MB in-memory partitions

In [8]:
%%time
ddf = dd.read_parquet(
    pqt_100,
    engine="pyarrow",
    storage_options={"anon": True, "use_ssl": True} ,
    schema=PHY_schema,
    columns = cols,
    filters = filter_to_apply
    )
ddf.persist()

CPU times: user 9.06 s, sys: 3.42 s, total: 12.5 s
Wall time: 2.4 s


Unnamed: 0_level_0,TEMP_ADJUSTED,LATITUDE,LONGITUDE,PRES_ADJUSTED
npartitions=287,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
,float32,float64,float64,float32
,...,...,...,...
...,...,...,...,...
,...,...,...,...
,...,...,...,...


#### 300 MB in-memory partitions

In [9]:
%%time
ddf = dd.read_parquet(
    pqt_300,
    engine="pyarrow",
    storage_options={"anon": True, "use_ssl": True},
    columns = cols,
    filters = filter_to_apply
    )
ddf.persist()

CPU times: user 12.1 s, sys: 6.06 s, total: 18.1 s
Wall time: 1.82 s


Unnamed: 0_level_0,TEMP_ADJUSTED,LATITUDE,LONGITUDE,PRES_ADJUSTED
npartitions=184,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
,float32,float64,float64,float32
,...,...,...,...
...,...,...,...,...
,...,...,...,...
,...,...,...,...


#### YYYY-MM on-disk partitions (JULD_D)

In [10]:
%%time
ddf = dd.read_parquet(
    pqt_juld,
    engine="pyarrow",
    storage_options={"anon": True, "use_ssl": True},
    columns = cols,
    filters = filter_to_apply_D
    )
ddf.persist()

ArrowNotImplementedError: Function 'greater_equal' has no kernel matching input types (string, timestamp[us])

### pyarrow+dask cluster

We start using only pyarrow. While dask will likely improve the performance, we first want to see how pyarrow performs. Note the that pyarrow is the same engine used by dask, and that it supports multi-threaded column reads natively and by default.

In [17]:
from dask.distributed import Client
client = Client(n_workers=10, threads_per_worker=10, processes=True, memory_limit='auto')
client

0,1
Connection method: Cluster object,Cluster type: distributed.LocalCluster
Dashboard: http://127.0.0.1:8787/status,

0,1
Dashboard: http://127.0.0.1:8787/status,Workers: 10
Total threads: 100,Total memory: 271.27 GiB
Status: running,Using processes: True

0,1
Comm: tcp://127.0.0.1:41829,Workers: 10
Dashboard: http://127.0.0.1:8787/status,Total threads: 100
Started: Just now,Total memory: 271.27 GiB

0,1
Comm: tcp://127.0.0.1:36951,Total threads: 10
Dashboard: http://127.0.0.1:45728/status,Memory: 27.13 GiB
Nanny: tcp://127.0.0.1:45057,
Local directory: /tmp/dask-scratch-space/worker-pdmo8x6o,Local directory: /tmp/dask-scratch-space/worker-pdmo8x6o

0,1
Comm: tcp://127.0.0.1:39923,Total threads: 10
Dashboard: http://127.0.0.1:38399/status,Memory: 27.13 GiB
Nanny: tcp://127.0.0.1:34358,
Local directory: /tmp/dask-scratch-space/worker-7pyrh80j,Local directory: /tmp/dask-scratch-space/worker-7pyrh80j

0,1
Comm: tcp://127.0.0.1:36189,Total threads: 10
Dashboard: http://127.0.0.1:37029/status,Memory: 27.13 GiB
Nanny: tcp://127.0.0.1:45548,
Local directory: /tmp/dask-scratch-space/worker-gtt7_pnj,Local directory: /tmp/dask-scratch-space/worker-gtt7_pnj

0,1
Comm: tcp://127.0.0.1:36177,Total threads: 10
Dashboard: http://127.0.0.1:32889/status,Memory: 27.13 GiB
Nanny: tcp://127.0.0.1:45108,
Local directory: /tmp/dask-scratch-space/worker-i8lcypwn,Local directory: /tmp/dask-scratch-space/worker-i8lcypwn

0,1
Comm: tcp://127.0.0.1:33037,Total threads: 10
Dashboard: http://127.0.0.1:41220/status,Memory: 27.13 GiB
Nanny: tcp://127.0.0.1:43500,
Local directory: /tmp/dask-scratch-space/worker-6npu0don,Local directory: /tmp/dask-scratch-space/worker-6npu0don

0,1
Comm: tcp://127.0.0.1:37455,Total threads: 10
Dashboard: http://127.0.0.1:44008/status,Memory: 27.13 GiB
Nanny: tcp://127.0.0.1:44119,
Local directory: /tmp/dask-scratch-space/worker-p7a95l9m,Local directory: /tmp/dask-scratch-space/worker-p7a95l9m

0,1
Comm: tcp://127.0.0.1:46391,Total threads: 10
Dashboard: http://127.0.0.1:34447/status,Memory: 27.13 GiB
Nanny: tcp://127.0.0.1:41901,
Local directory: /tmp/dask-scratch-space/worker-ecpi9km9,Local directory: /tmp/dask-scratch-space/worker-ecpi9km9

0,1
Comm: tcp://127.0.0.1:41003,Total threads: 10
Dashboard: http://127.0.0.1:33902/status,Memory: 27.13 GiB
Nanny: tcp://127.0.0.1:39356,
Local directory: /tmp/dask-scratch-space/worker-qyfgcc7w,Local directory: /tmp/dask-scratch-space/worker-qyfgcc7w

0,1
Comm: tcp://127.0.0.1:35997,Total threads: 10
Dashboard: http://127.0.0.1:36859/status,Memory: 27.13 GiB
Nanny: tcp://127.0.0.1:34234,
Local directory: /tmp/dask-scratch-space/worker-4howf87x,Local directory: /tmp/dask-scratch-space/worker-4howf87x

0,1
Comm: tcp://127.0.0.1:34726,Total threads: 10
Dashboard: http://127.0.0.1:41676/status,Memory: 27.13 GiB
Nanny: tcp://127.0.0.1:34242,
Local directory: /tmp/dask-scratch-space/worker-bt2rcph1,Local directory: /tmp/dask-scratch-space/worker-bt2rcph1


#### 100 MB in-memory partitions

In [18]:
%%time
ddf = dd.read_parquet(
    pqt_100,
    engine="pyarrow",
    storage_options={"anon": True, "use_ssl": True} ,
    schema=PHY_schema,
    columns = cols,
    filters = filter_to_apply
    )
ddf.persist()

CPU times: user 155 ms, sys: 30.9 ms, total: 186 ms
Wall time: 171 ms


Unnamed: 0_level_0,TEMP_ADJUSTED,LATITUDE,LONGITUDE,PRES_ADJUSTED
npartitions=287,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
,float32,float64,float64,float32
,...,...,...,...
...,...,...,...,...
,...,...,...,...
,...,...,...,...


#### 300 MB in-memory partitions

In [19]:
%%time
ddf = dd.read_parquet(
    pqt_300,
    engine="pyarrow",
    storage_options={"anon": True, "use_ssl": True},
    columns = cols,
    filters = filter_to_apply
    )
ddf.persist()

CPU times: user 145 ms, sys: 28.5 ms, total: 174 ms
Wall time: 166 ms


Unnamed: 0_level_0,TEMP_ADJUSTED,LATITUDE,LONGITUDE,PRES_ADJUSTED
npartitions=184,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
,float32,float64,float64,float32
,...,...,...,...
...,...,...,...,...
,...,...,...,...
,...,...,...,...


#### YYYY-MM on-disk partitions (on JULD)

In [20]:
%%time
ddf = dd.read_parquet(
    pqt_juld,
    engine="pyarrow",
    storage_options={"anon": True, "use_ssl": True},
    columns = cols,
    filters = filter_to_apply
    )
ddf.persist()

CPU times: user 45 s, sys: 35.7 s, total: 1min 20s
Wall time: 5min 15s


Unnamed: 0_level_0,TEMP_ADJUSTED,LATITUDE,LONGITUDE,PRES_ADJUSTED
npartitions=54,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
,float32,float64,float64,float32
,...,...,...,...
...,...,...,...,...
,...,...,...,...
,...,...,...,...


#### YYYY-MM on-disk partitions (on JULD_D)

In [21]:
%%time
ddf = dd.read_parquet(
    pqt_juld,
    engine="pyarrow",
    storage_options={"anon": True, "use_ssl": True},
    columns = cols,
    filters = filter_to_apply_D
    )
ddf.persist()

ArrowNotImplementedError: Function 'greater_equal' has no kernel matching input types (string, timestamp[us])

In [22]:
client.close()