# Read test - Map of Argo and GLODAP temperature measurements in the South Eastern Indian Ocean (Poseidon version)

This notebook tests some basic filtering for a couple of different partitionings of the Argo Core parquet database, namely:
* reorganizing the dataframes so that each takes up around 100 MB (min. recommended by dask);
* reorganizing the dataframes so that each takes up around 300 MB (max. recommended by dask);
* saving to disk so that data is split by year-month-day, as users will most likely be interested in a specific time range.

The data are stored across multiple files: we will load into memory only what we need by applying some filters, and we will create a map showing the temperature measurements in the North West Atlantic.

##### Note on Poseidon

In this example we will access data stored in WHOI's **Poseidon cluster**. Reading data from WHOI's Amazon S3 data lake is slightly different and we refer you to dedicated examples (manipulating the data once loaded into the memory does not change).

NB: to use this example you need to have access to WHOI's VPN or network, **and** to Boom lab's shared storage at `/vortexfs1/share/boom`. The notebook should also be executed from Poseidon.

#### Getting started

We first load all the modules we need, and define the geographical coordinates that the limit the area that we are interested in.

In [1]:
from datetime import datetime
import xarray as xr
import pyarrow as pa
import pyarrow.parquet as pq
from pprint import pprint
import numpy as np

# Paths on Poseidon cluster
pqt_dir = '/vortexfs1/share/boom/data/nc2pqt_test/pqt2/'

pqt_100 = pqt_dir + 'partition100MB/'
pqt_300 = pqt_dir + 'partition300MB/'
pqt_juld = pqt_dir + 'partitionYYYYMM/'

lat0 = 34
lat1 = 80
lon0 = -78
lon1 = -50

# pre-load schema
schema_path = "/vortexfs1/share/boom/data/nc2pqt_test/pqt/data/metadata/ArgoPHY_schema.metadata"
PHY_schema = pq.read_schema(schema_path)
todrop = ["DOXY","DOXY_ADJUSTED","DOXY_ADJUSTED_QC","DOXY_ADJUSTED_ERROR","DOXY_QC"]
for name in todrop:
    idx = PHY_schema.get_field_index(name)
    PHY_schema = PHY_schema.remove(idx)

PHY_schema = PHY_schema.append(
    pa.field('JULD_D', 
             pa.from_numpy_dtype(np.dtype('datetime64[ns]'))
            )
)

## Timing tests

The geographical coordinates are stored in the variables 'LATITUDE'and 'LONGITUDE'. We then generate the filter, with its syntax being: `[[(column, op, val), …],…]` where `column` is the variable name, and `val` is the value to for the operator `op`, which accepts `[==, =, >, >=, <, <=, !=, in, not in]`. Similarly, we will also filter by depth through the pressure values in 'PRES_ADJUSTED', to restrain our selection to the first 50m of the ocean.

Let's set up the filters first:

In [2]:
from datetime import datetime, timedelta
time0 = datetime.utcnow() - timedelta(days=365)
time1 = time0 + timedelta(days=90)

ref_var = 'TEMP_ADJUSTED'
cols = [ref_var,"LATITUDE","LONGITUDE","PRES_ADJUSTED"]
filter_to_apply = [("JULD",">=",time0),("JULD","<",time1),
                      ("LATITUDE",">=",lat0), ("LATITUDE","<=",lat1),
                      ("LONGITUDE",">=",lon0), ("LONGITUDE","<=",lon1),
                      ("PRES_ADJUSTED",">=",0),("PRES_ADJUSTED","<=",50),
                      (ref_var,">=",-1e30),(ref_var,"<=",+1e30)]

Now we time how long it takes to load the filtered data with each different partitioning scheme.

### pyarrow only

We start using only pyarrow. While dask will likely improve the performance, we first want to see how pyarrow performs. Note the that pyarrow is the same engine used by dask, and that it supports multi-threaded column reads natively and by default.

#### 100 MB in-memory partitions

In [3]:
%%time
argo_ds = pq.ParquetDataset(
    pqt_100, 
    schema=PHY_schema,
    filters=filter_to_apply
)
argo_df = argo_ds.read(columns=cols).to_pandas()

CPU times: user 12.8 s, sys: 2.67 s, total: 15.5 s
Wall time: 3.73 s


#### 300 MB in-memory partitions

In [4]:
%%time
argo_ds = pq.ParquetDataset(
    pqt_300, 
    schema=PHY_schema,
    filters=filter_to_apply
)
argo_df = argo_ds.read(columns=cols).to_pandas()

CPU times: user 17.1 s, sys: 2.67 s, total: 19.8 s
Wall time: 1.93 s


#### YYYY-MM on-disk partitions (filtering on JULD)

In [5]:
%%time
argo_ds = pq.ParquetDataset(
    pqt_juld, 
    schema=PHY_schema,
    filters=filter_to_apply
)
argo_df = argo_ds.read(columns=cols).to_pandas()

CPU times: user 8min 2s, sys: 1min 18s, total: 9min 20s
Wall time: 6min 45s


#### YYYY-MM on-disk partitions (filtering on partitioned parameter JULD_D)

In [6]:
%%time
filter_to_apply_D = [("JULD_D",">=",time0),("JULD_D","<",time1),
                      ("LATITUDE",">=",lat0), ("LATITUDE","<=",lat1),
                      ("LONGITUDE",">=",lon0), ("LONGITUDE","<=",lon1),
                      ("PRES_ADJUSTED",">=",0),("PRES_ADJUSTED","<=",50),
                      (ref_var,">=",-1e30),(ref_var,"<=",+1e30)]
argo_ds = pq.ParquetDataset(
    pqt_juld, 
    schema=PHY_schema,
    filters=filter_to_apply_D
)
argo_df = argo_ds.read(columns=cols).to_pandas()

CPU times: user 48.3 s, sys: 25.2 s, total: 1min 13s
Wall time: 2min 54s


### pyarrow+dask

We start using only pyarrow. While dask will likely improve the performance, we first want to see how pyarrow performs. Note the that pyarrow is the same engine used by dask, and that it supports multi-threaded column reads natively and by default.

In [7]:
import dask
import dask.dataframe as dd

#### 100 MB in-memory partitions

In [8]:
%%time
ddf = dd.read_parquet(
    pqt_100,
    engine="pyarrow",
    storage_options={"anon": True, "use_ssl": True} ,
    schema=PHY_schema,
    columns = cols,
    filters = filter_to_apply
    )
ddf.persist()

CPU times: user 9.06 s, sys: 3.42 s, total: 12.5 s
Wall time: 2.4 s


Unnamed: 0_level_0,TEMP_ADJUSTED,LATITUDE,LONGITUDE,PRES_ADJUSTED
npartitions=287,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
,float32,float64,float64,float32
,...,...,...,...
...,...,...,...,...
,...,...,...,...
,...,...,...,...


#### 300 MB in-memory partitions

In [9]:
%%time
ddf = dd.read_parquet(
    pqt_300,
    engine="pyarrow",
    storage_options={"anon": True, "use_ssl": True},
    columns = cols,
    filters = filter_to_apply
    )
ddf.persist()

CPU times: user 12.1 s, sys: 6.06 s, total: 18.1 s
Wall time: 1.82 s


Unnamed: 0_level_0,TEMP_ADJUSTED,LATITUDE,LONGITUDE,PRES_ADJUSTED
npartitions=184,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
,float32,float64,float64,float32
,...,...,...,...
...,...,...,...,...
,...,...,...,...
,...,...,...,...


#### YYYY-MM on-disk partitions (JULD_D)

In [10]:
%%time
ddf = dd.read_parquet(
    pqt_juld,
    engine="pyarrow",
    storage_options={"anon": True, "use_ssl": True},
    columns = cols,
    filters = filter_to_apply_D
    )
ddf.persist()

ArrowNotImplementedError: Function 'greater_equal' has no kernel matching input types (string, timestamp[us])

### pyarrow+dask cluster

We start using only pyarrow. While dask will likely improve the performance, we first want to see how pyarrow performs. Note the that pyarrow is the same engine used by dask, and that it supports multi-threaded column reads natively and by default.

In [17]:
from dask.distributed import Client
client = Client(n_workers=10, threads_per_worker=10, processes=True, memory_limit='auto')
client

0,1
Connection method: Cluster object,Cluster type: distributed.LocalCluster
Dashboard: http://127.0.0.1:8787/status,

0,1
Dashboard: http://127.0.0.1:8787/status,Workers: 10
Total threads: 100,Total memory: 271.27 GiB
Status: running,Using processes: True

0,1
Comm: tcp://127.0.0.1:41829,Workers: 10
Dashboard: http://127.0.0.1:8787/status,Total threads: 100
Started: Just now,Total memory: 271.27 GiB

0,1
Comm: tcp://127.0.0.1:36951,Total threads: 10
Dashboard: http://127.0.0.1:45728/status,Memory: 27.13 GiB
Nanny: tcp://127.0.0.1:45057,
Local directory: /tmp/dask-scratch-space/worker-pdmo8x6o,Local directory: /tmp/dask-scratch-space/worker-pdmo8x6o

0,1
Comm: tcp://127.0.0.1:39923,Total threads: 10
Dashboard: http://127.0.0.1:38399/status,Memory: 27.13 GiB
Nanny: tcp://127.0.0.1:34358,
Local directory: /tmp/dask-scratch-space/worker-7pyrh80j,Local directory: /tmp/dask-scratch-space/worker-7pyrh80j

0,1
Comm: tcp://127.0.0.1:36189,Total threads: 10
Dashboard: http://127.0.0.1:37029/status,Memory: 27.13 GiB
Nanny: tcp://127.0.0.1:45548,
Local directory: /tmp/dask-scratch-space/worker-gtt7_pnj,Local directory: /tmp/dask-scratch-space/worker-gtt7_pnj

0,1
Comm: tcp://127.0.0.1:36177,Total threads: 10
Dashboard: http://127.0.0.1:32889/status,Memory: 27.13 GiB
Nanny: tcp://127.0.0.1:45108,
Local directory: /tmp/dask-scratch-space/worker-i8lcypwn,Local directory: /tmp/dask-scratch-space/worker-i8lcypwn

0,1
Comm: tcp://127.0.0.1:33037,Total threads: 10
Dashboard: http://127.0.0.1:41220/status,Memory: 27.13 GiB
Nanny: tcp://127.0.0.1:43500,
Local directory: /tmp/dask-scratch-space/worker-6npu0don,Local directory: /tmp/dask-scratch-space/worker-6npu0don

0,1
Comm: tcp://127.0.0.1:37455,Total threads: 10
Dashboard: http://127.0.0.1:44008/status,Memory: 27.13 GiB
Nanny: tcp://127.0.0.1:44119,
Local directory: /tmp/dask-scratch-space/worker-p7a95l9m,Local directory: /tmp/dask-scratch-space/worker-p7a95l9m

0,1
Comm: tcp://127.0.0.1:46391,Total threads: 10
Dashboard: http://127.0.0.1:34447/status,Memory: 27.13 GiB
Nanny: tcp://127.0.0.1:41901,
Local directory: /tmp/dask-scratch-space/worker-ecpi9km9,Local directory: /tmp/dask-scratch-space/worker-ecpi9km9

0,1
Comm: tcp://127.0.0.1:41003,Total threads: 10
Dashboard: http://127.0.0.1:33902/status,Memory: 27.13 GiB
Nanny: tcp://127.0.0.1:39356,
Local directory: /tmp/dask-scratch-space/worker-qyfgcc7w,Local directory: /tmp/dask-scratch-space/worker-qyfgcc7w

0,1
Comm: tcp://127.0.0.1:35997,Total threads: 10
Dashboard: http://127.0.0.1:36859/status,Memory: 27.13 GiB
Nanny: tcp://127.0.0.1:34234,
Local directory: /tmp/dask-scratch-space/worker-4howf87x,Local directory: /tmp/dask-scratch-space/worker-4howf87x

0,1
Comm: tcp://127.0.0.1:34726,Total threads: 10
Dashboard: http://127.0.0.1:41676/status,Memory: 27.13 GiB
Nanny: tcp://127.0.0.1:34242,
Local directory: /tmp/dask-scratch-space/worker-bt2rcph1,Local directory: /tmp/dask-scratch-space/worker-bt2rcph1


#### 100 MB in-memory partitions

In [18]:
%%time
ddf = dd.read_parquet(
    pqt_100,
    engine="pyarrow",
    storage_options={"anon": True, "use_ssl": True} ,
    schema=PHY_schema,
    columns = cols,
    filters = filter_to_apply
    )
ddf.persist()

CPU times: user 155 ms, sys: 30.9 ms, total: 186 ms
Wall time: 171 ms


Unnamed: 0_level_0,TEMP_ADJUSTED,LATITUDE,LONGITUDE,PRES_ADJUSTED
npartitions=287,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
,float32,float64,float64,float32
,...,...,...,...
...,...,...,...,...
,...,...,...,...
,...,...,...,...


#### 300 MB in-memory partitions

In [19]:
%%time
ddf = dd.read_parquet(
    pqt_300,
    engine="pyarrow",
    storage_options={"anon": True, "use_ssl": True},
    columns = cols,
    filters = filter_to_apply
    )
ddf.persist()

CPU times: user 145 ms, sys: 28.5 ms, total: 174 ms
Wall time: 166 ms


Unnamed: 0_level_0,TEMP_ADJUSTED,LATITUDE,LONGITUDE,PRES_ADJUSTED
npartitions=184,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
,float32,float64,float64,float32
,...,...,...,...
...,...,...,...,...
,...,...,...,...
,...,...,...,...


#### YYYY-MM on-disk partitions (on JULD)

In [20]:
%%time
ddf = dd.read_parquet(
    pqt_juld,
    engine="pyarrow",
    storage_options={"anon": True, "use_ssl": True},
    columns = cols,
    filters = filter_to_apply
    )
ddf.persist()

CPU times: user 45 s, sys: 35.7 s, total: 1min 20s
Wall time: 5min 15s


Unnamed: 0_level_0,TEMP_ADJUSTED,LATITUDE,LONGITUDE,PRES_ADJUSTED
npartitions=54,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
,float32,float64,float64,float32
,...,...,...,...
...,...,...,...,...
,...,...,...,...
,...,...,...,...


#### YYYY-MM on-disk partitions (on JULD_D)

In [21]:
%%time
ddf = dd.read_parquet(
    pqt_juld,
    engine="pyarrow",
    storage_options={"anon": True, "use_ssl": True},
    columns = cols,
    filters = filter_to_apply_D
    )
ddf.persist()

ArrowNotImplementedError: Function 'greater_equal' has no kernel matching input types (string, timestamp[us])

In [22]:
client.close()