### Example 3 - Temperature-Salinity profiles in the North West Atlantic

This example shows how to read and manipulate QCed Argo observations stored in parquet format. The data are stored across multiple files: we will load into memory only what we need by applying some filters, and we will create a plot of the temperature-salinity profiles in 2019 in the North West Atlantic, and a map showing the average position of the profiles.

This example shows how to use the QCed Argo dataset, i.e. data with QC flag equal to 1, 2, 5, or 8 and in delayed mode (unless this is missing, then the real time data is provided). For more details on how the dataset is built, see: 

Example 1 shows how to use the full Argo dataset, and Example 2 shows CrocoLake, which contains QCed data from different datasets (Argo, GLODAP, Spray Gliders as of today).

##### Note on parquet files
There are several ways to load parquet files in a dataframe in Python, and a few are illustrated in Examples 1 and 2. This notebook uses dask as it is more efficient and optimized to work with larger than memory data.

#### Getting started

If you haven't already, install the required packages by running `pip install .` at the root of the repository.

We also need the dataset! In this example we use the Argo PHY dataset: you can uncomment and run the cell below, or copy-paste the command (without the leading `!`) in your command line. If you are interested in physical quantities only, you can replace 'PHY' with 'PHY' here and throughout the notebook.

The script downloads the dataset to the default directory `./CrocoLake`; if you want to specify a different path, you can use the `--destination` flag.

NB: the download might take a while, the PHY datasets are ~20GB and the BGC ~6GB.

In [None]:
# !download_db -d Argo -t PHY --qc

We then import the necessary modules and set up the path to the dataset (update the `parquet_dir` variable below if you have specified a different location in the previous cell or have moved the dataset).

In [None]:
import datetime
import glob
from pprint import pprint

import dask
import dask.dataframe as dd
import numpy as np
import pandas as pd
import pyarrow.parquet as pq

import matplotlib.pyplot as plt

# Path to Argo PHY 'QC'
parquet_dir = './CrocoLake/1003_PHY_ARGO-QC/'
# Setting up parquet schema
PHY_schema = pq.read_schema(parquet_dir+"_common_metadata")

In [None]:
%%time
cols = ["JULD","LATITUDE","LONGITUDE","PRES","TEMP","PSAL","PLATFORM_NUMBER"]

lat0 = 37
lat1 = 42
lon0 = -70
lon1 = -65

date0 = datetime.datetime(2019, 1, 1, 0, 0, 0)
date1 = datetime.datetime(2020, 1, 1, 0, 0, 0)

myfilter = [
    ("LATITUDE",">",lat0), ("LATITUDE","<",lat1),
    ("LONGITUDE",">",lon0), ("LONGITUDE","<",lon1),
    ("JULD",">",date0), ("JULD","<",date1)
]

ddf = dd.read_parquet(
    parquet_dir,
    engine="pyarrow",
    schema=PHY_schema,
    filters=myfilter,
    columns= cols
)

ddf['MONTH'] = ddf['JULD'].dt.strftime('%B')
month_order = ['January', 'February', 'March', 'April', 'May', 'June', 
               'July', 'August', 'September', 'October', 'November', 'December']

# Setting up figure
plt.figure(figsize=(10, 6))
custom_colors = ['#000000', '#7f7f7f', '#2ca02c', '#FFF300', '#ff7f0e', '#d62728', '#9467bd', '#e377c2', '#bcbd22', '#8c564b', '#17becf', '#1f77b4']
from cycler import cycler
plt.rcParams['axes.prop_cycle'] = cycler(color=custom_colors)

# Group by month and plot each group
cols_for_ts = ["TEMP","PSAL","MONTH"]
df_for_ts =  ddf[cols_for_ts].compute()
df_for_ts['MONTH'] = pd.Categorical(df_for_ts['MONTH'], categories=month_order, ordered=True)
df_for_ts = df_for_ts.sort_values(["MONTH"])

for month, group in df_for_ts.groupby('MONTH',observed=False):
    plt.plot(
        group['PSAL'],
        group['TEMP'],
        linestyle='None',
        marker='o',
        markersize=2.5,
        fillstyle='full',
        alpha=0.6,
        label=f'{month}'
    )

plt.xlabel('PSAL')
plt.ylabel('TEMP')
plt.title('T-S plot')
plt.legend()
plt.show()

custom_colors = ['#000000', '#7f7f7f', '#2ca02c', '#FFF300', '#ff7f0e', '#d62728', '#9467bd', '#e377c2', '#bcbd22', '#8c564b', '#17becf', '#1f77b4']

In the following, we visualize the average location of the float(s) used for the previous profiles

In [None]:
%%time

# Setting up figure
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 6))

# Group by month and plot each group
cols_for_p = ["PRES","TEMP","PSAL","MONTH"]
df_for_p =  ddf[cols_for_p].compute()
df_for_p['MONTH'] = pd.Categorical(df_for_p['MONTH'], categories=month_order, ordered=True)
df_for_p = df_for_p.sort_values(["MONTH"])

binwidth = 10
start = -binwidth/2
end = 2505 + binwidth/2
bins = np.arange(start,end,binwidth)
labels = bins[:-1]+binwidth/2

df_for_p["PRES_BINNED"] = pd.cut(
    df_for_p["PRES"],
    bins=bins,
    labels=labels
)

colors_iterator = iter(custom_colors)

for month, group in df_for_p.groupby('MONTH',observed=False):

    # print(month)
    # print(group)
    average_df = group.groupby(
        ["PRES_BINNED"],
        as_index=False
    ).aggregate(
        {
            "TEMP": "mean",
            "PSAL": "mean"
        }
    )
    
    current_color = next(colors_iterator)
    
    ax1.plot(
        average_df["TEMP"],
        average_df["PRES_BINNED"],
        linestyle='None',
        color=current_color,
        marker='o',
        markersize=2.5,
        fillstyle='full',
        alpha=0.6,
        label=f'{month}'
    )

    ax2.plot(
        average_df["PSAL"],
        average_df["PRES_BINNED"],
        linestyle='None',
        color=current_color,
        marker='o',
        markersize=2.5,
        fillstyle='full',
        alpha=0.6,
        label=f'{month}'
    )

# y axis
yl = "pressure"
ax1.invert_yaxis()
ax2.invert_yaxis()
ax1.set_ylabel(yl)
ax2.set_ylabel(yl)

# x axis
ax1.set_xlabel("temperature")
ax2.set_xlabel("salinity")

# title, legend
ax1.set_title("Temperature-Pressure")
ax2.set_title("Salinity-Pressure")
ax1.legend()
ax2.legend()

plt.subplots_adjust(wspace=0.5)
plt.show()