### Example 3 - Temperature-Salinity profiles in the North West Atlantic

This example shows how to read and manipulate QCed Argo observations stored in parquet format. The data are stored across multiple files: we will load into memory only what we need by applying some filters, and we will create a plot of the temperature-salinity profiles in 2019 in the North West Atlantic, and a map showing the average position of the profiles.

This example shows how to use the QCed Argo dataset, i.e. data with QC flag equal to 1, 2, 5, or 8 and in delayed mode (unless this is missing, then the real time data is provided). For more details on how the dataset is built, see: 

Example 1 shows how to use the full Argo dataset, and Example 2 shows CrocoLake, which contains QCed data from different datasets (Argo, GLODAP, Spray Gliders as of today).

##### Note on parquet files
There are several ways to load parquet files in a dataframe in Python, and a few are illustrated in Examples 1 and 2. This notebook uses dask as it is more efficient and optimized to work with larger than memory data.

#### Getting started

If you haven't already, install the required packages by running `pip install .` at the root of the repository.

We also need the dataset! In this example we use the Argo PHY dataset: you can uncomment and run the cell below, or copy-paste the command (without the leading `!`) in your command line. If you are interested in physical quantities only, you can replace 'PHY' with 'PHY' here and throughout the notebook.

In [None]:
!download_db -d Argo -t PHY --qc

We then import the necessary modules and set up the path to the dataset (update the `parquet_dir` variable below if you have specified a different location in the previous cell or have moved the dataset).

In [None]:
import datetime
import glob
from pprint import pprint

import dask
import dask.dataframe as dd
import pandas as pd
import pyarrow.parquet as pq

import matplotlib.pyplot as plt

# Path to Argo PHY 'QC'
parquet_dir = './CrocoLake/1002_PHY_ARGO-QC-DEV/'
# Setting up parquet schema
PHY_schema = pq.read_schema(parquet_dir+"_common_metadata")

In [None]:
%%time
cols = ["LATITUDE","LONGITUDE","JULD","TEMP","PSAL","PLATFORM_NUMBER"]

lat0 = 37
lat1 = 42
lon0 = -70
lon1 = -65

date0 = datetime.datetime(2019, 1, 1, 0, 0, 0)
date1 = datetime.datetime(2020, 1, 1, 0, 0, 0)

myfilter = [
    ("LATITUDE",">",lat0), ("LATITUDE","<",lat1),
    ("LONGITUDE",">",lon0), ("LONGITUDE","<",lon1),
    ("JULD",">",date0), ("JULD","<",date1)
]

ddf = dd.read_parquet(
    parquet_dir,
    engine="pyarrow",
    schema=PHY_schema,
    filters=myfilter,
    columns= cols
)

ddf['MONTH'] = ddf['JULD'].dt.strftime('%B')
month_order = ['January', 'February', 'March', 'April', 'May', 'June', 
               'July', 'August', 'September', 'October', 'November', 'December']

# Setting up figure
plt.figure(figsize=(10, 6))
custom_colors = ['#000000', '#7f7f7f', '#2ca02c', '#FFF300', '#ff7f0e', '#d62728', '#9467bd', '#e377c2', '#bcbd22', '#8c564b', '#17becf', '#1f77b4']
from cycler import cycler
plt.rcParams['axes.prop_cycle'] = cycler(color=custom_colors)

# Group by month and plot each group
cols_for_ts = ["TEMP","PSAL","MONTH"]
df_for_ts =  ddf[cols_for_ts].compute()
df_for_ts['MONTH'] = pd.Categorical(df_for_ts['MONTH'], categories=month_order, ordered=True)
df_for_ts = df_for_ts.sort_values(["MONTH"])

for month, group in df_for_ts.groupby('MONTH',observed=False):
    plt.plot(
        group['PSAL'],
        group['TEMP'],
        linestyle='None',
        marker='o',
        markersize=2.5,
        fillstyle='full',
        alpha=0.6,
        label=f'{month}'
    )

plt.xlabel('PSAL')
plt.ylabel('TEMP')
plt.title('T-S plot')
plt.legend()
plt.show()

custom_colors = ['#000000', '#7f7f7f', '#2ca02c', '#FFF300', '#ff7f0e', '#d62728', '#9467bd', '#e377c2', '#bcbd22', '#8c564b', '#17becf', '#1f77b4']

In the following, we visualize the average location of the float(s) used for the previous profiles

In [None]:
# Group by month and plot each group
import cartopy.crs as ccrs
from matplotlib.patches import Ellipse

plt.figure(figsize=(16, 12))
ax = plt.axes(projection=ccrs.PlateCarree())
ax.coastlines()

# Group by month and plot each group
cols_for_map = ["LATITUDE","LONGITUDE","MONTH"]
df_for_map =  ddf[cols_for_map].compute()
df_for_map['MONTH'] = pd.Categorical(df_for_map['MONTH'], categories=month_order, ordered=True)
df_for_map = df_for_map.sort_values(["MONTH"])

for month, group in df_for_map.groupby('MONTH',observed=False):
    # Compute mean and standard deviation
    mean_lat = group['LATITUDE'].mean()
    mean_lon = group['LONGITUDE'].mean()
    std_lat = group['LATITUDE'].std()
    std_lon = group['LONGITUDE'].std()

    plt.errorbar(
        mean_lon,
        mean_lat,
        xerr=std_lon,
        yerr=std_lat,
        fmt='o',
        elinewidth=2,
        capsize=4,
        alpha=0.8,
        label=f'{month}',
        transform=ccrs.PlateCarree()
    )

plt.xlabel('Longitude')
plt.ylabel('Latitude')
plt.title('Location of Argo floats used for T-S profiles')
plt.grid(True)
plt.xlim([lon0, lon1])
plt.ylim([lat0, lat1])
plt.legend()
plt.show()