# Download matchup data from CMEMS

**Last updated: 29/04/2024**

**Copernicus Marine Environment Monitoring Service** (**CMEMS**, or Copernicus Marine Service for short) stores **L3**, **L4** and **reanalysis** satellite data from various satellite data providers. This script uses the conda environment `copernicusmarine`, the toolbox developed by Copernicus Marine Service, to download **matchup** data from the **Copernicus Marine Data Store**. I developed it with the assistance of the Copernicus Marine Service helpdesk officers.

This script is configured to download L3 and reanalysis matchup data for the following example datasets:
* **HPLC dataset** for the North Sea, provided by CEFAS
* **pH dataset** from the SSB and UKOA programs, provided by Naomi Greenwood (CEFAS)
* **Oxford-CEFAS ship survey dataset** for the Endurance site area

**You can use this script as a template and modify the code sections below to customise it for your own datasets.**

As of the latest update, certain CMEMS [products](https://marine.copernicus.eu/user-corner/user-notification-service/datasets-process-being-formatted-service-subset) (4km-global-CMEMSmulti and 300m-global-OLCI) are unavailable via the `copernicusmarine`, and have therefore been commented out from the product list.

## Import libraries, functions and define paths

In [50]:
import os 
from pathlib import Path
import copernicusmarine
from datetime import datetime
import pandas as pd

In [51]:
# Because it could occur the latitudes or longitudes are inverted, 
# we define the following function called when doing the download

def sort_dimension(dataset, dim_name):
    """
    Get the values for the specified dimension and verify if they are unsorted. If so, the function sorts them.
    """
    # Get the coordinate values for the specified dimension.
    coords = dataset[dim_name].values

    # Check if the coordinates are unsorted.
    if (coords[0] >= coords[:-1]).all():
        dataset = dataset.sortby(dim_name, ascending=True)
        
    return dataset

In [52]:
# Create a download directory for our outputs

PATH_ROOT_DIR = Path.cwd().resolve().parents[1] # /absolute/path/to/two/levels/up

NAME_DOWNLOAD_DIR_HPLC_MATCHUPS = 'data_matchups_HPLC_CMT_csv'
NAME_DOWNLOAD_DIR_PH_MATCHUPS = 'data_matchups_pH_CMT_csv'
NAME_DOWNLOAD_DIR_SHIP_MATCHUPS = 'data_matchups_ship_CMT_csv'

# Combine ROOT_DIR with the directory name
full_path_download_dir_hplc = os.path.join(PATH_ROOT_DIR,"data","raw","CMEMS_data",NAME_DOWNLOAD_DIR_HPLC_MATCHUPS)
full_path_download_dir_ph = os.path.join(PATH_ROOT_DIR,"data","raw","CMEMS_data",NAME_DOWNLOAD_DIR_PH_MATCHUPS)
full_path_download_dir_ship = os.path.join(PATH_ROOT_DIR,"data","raw","CMEMS_data",NAME_DOWNLOAD_DIR_SHIP_MATCHUPS)

# Create the directory at the specified path
os.makedirs(full_path_download_dir_hplc, exist_ok=True)
os.makedirs(full_path_download_dir_ph, exist_ok=True)
os.makedirs(full_path_download_dir_ship, exist_ok=True)

## Read in our in situ HPLC observations

In [53]:
# This file was created by the Matlab function prepareHPLCdata.m
NAME_HPLC_DATA_FILE = 'cefasHPLCfiltered.csv'
full_path_hplc_data_dir = os.path.join(PATH_ROOT_DIR,'data','processed',NAME_HPLC_DATA_FILE)
matchup_hplc_locations_list = pd.read_csv(full_path_hplc_data_dir, sep = ',')

# Converting date column into right format
matchup_hplc_locations_list["DateTime"] = pd.to_datetime(matchup_hplc_locations_list["DateTime"])
matchup_hplc_locations_list

Unnamed: 0,idd,Survey_name,Station_number,Prime_number,DateTime,Latitude,Longitude,Smartbuoy,Sample_depth,TP_ug_L,...,Lut_ug_L,Myxo_ug_L,Croc_ug_L,x19_Keto_Hex_fuco_ug_L,Hexkfuco_ug_L,HexkfucoL_ug_L,x4keto_hex_ug_L,x4keto_hexL_ug_L,bathymetry_m,season
0,625,CEND19_17,230.0,102.0,2017-10-28 01:18:00,48.350783,-5.750117,,4,0.701378,...,0.000000,0.000000,0.00000,,,,0.000000,,-116.982548,Autumn
1,671,CEND17_18,169.0,102.0,2018-10-25 03:42:00,48.366280,-5.725660,,6,0.796561,...,0.000000,,0.00000,,,,0.004086,,-116.179468,Autumn
2,672,CEND17_18,176.0,106.0,2018-10-25 20:40:00,48.547920,-4.928820,,6,0.628834,...,0.000662,,0.00000,,,,0.002824,,-74.689037,Autumn
3,626,CEND19_17,231.0,106.0,2017-10-28 05:48:00,48.552300,-4.915950,,4,0.858928,...,0.000000,0.000000,0.00000,,,,0.000000,,-77.795798,Autumn
4,55,CEND09_11,44.0,,2011-05-22 15:50:00,48.778933,-4.390117,,5,1.906309,...,0.004340,0.003500,0.00701,,,,,,-88.030648,Spring
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
669,499,CEND15_13,191.0,72.0,2013-08-29 12:40:00,61.232000,-0.401000,,4,0.620383,...,0.001790,0.000000,,0.00000,0.0,0.0,,,-164.470400,Summer
670,166,CEND13_10,160.0,74.0,2010-09-01 04:06:00,61.251333,1.393667,,3,6.715475,...,0.024246,0.013840,,0.11529,,,,,-148.207900,Summer
671,445,CEND18_15,125.0,73.0,2015-08-30 14:47:00,61.285900,0.488133,,4,2.892239,...,0.014518,0.000000,0.00000,,,,0.016483,,-167.815296,Summer
672,533,CEND18_16,120.0,73.0,2016-08-29 09:35:00,61.288283,0.500183,,4,2.378788,...,0.001587,0.006305,0.00000,,,,0.000000,,-168.083801,Summer


## Read in our in situ pH observations

In [54]:
# This file was created by the Matlab function prepareInsituPhData.m
NAME_PH_DATA_FILE = 'greenwoodPhFiltered.csv'
full_path_ph_data_dir = os.path.join(PATH_ROOT_DIR,'data','processed',NAME_PH_DATA_FILE)
matchup_ph_locations_list = pd.read_csv(full_path_ph_data_dir, sep = ',')

# Converting date column into right format in a new column
matchup_ph_locations_list["DateTime"] = pd.to_datetime(matchup_ph_locations_list["DateTime"])
matchup_ph_locations_list

Unnamed: 0,Program,Cruise,DateTime,Latitude_degN,Longitude_degE,Sample_depth_m,pH,bathymetry_m,season
0,ssb,CEND_04_15,2015-03-09 08:16:00,48.077000,-4.895833,4.0,8.006081,-70.028300,Winter
1,ssb,CEND_04_15,2015-03-16 14:12:00,48.230833,-6.816000,4.0,8.022130,-155.254465,Winter
2,ssb,CEND_04_15,2015-03-10 06:12:00,48.950833,-5.444500,4.0,8.026358,-107.897395,Winter
3,ssb,CEND_04_14,2014-02-19 09:28:00,48.956667,-3.194167,4.0,8.050431,-61.779111,Winter
4,ssb,CEND_04_14,2014-02-19 09:28:00,48.956667,-3.194167,60.0,8.058609,-61.779111,Winter
...,...,...,...,...,...,...,...,...,...
1431,ukoa,CEND_15_13,2013-08-29 07:30:00,61.222333,0.610667,160.0,8.028847,-162.936500,Summer
1432,ukoa,CEND_15_13,2013-08-29 07:22:00,61.222333,0.610667,4.0,8.112115,-162.936500,Summer
1433,ukoa,CEND_13_12,2012-08-31 08:57:00,61.226833,0.628000,4.0,7.973876,-158.735400,Summer
1434,ukoa,CEND_15_13,2013-08-29 11:50:00,61.233667,-0.399167,165.0,7.995566,-164.833000,Summer


## Read in our ship survey data

In [19]:
NAME_SHIP_DATA_FILE = 'C8611 Oxford Biogeochemistry Results v2.xlsx'
full_path_ship_data_dir = os.path.join(PATH_ROOT_DIR,'data','raw','CEFAS_Oxford_shipboard_survey',NAME_SHIP_DATA_FILE)
matchup_ship_locations_list = pd.read_excel(full_path_ship_data_dir, sheet_name='Results', header=2)

# Converting date column into right format in a new column
matchup_ship_locations_list["DateTime"] = pd.to_datetime(matchup_ship_locations_list["Sample\nDate"])
matchup_ship_locations_list

Unnamed: 0,Station\nNumber,Sample\nDepth\n(m),Sample\nDate,Sample\nTime \n(UTC),Latitude,Longitude,Notes,Salinity,TOxN \n(umol/l),Nitrite \n(umol/l),Phospahte \n(umol/l),Silicate \n(umol/l),Ammonia \n(umol/l),SPM \n(umol/l),Chlorophyll \n(umol/l),Phaeopigments \n(umol/l),DateTime
0,18,4,2023-04-23,03:07:00,54.1823,1.1192,Oxford 6 Pre,34.595,2.5,0.26,0.27,<0.1,0.2,0.68,1.52,0.53,2023-04-23
1,20,4,2023-04-23,04:01:00,54.1986,1.107,Oxford 6 Post,34.621,1.8,0.15,0.24,<0.1,0.5,1.47,1.98,0.47,2023-04-23
2,21,4,2023-04-23,04:31:00,54.216,1.0416,Oxford 3 Pre,34.624,2.0,0.13,0.28,<0.1,0.5,0.92,1.98,0.7,2023-04-23
3,23,4,2023-04-23,05:04:00,54.2178,1.0381,Oxford 3 Post,34.621,2.3,0.14,0.28,<0.1,0.2,0.95,1.65,0.62,2023-04-23
4,24,4,2023-04-23,05:29:00,54.2256,1.0055,Oxford 1 Pre,34.627,2.9,0.15,0.34,<0.1,0.4,0.88,1.74,0.64,2023-04-23
5,26,4,2023-04-23,05:53:00,54.221,1.0036,Oxford 1 Post,34.514,3.1,0.14,0.35,<0.1,0.4,1.44,1.28,0.5,2023-04-23


## Set the download parameters

**Modify as needed.**

In [2]:
# ===============================================================================
# List of datasets for matchup with HPLC data
# ===============================================================================

LIST_DATASET_IDS_HPLC_MATCHUP = [
      
# L3 satellite observations, global, various resolutions, daily, Copernicus-GlobColour algorithm
# Product ID: OCEANCOLOUR_GLO_BGC_L3_MY_009_103
    
    # 4 km res (multiple sensors merged)
    #"cmems_obs-oc_glo_bgc-plankton_my_l3-multi-4km_P1D",  
    # 4 km resolution (OLCI sensor) 
    "cmems_obs-oc_glo_bgc-plankton_my_l3-olci-4km_P1D",   
    # 300 m resolution (OLCI sensor)
    #"cmems_obs-oc_glo_bgc-plankton_my_l3-olci-300m_P1D",  
    
# L3 satellite observations, Atlantic-European NWS, various resolutions, daily
# Product ID: OCEANCOLOUR_ATL_BGC_L3_MY_009_113
    
    # 1 km resolution (multiple sensors merged)
    "cmems_obs-oc_atl_bgc-plankton_my_l3-multi-1km_P1D", 
    # 300 m resolution (OLCI sensor)
    "cmems_obs-oc_atl_bgc-plankton_my_l3-olci-300m_P1D",
    
# Biogeochemical reanalysis, Atlantic-European NWS, 7 km horizontal resolution, daily
# Product ID: NWSHELF_MULTIYEAR_BGC_004_011
    "cmems_mod_nws_bgc-chl_my_7km-3D_P1D-m"  
]

# ===============================================================================
# List of output file names (should correspond to the dataset names listed above)
# ===============================================================================

LIST_OUTPUT_NAMES_HPLC_MATCHUP = [
    #"obs_satell_glob_cmems_multi_4km_plk",
    "obs_satell_glob_cmems_olci_4km_plk",
    #"obs_satell_glob_cmems_olci_300m_plk",
    "obs_satell_reg_cmems_multi_1km_plk",
    "obs_satell_reg_cmems_olci_300m_plk",
    "mod_bgc_reg_chl"
]

# ===============================================================================
# List of variable names to download (search for small and large caps)
# ===============================================================================

LIST_VARIABLES_HPLC_MATCHUP = [
    "CHL",
    "chl"
]

In [78]:
# ===============================================================================
# List of datasets for matchup with pH data
# ===============================================================================

# Biogeochemical reanalysis, Atlantic-European NWS, 7 km horizontal resolution, daily
# Product ID: NWSHELF_MULTIYEAR_BGC_004_011
LIST_DATASET_IDS_PH_MATCHUP = [
    "cmems_mod_nws_bgc-ph_my_7km-3D_P1D-m" 
]

# ===============================================================================
# List of output file names (should correspond to the dataset names listed above)
# ===============================================================================

LIST_OUTPUT_NAMES_PH_MATCHUP = [
    "mod_bgc_reg_ph"
] 

# ===============================================================================
# List of variable names to download
# ===============================================================================

LIST_VARIABLES_PH_MATCHUP = [
    "ph"
]

In [18]:
# ===============================================================================
# List of datasets for matchup with ship survey data
# ===============================================================================

# Biogeochemical reanalysis, Atlantic-European NWS, 7 km horizontal resolution, daily
# Product ID: NWSHELF_MULTIYEAR_BGC_004_011
LIST_DATASET_IDS_SHIP_MATCHUP = [
    "cmems_mod_nws_bgc-chl_my_7km-3D_P1D-m",             # Chlorophyll a (mg chla m-3)
    "cmems_mod_nws_bgc-kd_my_7km-3D_P1D-m",              # Attenuation coefficient kd (m-1)
    "cmems_mod_nws_bgc-no3_my_7km-3D_P1D-m",             # Nitrate (mmol m-3)  
    "cmems_mod_nws_phy-s_my_7km-3D_P1D-m"                # Salinity (PSU)
]

# ===============================================================================
# List of output file names (should correspond to the dataset names listed above)
# ===============================================================================

LIST_OUTPUT_NAMES_SHIP_MATCHUP = [
    "mod_bgc_reg_chl",
    "mod_bgc_reg_kd",
    "mod_bgc_reg_no3",
    "mod_phy_reg_sal"
] 

# ===============================================================================
# List of variable names to download
# ===============================================================================

LIST_VARIABLES_SHIP_MATCHUP = [
    "chl",
    "attn",
    "no3",
    "so"  
]

## Exploratory analysis of one of the datasets

In [8]:
DS = copernicusmarine.open_dataset(
    dataset_id = LIST_DATASET_IDS_HPLC_MATCHUP[3]
)
DS

INFO - 2024-04-30T08:56:26Z - Dataset version was not specified, the latest one was selected: "202012"
INFO - 2024-04-30T08:56:26Z - Dataset part was not specified, the first one was selected: "default"
INFO - 2024-04-30T08:56:28Z - Service was not specified, the default one was selected: "arco-geo-series"


Notice that, for each variable in `Data variables` the arrangement is time x lat x lon, thus `time=row[0]`, `lat=row[1]` and `lon=row[2]`.

Since it is a daily dataset with midnight as the time reference, the data you were downloading was at the same day if the time in your `DateTime` column was between 12:00AM and 12:00PM, the day after otherwise. Here is a visualisation of this difference for the two first dates of your csv:

In [58]:
print("Time from csv:    ", matchup_hplc_locations_list.DateTime[0])
print("Time from dataset:", DS.sel(time=matchup_hplc_locations_list.DateTime[0], method="nearest").time.values)
print(" ")
print("Time from csv:    ", matchup_hplc_locations_list.DateTime[1])
print("Time from dataset:", DS.sel(time=matchup_hplc_locations_list.DateTime[1], method="nearest").time.values)

Time from csv:     2017-10-28 01:18:00
Time from dataset: 2017-10-28T00:00:00.000000000
 
Time from csv:     2018-10-25 03:42:00
Time from dataset: 2018-10-25T00:00:00.000000000


**For 19:04:00, it makes more sense to take the data from the day after at 12:00AM because it's closer than the same day at 12:00AM.**

## Download data for HPLC matchups

In [3]:
%%time

selected_columns = ['DateTime', 'Latitude', 'Longitude', 'Sample_depth'] # in the .csv file

list_paths_output_filenames_hplc = []

# Loop for datasets in LIST_DATASET_IDS_HPLC_MATCHUP
for dataset_id, output_name in zip(LIST_DATASET_IDS_HPLC_MATCHUP, LIST_OUTPUT_NAMES_HPLC_MATCHUP):
    print("Downloading dataset: ", dataset_id)
    
    # Read dataset with CMT
    ds = copernicusmarine.open_dataset(dataset_id = dataset_id)

    # Select surface and rename dimensions
    for coords in ds.coords:
        if coords=='lon':
            ds = ds.rename({'lon': 'longitude'})
        if coords=='lat':
            ds = ds.rename({'lat': 'latitude'})
            
    # Sort axis that were inverted
    ds = sort_dimension(ds, 'latitude')
    ds = sort_dimension(ds, 'longitude')
    
    # Copy the input dataframe
    df_valid = matchup_hplc_locations_list[selected_columns].copy()
    
    # The following code subsets data using the .sel function. It does it in a two-step process:
    # (1) selection of time, where time/depth are "sliced" using method="nearest"
    # (2) selection of lon/lat, where a "point" or "single value" are selected using method="nearest"

    for variable_name in LIST_VARIABLES_HPLC_MATCHUP:

        if variable_name in ds.data_vars:
           
            print("Downloading variable: ", variable_name)

            # Download data for 3D datasets
            if "depth" in ds.dims:
                df_valid = df_valid.assign(**{
                    variable_name : [float(ds[variable_name].sel(time=row[0], depth=row[3], method='nearest')\
                        .sel(latitude=row[1], longitude=row[2], method='nearest'))\
                        for row in zip(df_valid['DateTime'], df_valid['Latitude'], df_valid['Longitude'], df_valid['Sample_depth'])]            
                })

            # Download data for 2D datasets 
            else:
                df_valid = df_valid.assign(**{
                    variable_name : [float(ds[variable_name].sel(time=row[0], method='nearest')\
                        .sel(latitude=row[1], longitude=row[2], method='nearest'))\
                        for row in zip(df_valid['DateTime'], df_valid['Latitude'], df_valid['Longitude'])]
            })
    
            # Add the corresponding date from the dataset (for checking purpose)
            df_valid = df_valid.assign(**{
                "Date_dataset" : [ ds.sel(time=date, method='nearest').time.values for date in df_valid['DateTime'] ]
            })

            # Save the dataframe with downloaded variable(s)
            csvfilename = os.path.join(full_path_download_dir_hplc, f"{output_name}_{variable_name}.csv")
            df_valid.to_csv(csvfilename)
            list_paths_output_filenames_hplc.append(csvfilename)

print("Download completed!")

Downloading dataset:  cmems_obs-oc_glo_bgc-plankton_my_l3-olci-4km_P1D


NameError: name 'copernicusmarine' is not defined

### Create a matchup table for HPLC data

In [57]:
# Initialise a new dataframe from the input one
df_coords_with_data = matchup_hplc_locations_list[selected_columns].copy()

# Pick up and add the variable from every saved dataframe
for output_filename_path, output_name in zip(list_paths_output_filenames_hplc, LIST_OUTPUT_NAMES_HPLC_MATCHUP):

    df_variable = pd.read_csv(output_filename_path)
    
    for variable_name in LIST_VARIABLES_HPLC_MATCHUP: 
    
        if variable_name in df_variable.columns:
            variable_column = df_variable[variable_name]
            print(f"Adding {variable_name} from {output_filename_path}")
            
            # Convert variable_name to lowercase
            variable_name_lower = variable_name.lower()
            
            # Add the column to the output DataFrame
            df_coords_with_data[f"{output_name}"] = variable_column

# Save new dataframe with all data
csvfilename = os.path.join(full_path_download_dir_hplc, "cmems_hplc_matchups.csv")
df_coords_with_data.to_csv(csvfilename)
df_coords_with_data

Adding CHL from /Users/Anna/LocalDocuments/Academic/Projects/Agile/matlab-jupyter-EBA-toolbox/data/raw/CMEMS_data/data_matchups_HPLC_CMT_csv/obs_satell_glob_cmems_olci_4km_plk_CHL.csv
Adding CHL from /Users/Anna/LocalDocuments/Academic/Projects/Agile/matlab-jupyter-EBA-toolbox/data/raw/CMEMS_data/data_matchups_HPLC_CMT_csv/obs_satell_reg_cmems_multi_1km_plk_CHL.csv
Adding CHL from /Users/Anna/LocalDocuments/Academic/Projects/Agile/matlab-jupyter-EBA-toolbox/data/raw/CMEMS_data/data_matchups_HPLC_CMT_csv/obs_satell_reg_cmems_olci_300m_plk_CHL.csv
Adding chl from /Users/Anna/LocalDocuments/Academic/Projects/Agile/matlab-jupyter-EBA-toolbox/data/raw/CMEMS_data/data_matchups_HPLC_CMT_csv/mod_bgc_reg_chl_chl.csv


Unnamed: 0,DateTime,Latitude,Longitude,Sample_depth,obs_satell_glob_cmems_olci_4km_plk,obs_satell_reg_cmems_multi_1km_plk,obs_satell_reg_cmems_olci_300m_plk,mod_bgc_reg_chl
0,2017-10-28 01:18:00,48.350783,-5.750117,4,,,,0.497997
1,2018-10-25 03:42:00,48.366280,-5.725660,6,,,,0.241997
2,2018-10-25 20:40:00,48.547920,-4.928820,6,,0.628634,,0.325996
3,2017-10-28 05:48:00,48.552300,-4.915950,4,,,,0.283997
4,2011-05-22 15:50:00,48.778933,-4.390117,5,,0.678203,,0.401997
...,...,...,...,...,...,...,...,...
669,2013-08-29 12:40:00,61.232000,-0.401000,4,,,,0.241997
670,2010-09-01 04:06:00,61.251333,1.393667,3,,,,1.073997
671,2015-08-30 14:47:00,61.285900,0.488133,4,,,,0.325996
672,2016-08-29 09:35:00,61.288283,0.500183,4,,0.803364,,0.261997


## Download data for pH matchups

In [79]:
%%time

selected_columns_ph = ['DateTime', 'Latitude_degN', 'Longitude_degE', 'Sample_depth_m']

# Loop for datasets in LIST_DATASET_IDS_PH_MATCHUP
for dataset_id, output_name in zip(LIST_DATASET_IDS_PH_MATCHUP, LIST_OUTPUT_NAMES_PH_MATCHUP):
    print("Downloading dataset: ", dataset_id)
    
    # Read dataset with CMT
    ds = copernicusmarine.open_dataset(dataset_id = dataset_id)

    # Select surface and rename dimensions
    for coords in ds.coords:
        if coords=='lon':
            ds = ds.rename({'lon': 'longitude'})
        if coords=='lat':
            ds = ds.rename({'lat': 'latitude'})
            
    # Sort axis that were inverted
    ds = sort_dimension(ds, 'latitude')
    ds = sort_dimension(ds, 'longitude')
    
    # Copy the input dataframe
    df_valid = matchup_ph_locations_list[selected_columns_ph].copy()
    
    # The following code subsets data using the .sel function. It does it in a two-step process:
    # (1) selection of time, where time/depth are "sliced" using method="nearest"
    # (2) selection of lon/lat, where a "point" or "single value" are selected using method="nearest"

    for variable_name in LIST_VARIABLES_PH_MATCHUP:
        
        if variable_name in ds.data_vars: # If the variable exists, do something with it
            
            print("Downloading variable: ", variable_name)

            # Download data for 3D datasets
            if "depth" in ds.dims:
                df_valid = df_valid.assign(**{
                    variable_name : [float(ds[variable_name].sel(time=row[0], depth=row[3], method='nearest')\
                        .sel(latitude=row[1], longitude=row[2], method='nearest'))\
                        for row in zip(df_valid['DateTime'], df_valid['Latitude_degN'], df_valid['Longitude_degE'], df_valid['Sample_depth_m'])]            
                })

            # Download data for 2D datasets 
            else:
                df_valid = df_valid.assign(**{
                    variable_name : [float(ds[variable_name].sel(time=row[0], method='nearest')\
                        .sel(latitude=row[1], longitude=row[2], method='nearest'))\
                        for row in zip(df_valid['DateTime'], df_valid['Latitude_degN'], df_valid['Longitude_degE'])]
            })
    
            # Add the corresponding date from the dataset (for checking purpose)
            df_valid = df_valid.assign(**{
                "Date_dataset" : [ ds.sel(time=date, method='nearest').time.values for date in df_valid['DateTime'] ]
            })

            # Save the dataframe with downloaded variable(s)
            csvfilename = os.path.join(full_path_download_dir_ph, f"{output_name}_{variable_name}.csv")
            df_valid.to_csv(csvfilename)

print("Download completed!")

Downloading dataset:  cmems_mod_nws_bgc-ph_my_7km-3D_P1D-m
INFO - 2024-04-25T08:58:55Z - Dataset version was not specified, the latest one was selected: "202012"
INFO - 2024-04-25T08:58:55Z - Dataset part was not specified, the first one was selected: "default"
INFO - 2024-04-25T08:58:56Z - Service was not specified, the default one was selected: "arco-geo-series"
Downloading variable:  ph
Download completed!
CPU times: user 21.1 s, sys: 2.76 s, total: 23.9 s
Wall time: 2min 48s


### Create a matchup table for pH data

In [80]:
# Initialise a new dataframe from the input one
df_coords_with_data = matchup_ph_locations_list[selected_columns_ph].copy()

# Pick up and add the variable from every saved dataframe
for output_name in LIST_OUTPUT_NAMES_PH_MATCHUP:
    for variable_name in LIST_VARIABLES_PH_MATCHUP: 
        csvfilename = os.path.join(full_path_download_dir_ph, f"{output_name}_{variable_name}.csv")
        if os.path.exists(csvfilename):
            df_variable = pd.read_csv(csvfilename)
            variable_column = df_variable[variable_name]
            df_coords_with_data[f"{variable_name}_{output_name}"] = variable_column
        
# Save new dataframe with all data
csvfilename = os.path.join(full_path_download_dir_ph, "cmems_ph_matchups.csv")
df_coords_with_data.to_csv(csvfilename)
df_coords_with_data

Unnamed: 0,DateTime,Latitude_degN,Longitude_degE,Sample_depth_m,ph_mod_bgc_reg_ph
0,2015-03-09 08:16:00,48.077000,-4.895833,4.0,8.100006
1,2015-03-16 14:12:00,48.230833,-6.816000,4.0,8.080017
2,2015-03-10 06:12:00,48.950833,-5.444500,4.0,8.100006
3,2014-02-19 09:28:00,48.956667,-3.194167,4.0,8.100006
4,2014-02-19 09:28:00,48.956667,-3.194167,60.0,8.100006
...,...,...,...,...,...
1431,2013-08-29 07:30:00,61.222333,0.610667,160.0,8.040009
1432,2013-08-29 07:22:00,61.222333,0.610667,4.0,8.220001
1433,2012-08-31 08:57:00,61.226833,0.628000,4.0,8.200012
1434,2013-08-29 11:50:00,61.233667,-0.399167,165.0,8.040009


## Download data for ship matchups

In [21]:
%%time

selected_columns_ship = ['DateTime', 'Latitude', 'Longitude', 'Sample\nDepth\n(m)']

# Loop for datasets in LIST_DATASET_IDS_PH_MATCHUP
for dataset_id, output_name in zip(LIST_DATASET_IDS_SHIP_MATCHUP, LIST_OUTPUT_NAMES_SHIP_MATCHUP):
    print("Downloading dataset: ", dataset_id)
    
    # Read dataset with CMT
    ds = copernicusmarine.open_dataset(dataset_id = dataset_id)

    # Select surface and rename dimensions
    for coords in ds.coords:
        if coords=='lon':
            ds = ds.rename({'lon': 'longitude'})
        if coords=='lat':
            ds = ds.rename({'lat': 'latitude'})
            
    # Sort axis that were inverted
    ds = sort_dimension(ds, 'latitude')
    ds = sort_dimension(ds, 'longitude')
    
    # Copy the input dataframe
    df_valid = matchup_ship_locations_list[selected_columns_ship].copy()
    
    # The following code subsets data using the .sel function. It does it in a two-step process:
    # (1) selection of time, where time/depth are "sliced" using method="nearest"
    # (2) selection of lon/lat, where a "point" or "single value" are selected using method="nearest"

    for variable_name in LIST_VARIABLES_SHIP_MATCHUP:
        
        if variable_name in ds.data_vars: # If the variable exists, do something with it
            
            print("Downloading variable: ", variable_name)

            # Download data for 3D datasets
            if "depth" in ds.dims:
                df_valid = df_valid.assign(**{
                    variable_name : [float(ds[variable_name].sel(time=row[0], depth=row[3], method='nearest')\
                        .sel(latitude=row[1], longitude=row[2], method='nearest'))\
                        for row in zip(df_valid['DateTime'], df_valid['Latitude'], df_valid['Longitude'], df_valid['Sample\nDepth\n(m)'])]            
                })

            # Download data for 2D datasets 
            else:
                df_valid = df_valid.assign(**{
                    variable_name : [float(ds[variable_name].sel(time=row[0], method='nearest')\
                        .sel(latitude=row[1], longitude=row[2], method='nearest'))\
                        for row in zip(df_valid['DateTime'], df_valid['Latitude'], df_valid['Longitude'])]
            })
    
            # Add the corresponding date from the dataset (for checking purpose)
            df_valid = df_valid.assign(**{
                "Date_dataset" : [ ds.sel(time=date, method='nearest').time.values for date in df_valid['DateTime'] ]
            })

            # Save the dataframe with downloaded variable(s)
            csvfilename = os.path.join(full_path_download_dir_ship, f"{output_name}_{variable_name}.csv")
            df_valid.to_csv(csvfilename)

print("Download completed!")

Downloading dataset:  cmems_mod_nws_bgc-chl_my_7km-3D_P1D-m
INFO - 2024-04-28T17:48:44Z - Dataset version was not specified, the latest one was selected: "202012"
INFO - 2024-04-28T17:48:44Z - Dataset part was not specified, the first one was selected: "default"
INFO - 2024-04-28T17:48:45Z - Service was not specified, the default one was selected: "arco-geo-series"
Downloading variable:  chl
Downloading dataset:  cmems_mod_nws_bgc-kd_my_7km-3D_P1D-m
INFO - 2024-04-28T17:48:49Z - Dataset version was not specified, the latest one was selected: "202012"
INFO - 2024-04-28T17:48:49Z - Dataset part was not specified, the first one was selected: "default"
INFO - 2024-04-28T17:48:50Z - Service was not specified, the default one was selected: "arco-geo-series"
Downloading variable:  attn
Downloading dataset:  cmems_mod_nws_bgc-no3_my_7km-3D_P1D-m
INFO - 2024-04-28T17:48:54Z - Dataset version was not specified, the latest one was selected: "202012"
INFO - 2024-04-28T17:48:54Z - Dataset part was 

### Create a matchup table for ship data

In [22]:
# Initialise a new dataframe from the input one
df_coords_with_data = matchup_ship_locations_list[selected_columns_ship].copy()

# Pick up and add the variable from every saved dataframe
for output_name in LIST_OUTPUT_NAMES_SHIP_MATCHUP:
    for variable_name in LIST_VARIABLES_SHIP_MATCHUP: 
        csvfilename = os.path.join(full_path_download_dir_ship, f"{output_name}_{variable_name}.csv")
        if os.path.exists(csvfilename):
            df_variable = pd.read_csv(csvfilename)
            variable_column = df_variable[variable_name]
            df_coords_with_data[f"{variable_name}_{output_name}"] = variable_column
        
# Save new dataframe with all data
csvfilename = os.path.join(full_path_download_dir_ship, "cmems_ship_matchups.csv")
df_coords_with_data.to_csv(csvfilename)
df_coords_with_data

Unnamed: 0,DateTime,Latitude,Longitude,Sample\nDepth\n(m),chl_mod_bgc_reg_chl,attn_mod_bgc_reg_kd,no3_mod_bgc_reg_no3,so_mod_phy_reg_sal
0,2023-04-23,54.1823,1.1192,4,1.137997,0.171997,3.480011,34.498
1,2023-04-23,54.1986,1.107,4,1.137997,0.171997,3.480011,34.498
2,2023-04-23,54.216,1.0416,4,0.969997,0.169998,3.770004,34.473
3,2023-04-23,54.2178,1.0381,4,0.969997,0.169998,3.770004,34.473
4,2023-04-23,54.2256,1.0055,4,0.969997,0.169998,3.770004,34.473
5,2023-04-23,54.221,1.0036,4,0.969997,0.169998,3.770004,34.473
