# 1 - Request OOI water sampling data from BCO-DMO

We will import the OOI Irminger Sea CTD Cast and Discrete Water Sample Summary dataset from a comma separated variable file on the BCO-DMO ERDDAP server. 

More info about the BCO-DMO ERDDAP server can be found at: https://guide.bco-dmo.org/access-and-reuse/erddap

In [1]:
import pandas as pd

### Load OOI data into workspace
To access the data in comma separated variable (CSV) format, one can either:
1. Access the File through the URL at ERDDAP > Files (shown below) 
2. Generate a URL using the [ERDDAP Data Access Form](https://erddap.bco-dmo.org/erddap/tabledap/bcodmo_dataset_911407_v1.html)
      - If using the ERDDAP Data Access Form, file type .csvp provides a single header row. 
      - If using the ERDDAP Data Access Form, one can subset (e.g., select variables of interest) prior to loading the data.

In [2]:
# ERDDAP link to the public dataset CSV
erddap_url = "https://erddap.bco-dmo.org/erddap/files/bcodmo_dataset_911407_v1/911407_v1_ooi_irminger_sea_discrete_water_sampling_data.csv"

# Load the OOI data into a dataframe
ooi_irm = pd.read_csv(erddap_url)

# Call head() to check that the data was imported correctly
ooi_irm.head()

Unnamed: 0,Cruise,Station,Target_Asset,Start_Latitude,Start_Longitude,Start_Time,Cast,Cast_Flag,Bottom_Depth_at_Start_Position,CTD_File,...,Discrete_pH_Replicate_Flag,Calculated_Alkalinity,Calculated_DIC,Calculated_pCO2,Calculated_pH,Calculated_CO2aq,Calculated_Bicarb,Calculated_CO3,Calculated_Omega_C,Calculated_Omega_A
0,KN221-04,1,Test Site #1,62.107,-31.381667,2014-09-08T11:39:06.000Z,1,*0000000000000100,,KN22104001.hex,...,,,,,,,,,,
1,KN221-04,1,Test Site #1,62.107,-31.381667,2014-09-08T11:39:06.000Z,1,*0000000000000100,,KN22104001.hex,...,,,,,,,,,,
2,KN221-04,1,Test Site #1,62.107,-31.381667,2014-09-08T11:39:06.000Z,1,*0000000000000100,,KN22104001.hex,...,,,,,,,,,,
3,KN221-04,1,Test Site #1,62.107,-31.381667,2014-09-08T11:39:06.000Z,1,*0000000000000100,,KN22104001.hex,...,,,,,,,,,,
4,KN221-04,1,Test Site #1,62.107,-31.381667,2014-09-08T11:39:06.000Z,1,*0000000000000100,,KN22104001.hex,...,,,,,,,,,,


##### List available parameters
There are 80 columns in the dataframe consisting of the different parameters and associated quality flags

In [3]:
# Print a list columns
ooi_irm.columns

Index(['Cruise', 'Station', 'Target_Asset', 'Start_Latitude',
       'Start_Longitude', 'Start_Time', 'Cast', 'Cast_Flag',
       'Bottom_Depth_at_Start_Position', 'CTD_File', 'CTD_File_Flag',
       'Niskin_Bottle_Position', 'Niskin_Flag', 'CTD_Bottle_Closure_Time',
       'CTD_Pressure', 'CTD_Pressure_Flag', 'CTD_Depth', 'CTD_Latitude',
       'CTD_Longitude', 'CTD_Temperature_1', 'CTD_Temperature_1_Flag',
       'CTD_Temperature_2', 'CTD_Temperature_2_Flag', 'CTD_Conductivity_1',
       'CTD_Conductivity_1_Flag', 'CTD_Conductivity_2',
       'CTD_Conductivity_2_Flag', 'CTD_Salinity_1', 'CTD_Salinity_2',
       'CTD_Oxygen', 'CTD_Oxygen_Flag', 'CTD_Oxygen_Saturation',
       'CTD_Fluorescence', 'CTD_Fluorescence_Flag', 'CTD_Beam_Attenuation',
       'CTD_Beam_Transmission', 'CTD_Transmissometer_Flag', 'CTD_pH',
       'CTD_pH_Flag', 'Discrete_Oxygen', 'Discrete_Oxygen_Flag',
       'Discrete_Oxygen_Replicate_Flag', 'Discrete_Chlorophyll',
       'Discrete_Phaeopigment', 'Discrete_Fo_

##### Subset available data
Next, we want to select a subset of the available data just for the parameters associated with either cruise/water sampling metadata or one of the biogeochemical parameters that we are interested in: temperature, salinity, nutrients, oxygen, chlorophyll, fluorescence

In [5]:
# Index a subset of parameters from the summary spreadsheet using column names printed above. The definitions of parameter
# names can be found towards the bottom of the dataset page under "More Information about this dataset" > "Parameters".

# Note shortened strings
subset_vars = [
    "Cruise", "Target_Asset", "Start_Latitude",
    "Start_Longitude", "Start_Time", "Cast",
    "CTD_Bottle_Closure_Time", "CTD_Latitude",
    "CTD_Longitude", "CTD_Pressure", "CTD_Depth",
    "pH", "Nitrate", "Nutrients", "Fluor",
    "Temperature", "Salinity", "Oxygen",
    "Chlorophyll", "Phosphate"
]

In [19]:
# Create a new list containing any parameter representing a measurement of one of the
# BGC variables of interest.
var_columns = []
for var in subset_vars:
    columns = [col for col in ooi_irm.columns if var in col]
    var_columns.extend(columns)

# Subset the dataset just for the parameters of interest
subset_irm = ooi_irm[var_columns]
subset_irm.head()

Unnamed: 0,Cruise,Target_Asset,Start_Latitude,Start_Longitude,Start_Time,Cast,Cast_Flag,CTD_Bottle_Closure_Time,CTD_Latitude,CTD_Longitude,...,Discrete_Salinity_Flag,Discrete_Salinity_Replicate_Flag,CTD_Oxygen,CTD_Oxygen_Flag,CTD_Oxygen_Saturation,Discrete_Oxygen,Discrete_Oxygen_Flag,Discrete_Oxygen_Replicate_Flag,Discrete_Chlorophyll,Discrete_Phosphate
0,KN221-04,Test Site #1,62.107,-31.381667,2014-09-08T11:39:06.000Z,1,*0000000000000100,2014-09-08T12:07:58.000Z,62.10702,-31.38174,...,,,6.3251,*0000000000000100,7.30825,,,,,
1,KN221-04,Test Site #1,62.107,-31.381667,2014-09-08T11:39:06.000Z,1,*0000000000000100,2014-09-08T12:08:10.000Z,62.10702,-31.38176,...,,,6.3326,*0000000000000100,7.30829,,,,,
2,KN221-04,Test Site #1,62.107,-31.381667,2014-09-08T11:39:06.000Z,1,*0000000000000100,2014-09-08T12:08:22.000Z,62.10702,-31.38176,...,,,6.3246,*0000000000000100,7.30832,,,,,
3,KN221-04,Test Site #1,62.107,-31.381667,2014-09-08T11:39:06.000Z,1,*0000000000000100,2014-09-08T12:08:30.000Z,62.10702,-31.38176,...,,,6.2751,*0000000000000100,7.30832,,,,,
4,KN221-04,Test Site #1,62.107,-31.381667,2014-09-08T11:39:06.000Z,1,*0000000000000100,2014-09-08T12:08:38.000Z,62.10702,-31.38174,...,,,6.2838,*0000000000000100,7.30835,,,,,


##### Save the subset dataset

In [17]:
subset_irm.to_csv("../data/interim/irminger_sea_subset.csv", index=False)

**Note**: the interim file can be used in subsequent notebooks