# Data Collection
---

For this project, I'm building a model to identify periods of coastal upwelling off the coast of Oregon using data collected by the Ocean Observatories Initiatve (OOI). I intend to use environmental variables, such as seawater temperature, salinity, and dissolved oxygen, as features in a classification model, and I'll be labeling my target variable using the CUTI upwelling index. The OOI has several instrument packages off the Washington and Oregon coasts; for this project, I'll be focusing on the Oregon Offshore location, located offshore from Newport, Oregon. The instrument packages found here include a surface mooring that has a bulk meteorology package, a shallow profiler that collects data in the upper ~200 meters of the water column, a stationary platform located at a depth of 200 meters, and a deep profiler that collects data in the lower portion of the water column.

In [1]:
# Imports
import numpy as np
import sys, os
import xarray as xr
import pandas as pd
import cmocean.cm as cmo
import requests
import re
import datetime as dt
import seaborn as sns

from netCDF4 import Dataset, num2date, date2num 
from datetime import datetime, timedelta
from numpy import datetime64 as dt64, timedelta64 as td64
from matplotlib import pyplot as plt

### OOI API

In order to run this notebook, you'll need to set up an account with the OOI and get a username and temporary token to use for data requests. You can do this here: https://ooinet.oceanobservatories.org/.

Once you've made an account, copy and paste your username and token into the cell below. 

In [2]:
# enter your OOI API username and token 
API_USERNAME = 'OOIAPI-xx'          # this will be similar to U6ZIZ5UNB1LIMA
API_TOKEN = 'xx'                    # this will be similar to VUO6PXYMNLE

Make sure you don't upload your API username and token combination to a public repository! If you accidentally do, you can go to the OOI website and get a new token - do this as soon as possible to prevent your credentials being used without your consent.

---
### Create output directory

Set up an output directory to store the data pulled by this notebook - these files are fairly large, so they won't be saved to this repository. Instead, they'll be stored in a directory called `coastal_upwelling_output` that will be parallel to this repository on your local machine.

In [3]:
parent_dir = os.path.dirname(os.getcwd())
grandparent_dir = os.path.dirname(parent_dir)
output_dir = os.path.join(grandparent_dir, 'coastal_upwelling_output')

try:
    os.mkdir(output_dir)
except OSError as error:
    pass

print(f'Data will be stored in {output_dir}.')

Data will be stored in C:\Users\Derya\Documents\GitHub\coastal_upwelling_output.


---
### Pull data 

Start by looking at just a small selection of the data available:
* pull data from the Oregon Offshore location (CE04)
* use the surface mooring, 200m platform, and shallow profiler
* was going to start with March-June 2017 but ended up pulling data for all of 2017
* 2017 had poor continuity for the shallow profiler, so I also ended up pulling data for all of 2018

The following two functions were provided by the OOI for requesting and downloading data.

`request_data` takes your API username and temporary token and inputs a request for data from the OOI. This function returns the URL where your requested data is stored, but the URl is not populated right away because these requests take time, especially if you request an entire year's worth of data! If you pass these URLs to the `get_data` function right away, you might get nothing but errors because the data isn't ready yet. When it is ready, you'll get an email notification with the same URL in it as is returned by the `request_data` function. Then you'll know it's time to run the next function!

The URLs don't expire so you can keep using them if you get the data but don't save it locally to your machine, which I highyl recommend doing. I've saved all the data requests I've done in the file `data_urls.txt` for use again later.

In [4]:
def request_data(reference_designator, method, stream, start_date=None, end_date=None):
    site = reference_designator[:8]
    node = reference_designator[9:14]
    instrument = reference_designator[15:]

    # Create the request URL
    api_base_url = 'https://ooinet.oceanobservatories.org/api/m2m/12576/sensor/inv'
    data_request_url = '/'.join((api_base_url, site, node, instrument, method, stream))
    print(data_request_url)
    # All of the following are optional, but you should specify a date range
    params = {
        'format': 'application/netcdf',
        'include_provenance': 'true',
        'include_annotations': 'true'
    }
    if start_date:
        params['beginDT'] = start_date
    if end_date:
        params['endDT'] = end_date

    # Make the data request
    r = requests.get(data_request_url, params=params, auth=(API_USERNAME, API_TOKEN))
    data = r.json()

    # Return just the THREDDS URL
    return data['allURLs'][0]

`get_data` accesses the URLs provided by the `request_data` function and accesses the .nc folders and OPeNDAP server data files. These files are the standard .netCDF file type, and are initially accessed using xarray, but this function returns them to you as a pandas dataframe. Running `get_data` can take a while if you are getting a lot of data at once. 

In [5]:
def get_data(url, variables, deployments=None):
    # Function to grab all data from specified directory
    tds_url = 'https://opendap.oceanobservatories.org/thredds/dodsC'
    dataset = requests.get(url).text
    ii = re.findall(r'href=[\'"]?([^\'" >]+)', dataset)
    # x = re.findall(r'(ooi/.*?.nc)', dataset)
    x = [y for y in ii if y.endswith('.nc')]
    for i in x:
        if i.endswith('.nc') == False:
            x.remove(i)
    for i in x:
        try:
            float(i[-4])
        except:
            x.remove(i)
    # dataset = [os.path.join(tds_url, i) for i in x]
    datasets = [os.path.join(tds_url, i.split('=')[-1]).replace("\\","/") for i in x]

    # remove deployments not in deployment list, if given
    if deployments is not None:
        deploy = ['deployment{:04d}'.format(j) for j in deployments]
        datasets = [k for k in datasets if k.split('/')[-1].split('_')[0] in deploy]

    # remove collocated data files if necessary
    catalog_rms = url.split('/')[-2][20:]
    selected_datasets = []
    for d in datasets:
        if catalog_rms == d.split('/')[-1].split('_20')[0][15:]:
            selected_datasets.append(d)

    # create a dictionary to populate with data from the selected datasets
    data_dict = {'time': np.array([], dtype='datetime64[ns]')}
    unit_dict = {}
    for v in variables:
        data_dict.update({v: np.array([])})
        unit_dict.update({v: []})
    print('Appending data from files')

    for sd in selected_datasets:
        try:
            url_with_fillmismatch = f'{sd}#fillmismatch'  # I had to add this line to get the function to work
            ds = xr.open_dataset(url_with_fillmismatch, mask_and_scale=False)
            data_dict['time'] = np.append(data_dict['time'], ds['time'].values)
            for var in variables:
                data_dict[var] = np.append(data_dict[var], ds[var].values)
                units = ds[var].units
                if units not in unit_dict[var]:
                    unit_dict[var].append(units)
        except:
            pass

    # convert dictionary to a dataframe
    df = pd.DataFrame(data_dict)
    df.sort_values(by=['time'], inplace=True)  # make sure the timestamps are in ascending order

    return df, unit_dict

You can uncomment the three cells below and run the requests, but you'll need to have entered your own API credentials near the start of the notebook. You only need to run the requests once, because the resulting URLs don't expire. However, requesting a full year's worth of data takes several minutes! The cells below will output a URL right away, but the `get_data()` function won't work until the request is actually fulfilled - you'll get an email from the OOI when your request is completed, and then you'll be able to continue.

In [6]:
# Request data from the bulk meteorology package on the surface mooring

# METBK_url = request_data('CE04OSSM-SBD11-06-METBKA000', 'recovered_host', 
#                          'metbk_a_dcl_instrument_recovered',
#                          '2017-01-01T00:00:00.000Z', '2017-12-31T12:00:00.000Z')
# print('METBK_url: %s' %METBK_url)

In [7]:
# Request data from the CTD-O on the shallow profiler

# profiler_url = request_data('CE04OSPS-PC01B-4A-CTDPFA109', 'streamed', 'ctdpf_sbe43_sample',
#                         '2017-01-01T00:00:00.000Z', '2017-12-31T12:00:00.000Z')
# print('profiler_url: %s' %profiler_url)

In [8]:
# Request data from the CTD-O on the 200 meter platform

# platform_url = request_data('CE04OSPS-PC01B-4A-CTDPFA109', 'streamed', 
#                          'ctdpf_optode_sample',
#                          '2017-01-01T00:00:00.000Z', '2017-12-31T12:00:00.000Z')
# print('platform_url: %s' %platform_url)

Since I used my own credentials to get these URLs, I'm not sure they'll work for you. You may need to enter your own credentials, run the `request_data()` cells above, and replace the URLs below with the output.

Here are three URLs that have data for the year 2017. We can use these to load in data files. Putting these URLs into your browser window will bring you to the OPeNDAP server where you can see variable names and descriptions. There are a lot of folders to navigate through, but [here](https://opendap.oceanobservatories.org/thredds/dodsC/ooi/deryag@uw.edu/20210422T030848056Z-CE04OSPS-SF01B-2A-CTDPFA107-streamed-ctdpf_sbe43_sample/deployment0004_CE04OSPS-SF01B-2A-CTDPFA107-streamed-ctdpf_sbe43_sample_20170801T160709.510843-20170916T121340.481090.nc.html) is an example of the CTD data, and [here](https://opendap.oceanobservatories.org/thredds/dodsC/ooi/deryag@uw.edu/20210422T030752259Z-CE04OSSM-SBD11-06-METBKA000-recovered_host-metbk_a_dcl_instrument_recovered/deployment0006_CE04OSSM-SBD11-04-VELPTA000-recovered_host-velpt_ab_dcl_instrument_recovered_20180403T183000-20180403T183000.nc.html) is an example of the METBK data. You can navigate to these examples by using the URLs below, selecting a .nc folder, and then clicking on the OPeNDAP link. 

In [9]:
METBK_2017_url = 'https://opendap.oceanobservatories.org/thredds/catalog/ooi/deryag@uw.edu/20210422T030752259Z-CE04OSSM-SBD11-06-METBKA000-recovered_host-metbk_a_dcl_instrument_recovered/catalog.html'
profiler_2017_url = 'https://opendap.oceanobservatories.org/thredds/catalog/ooi/deryag@uw.edu/20210422T030848056Z-CE04OSPS-SF01B-2A-CTDPFA107-streamed-ctdpf_sbe43_sample/catalog.html'
platform_2017_url = 'https://opendap.oceanobservatories.org/thredds/catalog/ooi/deryag@uw.edu/20210428T021551666Z-CE04OSPS-PC01B-4A-CTDPFA109-streamed-ctdpf_optode_sample/catalog.html'

### Get 2017 data

Time to actually get the data! The `get_data` function returns a pandas dataframe, so if you'd rather use xarray you can convert the resulting dataframe to a data array or alter the `get_data` function to return xarray data array. 

In [10]:
# Specify the variable(s) of interest
METBK_2017_var = ['sea_surface_temperature', 'met_windavg_mag_corr_east', 'met_windavg_mag_corr_north']
profiler_2017_var = ['seawater_pressure', 'density', 'practical_salinity', 'seawater_temperature', 'corrected_dissolved_oxygen']
platform_2017_var = ['seawater_pressure', 'density', 'practical_salinity', 'seawater_temperature', 'dissolved_oxygen']

The cell below takes a few minutes to run because the datasets we're getting from the OOI are quite large!

In [11]:
# Get the data! 
METBK_2017_data, METBK_2017_units = get_data(METBK_2017_url, METBK_2017_var)
profiler_2017_data, profiler_2017_units = get_data(profiler_2017_url, profiler_2017_var)
platform_2017_data, platform_2017_units = get_data(platform_2017_url, platform_2017_var)

# Check the variable units
print(METBK_2017_units)
print(profiler_2017_units)
print(platform_2017_units)

Appending data from files
Appending data from files
Appending data from files
{'sea_surface_temperature': ['ºC'], 'met_windavg_mag_corr_east': ['m s-1'], 'met_windavg_mag_corr_north': ['m s-1']}
{'seawater_pressure': ['dbar'], 'density': ['kg m-3'], 'practical_salinity': ['1'], 'seawater_temperature': ['ºC'], 'corrected_dissolved_oxygen': ['µmol kg-1']}
{'seawater_pressure': ['dbar'], 'density': ['kg m-3'], 'practical_salinity': ['1'], 'seawater_temperature': ['ºC'], 'dissolved_oxygen': ['µmol kg-1']}


In [12]:
# Save the unit dictionaries above in case you accidentally overwrite the output: 

METBK_2017_units = {'sea_surface_temperature': ['ºC'], 'met_windavg_mag_corr_east': ['m s-1'], 'met_windavg_mag_corr_north': ['m s-1']}
profiler_2017_units = {'seawater_pressure': ['dbar'], 'density': ['kg m-3'], 'practical_salinity': ['1'], 'seawater_temperature': ['ºC'], 'corrected_dissolved_oxygen': ['µmol kg-1']}
platform_2017_units = {'seawater_pressure': ['dbar'], 'density': ['kg m-3'], 'practical_salinity': ['1'], 'seawater_temperature': ['ºC'], 'dissolved_oxygen': ['µmol kg-1']}

Save these data files as `.csv`s so we can use them in the rest of the notebooks. This will take a few minutes! 

In [13]:
# Save 2017 dataframes to the output folder parallel to this GitHub repo 

METBK_2017_data.to_csv('../../coastal_upwelling_output/metbk_data_2017.csv', index=False)
profiler_2017_data.to_csv('../../coastal_upwelling_output/profiler_data_2017.csv', index=False)
platform_2017_data.to_csv('../../coastal_upwelling_output/platform_data_2017.csv', index=False)

---
### Get 2018 data

The data availability in 2017 wasn't very good for the shallow profiler (it spent quite a number of months stuck near 200 meters), so I want to pull in the 2018 data to see if it's any better. The code below is all the same as the code above - the only differences are the dates that I used in the data requests. 

In [15]:
# METBK_url = request_data('CE04OSSM-SBD11-06-METBKA000', 'recovered_host', 
#                          'metbk_a_dcl_instrument_recovered',
#                          '2018-01-01T00:00:00.000Z', '2018-12-31T12:00:00.000Z')
# print('METBK_url: %s' %METBK_url)

https://ooinet.oceanobservatories.org/api/m2m/12576/sensor/inv/CE04OSSM/SBD11/06-METBKA000/recovered_host/metbk_a_dcl_instrument_recovered
METBK_url: https://opendap.oceanobservatories.org/thredds/catalog/ooi/deryag@uw.edu/20210502T005210982Z-CE04OSSM-SBD11-06-METBKA000-recovered_host-metbk_a_dcl_instrument_recovered/catalog.html


In [16]:
# profiler_url = request_data('CE04OSPS-SF01B-2A-CTDPFA107', 'streamed', 'ctdpf_sbe43_sample',
#                         '2018-01-01T00:00:00.000Z', '2018-12-31T12:00:00.000Z')

# print('profiler_url: %s' %profiler_url)

https://ooinet.oceanobservatories.org/api/m2m/12576/sensor/inv/CE04OSPS/SF01B/2A-CTDPFA107/streamed/ctdpf_sbe43_sample
profiler_url: https://opendap.oceanobservatories.org/thredds/catalog/ooi/deryag@uw.edu/20210502T005211652Z-CE04OSPS-SF01B-2A-CTDPFA107-streamed-ctdpf_sbe43_sample/catalog.html


In [17]:
# platform_url = request_data('CE04OSPS-PC01B-4A-CTDPFA109', 'streamed', 
#                          'ctdpf_optode_sample',
#                          '2018-01-01T00:00:00.000Z', '2018-12-31T12:00:00.000Z')
# print('platform_url: %s' %platform_url)

https://ooinet.oceanobservatories.org/api/m2m/12576/sensor/inv/CE04OSPS/PC01B/4A-CTDPFA109/streamed/ctdpf_optode_sample
platform_url: https://opendap.oceanobservatories.org/thredds/catalog/ooi/deryag@uw.edu/20210502T005215562Z-CE04OSPS-PC01B-4A-CTDPFA109-streamed-ctdpf_optode_sample/catalog.html


Again, all of these URLs and their associated `request_data` inputs are saved in the `data_urls.txt` file in the repo in case you lose them.

In [16]:
METBK_2018_url = 'https://opendap.oceanobservatories.org/thredds/catalog/ooi/deryag@uw.edu/20210502T005210982Z-CE04OSSM-SBD11-06-METBKA000-recovered_host-metbk_a_dcl_instrument_recovered/catalog.html'
profiler_2018_url = 'https://opendap.oceanobservatories.org/thredds/catalog/ooi/deryag@uw.edu/20210502T005211652Z-CE04OSPS-SF01B-2A-CTDPFA107-streamed-ctdpf_sbe43_sample/catalog.html'
platform_2018_url = 'https://opendap.oceanobservatories.org/thredds/catalog/ooi/deryag@uw.edu/20210502T005215562Z-CE04OSPS-PC01B-4A-CTDPFA109-streamed-ctdpf_optode_sample/catalog.html'

In [17]:
# Specify the variable(s) of interest
METBK_2018_var = ['sea_surface_temperature', 'met_windavg_mag_corr_east', 'met_windavg_mag_corr_north']
profiler_2018_var = ['seawater_pressure', 'density', 'practical_salinity', 'seawater_temperature', 'corrected_dissolved_oxygen']
platform_2018_var = ['seawater_pressure', 'density', 'practical_salinity', 'seawater_temperature', 'dissolved_oxygen']

For some reason, the platform data was throwing an error in the `get_data()` function for one of the `.nc` files, so I had to go back and add a try/except block to it. This means the platform data collected by this code may not be all of the data available, but I'm not sure what's causing that to happen.

In [18]:
# get the data! 
METBK_data_2018, METBK_2018_units = get_data(METBK_2018_url, METBK_2018_var)
profiler_data_2018, profiler_2018_units = get_data(profiler_2018_url, profiler_2018_var)
platform_data_2018, platform_2018_units = get_data(platform_2018_url, platform_2018_var)

# check the variable units
print(METBK_2018_units)
print(profiler_2018_units)
print(platform_2018_units)

Appending data from files
Appending data from files
Appending data from files
{'sea_surface_temperature': ['ºC'], 'met_windavg_mag_corr_east': ['m s-1'], 'met_windavg_mag_corr_north': ['m s-1']}
{'seawater_pressure': ['dbar'], 'density': ['kg m-3'], 'practical_salinity': ['1'], 'seawater_temperature': ['ºC'], 'corrected_dissolved_oxygen': ['µmol kg-1']}
{'seawater_pressure': ['dbar'], 'density': ['kg m-3'], 'practical_salinity': ['1'], 'seawater_temperature': ['ºC'], 'dissolved_oxygen': ['µmol kg-1']}


In [24]:
# Save the unit dictionaries above in case you accidentally overwrite the output: 

METBK_2018_units = {'sea_surface_temperature': ['ºC'], 'met_windavg_mag_corr_east': ['m s-1'], 'met_windavg_mag_corr_north': ['m s-1']}
profiler_2018_units = {'seawater_pressure': ['dbar'], 'density': ['kg m-3'], 'practical_salinity': ['1'], 'seawater_temperature': ['ºC'], 'corrected_dissolved_oxygen': ['µmol kg-1']}
platform_2018_units = {'seawater_pressure': ['dbar'], 'density': ['kg m-3'], 'practical_salinity': ['1'], 'seawater_temperature': ['ºC'], 'dissolved_oxygen': ['µmol kg-1']}

In [19]:
METBK_data_2018

Unnamed: 0,time,sea_surface_temperature,met_windavg_mag_corr_east,met_windavg_mag_corr_north
122325,2018-04-03 18:26:39.426999808,10.135,-1.700006,-1.983426
122326,2018-04-03 18:27:43.659999744,10.136,-2.854736,-2.457026
122327,2018-04-03 18:28:48.444000256,10.141,-1.223021,-2.973268
122328,2018-04-03 18:29:52.679000064,10.141,-2.237504,-2.168889
122329,2018-04-03 18:30:27.203999744,10.136,-1.361928,-2.531334
...,...,...,...,...
122320,2018-12-31 09:35:21.017000448,11.750,2.496349,-6.370490
122321,2018-12-31 09:36:26.114000384,11.751,1.183581,-7.435216
122322,2018-12-31 09:37:30.170999808,11.758,2.250494,-7.588503
122323,2018-12-31 09:38:37.085000192,11.753,2.812786,-6.963421


In [20]:
profiler_data_2018

Unnamed: 0,time,seawater_pressure,density,practical_salinity,seawater_temperature,corrected_dissolved_oxygen
6733799,2018-07-17 15:38:50.061576192,81.279138,1026.516091,33.675824,8.613400,92.728326
6733800,2018-07-17 15:38:51.061581824,81.182077,1026.515845,33.675754,8.611768,92.708655
6733801,2018-07-17 15:38:52.061271552,81.084999,1026.515921,33.676071,8.610009,92.688996
6733802,2018-07-17 15:38:53.061588992,80.987921,1026.517571,33.678387,8.608188,92.679191
6733803,2018-07-17 15:38:54.061177344,80.889759,1026.518862,33.680522,8.607749,92.633305
...,...,...,...,...,...,...
475185,2018-12-31 11:59:55.458120192,61.875097,1025.117886,32.660133,11.666675,258.166072
475186,2018-12-31 11:59:56.458231296,61.828144,1025.117640,32.660090,11.666675,258.112745
475187,2018-12-31 11:59:57.458030592,61.794006,1025.117535,32.660169,11.666744,258.100726
475188,2018-12-31 11:59:58.457932288,61.756694,1025.117198,32.660002,11.666952,258.181654


In [21]:
platform_data_2018

Unnamed: 0,time,seawater_pressure,density,practical_salinity,seawater_temperature,dissolved_oxygen
8142569,2018-07-17 15:16:20.566366208,196.741005,1027.523294,33.981238,7.006586,83.921237
8142570,2018-07-17 15:16:21.565952512,196.743134,1027.522706,33.980552,7.007007,83.928797
8142571,2018-07-17 15:16:22.564911616,196.737798,1027.522188,33.980022,7.007548,83.935476
8142572,2018-07-17 15:16:23.566372352,196.736736,1027.521963,33.979785,7.007788,83.963296
8142573,2018-07-17 15:16:24.565125632,196.732468,1027.521513,33.979280,7.008028,84.000929
...,...,...,...,...,...,...
388786,2018-12-31 11:59:55.363662848,197.703089,1027.157078,33.863666,8.853098,113.586892
388787,2018-12-31 11:59:56.363566080,197.698820,1027.157456,33.863887,8.851699,113.587973
388788,2018-12-31 11:59:57.363363840,197.698817,1027.158236,33.864546,8.850046,113.583926
388789,2018-12-31 11:59:58.363266048,197.696683,1027.159124,33.865642,8.849791,113.620820


Looking at the start and end dates in the dataframe displays above, it doesn't look like the full year of 2018 was covered by any of these instrument packages. How unfortunate! I think the best bet will be to make a model with the 2017 data first, and then come back to the 2018 afterwards and see if there's anything I can do with it in addition.

Save these data files as `.csv`s so we can use them in the rest of the notebooks.

In [22]:
# Save 2018 dataframes to the output folder parallel to this GitHub repo 

METBK_data_2018.to_csv('../../coastal_upwelling_output/metbk_data_2018.csv', index=False)
profiler_data_2018.to_csv('../../coastal_upwelling_output/profiler_data_2018.csv', index=False)
platform_data_2018.to_csv('../../coastal_upwelling_output/platform_data_2018.csv', index=False)