# Request and Download OOI Data

**Purpose:** This notebook creates requests for data and QARTOD QC test results that are available from OOINet and from the OOI dev1 server. QC tests associated with datasets from OOINet have already been implemented in production by the Data Team. The dev1 server is where datasets with results of QARTOD tests in development are hosted. Access to dev1 is restricted to OOI personnel on the internal network.

The requests built below include the retrieval method, data stream, and either the reference designator or site, node, and sensor combination for a specific instrument to request data through the OOI M2M API. The requested datasets can also be limited to a time period defined by start datetime and end datetime parameters.

After downloading the datasets and performing preprocessing to prepare the data for analysis, the datasets are saved locally to an interim data folder for the next step in testing and analyzing QARTOD test results.

### Import modules used in this notebook

In [1]:
# Import libraries available from main conda channels or conda-forge
import xarray as xr
import io
import os
import warnings
import re
warnings.filterwarnings("ignore")
from tqdm import tqdm
import requests

# Import functions from ooi-data-explorations library
from ooi_data_explorations import common 
from ooi_data_explorations.common import SESSION

# Import OOINet library
from ooinet import M2M
from ooinet.M2M import AUTH, SESSION 
from ooinet.Instrument.common import process_file

# Import qartod_testing project modules
import qartod_testing.data_processing as dp

### QARTOD in Production: Request data from the OOINet THREDDS catalog

The next 4 subsections are different attempts a requesting data from OOINet. 
Downloading data with `M2M.download_netCDF_files()` was successful for a couple of PHSEN instruments, although it does not do any preprocessing before saving these datasets. For other datasets I usually run into a file or directory not found error at the local directory where I am trying to write data.

##### Define data parameters

In [2]:
# Setup parameters needed to request data
refdes = "GA01SUMO-RII11-02-CTDBPP032"              # Coastal Pioneer Array (NES) - Central Surface Mooring CTD Bottom-pumped, is this the same as site, node, sensor?
method = "recovered_inst"                           # non-decimated data from recovered instrument
stream = "ctdbp_cdef_instrument_recovered"          # name of data stream

# Site, node, and sensor info from deconstructed reference designator
[site, node, sensor] = refdes.split('-', 2)

login, password = AUTH[0], AUTH[2]

##### Using OOINet module

In [10]:
# Use the gold copy THREDDs datasets
thredds_url = M2M.get_thredds_url(refdes, method, stream, goldCopy=True)

# Get the THREDDs catalog
thredds_catalog = M2M.get_thredds_catalog(thredds_url)
deployments = M2M.get_deployments(refdes)

In [11]:
# Clean the THREDDs catalog
# This step separates entries from thredds_catalog if they do not match the stream. These ancillary files are usually provided 
# because they are used in calculating a derived variable from the measured variable stream.
sensor_files = M2M.clean_catalog(thredds_catalog, stream, deployments) 

# Now build the url to access the data
# sensor_files = [re.sub("catalog.html\?dataset=", M2M.URLS["goldCopy_fileServer"], file) for file in sensor_files]

In [10]:
sensor_files

['catalog.html?dataset=ooigoldcopy/public/CP01CNSM-RID27-04-DOSTAD000-recovered_host-dosta_abcdjm_dcl_instrument_recovered/deployment0001_CP01CNSM-RID27-04-DOSTAD000-recovered_host-dosta_abcdjm_dcl_instrument_recovered_20131121T182017.889000-20140217T132558.909000.nc',
 'catalog.html?dataset=ooigoldcopy/public/CP01CNSM-RID27-04-DOSTAD000-recovered_host-dosta_abcdjm_dcl_instrument_recovered/deployment0003_CP01CNSM-RID27-04-DOSTAD000-recovered_host-dosta_abcdjm_dcl_instrument_recovered_20150507T174515.211000-20151023T193331.593000.nc',
 'catalog.html?dataset=ooigoldcopy/public/CP01CNSM-RID27-04-DOSTAD000-recovered_host-dosta_abcdjm_dcl_instrument_recovered/deployment0004_CP01CNSM-RID27-04-DOSTAD000-recovered_host-dosta_abcdjm_dcl_instrument_recovered_20151023T190008.653000-20160402T041511.024000.nc',
 'catalog.html?dataset=ooigoldcopy/public/CP01CNSM-RID27-04-DOSTAD000-recovered_host-dosta_abcdjm_dcl_instrument_recovered/deployment0005_CP01CNSM-RID27-04-DOSTAD000-recovered_host-dosta_abc

In [4]:
# build path to folder where data will be saved
folder_path = os.path.join(os.path.abspath('../data'), 'external', method, stream, refdes)
# make folder if it does not already exist
# if not os.path.exists(folder_path):
#     os.makedirs(folder_path)

In [13]:
M2M.download_netCDF_files(sensor_files, goldCopy=True, saveDir=folder_path)

----- Downloading files -----


Downloading https://thredds.dataexplorer.oceanobservatories.org/thredds/fileServer/ooigoldcopy/public/CP01CNSM-RID27-03-CTDBPC000-recovered_inst-ctdbp_cdef_instrument_recovered/deployment0001_CP01CNSM-RID27-03-CTDBPC000-recovered_inst-ctdbp_cdef_instrument_recovered_20131121T181601-20140217T132711.nc to c:\Users\kylene.cooley\Documents\GitHub\qartod_testing\data\external\recovered_inst\ctdbp_cdef_instrument_recovered\CP01CNSM-RID27-03-CTDBPC000\deployment0001_CP01CNSM-RID27-03-CTDBPC000-recovered_inst-ctdbp_cdef_instrument_recovered_20131121T181601-20140217T132711.nc 
Downloading https://thredds.dataexplorer.oceanobservatories.org/thredds/fileServer/ooigoldcopy/public/CP01CNSM-RID27-03-CTDBPC000-recovered_inst-ctdbp_cdef_instrument_recovered/deployment0004_CP01CNSM-RID27-03-CTDBPC000-recovered_inst-ctdbp_cdef_instrument_recovered_20151023T191528-20160402T034848.nc to c:\Users\kylene.cooley\Documents\GitHub\qartod_testing\data\external\recovered_inst\ctdbp_cdef_instrument_recovered\CP01

Exception in thread Thread-26:
Traceback (most recent call last):
  File "c:\Users\kylene.cooley\AppData\Local\anaconda3\envs\qartod_test\Lib\threading.py", line 1038, in _bootstrap_inner
    self.run()
  File "C:\Users\kylene.cooley\Documents\GitHub\OOINet\ooinet\Download.py", line 60, in run
    download_file(directory, link)
  File "C:\Users\kylene.cooley\Documents\GitHub\OOINet\ooinet\Download.py", line 30, in download_file
    urlretrieve(link, download_path)
  File "c:\Users\kylene.cooley\AppData\Local\anaconda3\envs\qartod_test\Lib\urllib\request.py", line 251, in urlretrieve
    tfp = open(filename, 'wb')
          ^^^^^^^^^^^^^^^^^^^^
FileNotFoundError: [Errno 2] No such file or directory: 'c:\\Users\\kylene.cooley\\Documents\\GitHub\\qartod_testing\\data\\external\\recovered_inst\\ctdbp_cdef_instrument_recovered\\CP01CNSM-RID27-03-CTDBPC000\\deployment0005_CP01CNSM-RID27-03-CTDBPC000-recovered_inst-ctdbp_cdef_instrument_recovered_20160513T135001-20161013T193001.nc'
Exception 

#### Try same process as dev1 data download

In [3]:
# Use the Dev1 data catalog URL for the request
api_base_url = M2M.URLS['goldCopy_dodsC']
api_base_url = re.sub("https", "http", api_base_url) 

# Use the fileServer URL for downloading data files from the thredds server
tds_url = M2M.URLS['goldCopy_fileServer']
tds_url = re.sub("https", "http", tds_url) 

# Create the request URL
data_request_url =''.join((api_base_url,'-'.join((site,node,sensor,method,stream))))

# Build and send the data request
r = requests.get(data_request_url, auth=(login, password))
data_request = r.json()

JSONDecodeError: Expecting value: line 1 column 1 (char 0)

In [47]:
api_base_url

'http://thredds.dataexplorer.oceanobservatories.org/thredds/dodsC/'

In [48]:
data_request_url

'http://thredds.dataexplorer.oceanobservatories.org/thredds/dodsC/CP01CNSM-RID27-03-CTDBPC000-recovered_inst-ctdbp_cdef_instrument_recovered'

In [37]:
# Checking contents of request 
r.content

b'Error {\n    code = 400;\n    message = "Unrecognized request";\n};\n'

#### Using some M2M module and some xarray 

Xarray would be used at the step where we download the data once the request is successful, but so far my attempts have failed before I can try downloading any data.

In [3]:
# Routine in data_processing module from this project to download the gold copy THREDDs datasets

files = dp.ooinet_gold_copy_request(refdes, method, stream)

Downloading and Processing Data Files:   0%|          | 0/2 [00:02<?, ?it/s]


PermissionError: [Errno 13] Permission denied: b'c:\\Users\\kylene.cooley\\Documents\\GitHub\\qartod_testing\\data\\external\\recovered_inst\\ctdbp_cdef_instrument_recovered\\GA01SUMO-RII11-02-CTDBPP032\\deployment0002_GA01SUMO-RII11-02-CTDBPP032-recovered_inst-ctdbp_cdef_instrument_recovered_20151114T220003-20160606T220003.nc'

In [5]:
txt = "c:\\Users\\kylene.cooley"
txt = r'%s' % txt
print(txt)

c:\Users\kylene.cooley


In [3]:
# Same as routine in project but step by step (this cell is the same as all-M2M request)
# Use the gold copy THREDDs datasets
thredds_url = M2M.get_thredds_url(refdes, method, stream, goldCopy=True)

# Get the THREDDs catalog
thredds_catalog = M2M.get_thredds_catalog(thredds_url)
deployments = M2M.get_deployments(refdes)

# Clean the THREDDs catalog
# This step separates entries from thredds_catalog if they do not match the stream. These ancillary files are usually provided 
# because they are used in calculating a derived variable from the measured variable stream.
sensor_files = M2M.clean_catalog(thredds_catalog, stream, deployments) 

In [4]:
file = sensor_files[0]
from ooi_data_explorations.common import process_file

In [5]:
file

'catalog.html?dataset=ooigoldcopy/public/GA01SUMO-RII11-02-CTDBPP032-recovered_inst-ctdbp_cdef_instrument_recovered/deployment0002_GA01SUMO-RII11-02-CTDBPP032-recovered_inst-ctdbp_cdef_instrument_recovered_20151114T220003-20160606T220003.nc'

In [6]:
ds = process_file(file, gc=True)
ds

In [5]:
# Now build the url to access the data
sensor_files = [re.sub("catalog.html\?dataset=", M2M.URLS["goldCopy_dodsC"], file) for file in sensor_files]
sensor_files = [re.sub("https", "http", file) for file in sensor_files]

# build path to folder where data will be saved
folder_path = os.path.join(os.path.relpath('../data'), 'external', method, stream, refdes)
# make folder if it does not already exist
if not os.path.exists(folder_path):
    os.makedirs(folder_path)

In [6]:
sensor_files

['http://thredds.dataexplorer.oceanobservatories.org/thredds/dodsC/ooigoldcopy/public/CP01CNSM-RID27-03-CTDBPC000-recovered_inst-ctdbp_cdef_instrument_recovered/deployment0001_CP01CNSM-RID27-03-CTDBPC000-recovered_inst-ctdbp_cdef_instrument_recovered_20131121T181601-20140217T132711.nc',
 'http://thredds.dataexplorer.oceanobservatories.org/thredds/dodsC/ooigoldcopy/public/CP01CNSM-RID27-03-CTDBPC000-recovered_inst-ctdbp_cdef_instrument_recovered/deployment0004_CP01CNSM-RID27-03-CTDBPC000-recovered_inst-ctdbp_cdef_instrument_recovered_20151023T191528-20160402T034848.nc',
 'http://thredds.dataexplorer.oceanobservatories.org/thredds/dodsC/ooigoldcopy/public/CP01CNSM-RID27-03-CTDBPC000-recovered_inst-ctdbp_cdef_instrument_recovered/deployment0005_CP01CNSM-RID27-03-CTDBPC000-recovered_inst-ctdbp_cdef_instrument_recovered_20160513T135001-20161013T193001.nc',
 'http://thredds.dataexplorer.oceanobservatories.org/thredds/dodsC/ooigoldcopy/public/CP01CNSM-RID27-03-CTDBPC000-recovered_inst-ctdbp_c

In [8]:
streams = M2M.get_datastreams(refdes)
streams

Unnamed: 0,refdes,method,stream
0,CP01CNSM-RID27-03-CTDBPC000,recovered_host,ctdbp_cdef_dcl_instrument_recovered
1,CP01CNSM-RID27-03-CTDBPC000,recovered_inst,ctdbp_cdef_instrument_recovered
2,CP01CNSM-RID27-03-CTDBPC000,telemetered,ctdbp_cdef_dcl_instrument


In [10]:
# Try data download with just one file
file = sensor_files[0]

file_name = re.findall("deployment.*\.nc$", file)[0]
r = SESSION.get(file, timeout=(3.05, 120), auth=(login, password))
r.ok

MissingSchema: Invalid URL 'catalog.html?dataset=ooigoldcopy/public/GA01SUMO-RII11-02-CTDBPP032-recovered_inst-ctdbp_cdef_instrument_recovered/deployment0002_GA01SUMO-RII11-02-CTDBPP032-recovered_inst-ctdbp_cdef_instrument_recovered_20151114T220003-20160606T220003.nc': No scheme supplied. Perhaps you meant https://catalog.html?dataset=ooigoldcopy/public/GA01SUMO-RII11-02-CTDBPP032-recovered_inst-ctdbp_cdef_instrument_recovered/deployment0002_GA01SUMO-RII11-02-CTDBPP032-recovered_inst-ctdbp_cdef_instrument_recovered_20151114T220003-20160606T220003.nc?

In [10]:
# Figure out why r.ok is false
r.content

b'Error {\n    code = 400;\n    message = "Unrecognized request";\n};\n'

In [28]:
r.json()

JSONDecodeError: Expecting value: line 1 column 1 (char 0)

In [50]:
# What is in file?
file

'http://thredds.dataexplorer.oceanobservatories.org/thredds/dodsC/ooigoldcopy/public/CP01CNSM-RID27-03-CTDBPC000-recovered_inst-ctdbp_cdef_instrument_recovered/deployment0001_CP01CNSM-RID27-03-CTDBPC000-recovered_inst-ctdbp_cdef_instrument_recovered_20131121T181601-20140217T132711.nc'

In [None]:
if r.ok:
    # load the data file
    if use_dask:
        ds = xr.open_dataset(io.BytesIO(r.content), decode_cf=False, chunks=10000)
    else:
        ds = xr.load_dataset(io.BytesIO(r.content), decode_cf=False)

        # ds = M2M.get_api(ds)
        # r = SESSION.get(ds, timeout=(3.05, 120))
        # ds = xr.open_dataset(ds, chunks={})
        ds = process_file(ds)
        file_path = os.path.join(folder_path, file_name)
        ds.to_netcdf(file_path)
else:
    print("bad request")

##### Using ooi_data_explorations modules

In [9]:
# Load data with common module

data = common.load_gc_thredds(site,node,sensor,method,stream,use_dask=True)    # Request the gold copy data through THREDDs catalog

# It looks like the OOINet module method attempts to avoid collecting ancillary files in addition to the requested sensor files which could add time to the download and open dataset step.
# load_gc_thredds() also calls process_file() within gc_collect() so we achieve the same preprocessing as in the preprocess() defined above.

Downloading 15 data file(s) from the OOI Gold Copy THREDSS catalog
Downloading and Processing Data Files: 100%|██████████| 15/15 [03:29<00:00, 13.94s/it]


In [None]:
# Make a copy of the data with a unique name

ds_prod = data.copy()
ds_prod

##### Save datasets for test in production to interim data folder for further processing

In [None]:
prod_path = dp.build_data_path(refdes, method, stream, 'prod', folder='external') # added this folder='external' as I was updating the notebook. not sure if I want to put requested data in interim or external going forward

ds_prod.to_netcdf(path=prod_path)                           # repeat for ds_prod

### QARTOD in Development: Request data from dev1 server

We may also want to examine new QARTOD tests which are on staging in the Dev-1 environment before they are moved to production. The Development environemt at ooinet-dev1-west.intra.oceanobservatories.org. In order to access data on Dev-1, you need to be granted access and be connected to the CI-West VPN (vpn-west.oceanobservatories.org) at Oregon State.

In [None]:
# Setup parameters needed to request data 
# Check that instrument parameters match an available OOI datasets on dev1 server
# Maybe change this section to look for data sets programatically with ooi-data-explorations functions (list platforms/sites, list methods, list streams,...)

refdes = "CP03ISSM-RID27-03-CTDBPC000"              # Coastal Pioneer Array (NES) - Inshore Surface Mooring Near Surface Instrument Frame - Bottom-pumped CTD
method = "recovered_inst"                           # non-decimated data from recovered instrument
stream = "ctdbp_cdef_instrument_recovered"          # name of data stream

# Site, node, and sensor info from deconstructed reference designator
[site, node, sensor] = refdes.split('-', 2)

# Set optional parameters 
# We specify a date range to control the size of the dataset requested 
params = {
  'beginDT':'2019-09-26T13:50:00.000Z',
  'endDT':'2020-11-01T13:16:00.000Z',
  'format':'application/netcdf',
  'include_provenance':'true',
  'include_annotations':'true'
}

The Dev-1 environment has no "goldcopy" equivalent THREDDs catalog. Instead we'll have to do the normal request and wait for the datasets to be assembled and made available for download.

We are using a different process for downloading data than in the OOINet section since the default URLs that are set within the other functions connect to OOINet. 
The development environment also doesn't have a gold copy, although different functions to request non-gold copy datasets from OOINet exist in the OOINet and ooi-data-explorations modules.


Our choice of URL is similar to the URL used in the M2M example notebook here: https://github.com/ooi-data-review/2018-data-workshops/blob/master/chemistry/examples/quickstart_python.ipynb 
The rest of the data request process through this section is modeled after the linked tutorial above. 

In [None]:
# Connect to ci-west vpn before running this cell
data = dp.dev1_request(site, node, sensor, method, stream, params)

In [None]:
# Make a copy of the data with a unique name

ds_dev = data.copy()
ds_dev

##### Save datasets for test in development to interim data folder for further processing

In [None]:
interim_data = os.path.relpath('../data/interim')           # path to interim data folder from notebook folder

dev_filename = '-'.join(('dev',ds_dev.id,))+'.nc'           # build ds_dev filename from dataset attributes

dev_path=os.path.join(interim_data, dev_filename)           # build full relative path with ds_dev filename

ds_dev.to_netcdf(path=dev_path)                             # provide both relative path and filename for ds_dev in path parameter