## Aggregation and Resampling
*Written by Friedrich Knuth, Rutgers University, May 21, 2018, as revised by Sage Lichtenwalner June 13, 2018*

In this example we will learn how to programmatically download and work with OOI NetCDF data from within a notebook. 

We will use data from the 3D Thermistor Array deployed in the ASHES Vent field at Axial Seamount for this example, but the mechanics apply to all datasets that are processed through the OOI Cyberinfrastructure (CI) system. 

In this example, you will learn:
* how to find the data you are looking for
* how to use the machine to machine API to request data
* how to load the NetCDF data into your notebook, once the data request has completed
* how to explore and plot data

A great resource for data wrangling and exploration in python can be found at https://chrisalbon.com/. Tip: add "albon" to your search in google when trying to find a common technique and chances are Chris Albon has made a post on how to do it.

The difference between a NetCDF and JSON data request is that NetCDF files are served asynchronously and delivered to a THREDDS server, while the JSON data response is synchronous (instantaneous) and served as a JSON object in the GET response. NetCDF data is undecimated (full data set), while the JSON response is decimated down to a maximum of 20,000 data points.

*Note, given the size of the dataset, parts of this notebook are rather processing intensive and it may not run well on Google Colab or your local machine.  High-performance computing solutions (e.g. [pangeo](http://pangeo-data.org)) might be a better option.

## Setup your API Information
Login in at https://ooinet.oceanobservatories.org/ and obtain your <b>API username and API token</b> under your profile (top right corner), or use the credentials provided below.

In [None]:
username = ''
token = ''

## Find and request the data

In [None]:
import requests
import time

The ingredients being used to build the data_request_url can be found here. For this example, we will use the data from the 3D Thermistor Array (TMPSF)
http://ooi.visualocean.net/instruments/view/RS03ASHS-MJ03B-07-TMPSFA301

![RS03ASHS-MJ03B-07-TMPSFA301](https://github.com/ooi-data-review/ooi_datateam_notebooks/raw/master/images/RS03ASHS-MJ03B-07-TMPSFA301.png)

In [None]:
subsite = 'RS03ASHS'
node = 'MJ03B'
sensor = '07-TMPSFA301'
method = 'streamed'
stream = 'tmpsf_sample'
beginDT = '2014-09-27T01:01:01.000Z' #begin of first deployement
endDT = None

In [None]:
base_url = 'https://ooinet.oceanobservatories.org/api/m2m/12576/sensor/inv/'

data_request_url ='/'.join((base_url,subsite,node,sensor,method,stream))
params = {
    'beginDT':beginDT,
    'endDT':endDT,   
}

Send the data request.

In [None]:
# r = requests.get(data_request_url, params=params, auth=(username, token))
# data = r.json()

The first url in the response is the location on THREDDS where the data is being served. We will get back to using the THREDDS location later.

In [None]:
# print(data['allURLs'][0])

The second url in the response is the regular APACHE server location for the data.

In [None]:
# print(data['allURLs'][1])

We will use this second location to programmatically check for a status.txt file to be written, containing the text 'request completed'. This indicates that the request is completed and the system has finished writing out the data to this location. This step may take a few minutes.

In [None]:
%%time
check_complete = data['allURLs'][1] + '/status.txt'
for i in range(1800): 
    r = requests.get(check_complete)
    if r.status_code == requests.codes.ok:
        print('request completed')
        break
    else:
        time.sleep(1)

## Load the dataset into the notebook

Copy the thredds url (from `print(data['allURLs'][0])`) and add it here so we can use it again later without having to regnerate the dataset.

In [None]:
url = 'https://opendap.oceanobservatories.org/thredds/catalog/ooi/ooidatateam@gmail.com/20180517T182049-RS03ASHS-MJ03B-07-TMPSFA301-streamed-tmpsf_sample/catalog.html'

In [None]:
import requests
import re
!pip install netcdf4
!pip install dask
!pip install xarray
import xarray as xr
import pandas as pd
import os

Paste the thredds url you received by downloading the data from the netcdf_data_requests notebook, or use the one provided below.  

We will parse the html at the location where the files are being delivered to get the list of the NetCDF files written to THREDDS. Note that separate NetCDF files are created at 500 mb intervals and when there is a new deployment.

In [None]:
tds_url = 'https://opendap.oceanobservatories.org/thredds/dodsC'
datasets = requests.get(url).text
urls = re.findall(r'href=[\'"]?([^\'" >]+)', datasets)
x = re.findall(r'(ooi/.*?.nc)', datasets)
for i in x:
    if i.endswith('.nc') == False:
        x.remove(i)
for i in x:
    try:
        float(i[-4])
    except:
        x.remove(i)
datasets = [os.path.join(tds_url, i) for i in x]

In [None]:
datasets

Use xarray to open all netcdf files as a single xarray dataset, swap the dimension from obs to time and and examine the content.

In [None]:
# Note this may take a while
ds = xr.open_mfdataset(datasets)
ds = ds.swap_dims({'obs': 'time'})
ds = ds.chunk({'time': 100})
ds

## Explore the dataset

In [None]:
import datetime
import matplotlib.pyplot as plt
import matplotlib.dates as dates
import numpy as np

Use built in xarray plotting functions to create simple line plot.

In [None]:
# Given the amount of data, this may take a while to plot
ds['temperature12'].plot()
plt.show();

We can tell that the peak temperature is increasing, but this simple line plot does not reveal the internal data distribution. Let's convert to pandas dataframe and downsample from 1 Hz to 1/60 Hz. This step may take 5-10 minutes. 

In [None]:
%%time
from dask.diagnostics import ProgressBar
with ProgressBar():
    df = ds['temperature12'].to_dataframe()
    df = df.resample('min').mean()

In [None]:
%%time
plt.close()
fig, ax = plt.subplots()
fig.set_size_inches(16, 6)
df['temperature12'].plot(ax=ax)
df['temperature12'].resample('H').mean().plot(ax=ax)
df['temperature12'].resample('D').mean().plot(ax=ax)
plt.show()

Now we are getting a better sense of the data. Let's convert time to ordinal, grab temperature values and re-examine using hexagonal bi-variate binning. Again, this step may take a few minutes.

In [None]:
%%time
time = []
time_pd = pd.to_datetime(ds.time.values.tolist())
for i in time_pd:
    i = np.datetime64(i).astype(datetime.datetime)
    time.append(dates.date2num(i)) 

In [None]:
temperature = ds['temperature12'].values.tolist()

In [None]:
plt.close()
fig, ax = plt.subplots()
fig.set_size_inches(16, 6)

hb1 = ax.hexbin(time, temperature, bins='log', vmin=0.4, vmax=3, gridsize=(1100, 100), mincnt=1, cmap='Blues')
fig.colorbar(hb1)
ax.yaxis.grid(True)
ax.xaxis.grid(True)
# ax.set_xlim(datetime.datetime(2015, 12, 1, 0, 0),datetime.datetime(2016, 7, 25, 0, 0))
# ax.set_ylim(2,11)
years = dates.YearLocator()
months = dates.MonthLocator()
yearsFmt = dates.DateFormatter('\n\n\n%Y')
monthsFmt = dates.DateFormatter('%b')
ax.xaxis.set_major_locator(months)
ax.xaxis.set_major_formatter(monthsFmt)
ax.xaxis.set_minor_locator(years)
ax.xaxis.set_minor_formatter(yearsFmt)
plt.tight_layout()
plt.setp(ax.xaxis.get_majorticklabels(), rotation=90)
plt.ylabel('Temperature $^\circ$C')
plt.xlabel('Time')
plt.show()