# OISST Datetime Timestamp Management

In the `oisst_mainstays` data-stream we have netCDF files coming from two distribution centers:
 * The NOAA physical sciences laboratory (NOAA PSL)
 * National Center for Environmental Information (NCEI)

While the source data is the same, there is an inconsistency on the time-stamps that are recorded for each day. This difference is **usually** not an issue, because there is only one measurement per-day.

However, when these two data sources are combined there is the potential to have duplicate dates. This can happen because one data source labels measurements using a `12:00` noon timestamp (NCEI) & the other uses a midnight `00:00:00` timestamp (PSL).

This is an issue particularly in the regional timeseries that are being produced, as there are some duplicate dates slipping through in the 2020-2022 era. This notebook is for isolating how to check against this to implement those steps into `oisstools.py`

In [4]:
# Load Libraries
import xarray as xr
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import netCDF4
import datetime
import os
import oisstools as ot


# Set the workspace - local/ docker
workspace = "local"
box_root = ot.set_workspace(workspace)
print(f"Working via {workspace} directory at: {box_root}")

Working via local directory at: /Users/akemberling/Library/CloudStorage/Box-Box/


# Load a dataset with known problems to check against

In [9]:
# 1. Load the Gulf of Maine
# Get their names from lookup catalog
region_collection = "gmri_sst_focal_areas"
region_names = ot.get_region_names(region_group = region_collection)
region_paths = ot.get_timeseries_paths(
    box_root = box_root, 
    region_group = region_collection, 
    region_list = region_names)


gom_sst = pd.read_csv(region_paths[0])
gom_sst.head()

Unnamed: 0,time,sst,area_wtd_sst,modified_ordinal_day,sst_clim,area_wtd_clim,clim_sd,sst_anom,area_wtd_anom
0,1981-09-01,15.780159,15.81786,245,16.430416,16.46841,2.270726,-0.650257,-0.65055
1,1981-09-02,15.787786,15.823025,246,16.356068,16.393608,2.24565,-0.568282,-0.570583
2,1981-09-03,15.494051,15.525661,247,16.299295,16.336779,2.232613,-0.805244,-0.811118
3,1981-09-04,14.993513,15.02563,248,16.2613,16.298191,2.201674,-1.267787,-1.272561
4,1981-09-05,14.843195,14.874094,249,16.145683,16.18252,2.176677,-1.302488,-1.308426


In [None]:
# Interrogate the dates:


# Find duplicates


# See if they exist in the current Anomaly netCDF files

## Debugging in xarray

The reason this even becomes an issue is when netCDF files from both sources are combined into a single xr.Dataset. When this happens both timestamps are introduced into the same object, and everything downstream from them deals with the consequences.

In [1]:
# Load a PSL file


# Load an NCEI file



# Force a timestamp?


# Checking for duplicate dates, but not times:



# Picking one timestamp over another:

## De-bugging in Pandas

Currently, across all regional timeseries, there are dates that passed the duplicated() test because they were not considered duplicate 