# Build Dataset A

## Description of approach A:
The idea behind this type of dataset is to represent the current state of a certain area (defined by the used stations) in one data record. The ocean is a highly intricate and interrelated system. This approach might help to enable a neural network to detect patterns of ocean currents and increase the forecasting accuracy.

The dataset is a table which can be stored as .csv file and read as pd.dataframe. Within a certain timeframe, there is exactly one record for each full hour, so each record has a timestamp as index. Each feature (columns) of the dataset represent one of the 9 measurements ["WDIR", "WSPD", "WVHT", "APD", "MWD", "PRES", "ATMP", "WTMP", "DEWP"] of a certain station. If ERA5=true, the corresponding measurements of the corrsponding stations location are added as additional features.

To train a neural network with the created data, it first needs to be transformed to a supervised data. The function myLibrary.DataProcessing.data_to_supervised(data, n_in, n_out, dropnan): pd.dataframe can do that.

Advantage:
* Represents the current state of a area, not just of a single location
* Includes every timestamp, which should allow to detect seasonal patterns.

Disadvantage:
* No good solution for NaN imputation
* Since all features and timestamps must have a value, there is also no good way to remove NaN values.

In [1]:
import ipywidgets as widgets
from IPython.display import display
import pickle

import myLibrary as mL
NDBC = mL.NDBC_lib
ERA5 = mL.ERA5_lib
Models = mL.Models
DP = mL.DataProcessor
Experiment = mL.Experiment

In [2]:
def build_UI():
    #STATIONS -----------------------------------------------------------------------------------
    # create a list of checkbox widgets for each station
    stations = []
    for station in NDBC.cleaned_stations_GOM:
        checkbox = widgets.Checkbox(value=False, description=station, disabled=False, indent=False)
        stations.append(checkbox)

    # create a GridBox layout container with two columns
    global stations_grid
    stations_grid = widgets.GridBox(stations, layout=widgets.Layout(grid_template_columns="repeat(3, 300px)"))

    # wrap the GridBox inside a Box layout container with a fixed height and scrollable overflow
    stations_box = widgets.Box(children=[stations_grid], layout=widgets.Layout(height="200px", overflow="scroll"))

    # display the checkboxes
    print("STATIONS")
    display(stations_box)

    #Years --------------------------------------------------------------------------------------
    # create a range slider widget for selecting a time range
    global time_range_slider
    time_range_slider = widgets.SelectionRangeSlider(
        options=list(range(1970, 2023)),  # range of years to select from
        index=(51, 52),  # initial range selected (1970-2023)
        description='Time Range:',
        orientation='horizontal',
        layout={'width': '500px'}
    )

    # display the range slider widget
    display(time_range_slider)

    #NaN_Theshold--------------------------------------------------------------------------------
    # create a FloatSlider widget for a value between 0 and 1
    print("NaN-Threshold:")
    global nan_threshold_slider
    nan_threshold_slider = widgets.FloatSlider(
        value=0.5,
        min=0,
        max=1,
        step=0.01,
        description='',
        readout_format='.2f',
        orientation='horizontal',
        layout={'width': '500px'}
    )

    # display the FloatSlider widget
    display(nan_threshold_slider)

    #Features--------------------------------------------------------------------------------
    features = []
    for feature in ["WDIR", "WSPD", "WVHT", "APD", "MWD", "PRES", "ATMP", "WTMP", "DEWP"]:
        checkbox = widgets.Checkbox(value=False, description=feature, disabled=False, indent=False)
        features.append(checkbox)

    print("Features:")
    global feature_container
    feature_container = widgets.HBox(features)
    display(feature_container)

    #ERA5------------------------------------------------------------------------------------
    global era5_checkbox
    era5_checkbox = widgets.Checkbox(value=False, description="Add ERA5 model data", disabled=False, indent=False)
    print("Model Data:")
    display(era5_checkbox)


build_UI()

STATIONS


Box(children=(GridBox(children=(Checkbox(value=False, description='41117', indent=False), Checkbox(value=False…

SelectionRangeSlider(description='Time Range:', index=(51, 52), layout=Layout(width='500px'), options=(1970, 1…

NaN-Threshold:


FloatSlider(value=0.5, layout=Layout(width='500px'), max=1.0, step=0.01)

Features:


HBox(children=(Checkbox(value=False, description='WDIR', indent=False), Checkbox(value=False, description='WSP…

Model Data:


Checkbox(value=False, description='Add ERA5 model data', indent=False)

In [3]:
# Read variables from UI
STATIONS = [checkbox.description for checkbox in stations_grid.children if checkbox.value]

# get the selected time range
start_year, end_year = time_range_slider.value
YEARS = [str(year) for year in range(start_year, end_year + 1)]
NAN_THRESHOLD = nan_threshold_slider.value
FEATURES =  [checkbox.description for checkbox in feature_container.children if checkbox.value]
ADD_ERA5 = era5_checkbox.value

## Optional: use hardcoded variables instead

In [3]:
STATIONS = [ '42001',
             '42002',
             '42003',
             '42007',
             '42012',
             '42019',
             '42020',
             '42035',
             '42036',
             '42038',
             '42039',
             '42040',
             '42041',
             '42055']

YEARS = ['2002',
         '2003',
         '2004',
         '2005',
         '2006',
         '2007',
         '2008',
         '2009',
         '2010',
         '2011',
         '2012',
         '2013',
         '2014',
         '2015',
         '2016',
         '2017',
         '2018',
         '2019',
         '2020',
         '2021',
         '2022']


# STATIONS = ["41117"]
# YEARS = ["2022"]
NAN_THRESHOLD = 0.66
FEATURES = mL.measurements     # ["WDIR", "WSPD", "WVHT", "APD", "MWD", "PRES", "ATMP", "WTMP", "DEWP"]
ADD_ERA5 = True

In [4]:
print(f"Stations: {STATIONS}")
print(f"Years: {YEARS}")
print(f"NaN_Threshold: {NAN_THRESHOLD}")
print(f"Features: {FEATURES}")
print(f"ADD_ERA5: {ADD_ERA5}")

Stations: ['42001', '42002', '42003', '42007', '42012', '42019', '42020', '42035', '42036', '42038', '42039', '42040', '42041', '42055']
Years: ['2002', '2003', '2004', '2005', '2006', '2007', '2008', '2009', '2010', '2011', '2012', '2013', '2014', '2015', '2016', '2017', '2018', '2019', '2020', '2021', '2022']
NaN_Threshold: 0.66
Features: ['WDIR', 'WSPD', 'WVHT', 'APD', 'MWD', 'PRES', 'ATMP', 'WTMP', 'DEWP']
ADD_ERA5: True


In [5]:
data = mL.get_data_A(
    stations=STATIONS,
    years=YEARS,
    nan_threshold=NAN_THRESHOLD,
    features=FEATURES,
    era5=ADD_ERA5
    )

data

Started with  2002 . Previous year took:   0.00522303581237793 seconds
from disc
from disc
from disc
from disc
Failed to get 42012h2002.txt.gz: HTTP Error 404: Not Found
from disc
from disc
from disc
from disc
Failed to get 42038h2002.txt.gz: HTTP Error 404: Not Found
from disc
from disc
from disc
Failed to get 42055h2002.txt.gz: HTTP Error 404: Not Found
Started with  2003 . Previous year took:   27.299588918685913 seconds
from disc
from disc
from disc
from disc
Failed to get 42012h2003.txt.gz: HTTP Error 404: Not Found
from disc
from disc
from disc
from disc
Failed to get 42038h2003.txt.gz: HTTP Error 404: Not Found
from disc
from disc
from disc
Failed to get 42055h2003.txt.gz: HTTP Error 404: Not Found
Started with  2004 . Previous year took:   26.762537956237793 seconds
from disc
from disc
from disc
from disc
Failed to get 42012h2004.txt.gz: HTTP Error 404: Not Found
from disc
from disc
from disc
from disc
from disc
from disc
from disc
from disc
Failed to get 42055h2004.txt.gz: HTT

Unnamed: 0,WDIR_42001,WSPD_42001,PRES_42001,ATMP_42001,WTMP_42001,DEWP_42001,WDIR_42002,WSPD_42002,PRES_42002,ATMP_42002,...,WDIR_42039_ERA5,WSPD_42039_ERA5,ATMP_42039_ERA5,WSPD_42035_ERA5,WSPD_42001_ERA5,DEWP_42020_ERA5,ATMP_42019_ERA5,WTMP_42039_ERA5,WSPD_42002_ERA5,PRES_42039_ERA5
2002-01-01 00:00:00,66.0,9.3,1017.1,22.3,25.5,16.8,39.0,10.5,1016.1,21.7,...,246.007357,5.756333,13.882608,8.031200,9.867456,10.834305,11.708612,21.781113,9.820263,1019.426223
2002-01-01 01:00:00,66.0,9.3,1017.1,22.3,25.5,16.8,39.0,10.5,1016.1,21.7,...,247.678051,5.579721,14.020573,8.216895,9.782997,10.975658,11.926516,21.781113,10.465795,1019.792677
2002-01-01 02:00:00,67.0,9.4,1017.2,21.9,25.5,16.6,36.0,10.9,1016.1,21.7,...,250.591891,5.582730,14.070538,8.454808,9.517146,11.111871,12.166319,21.781113,11.760698,1019.725358
2002-01-01 03:00:00,69.0,9.1,1017.2,22.4,25.5,16.9,32.0,12.7,1015.9,20.8,...,253.468273,5.633966,14.058979,8.471692,8.911373,11.239089,12.361607,21.781113,11.910608,1019.833394
2002-01-01 04:00:00,70.0,9.0,1017.1,22.5,25.5,16.3,33.0,12.7,1015.8,21.0,...,251.493918,5.638108,13.986641,8.698506,8.481407,11.332468,12.482585,21.781113,11.716782,1019.804620
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2022-12-31 19:00:00,195.0,0.5,1015.0,25.6,24.7,24.5,22.0,0.6,1015.9,25.1,...,28.180851,8.602918,22.730194,2.533178,0.601025,19.323104,21.847184,24.392457,1.062895,1016.581840
2022-12-31 20:00:00,210.0,1.0,1015.0,25.3,24.5,24.3,88.0,1.2,1015.4,24.7,...,22.969408,7.734290,22.564177,2.669909,0.979714,19.245361,22.052563,24.392457,1.330008,1016.363260
2022-12-31 21:00:00,231.0,0.6,1014.7,26.4,24.7,24.4,87.0,1.7,1014.9,24.6,...,14.655430,6.485218,22.377552,3.147255,2.067327,19.338348,22.260412,24.392457,1.770915,1016.186435
2022-12-31 22:00:00,18.0,0.7,1014.9,25.4,24.7,24.1,90.0,2.5,1014.9,24.5,...,20.371840,3.294807,24.068445,3.469271,1.554299,18.567779,22.101541,24.885927,2.483432,1017.067575


# Save Dataset

In [6]:
# create a text input widget for username
filename_widget = widgets.Text(
    value='',
    placeholder='Enter filename',
    description='Filename:',
    disabled=False
)
# add '.csv' to the description
extension_label = widgets.Label('.pickle')

# display the widget
widgets.HBox([filename_widget, extension_label])

HBox(children=(Text(value='', description='Filename:', placeholder='Enter filename'), Label(value='.pickle')))

In [7]:
filename = filename_widget.value
if filename == "":
    print("Enter a valid filename!")

else:
    dataset = {
        "stations": STATIONS,
        "years": YEARS,
        "nan_threshold": NAN_THRESHOLD,
        "features": FEATURES,
        "add_era5": ADD_ERA5,
        "data": data
    }

    # open a file for writing in binary mode
    filepath = f'data/datasets/type_A/{filename}_A.pickle'
    with open(filepath, 'wb') as f:
        # write the object to the file using pickle.dump()
        pickle.dump(dataset, f)

    print("File successfully saved:")
    print(filepath)

File successfully saved:
data/datasets/type_A/dataset_GOM_1_A_A.pickle
