# Build Dataset A

## Description of approach A:
The idea behind this type of dataset is to represent the current state of a certain area (defined by the used stations) in one data record. The ocean is a highly intricate and interrelated system. This approach might help to enable a neural network to detect patterns of ocean currents and increase the forecasting accuracy.

The dataset is a table which can be stored as .csv file and read as pd.dataframe. Within a certain timeframe, there is exactly one record for each full hour, so each record has a timestamp as index. Each feature (columns) of the dataset represent one of the 9 measurements ["WDIR", "WSPD", "WVHT", "APD", "MWD", "PRES", "ATMP", "WTMP", "DEWP"] of a certain station. If ERA5=true, the corresponding measurements of the corrsponding stations location are added as additional features.

To train a neural network with the created data, it first needs to be transformed to a supervised data. The function myLibrary.DataProcessing.data_to_supervised(data, n_in, n_out, dropnan): pd.dataframe can do that.

Advantage:
* Represents the current state of a area, not just of a single location
* Includes every timestamp, which should allow to detect seasonal patterns.

Disadvantage:
* No good solution for NaN imputation
* Since all features and timestamps must have a value, there is also no good way to remove NaN values.

In [1]:
import ipywidgets as widgets
from IPython.display import display
import pickle

import myLibrary as mL
NDBC = mL.NDBC_lib
ERA5 = mL.ERA5_lib
Models = mL.Models
DP = mL.DataProcessor
Experiment = mL.Experiment

In [3]:
def build_UI():
    #STATIONS -----------------------------------------------------------------------------------
    # create a list of checkbox widgets for each station
    stations = []
    for station in NDBC.cleaned_stations_GOM:
        checkbox = widgets.Checkbox(value=False, description=station, disabled=False, indent=False)
        stations.append(checkbox)

    # create a GridBox layout container with two columns
    global stations_grid
    stations_grid = widgets.GridBox(stations, layout=widgets.Layout(grid_template_columns="repeat(3, 300px)"))

    # wrap the GridBox inside a Box layout container with a fixed height and scrollable overflow
    stations_box = widgets.Box(children=[stations_grid], layout=widgets.Layout(height="200px", overflow="scroll"))

    # display the checkboxes
    print("STATIONS")
    display(stations_box)

    #Years --------------------------------------------------------------------------------------
    # create a range slider widget for selecting a time range
    global time_range_slider
    time_range_slider = widgets.SelectionRangeSlider(
        options=list(range(1970, 2023)),  # range of years to select from
        index=(51, 52),  # initial range selected (1970-2023)
        description='Time Range:',
        orientation='horizontal',
        layout={'width': '500px'}
    )

    # display the range slider widget
    display(time_range_slider)

    #NaN_Theshold--------------------------------------------------------------------------------
    # create a FloatSlider widget for a value between 0 and 1
    print("NaN-Threshold:")
    global nan_threshold_slider
    nan_threshold_slider = widgets.FloatSlider(
        value=0.5,
        min=0,
        max=1,
        step=0.01,
        description='',
        readout_format='.2f',
        orientation='horizontal',
        layout={'width': '500px'}
    )

    # display the FloatSlider widget
    display(nan_threshold_slider)

    #Features--------------------------------------------------------------------------------
    features = []
    for feature in ["WDIR", "WSPD", "WVHT", "APD", "MWD", "PRES", "ATMP", "WTMP", "DEWP"]:
        checkbox = widgets.Checkbox(value=False, description=feature, disabled=False, indent=False)
        features.append(checkbox)

    print("Features:")
    global feature_container
    feature_container = widgets.HBox(features)
    display(feature_container)

    #ERA5------------------------------------------------------------------------------------
    global era5_checkbox
    era5_checkbox = widgets.Checkbox(value=False, description="Add ERA5 model data", disabled=False, indent=False)
    print("Model Data:")
    display(era5_checkbox)


build_UI()

STATIONS


Box(children=(GridBox(children=(Checkbox(value=False, description='41117', indent=False), Checkbox(value=False…

SelectionRangeSlider(description='Time Range:', index=(51, 52), layout=Layout(width='500px'), options=(1970, 1…

NaN-Threshold:


FloatSlider(value=0.5, layout=Layout(width='500px'), max=1.0, step=0.01)

Features:


HBox(children=(Checkbox(value=False, description='WDIR', indent=False), Checkbox(value=False, description='WSP…

Model Data:


Checkbox(value=False, description='Add ERA5 model data', indent=False)

In [42]:
# Read variables from UI
STATIONS = [checkbox.description for checkbox in stations_grid.children if checkbox.value]

# get the selected time range
start_year, end_year = time_range_slider.value
YEARS = [str(year) for year in range(start_year, end_year + 1)]
NAN_THRESHOLD = nan_threshold_slider.value
FEATURES =  [checkbox.description for checkbox in feature_container.children if checkbox.value]
ADD_ERA5 = era5_checkbox.value

## Optional: use hardcoded variables instead

In [43]:
# STATIONS = ["41117"]
# YEARS = ["2022"]
# NAN_THRESHOLD = 0.5
# FEATURES = ['WTMP']     # ["WDIR", "WSPD", "WVHT", "APD", "MWD", "PRES", "ATMP", "WTMP", "DEWP"]
# ADD_ERA5 = True

In [44]:
print(f"Stations: {STATIONS}")
print(f"Years: {YEARS}")
print(f"NaN_Threshold: {NAN_THRESHOLD}")
print(f"Features: {FEATURES}")
print(f"ADD_ERA5: {ADD_ERA5}")

Stations: ['41117']
Years: ['2022']
NaN_Threshold: 0.5
Features: ['ATMP', 'WTMP']
ADD_ERA5: True


In [45]:
data = mL.get_data_A(
    stations=STATIONS,
    years=YEARS,
    nan_threshold=NAN_THRESHOLD,
    features=FEATURES,
    era5=ADD_ERA5
    )

data

Started with  2022 . Previous year took:   0.0005838871002197266 seconds
from disc
Finished downloading - now merging it together!
Started with  2022 . Previous year took:   0.0 seconds
Finished downloading - now merging it together!


Unnamed: 0_level_0,ATMP_41117,WTMP_41117,ATMP_41117_ERA5,WTMP_41117_ERA5
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2022-01-01 00:00:00,20.8,20.6,21.629668,20.254283
2022-01-01 01:00:00,20.8,20.6,21.576521,20.254283
2022-01-01 02:00:00,20.8,20.5,21.481707,20.254283
2022-01-01 03:00:00,20.8,20.5,21.298031,20.254283
2022-01-01 04:00:00,20.7,20.4,21.101175,20.254283
...,...,...,...,...
2022-12-31 19:00:00,18.1,17.4,20.968520,18.848540
2022-12-31 20:00:00,18.3,17.4,21.142417,18.848540
2022-12-31 21:00:00,18.7,17.4,21.182383,18.848540
2022-12-31 22:00:00,18.7,17.4,19.681513,19.500180


# Save Dataframe as .csv

In [46]:
# create a text input widget for username
filename_widget = widgets.Text(
    value='',
    placeholder='Enter filename',
    description='Filename:',
    disabled=False
)
# add '.csv' to the description
extension_label = widgets.Label('.pickle')

# display the widget
widgets.HBox([filename_widget, extension_label])

HBox(children=(Text(value='', description='Filename:', placeholder='Enter filename'), Label(value='.pickle')))

In [49]:
filename = filename_widget.value
if filename == "":
    print("Enter a valid filename!")

else:
    dataset = {
        "stations": STATIONS,
        "years": YEARS,
        "nan_threshold": NAN_THRESHOLD,
        "features": FEATURES,
        "add_era5": ADD_ERA5,
        "data": data
    }

    # open a file for writing in binary mode
    filepath = f'data/datasets/type_A/{filename}_A.pickle'
    with open(filepath, 'wb') as f:
        # write the object to the file using pickle.dump()
        pickle.dump(dataset, f)

    print("File successfully saved:")
    print(filepath)

File successfully saved:
data/datasets/type_A/dataset_1_A.pickle
