# Preparing past buoy data and reanalyses for use in model training

## 1. Concatenate all past buoy data into a single dataframe
This section will collect all of the cleaned buoy data and combine them into a single dataframe. A column to represent the day of year (DOY) as an integer is also added. These data will be used (along with weather reanalyses) as training data for the machine learning model.  

In [None]:
# Concatenate multiple raw buoy CSV files into a single DataFrame and add a new column with the Day of Year (DOY) as an integer

import pandas as pd
import glob
import os

# Define the path to the folder containing the CSV files
folder_path = '../data/cleaned/buoydata/past'

# Use glob to get all CSV files in the folder
csv_files = glob.glob(os.path.join(folder_path, '*.csv'))

# Initialize an empty list to store DataFrames
dfs = []

# Loop through the list of CSV files and read each one into a DataFrame
for file in csv_files:
    df = pd.read_csv(file)
    dfs.append(df)

# Concatenate all DataFrames into a single DataFrame
combined_df = pd.concat(dfs, ignore_index=True)

# Make a new column in the dataframe of DOY truncated to an integer
combined_df['DOY_int'] = combined_df['DOY'].astype(int)

# Rename the lat and lon columns to Latitude and Longitude
combined_df.rename(columns={'Lat': 'Latitude', 'Lon': 'Longitude'}, inplace=True)

# Display the combined DataFrame
combined_df.head()

## 2. Cleaned buoy data geospatial bounds confirmation
To confirm the resulting dataframe only contains buoy data within the area of interest (Arctic Ocean), this cell will analyze and display the minimum and maximum values of the latitude and longitude fields of the data. 

In [None]:
# Confirm the latitude and longitude ranges

min_latitude = combined_df['Latitude'].min()
max_latitude = combined_df['Latitude'].max()
min_longitude = combined_df['Longitude'].min()
max_longitude = combined_df['Longitude'].max()

print(f"Latitude: min = {min_latitude}, max = {max_latitude}")
print(f"Longitude: min = {min_longitude}, max = {max_longitude}")

## 3. Convert the NCEP reanalysis data from netCDF to numpy arrays
This cell will convert the NCEP reanalysis data from netCDF to 3-dimensional numpy arrays and convert the time dimension into a day of year (DOY) value. This will facilitate interpolation with the buoy data. The script will print the bounds of the geospatial elements of the arrays to confirm they match with the bounds of the buoy data. 

In [None]:
# Convert the uwnd and vwnd NetCDF files to 3D numpy arrays and convert the time variable as day of year (DOY)

import netCDF4 as nc
import numpy as np
import datetime

# Open the NetCDF file
file_path = '..\\data\\raw\\reanalyses\\ncep\\uwnd.sfc.2024.nc'
dataset = nc.Dataset(file_path, 'r')

# Extract the uwnd variable
uwnd_var = dataset.variables['uwnd']

# Convert the uwnd variable to a 3D numpy array
uwnd_3d_array = uwnd_var[:]

# Open the vwnd NetCDF file
vwnd_file_path = '..\\data\\raw\\reanalyses\\ncep\\vwnd.sfc.2024.nc'
vwnd_dataset = nc.Dataset(vwnd_file_path, 'r')

# Extract the vwnd variable
vwnd_var = vwnd_dataset.variables['vwnd']

# Convert the vwnd variable to a 3D numpy array
vwnd_3d_array = vwnd_var[:]

# Extract the latitudes and longitudes
latitudes = vwnd_dataset.variables['lat'][:]
longitudes = vwnd_dataset.variables['lon'][:]

# Convert the time variable to day of year (DOY)
time_var = vwnd_dataset.variables['time']
reference_date_str = time_var.units.split('since')[1].strip().split('.')[0]
reference_date = datetime.datetime.strptime(reference_date_str, '%Y-%m-%d %H:%M:%S')
doy = [(reference_date + datetime.timedelta(days=t)).timetuple().tm_yday for t in time_var[:]]

# Ensure the DOY values are within the valid range
doy = np.array(doy)
valid_indices = (doy >= 1) & (doy <= 365)
doy = doy[valid_indices]
uwnd_3d_array = uwnd_3d_array[valid_indices]
vwnd_3d_array = vwnd_3d_array[valid_indices]

# Print the shape of the uwnd and vwnd arrays
print(uwnd_3d_array.shape)
print(vwnd_3d_array.shape)

# Close the NetCDF files
dataset.close()
vwnd_dataset.close()

# Print the minimum and maximum values of the lat and lon in each array
print(f"uwnd_3d_array lat min: {latitudes.min()}, lat max: {latitudes.max()}")
print(f"uwnd_3d_array lon min: {longitudes.min()}, lon max: {longitudes.max()}")

print(f"vwnd_3d_array lat min: {latitudes.min()}, lat max: {latitudes.max()}")
print(f"vwnd_3d_array lon min: {longitudes.min()}, lon max: {longitudes.max()}")


## 4. Assign NCEP reanalysis to buoy data (interpolation and discrete values)
This section will assign the u-component and v-component (uwnd and vwnd) of the NCEP reanalysis to the past buoy data, with steps taken to ensure that the data from the correct day of year is assigned to the buoy data. Both interpolation and discrete value assignment are used and stored in separate columns for more options in model training. 

In [None]:
# Assign the interpolated uwnd and vwnd values to the combined DataFrame

import pandas as pd
import numpy as np
from scipy.interpolate import griddata

# Assuming combined_df, uwnd_3d_array, vwnd_3d_array, latitudes, and longitudes are already defined

# Step 1: Get a list of the unique DOY_int values in the data frame
unique_doy = combined_df['DOY_int'].unique()

# Step 2: Iterate through each unique value
for doy in unique_doy:
    # Step 3: Select the corresponding layers of the 3D arrays for the same DOY
    uwnd_layer = uwnd_3d_array[doy - 1]  # Assuming DOY starts from 1
    vwnd_layer = vwnd_3d_array[doy - 1]  # Assuming DOY starts from 1
    
    # Get the buoy data for the current DOY
    buoy_data = combined_df[combined_df['DOY_int'] == doy]
    
    # Step 4: Interpolate the buoy data with the corresponding uwnd and vwnd values
    points = np.array([(lat, lon) for lat in latitudes for lon in longitudes])
    uwnd_values = uwnd_layer.flatten()
    vwnd_values = vwnd_layer.flatten()
    
    buoy_points = buoy_data[['Latitude', 'Longitude']].values
    interpolated_uwnd = griddata(points, uwnd_values, buoy_points, method='linear')
    interpolated_vwnd = griddata(points, vwnd_values, buoy_points, method='linear')
    
    # Step 5: Append the results as new columns in the dataframe
    combined_df.loc[combined_df['DOY_int'] == doy, 'uwnd_ncep_interp'] = interpolated_uwnd
    combined_df.loc[combined_df['DOY_int'] == doy, 'vwnd_ncep_interp'] = interpolated_vwnd

# Ensure the new columns are of float type
combined_df['uwnd_ncep_interp'] = combined_df['uwnd_ncep_interp'].astype(float)
combined_df['vwnd_ncep_interp'] = combined_df['vwnd_ncep_interp'].astype(float)

# Display the updated DataFrame head
combined_df.head()

In [None]:
#Assign the interpolated uwnd and vwnd values to the combined DataFrame using the closest grid point (discrete values)

import pandas as pd
import numpy as np

# Assuming combined_df, uwnd_3d_array, vwnd_3d_array, latitudes, and longitudes are already defined

# Step 1: Get a list of the unique DOY_int values in the data frame
unique_doy = combined_df['DOY_int'].unique()

# Step 2: Iterate through each unique value
for doy in unique_doy:
    # Step 3: Select the corresponding layers of the 3D arrays for the same DOY
    uwnd_layer = uwnd_3d_array[doy - 1]  # Assuming DOY starts from 1
    vwnd_layer = vwnd_3d_array[doy - 1]  # Assuming DOY starts from 1
    
    # Get the buoy data for the current DOY
    buoy_data = combined_df[combined_df['DOY_int'] == doy]
    
    # Step 4: Assign the discrete values from uwnd and vwnd layers to the buoy points
    buoy_points = buoy_data[['Latitude', 'Longitude']].values
    assigned_uwnd = []
    assigned_vwnd = []
    
    for lat, lon in buoy_points:
        # Find the closest grid point
        lat_idx = (np.abs(latitudes - lat)).argmin()
        lon_idx = (np.abs(longitudes - lon)).argmin()
        
        # Assign the discrete values
        assigned_uwnd.append(uwnd_layer[lat_idx, lon_idx])
        assigned_vwnd.append(vwnd_layer[lat_idx, lon_idx])
    
    # Convert lists to numpy arrays
    assigned_uwnd = np.array(assigned_uwnd)
    assigned_vwnd = np.array(assigned_vwnd)
    
    # Step 5: Append the results as new columns in the dataframe
    combined_df.loc[combined_df['DOY_int'] == doy, 'uwnd_ncep_discrete'] = assigned_uwnd
    combined_df.loc[combined_df['DOY_int'] == doy, 'vwnd_ncep_discrete'] = assigned_vwnd

# Ensure the new columns are of float type
combined_df['uwnd_ncep_discrete'] = combined_df['uwnd_ncep_discrete'].astype(float)
combined_df['vwnd_ncep_discrete'] = combined_df['vwnd_ncep_discrete'].astype(float)

# Display the updated DataFrame head
combined_df.head()

In [7]:
# Save the combined DataFrame to a CSV file for validation
combined_df.to_csv('../data/cleaned/buoydata/combined_buoy_data.csv', index=False)

# Preparing current buoy data and forecasts for use in model training

## 1. Concatenate all current buoy data into a single dataframe
This section will collect all of the cleaned buoy data and combine them into a single dataframe. A column to represent the day of year (DOY) as an integer is also added. Finally, the buoy data is subsetted to only the most recent positions for each buoyID. 

In [14]:
# Concatenate multiple current buoy CSV files into a single DataFrame and add a new column with the Day of Year (DOY) as an integer

import pandas as pd
import glob
import os

# Define the path to the folder containing the CSV files
folder_path = '../data/cleaned/buoydata/current'

# Use glob to get all CSV files in the folder
csv_files = glob.glob(os.path.join(folder_path, '*.csv'))

# Initialize an empty list to store DataFrames
dfs = []

# Loop through the list of CSV files and read each one into a DataFrame
for file in csv_files:
    df = pd.read_csv(file)
    dfs.append(df)

# Concatenate all DataFrames into a single DataFrame
combined_df_current = pd.concat(dfs, ignore_index=True)

# Make a new column in the dataframe of DOY truncated to an integer
combined_df_current['DOY_int'] = combined_df_current['DOY'].astype(int)

# Rename the lat and lon columns to Latitude and Longitude
combined_df_current.rename(columns={'Lat': 'Latitude', 'Lon': 'Longitude'}, inplace=True)

# For each unique BuoyID, keep only the most current position based on the DOY column
combined_df_current = combined_df_current.loc[combined_df_current.groupby('BuoyID')['DOY'].idxmax()]

# Display the combined DataFrame
combined_df_current.head()

Unnamed: 0,BuoyID,Year,Hour,Min,DOY,POS_DOY,Latitude,Longitude,BP,Ts,Ta,DOY_int
183,300234065495190,2024,15,0,301.625,301.625,70.2242,232.4734,1010.6,-4.7,-52.6,301
222,300434065882720,2024,12,38,301.5268,301.5268,71.43801,200.66946,-999.0,-0.88,,301


## 2. Converting the netCDF GFS forecast data into arrays for interpolation with current buoy positions for use as initial conditions for prediction

This cell converts the GFS netCDF file into arrays for the `ugrd` and `vgrd` variables at a specific pressure level (currently set to the second item in the list, which corresponds to near-surface conditions). The script ensures that there is exactly one file in the specified directory and extracts the necessary variables from the netCDF dataset. The shapes of the resulting arrays are printed to verify the extraction process. This step is crucial for preparing the forecast data for interpolation with the current buoy positions, which will be used as initial conditions for prediction models.

In [21]:
# Convert the GFS netCDF file into arrays for the ugrd and vgrd variables at a specific pressure level (currently set to the second item in the list 
# which corresponds to near surface conditions)

import netCDF4 as nc
import os

# Define the directory path to the dataset in the data/raw/forecasts/gfs folder
directory_path = '../data/raw/forecasts/gfs'

# Get the list of files in the directory
files = os.listdir(directory_path)

# Ensure there is exactly one file in the directory
if len(files) != 1:
    raise ValueError("There should be exactly one file in the directory.")

# Get the file path
file_path = os.path.join(directory_path, files[0])

# Reopen the dataset
dataset = nc.Dataset(file_path, 'r')

# Extract the second item in the pfull list
pfull_index = 1

# Create arrays for ugrd and vgrd with pfull set to the second item
ugrd_array = dataset.variables['ugrd'][0, pfull_index, :, :]
vgrd_array = dataset.variables['vgrd'][0, pfull_index, :, :]

# Print the shapes of the arrays to verify
print("ugrd_array shape:", ugrd_array.shape)
print("vgrd_array shape:", vgrd_array.shape)

# Close the dataset
dataset.close()

ugrd_array shape: (1536, 3072)
vgrd_array shape: (1536, 3072)


## 3. Interpolating forecast data with current buoy positions

In [25]:
#WORK FOR THIS IS ONGOING