# Add wave height data to the location and time of each piracy incident

#### This notebook outlines efforts to augment our original piracy event dataframe with the meteorological data  surrounding those events to see if any trends existed between weather factors and acts of piracy. The test dataset came from the Copernicus Marine Data Store. The methods proved effective but due to time constraints only wave heights were augmented for a dataset that covered 3 of the 30 year period. There are other datasets that provide the desired information on the Copernicus site, but future work would be tuning this method to extract the pertinent information from those other datasets. 

Copernicus Marine Data Store: (https://data.marine.copernicus.eu/products)
Ocean Wave Data 2021-2024: https://data.marine.copernicus.eu/product/GLOBAL_ANALYSISFORECAST_WAV_001_027/description

NOTE: This notebook demonstrates the process used to arrive at the final product. The clean code that takes the dataset and spits out a csv with the augemented wave heights is in the Github repository in Copernicus_Final.py

In [ ]:
# Install dependencies
# !pip install copernicusmarine
# !pip install netCDF4

In [None]:
# Import statements
%matplotlib inline
import pandas as pd
import datetime
import copernicusmarine as copernicus_marine
from pathlib import Path

In [None]:
# Product's filename for GLOBAL_ANALYSISFORECAST_WAV_001_027 wave heights 
datasetID = 'cmems_mod_glo_wav_anfc_0.083deg_PT3H-i'

In [None]:
# Super nice because its is a 24 GB dataset but doesn't download to my computer. I can work with it here in the notebook 
# and save the data I actually want to a different file later. Drawback is could lose all I'm working on if connection to server goes down 
# only three variables I care about
# This data is only from 30 Sep 2021 to 25 Mar 2024 - will need to extend with other or just show as use-case
credential_path = Path('Data_Files/.copernicusmarine-credentials')
DS = copernicus_marine.open_dataset(dataset_id=datasetID, credentials_file=credential_path)
DS

The variable I care about:
1. VHM0 [m]
    Spectral significant wave height (Hm0)
    sea_surface_wave_significant_height

In [None]:
#get full list of variables available to dataset
DS.data_vars

In [None]:
#Get list of dimensions
DS.coords

In [None]:
#Get info on specific variable
DS.VHM0

In [None]:
#info on specific dimensions:
DS.time, DS.latitude

In [None]:
# Read in clean dataset
piracy_df = pd.read_csv(Path('Data_Files/[Clean] IMO Piracy - 2000 to 2022 (PDV 01-2023).csv'))

# Drop lat/long nulls: actually useful info on map
piracy_df_map = piracy_df.dropna(subset=['Latitude','Longitude'])

# Show result
piracy_df.head(10)

In [None]:
# Convert piracy_df_names incident dates to datetimes
piracy_df_map.loc[:,'Incident Date'] = pd.to_datetime(piracy_df_map.loc[:,'Incident Date'])
piracy_df_map['Incident Date'].head(10)

# Testing process on one Piracy Event:

Row entry:

5/28/2022	Magnum Energy	Marshall Islands	Bulk carrier	In international waters	1.141666667	103.475	Not Reported	Store Rooms	Steaming	Knives	FALSE	FALSE	FALSE	FALSE	FALSE


In [None]:
piracy_df_names = piracy_df_map.set_index('Ship Name')
piracy_df_names.loc['Magnum Energy']

#### Step 1: Determine my buffer / can play with this once I start seeing data or not seeing data


#### Step 2: Extract the lat, lon from piracy event

    
#### Step 3: Create a subset of data with the buffer to the Magnum Energy event 

In [None]:
# Step 1
# First testing the time buffer for the specific instance, then putting it into a loop
# Setting buffers so I have data that straddles the event in a 0.1x0.1 degree box lat/lon and 1 day (30 mins before 30 after)
# Will play to tune the buffers to get as small a dataset as possible 
time_buffer = pd.Timedelta(0.5, unit="h") #d "day", h "hour", m "minute"
lat_buffer = 0.05 #degree 
lon_buffer = 0.05 #degree 

# Step 2
# Set the lat and lon to the Magnum Energy event
lat = piracy_df_names.loc['Magnum Energy'].Latitude
lon = piracy_df_names.loc['Magnum Energy'].Longitude
time = piracy_df_names.loc['Magnum Energy']['Incident Date']

In [None]:
# Step 3
# Use the buffer to make a subset of the weather data for points around the event
lat_add = lat + lat_buffer
lat_subtract = lat - lat_buffer
lon_add = lon + lon_buffer
lon_subtract = lon - lon_buffer
time_add = time + time_buffer
time_subtract = time - time_buffer

# Create my data subset for the bubble around this specific piracy event
subset_Magnum_Energy = DS['VHM0'].sel(
    latitude = slice(lat_subtract,lat_add),
    longitude = slice(lon_subtract,lon_add),
    time = slice(time_subtract, time_add))
subset_Magnum_Energy

#### Inspecting the dataset, tuning was perfect (maybe by luck) and I got one reading very close to the event location at that time. 

#### If tuning is "imperfect" and I get more data points "around" the event, the values for wave height (my principle variable of interest) are means, and I can further average them to get a rough estimate of the wave height (indicator of sea state) at that time. Ultimately still outputting one value for that event. 

In [None]:
print(lat, lon)
# NOT HALF BAD MATEY - not sure if my dimension buffer will always filter out leaving only one but let's keep sailing
# Also, of note, these readings are for the day, so a good bit of variability (report didn't have hour/minute just day)

In [None]:
df = subset_Magnum_Energy.to_dataframe()
df
# Notice there is a NaN value for the max wave height VCMX......don't really need it....or the wave direciton for that matter. 
# But it raises the question of what do I do if I have a NaN value and have to expand the buffer, thus letting in potentially
# more than one value for a particular coordinate? That is when I'd use the nearest method or .minarg stack overflow

# Now build a function that builds these subsets and extracts the wave heights for each piracy event in our piracy dataframe. 

In [None]:
# For this case with the wave data from 30 Sep 2021 to 25 Mar 2024 
DS_start_date = datetime.date(2021,9,30)
DS_end_date = datetime.date(2024,3,25)

def get_wave_height(row):
    if row['Incident Date'] >= DS_start_date:
        # print(row['Incident_Date'])
        lat = row['Latitude']
        lon = row['Longitude']
        
        # Use the buffer to make a subset of the weather data for points around the event
        lat_add = lat + lat_buffer
        lat_subtract = lat - lat_buffer
        lon_add = lon + lon_buffer
        lon_subtract = lon - lon_buffer
        time = row['Incident Date']
        time_add = time + time_buffer
        time_subtract = time - time_buffer
        
        # Create my data subset for the bubble around this specific piracy event for wave height
        # Hopefully this is only going to return one value for each point but it may return more or none
        subset = DS[['VHM0', 'VMDR', 'VCMX']].sel(
            latitude = slice(lat_subtract,lat_add),
            longitude = slice(lon_subtract,lon_add),
            time = slice(time_subtract, time_add))
        
        return subset['VHM0'].values[0][0]

In [None]:
# Write code to augment this data to the new matrix 
piracy_df_map["Wave Height"] = piracy_df_map.apply(get_wave_height, axis=1)

In [None]:
piracy_df_map[piracy_df_map["Wave Height"].notna()]

In [None]:
# Write out to csv file for analysis (145 events updated)
piracy_df_map.to_csv(Path('./Results/piracy_df_waves.csv'), index=False) 

# Successful method. Would extend in future work to build out weather data for these piracy events. 