# **Label Creation Notebook**

### **1. Import Libraries**

In [1]:
import pandas as pd

In [None]:
# Import cleaned vessel data set
filePath = "../data/ecv_sep20.csv"
df = pd.read_csv(filePath)

In [None]:
# Display the first 5 rows of the data set
df.head(5)

In [None]:
# Display the shape of the data set
df.shape

### **Context for Determining Vessel Path**
- **Single Entry Vessels**: If a vessel has only one entry, assume it's on course or stopped if SOG (speed) is 0.
- **Vessel Stopped**: A vessel is stopped if SOG is 0.
- **Off Course**: A vessel is off course if the difference between its previous and current COG (course) exceeds a set threshold, and the data is recent (1 day or less).
- **On Course**: A vessel is on course if the difference between previous and current COG is within the acceptable threshold.
- **Turned Around**: A vessel has turned around if the previous COG is about 180° opposite the current COG.
- **Vessel Count**: There are about 10,613 distinct vessels in the dataset.
- **Near Port**: Assume the vessel is maintaining course if near a port.

Ships typically see COG changes of 5-10° within a day during normal operation, varying due to:
- **Weather**: Rough seas can cause larger changes.
- **Navigation**: More frequent course changes occur in congested waters or near ports.
- **Vessel Type**: Larger ships (tankers, cargo) tend to maintain more stable courses, while smaller vessels may change direction more often.

In [None]:
df['BaseDateTime'] = pd.to_datetime(df['BaseDateTime'])
df = df.sort_values(by=['MMSI', 'BaseDateTime']).reset_index(drop=True)  # Sort the data by vessel and timestamp
labels = []
cog_diffs = []

for i in range(1, len(df)):
    curr_vessel = df.loc[i, 'MMSI']
    prev_vessel = df.loc[i - 1, 'MMSI']
    time_diff = df.loc[i, 'BaseDateTime'] - df.loc[i - 1, 'BaseDateTime']

    if (curr_vessel == prev_vessel) and (time_diff <= pd.Timedelta('1D')):
        prev_cog = df.loc[i - 1, 'COG']
        curr_cog = df.loc[i, 'COG']
        cog_diff = abs(curr_cog - prev_cog)
        curr_sog = df.loc[i, 'SOG']

        cog_diffs.append(cog_diff)
        if curr_sog == 0:
            labels.append('stopped')
        elif abs(cog_diff - 180) <= 10:
            labels.append('turned around')
        elif cog_diff > 15:
            labels.append('veered off course')
        else:
            labels.append('stayed on course')
    else:
        labels.append('stayed on course')

if df.loc[0, 'SOG'] == 0:
    labels.insert(0, 'stopped')
else:
    labels.insert(0, 'stayed on course')

df['PathChange'] = labels
df.head(25)

In [None]:
# Display the value counts of the PathChange column
df['PathChange'].value_counts()

In [None]:
# Drop rows where 'PathChange' is 'stopped'
df = df[df['PathChange'] != 'stopped']

In [None]:
# Display the value counts of the PathChange column
df["PathChange"].value_counts()

In [None]:
# Combine 'veered off course' and 'turned around' into 'veered off course'
df['PathChange'] = df['PathChange'].replace('turned around', 'veered off course')

In [None]:
# Display the value counts of the PathChange column
df["PathChange"].value_counts()

In [None]:
# Export the data set to a new CSV file
df.to_csv('sept_vessels_2020_2023.csv', index=False)