What is planned to be done in this notebook: 

- Maybe redefine longitude and latitude to remove shifts from 0 to 160 when moving between the edge case longitudes and latitudes.

-  add lag of 3 

-  Create features for “Under way” and “Not under way”
    - Under way: navstat 0 and 8 
    - Not under way: navstat 1 and 5


- Add cartesian coordinates instead of lat and lon 
  - need to remember that they have been transformed later on 

- Add coordinates based on portid, possibly detect if port is in the right direction 

In [12]:
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestRegressor
from sklearn.experimental import enable_iterative_imputer  # noqa
from sklearn.impute import IterativeImputer

In [2]:
train = pd.read_csv('data/datasets/ais_train.csv', sep='|')
train.head()

Unnamed: 0,time,cog,sog,rot,heading,navstat,etaRaw,latitude,longitude,vesselId,portId
0,2024-01-01 00:00:25,284.0,0.7,0,88,0,01-09 23:00,-34.7437,-57.8513,61e9f3a8b937134a3c4bfdf7,61d371c43aeaecc07011a37f
1,2024-01-01 00:00:36,109.6,0.0,-6,347,1,12-29 20:00,8.8944,-79.47939,61e9f3d4b937134a3c4bff1f,634c4de270937fc01c3a7689
2,2024-01-01 00:01:45,111.0,11.0,0,112,0,01-02 09:00,39.19065,-76.47567,61e9f436b937134a3c4c0131,61d3847bb7b7526e1adf3d19
3,2024-01-01 00:03:11,96.4,0.0,0,142,1,12-31 20:00,-34.41189,151.02067,61e9f3b4b937134a3c4bfe77,61d36f770a1807568ff9a126
4,2024-01-01 00:03:51,214.0,19.7,0,215,0,01-25 12:00,35.88379,-5.91636,61e9f41bb937134a3c4c0087,634c4de270937fc01c3a74f3


Redefine longitude and latitude to remove shifts from 0 to 160 when moving between the edge case longitudes and latitudes.
   - range will now be between 0-360

In [4]:
def redefine_coordinates(df):
    df['longitude'] = df['longitude'].apply(lambda x: x if x >= 0 else x + 360)
    df['latitude'] = df['latitude'].apply(lambda x: x if x >= 0 else x + 180)
    return df

train = redefine_coordinates(train)
train.head()

Unnamed: 0,time,cog,sog,rot,heading,navstat,etaRaw,latitude,longitude,vesselId,portId
0,2024-01-01 00:00:25,284.0,0.7,0,88,0,01-09 23:00,145.2563,302.1487,61e9f3a8b937134a3c4bfdf7,61d371c43aeaecc07011a37f
1,2024-01-01 00:00:36,109.6,0.0,-6,347,1,12-29 20:00,8.8944,280.52061,61e9f3d4b937134a3c4bff1f,634c4de270937fc01c3a7689
2,2024-01-01 00:01:45,111.0,11.0,0,112,0,01-02 09:00,39.19065,283.52433,61e9f436b937134a3c4c0131,61d3847bb7b7526e1adf3d19
3,2024-01-01 00:03:11,96.4,0.0,0,142,1,12-31 20:00,145.58811,151.02067,61e9f3b4b937134a3c4bfe77,61d36f770a1807568ff9a126
4,2024-01-01 00:03:51,214.0,19.7,0,215,0,01-25 12:00,35.88379,354.08364,61e9f41bb937134a3c4c0087,634c4de270937fc01c3a74f3


Add lag of 3 - change time feature 
   - each row will now contain information about where the vessel has been the last hour 

In [14]:
def add_lag_features(df, lag_steps=3):
    # Ensure the DataFrame is sorted by 'vesselId' and 'time'
    df = df.sort_values(by=['vesselId', 'time'])

    # Create lagged features for 'latitude' and 'longitude' within each vesselId group
    for vessel_id, group in df.groupby('vesselId'):
        for lag in range(1, lag_steps + 1):
            df.loc[group.index, f'latitude_lag_{lag}'] = group['latitude'].shift(lag)
            df.loc[group.index, f'longitude_lag_{lag}'] = group['longitude'].shift(lag)

    return df

add_lag_features(train)
train.head()

Unnamed: 0,time,cog,sog,rot,heading,navstat,etaRaw,latitude,longitude,vesselId,portId,latitude_lag_1,longitude_lag_1,latitude_lag_2,longitude_lag_2,latitude_lag_3,longitude_lag_3
0,2024-01-01 00:00:25,284.0,0.7,0,88.0,0,01-09 23:00,145.2563,302.1487,61e9f3a8b937134a3c4bfdf7,61d371c43aeaecc07011a37f,,,,,,
1,2024-01-01 00:00:36,109.6,0.0,-6,347.0,1,12-29 20:00,8.8944,280.52061,61e9f3d4b937134a3c4bff1f,634c4de270937fc01c3a7689,,,,,,
2,2024-01-01 00:01:45,111.0,11.0,0,112.0,0,01-02 09:00,39.19065,283.52433,61e9f436b937134a3c4c0131,61d3847bb7b7526e1adf3d19,,,,,,
3,2024-01-01 00:03:11,96.4,0.0,0,142.0,1,12-31 20:00,145.58811,151.02067,61e9f3b4b937134a3c4bfe77,61d36f770a1807568ff9a126,,,,,,
4,2024-01-01 00:03:51,214.0,19.7,0,215.0,0,01-25 12:00,35.88379,354.08364,61e9f41bb937134a3c4c0087,634c4de270937fc01c3a74f3,,,,,,


In [7]:
ports = pd.read_csv('data/datasets/ports.csv', sep='|')
train.head()

Unnamed: 0,time,cog,sog,rot,heading,navstat,etaRaw,latitude,longitude,vesselId,portId,latitude_lag_1,longitude_lag_1,latitude_lag_2,longitude_lag_2,latitude_lag_3,longitude_lag_3
0,2024-01-01 00:00:25,284.0,0.7,0,88,0,01-09 23:00,145.2563,302.1487,61e9f3a8b937134a3c4bfdf7,61d371c43aeaecc07011a37f,,,,,,
1,2024-01-01 00:00:36,109.6,0.0,-6,347,1,12-29 20:00,8.8944,280.52061,61e9f3d4b937134a3c4bff1f,634c4de270937fc01c3a7689,,,,,,
2,2024-01-01 00:01:45,111.0,11.0,0,112,0,01-02 09:00,39.19065,283.52433,61e9f436b937134a3c4c0131,61d3847bb7b7526e1adf3d19,,,,,,
3,2024-01-01 00:03:11,96.4,0.0,0,142,1,12-31 20:00,145.58811,151.02067,61e9f3b4b937134a3c4bfe77,61d36f770a1807568ff9a126,,,,,,
4,2024-01-01 00:03:51,214.0,19.7,0,215,0,01-25 12:00,35.88379,354.08364,61e9f41bb937134a3c4c0087,634c4de270937fc01c3a74f3,,,,,,
