# vessel-cable-anomaly-hunter
DTU Deep Learning project 29, group 80

## Required Libraries Installation
Run this in your terminal before executing this:

In [1]:
pip install -q -r requirements.txt

Note: you may need to restart the kernel to use updated packages.




**NOTE ON PYTHON VERSION**

This project requires **Python 3.10 or a later version.**

**WARNING: Training Speed (CPU Usage)**

If your execution environment does not utilize a dedicated Graphics Processing Unit (GPU, such as NVIDIA CUDA or Apple MPS), please be advised that training the optimal $\text{LSTM}$ Autoencoder configuration will be extremely slow (potentially taking several hours per run). This is due to the computational intensity of processing long sequential data on a Central Processing Unit ($\text{CPU}$).To quickly verify the functionality of the training loop and observe early convergence without excessive runtime, it is strongly recommended that you lower the EPOCHS value (set $\text{EPOCHS} = 5$) in the config.py file before proceeding.



## Data Download
AIS Data scraper and filter

#### File imports

In [2]:
import config
import src.data.ais_downloader as ais_downloader
import src.data.ais_filtering as ais_filtering
import src.data.ais_reader as ais_reader
import src.data.ais_to_parquet as ais_to_parquet

#### Library imports

In [3]:
from pathlib import Path
from datetime import date
from tqdm import tqdm
import gc

#### Configuration

In [4]:
# Read configuration from config.py
VERBOSE_MODE = config.VERBOSE_MODE                          # Whether to print verbose output

START_DATE = config.START_DATE                              # Start date for data downloading
END_DATE   = config.END_DATE                                # End date for data downloading

DELETE_DOWNLOADED_CSV = config.DELETE_DOWNLOADED_CSV        # Whether to delete raw downloaded CSV files after parquet conversion

BBOX = config.BBOX                                          # Bounding Box to prefilter AIS data
POLYGON_COORDINATES = config.POLYGON_COORDINATES  

#### Paths

In [5]:
folder_path = Path(config.AIS_DATA_FOLDER)
folder_path.mkdir(parents=True, exist_ok=True)
    
csv_folder_path = folder_path / config.AIS_DATA_FOLDER_CSV_SUBFOLDER
csv_folder_path.mkdir(parents=True, exist_ok=True)
    
parquet_folder_path = folder_path / config.AIS_DATA_FOLDER_PARQUET_SUBFOLDER
parquet_folder_path.mkdir(parents=True, exist_ok=True)

file_port_locations = folder_path / config.FILE_PORT_LOCATIONS

#### Download, Filter and Save into Parquet
1. **Download:** Download one single .csv AIS data file from http://aisdata.ais.dk (link to data column description http://aisdata.ais.dk/!_README_information_CSV_files.txt);
2. **Filter:** For a given AOI in Denmark with known cable positions, filter AIS messages by cleansing unrealistic/unphysical messages or duplicates and removes error-prone messages within port areas;
3. **Segmentation:** Segment the cleaned data into tracks based on time gaps and track duration;
4. **Parquet Conversion:** Save the cleaned and filtered data into Parquet files for faster loading in the next steps.

In [6]:
# --- Build the schedule of download string dates ---
dates = ais_downloader.get_work_dates(START_DATE, END_DATE, csv_folder_path, filter=False)

# --- Iterate with tqdm and download, unzip and delete ---
for day in tqdm(dates, desc=f"Processing data", unit="file" ):
    tag = f"{day:%Y-%m}" if day < date.fromisoformat("2024-03-01") else f"{day:%Y-%m-%d}"
    print(f"\nProcessing date: {tag}")

    # --- Download one day ---
    csv_path = ais_downloader.download_one_ais_data(day, csv_folder_path)
    
    # --- Load CSV into DataFrame ---
    df_raw = ais_reader.read_single_ais_df(csv_path, BBOX, columns_to_drop=config.COLUMNS_TO_DROP, verbose=VERBOSE_MODE)
    # --- Optionally delete the downloaded CSV file ---
    if DELETE_DOWNLOADED_CSV: csv_path.unlink(missing_ok=True)
    
    # --- Filter and split ---
    # Filter AIS data, keeping Class A and Class B by default,
    df_filtered = ais_filtering.filter_ais_df(
        df_raw,                                               # raw AIS DataFrame
        polygon_coords=POLYGON_COORDINATES,                   # polygon coordinates for precise AOI filtering
        allowed_mobile_types=config.VESSEL_AIS_CLASS,                # vessel AIS class filter
        apply_polygon_filter=True,                            # keep polygon filtering enabled boolean
        remove_zero_sog_vessels=config.REMOVE_ZERO_SOG_VESSELS,      # use True/False to enable/disable 90% zero-SOG removal
        output_sog_in_ms=config.SOG_IN_MS,                           # convert SOG from knots in m/s (default) boolean
        sog_min_knots=config.SOG_MIN_KNOTS,                          # min SOG in knots to keep (None to disable)
        sog_max_knots=config.SOG_MAX_KNOTS,                          # max SOG in knots to keep (None to disable) 
        port_locodes_path=file_port_locations,                # path to port locodes CSV
        exclude_ports=True,                                   # exclude port areas boolean 
        verbose=VERBOSE_MODE,                                 # verbose mode boolean
    )
    
    # Free df_raw memory
    del df_raw
    gc.collect()

    # --- Parquet conversion ---
    # Save to Parquet by MMSI
    ais_to_parquet.save_by_mmsi(
        df_filtered,                                             # filtered AIS DataFrame 
        verbose=VERBOSE_MODE,                                    # verbose mode boolean
        output_folder=parquet_folder_path                        # output folder path
    )

    # Free df_filtered memory
    del df_filtered
    gc.collect()

Processing data:   0%|          | 0/8 [00:00<?, ?file/s]


Processing date: 2025-08-01
Skipping 2025-08-01 download: already present in ais-data/csv folder
Read AIS data: 1,128,873 rows within bbox, 511 unique vessels
 [filter_ais_df] Before filtering: 1,128,873 rows, 511 vessels
 [filter_ais_df] Type filtering: 1,093,101 rows (removed 35,772) using ['Class A', 'Class B']
 [filter_ais_df] MMSI filtering: 1,093,079 rows, 508 vessels
 [filter_ais_df] Duplicate removal: 638,467 rows, 508 vessels
 [filter_ais_df] Polygon filtering: 337,847 rows (removed 300,620), 378 vessels
 [filter_ais_df] Port-area removal: removed 160,778 rows in 3 overlapping ports
 [filter_ais_df] COG sanity: 175,598 rows (removed 1,471) with range [0, 360] deg
 [filter_ais_df] SOG sanity: 173,215 rows (removed 2,381) with range [0.5, 35.0] knots
 [filter_ais_df] Final: 173,215 rows, 327 unique vessels (SOG in m/s)
 [save_by_mmsi] Removed existing partitions for 327 (MMSI, Date) combinations.


Processing data:  12%|█▎        | 1/8 [00:14<01:40, 14.36s/file]

 [save_by_mmsi] Parquet dataset written/appended at: /dtu/blackhole/16/213558/dark-vessel-hunter/ais-data/parquet

Processing date: 2025-08-02
Skipping 2025-08-02 download: already present in ais-data/csv folder
Read AIS data: 1,161,214 rows within bbox, 452 unique vessels
 [filter_ais_df] Before filtering: 1,161,214 rows, 452 vessels
 [filter_ais_df] Type filtering: 1,120,127 rows (removed 41,087) using ['Class A', 'Class B']
 [filter_ais_df] MMSI filtering: 1,114,486 rows, 447 vessels
 [filter_ais_df] Duplicate removal: 616,865 rows, 447 vessels
 [filter_ais_df] Polygon filtering: 314,494 rows (removed 302,371), 342 vessels
 [filter_ais_df] Port-area removal: removed 168,854 rows in 3 overlapping ports
 [filter_ais_df] COG sanity: 145,303 rows (removed 337) with range [0, 360] deg
 [filter_ais_df] SOG sanity: 144,010 rows (removed 1,291) with range [0.5, 35.0] knots
 [filter_ais_df] Final: 144,010 rows, 290 unique vessels (SOG in m/s)
 [save_by_mmsi] Removed existing partitions for 2

Processing data:  25%|██▌       | 2/8 [00:27<01:22, 13.78s/file]


Processing date: 2025-08-03
Skipping 2025-08-03 download: already present in ais-data/csv folder
Read AIS data: 1,079,891 rows within bbox, 399 unique vessels
 [filter_ais_df] Before filtering: 1,079,891 rows, 399 vessels
 [filter_ais_df] Type filtering: 1,043,284 rows (removed 36,607) using ['Class A', 'Class B']
 [filter_ais_df] MMSI filtering: 1,043,284 rows, 398 vessels
 [filter_ais_df] Duplicate removal: 599,515 rows, 398 vessels
 [filter_ais_df] Polygon filtering: 319,478 rows (removed 280,037), 288 vessels
 [filter_ais_df] Port-area removal: removed 165,183 rows in 3 overlapping ports
 [filter_ais_df] COG sanity: 153,909 rows (removed 386) with range [0, 360] deg
 [filter_ais_df] SOG sanity: 152,641 rows (removed 1,266) with range [0.5, 35.0] knots
 [filter_ais_df] Final: 152,641 rows, 248 unique vessels (SOG in m/s)
 [save_by_mmsi] Removed existing partitions for 248 (MMSI, Date) combinations.


Processing data:  38%|███▊      | 3/8 [00:40<01:05, 13.09s/file]

 [save_by_mmsi] Parquet dataset written/appended at: /dtu/blackhole/16/213558/dark-vessel-hunter/ais-data/parquet

Processing date: 2025-08-04
Skipping 2025-08-04 download: already present in ais-data/csv folder
Read AIS data: 1,161,332 rows within bbox, 380 unique vessels
 [filter_ais_df] Before filtering: 1,161,332 rows, 380 vessels
 [filter_ais_df] Type filtering: 1,125,031 rows (removed 36,301) using ['Class A', 'Class B']
 [filter_ais_df] MMSI filtering: 1,125,031 rows, 379 vessels
 [filter_ais_df] Duplicate removal: 637,785 rows, 379 vessels
 [filter_ais_df] Polygon filtering: 321,848 rows (removed 315,937), 272 vessels
 [filter_ais_df] Port-area removal: removed 156,758 rows in 3 overlapping ports
 [filter_ais_df] COG sanity: 164,980 rows (removed 110) with range [0, 360] deg
 [filter_ais_df] SOG sanity: 162,923 rows (removed 2,057) with range [0.5, 35.0] knots
 [filter_ais_df] Final: 162,923 rows, 232 unique vessels (SOG in m/s)
 [save_by_mmsi] Removed existing partitions for 2

Processing data:  50%|█████     | 4/8 [00:53<00:53, 13.40s/file]

 [save_by_mmsi] Parquet dataset written/appended at: /dtu/blackhole/16/213558/dark-vessel-hunter/ais-data/parquet

Processing date: 2025-08-05
Skipping 2025-08-05 download: already present in ais-data/csv folder
Read AIS data: 1,157,746 rows within bbox, 268 unique vessels
 [filter_ais_df] Before filtering: 1,157,746 rows, 268 vessels
 [filter_ais_df] Type filtering: 1,119,953 rows (removed 37,793) using ['Class A', 'Class B']
 [filter_ais_df] MMSI filtering: 1,116,828 rows, 264 vessels
 [filter_ais_df] Duplicate removal: 613,845 rows, 264 vessels
 [filter_ais_df] Polygon filtering: 301,015 rows (removed 312,830), 175 vessels
 [filter_ais_df] Port-area removal: removed 171,761 rows in 3 overlapping ports
 [filter_ais_df] COG sanity: 129,228 rows (removed 26) with range [0, 360] deg
 [filter_ais_df] SOG sanity: 127,182 rows (removed 2,046) with range [0.5, 35.0] knots
 [filter_ais_df] Final: 127,182 rows, 115 unique vessels (SOG in m/s)
 [save_by_mmsi] Removed existing partitions for 11

Processing data:  62%|██████▎   | 5/8 [01:05<00:38, 12.70s/file]

 [save_by_mmsi] Parquet dataset written/appended at: /dtu/blackhole/16/213558/dark-vessel-hunter/ais-data/parquet

Processing date: 2025-08-06
Skipping 2025-08-06 download: already present in ais-data/csv folder
Read AIS data: 1,145,520 rows within bbox, 261 unique vessels
 [filter_ais_df] Before filtering: 1,145,520 rows, 261 vessels
 [filter_ais_df] Type filtering: 1,108,897 rows (removed 36,623) using ['Class A', 'Class B']
 [filter_ais_df] MMSI filtering: 1,108,897 rows, 260 vessels
 [filter_ais_df] Duplicate removal: 622,309 rows, 260 vessels
 [filter_ais_df] Polygon filtering: 300,756 rows (removed 321,553), 169 vessels
 [filter_ais_df] Port-area removal: removed 163,859 rows in 3 overlapping ports
 [filter_ais_df] COG sanity: 136,873 rows (removed 24) with range [0, 360] deg
 [filter_ais_df] SOG sanity: 134,698 rows (removed 2,175) with range [0.5, 35.0] knots
 [filter_ais_df] Final: 134,698 rows, 123 unique vessels (SOG in m/s)
 [save_by_mmsi] Removed existing partitions for 12

Processing data:  75%|███████▌  | 6/8 [01:16<00:24, 12.23s/file]


Processing date: 2025-08-07
Skipping 2025-08-07 download: already present in ais-data/csv folder
Read AIS data: 1,202,886 rows within bbox, 385 unique vessels
 [filter_ais_df] Before filtering: 1,202,886 rows, 385 vessels
 [filter_ais_df] Type filtering: 1,166,600 rows (removed 36,286) using ['Class A', 'Class B']
 [filter_ais_df] MMSI filtering: 1,162,713 rows, 382 vessels
 [filter_ais_df] Duplicate removal: 642,517 rows, 382 vessels
 [filter_ais_df] Polygon filtering: 301,062 rows (removed 341,455), 277 vessels
 [filter_ais_df] Port-area removal: removed 133,073 rows in 3 overlapping ports
 [filter_ais_df] COG sanity: 167,842 rows (removed 147) with range [0, 360] deg
 [filter_ais_df] SOG sanity: 166,354 rows (removed 1,488) with range [0.5, 35.0] knots
 [filter_ais_df] Final: 166,354 rows, 247 unique vessels (SOG in m/s)
 [save_by_mmsi] Removed existing partitions for 247 (MMSI, Date) combinations.
 [save_by_mmsi] Parquet dataset written/appended at: /dtu/blackhole/16/213558/dark-v

Processing data:  88%|████████▊ | 7/8 [01:29<00:12, 12.32s/file]


Processing date: 2025-08-08
Skipping 2025-08-08 download: already present in ais-data/csv folder
Read AIS data: 1,160,724 rows within bbox, 399 unique vessels
 [filter_ais_df] Before filtering: 1,160,724 rows, 399 vessels
 [filter_ais_df] Type filtering: 1,124,268 rows (removed 36,456) using ['Class A', 'Class B']
 [filter_ais_df] MMSI filtering: 1,124,255 rows, 397 vessels
 [filter_ais_df] Duplicate removal: 618,010 rows, 397 vessels
 [filter_ais_df] Polygon filtering: 270,458 rows (removed 347,552), 285 vessels
 [filter_ais_df] Port-area removal: removed 115,708 rows in 3 overlapping ports
 [filter_ais_df] COG sanity: 154,292 rows (removed 458) with range [0, 360] deg
 [filter_ais_df] SOG sanity: 151,933 rows (removed 2,359) with range [0.5, 35.0] knots
 [filter_ais_df] Final: 151,933 rows, 262 unique vessels (SOG in m/s)
 [save_by_mmsi] Removed existing partitions for 262 (MMSI, Date) combinations.


Processing data: 100%|██████████| 8/8 [01:41<00:00, 12.65s/file]

 [save_by_mmsi] Parquet dataset written/appended at: /dtu/blackhole/16/213558/dark-vessel-hunter/ais-data/parquet





## Preprocess

#### File imports

In [7]:
import config
import src.pre_proc.pre_processing_utils as pre_processing_utils
import src.pre_proc.ais_query as ais_query
import src.pre_proc.ais_segment as ais_segment

#### Library imports

In [8]:
from pathlib import Path

#### Configuration

In [9]:
# Read configuration from config.py
VERBOSE_MODE = config.VERBOSE_MODE

FOLDER_NAME = config.AIS_DATA_FOLDER
folder_path = Path(FOLDER_NAME)
parquet_folder_path = folder_path / config.AIS_DATA_FOLDER_PARQUET_SUBFOLDER

TRAIN_START_DATE = config.TRAIN_START_DATE
TRAIN_END_DATE = config.TRAIN_END_DATE

TEST_START_DATE = config.TEST_START_DATE
TEST_END_DATE = config.TEST_END_DATE

MAX_TIME_GAP_SEC = config.MAX_TIME_GAP_SEC
MAX_TRACK_DURATION_SEC = config.MAX_TRACK_DURATION_SEC
MIN_TRACK_DURATION_SEC = config.MIN_TRACK_DURATION_SEC
MIN_SEGMENT_LENGTH = config.MIN_SEGMENT_LENGTH

MIN_FREQ_POINTS_PER_MIN = config.MIN_FREQ_POINTS_PER_MIN

RESAMPLING_RULE = config.RESAMPLING_RULE

#### Preprocess function
1. **Data Loading:** Queries DuckDB for AIS data within specified date ranges for 'train' or 'test'.
2. **Feature Engineering:** Converts COG (Course Over Ground) to sine/cosine components.
3. **Cleaning:** Drops unnecessary columns and rows with missing values.
4. **Ship Type Grouping:** Aggregates specific ship types into broader categories (Commercial,
Passenger, Service, Other).
5. **Segmentation:** Splits AIS tracks into segments based on time gaps and duration constraints using
ais_segment
6. **Filtering:** Removes segments with low point density.
7. **Resampling:** Resamples tracks to a fixed time interval.
8. **Labeling:** Encodes ship types into numerical IDs.
9. **Saving:** Exports the processed DataFrame to a Parquet file.

In [10]:
def main_preprocess(dataframe_type: str = "all"):

    if dataframe_type == "all":
        main_preprocess("train")
        main_preprocess("test")
        return
        
    elif dataframe_type == "train":
        print(f"[preprocess] Querying AIS data for training period: {TRAIN_START_DATE} to {TRAIN_END_DATE}")
        # Loading filtered data from parquet files
        df = ais_query.query_ais_duckdb(parquet_folder_path, date_start=TRAIN_START_DATE, date_end=TRAIN_END_DATE, verbose=VERBOSE_MODE)
        
    elif dataframe_type == "test":
        print(f"[preprocess] Querying AIS data for testing period: {TEST_START_DATE} to {TEST_END_DATE}")
        # Loading filtered data from parquet files
        df = ais_query.query_ais_duckdb(parquet_folder_path, date_start=TEST_START_DATE, date_end=TEST_END_DATE, verbose=VERBOSE_MODE)
    else:
        raise ValueError(f"Invalid dataframe_type: {dataframe_type}. Must be 'train' or 'test'.")
     
    # Converting COG to sine and cosine components
    df = pre_processing_utils.cog_to_sin_cos(df)
    
    # Dropping unnecessary columns and rows with missing values
    df.drop(columns=[ 
        'Type of mobile', 
        'COG', 
        'Date'], inplace=True, errors='ignore')
    
    # Removing rows with NaN values in essential columns
    df.dropna(inplace=True)
    
    # Grouping Ship types
    commercial_types = ["Cargo", "Tanker"]
    passenger_types = ["Passenger", "Pleasure", "Sailing"]
    service_types = ["Dredging", "Law enforcement", "Military", "Port tender", "SAR", "Towing", "Towing long/wide","Tug"]
    valid_types =  ["Fishing", "Service", "Commercial", "Passenger"]

    df.loc[df["Ship type"].isin(commercial_types), "Ship type"] = "Commercial"
    df.loc[df["Ship type"].isin(passenger_types), "Ship type"] = "Passenger"
    df.loc[df["Ship type"].isin(service_types), "Ship type"] = "Service"
    df.loc[~df["Ship type"].isin(valid_types), "Ship type"] = "Other"
    
    print("[preprocess] Ship type counts:")
    print(df["Ship type"].value_counts())

    if VERBOSE_MODE:
        print(f"[preprocess] DataFrame after dropping unnecessary columns and NaNs: {len(df):,} rows")

    # Segmenting AIS tracks based on time gaps and max duration, filtering short segments
    df = ais_segment.segment_ais_tracks(
        df,
        max_time_gap_sec=MAX_TIME_GAP_SEC,
        max_track_duration_sec=MAX_TRACK_DURATION_SEC,
        min_track_duration_sec=MIN_TRACK_DURATION_SEC,
        min_track_len=MIN_SEGMENT_LENGTH,
        verbose=VERBOSE_MODE
    )

    # Adding segment nr feature
    # df = pre_processing_utils.add_segment_nr(df)

    # Removing segments with low point density
    df = pre_processing_utils.remove_notdense_segments(df, min_freq_points_per_min=MIN_FREQ_POINTS_PER_MIN)
    
    # Resampling all tracks to fixed time intervals
    df = pre_processing_utils.resample_all_tracks(df, rule=RESAMPLING_RULE)

    print(f"[preprocess] Number of segments and rows after removing low-density segments and resampling: {df['Segment_nr'].nunique():,} segments, {len(df):,} rows")

    # Normalizing numeric columns
    #df, mean, std = pre_processing_utils.normalize_df(df, NUMERIC_COLS)

    # Ship type labeling (mapping to be used later)
    df, ship_type_to_id = pre_processing_utils.label_ship_types(df)
    
    # Saving pre-processed DataFrame
    if dataframe_type == "train":
        print(f"[preprocess] Saving pre-processed DataFrame to {config.PRE_PROCESSING_DF_TRAIN_PATH}...")
        output_path = config.PRE_PROCESSING_DF_TRAIN_PATH
        #metadata_path = config.PRE_PROCESSING_METADATA_TRAIN_PATH
    else:
        print(f"[preprocess] Saving pre-processed DataFrame to {config.PRE_PROCESSING_DF_TEST_PATH}...")
        output_path = config.PRE_PROCESSING_DF_TEST_PATH
        #metadata_path = config.PRE_PROCESSING_METADATA_TEST_PATH

    if VERBOSE_MODE: print(f"[preprocess] Columns of pre-processed DataFrame:\n{df.columns.tolist()}")
    Path(output_path).parent.mkdir(parents=True, exist_ok=True)
    df.to_parquet(output_path, index=False)


#### Preprocess for train and test

In [11]:
main_preprocess("train")

[preprocess] Querying AIS data for training period: 2025-08-01 to 2025-08-07
[ais_query] Querying parquet files from: ais-data/parquet  from date 2025-08-01  to date  2025-08-07
[ais_query] 1,061,023 rows, 1,133 vessels, from date 2025-08-01 to date 2025-08-07
[preprocess] Ship type counts:
Ship type
Commercial    489776
Passenger     265809
Fishing       236166
Service        43170
Other          26102
Name: count, dtype: int64
[preprocess] DataFrame after dropping unnecessary columns and NaNs: 1,061,023 rows
[segment_ais_tracks] Starting with 1,061,023 rows, 1,133 unique vessels
[segment_ais_tracks] Final processed data: 1,056,993 rows, 1,691 segments.
[preprocess] Number of segments and rows after removing low-density segments and resampling: 1,572 segments, 199,476 rows
[preprocess] Saving pre-processed DataFrame to ais-data/df_preprocessed/pre_processed_df_train.parquet...
[preprocess] Columns of pre-processed DataFrame:
['Segment_nr', 'Timestamp', 'Latitude', 'Longitude', 'SOG', 

In [12]:
main_preprocess("test")

[preprocess] Querying AIS data for testing period: 2025-08-08 to 2025-08-08
[ais_query] Querying parquet files from: ais-data/parquet  from date 2025-08-08  to date  2025-08-08
[ais_query] 151,933 rows, 262 vessels, from date 2025-08-08 to date 2025-08-08
[preprocess] Ship type counts:
Ship type
Commercial    72556
Passenger     37184
Fishing       35480
Other          3914
Service        2799
Name: count, dtype: int64
[preprocess] DataFrame after dropping unnecessary columns and NaNs: 151,933 rows
[segment_ais_tracks] Starting with 151,933 rows, 262 unique vessels
[segment_ais_tracks] Final processed data: 151,180 rows, 278 segments.
[preprocess] Number of segments and rows after removing low-density segments and resampling: 255 segments, 29,510 rows
[preprocess] Saving pre-processed DataFrame to ais-data/df_preprocessed/pre_processed_df_test.parquet...
[preprocess] Columns of pre-processed DataFrame:
['Segment_nr', 'Timestamp', 'Latitude', 'Longitude', 'SOG', 'COG_sin', 'COG_cos', 'T

## Train

#### File imports

In [13]:
import config as config_file
from src.train.ais_dataset import AISDataset, ais_collate_fn
from src.train.model_anchoring import AIS_LSTM_Autoencoder
from src.train.training_loop import run_experiment

#### Library imports

In [14]:
import datetime
import torch
from torch.utils.data import DataLoader, random_split
import os
import json

#### Configuration

In [15]:
PARQUET_FILE = config_file.PRE_PROCESSING_DF_TRAIN_PATH
TRAIN_OUTPUT_DIR = config_file.TRAIN_OUTPUT_DIR
LOSS_TYPE = config_file.LOSS_TYPE

# ensure output directory exists
os.makedirs(TRAIN_OUTPUT_DIR, exist_ok=True)

SPLIT_TRAIN_VAL_RATIO = config_file.SPLIT_TRAIN_VAL_RATIO
EPOCHS = config_file.EPOCHS
PATIENCE = config_file.PATIENCE
FEATURES = config_file.FEATURE_COLS
NUM_SHIP_TYPES = config_file.NUM_SHIP_TYPES

#### Hyperparameters 

In [16]:
# ---------------------------------------------------------
# HYPERPARAMETERS
# ---------------------------------------------------------

top_params = {
        'hidden_dim': config_file.HIDDEN_DIM,       # Capacity of the LSTM
        'latent_dim': config_file.LATENT_DIM,         # Bottleneck
        'num_layers': config_file.NUM_LAYERS,           # Depth
        'lr': config_file.LEARNING_RATE,          # Learning Rate
        'batch_size': config_file.BATCH_SIZE,        # Batch Size
        'dropout': config_file.DROP_OUT           # Regularization
    }

run_name_suffix = datetime.datetime.now().strftime("%Y%m%d_%H%M%S")

run_name = (f"H{top_params['hidden_dim']}_L{top_params['latent_dim']}_"
            f"Lay{top_params['num_layers']}_lr{top_params['lr']}_"
            f"BS{top_params['batch_size']}_Drop{top_params['dropout']}_{run_name_suffix}_{LOSS_TYPE}")

config = {
    "run_name": run_name,
    "epochs": EPOCHS,
    "patience": PATIENCE,
    "features": FEATURES,
    "num_ship_types": NUM_SHIP_TYPES,
    "shiptype_emb_dim": 8,
    "loss_type": LOSS_TYPE,
    
    # Dynamic Params (ora fissi)
    "hidden_dim": top_params['hidden_dim'],
    "latent_dim": top_params['latent_dim'],
    "num_layers": top_params['num_layers'],
    "lr": top_params['lr'],
    "batch_size": top_params['batch_size'],
    "dropout": top_params['dropout']
}

configs = [config]

print(f"Running a single configuration: {run_name}")


Running a single configuration: H256_L64_Lay1_lr0.001_BS64_Drop0.0_20251205_220544_mse


#### Device setup

In [17]:
if torch.cuda.is_available():
    device = torch.device("cuda")  # for PC with NVIDIA
    print(f"Using device: {device} (NVIDIA GPU)")
elif torch.backends.mps.is_available():
    device = torch.device("mps")   # for Mac Apple Silicon
    print(f"Using device: {device} (Apple GPU)")
else:
    device = torch.device("cpu")   # Fallback on CPU
    print(f"Using device: {device} (CPU)")

Using device: cpu (CPU)


#### Data load

In [19]:
if not os.path.exists(PARQUET_FILE):
    print(f"Error: {PARQUET_FILE} not found.")

# Initialize Dataset
full_dataset = AISDataset(PARQUET_FILE)
input_dim = full_dataset.input_dim

# Split Train/Val (80/20)
train_size = int(SPLIT_TRAIN_VAL_RATIO * len(full_dataset))
val_size = len(full_dataset) - train_size
train_dataset, val_dataset = random_split(full_dataset, [train_size, val_size])

print(f"Train samples: {len(train_dataset)}, Val samples: {len(val_dataset)}")


Loading data from ais-data/df_preprocessed/pre_processed_df_train.parquet...
Normalizing features...
Grouping segments...
Processed 1572 unique segments.
Train samples: 1257, Val samples: 315


#### Experiment Loop
1. **Create DataLoaders:** Creates PyTorch DataLoaders for training and validation datasets.
2. **Training Setup:** Initialize Model with FIXED num_ship_types, optimizer, and loss function.
3. **Training Loop:** Trains the model over a set number of epochs, implementing early stopping based on validation loss.
4. **Evaluation:** Assesses model performance on the validation set after each epoch.
5. **Model Saving:** Saves the best-performing model based on validation loss.

In [20]:
results = []

for config in configs:
    # Create DataLoaders
    train_loader = DataLoader(  # Training DataLoader
        train_dataset, 
        batch_size=config['batch_size'], 
        shuffle=True, 
        collate_fn=ais_collate_fn
    )
    
    val_loader = DataLoader(    # Validation DataLoader
        val_dataset, 
        batch_size=config['batch_size'], 
        shuffle=False, 
        collate_fn=ais_collate_fn
    )
    
    # Initialize Model with FIXED num_ship_types
    model = AIS_LSTM_Autoencoder(
        input_dim=input_dim,
        hidden_dim=config['hidden_dim'],
        latent_dim=config['latent_dim'],
        num_layers=config['num_layers'],
        num_ship_types=NUM_SHIP_TYPES, # Always use the fixed constant
        shiptype_emb_dim=config['shiptype_emb_dim'],
        dropout=config['dropout']
    ).to(device)
    
    # Run Pipeline
    save_path = f"{TRAIN_OUTPUT_DIR}/weights_{config['run_name']}.pth"
    history, best_loss = run_experiment(config, model, train_loader, val_loader, device, save_path=f"{TRAIN_OUTPUT_DIR}/weights_{config['run_name']}.pth")
    
    # Save results
    results.append({
        "config": config['run_name'],
        "best_val_loss": best_loss,
        "history": history
    })

    # Save model and config
    os.makedirs(TRAIN_OUTPUT_DIR, exist_ok=True)
    with open(f"{TRAIN_OUTPUT_DIR}/config_{config['run_name']}.json", 'w') as f:
        json.dump(config, f, indent=4)


--- Starting Run: H256_L64_Lay1_lr0.001_BS64_Drop0.0_20251205_220544_mse (Loss: mse)---
Epoch [1/1] Train Loss: 0.137647 | Val Loss: 0.099916
Validation loss decreased. Model saved to models/weights_H256_L64_Lay1_lr0.001_BS64_Drop0.0_20251205_220544_mse.pth


#### Summary of the model

In [21]:
results_path = os.path.join(TRAIN_OUTPUT_DIR, "results_summary_single"+ datetime.datetime.now().strftime("%Y%m%d_%H%M%S")+".json")
with open(results_path, "w") as f:
    json.dump(results, f, indent=4)

# Print result
print("\n=== Single Configuration Result ===") 
print(f"Run: {config['run_name']} | Best Val Loss: {float(best_loss):.6f}")


=== Single Configuration Result ===
Run: H256_L64_Lay1_lr0.001_BS64_Drop0.0_20251205_220544_mse | Best Val Loss: 0.099916


## Test

#### File imports

In [22]:
import config as config_file
from src.test.ais_tester import AISTester

#### Library imports

In [23]:
import os
import json

#### Configuration

COPY the name of the model from above where it outputs "Starting Run: **H256_L64_Lay1_lr0.001_BS64_Drop0.0_20251205_210024_mse (Loss: mse)**"


In [24]:
# Name of the model configuration to use
MODEL_NAME = "H256_L64_Lay1_lr0.001_BS64_Drop0.0_20251205_220544_mse"  # Change as needed

N_BEST_WORST = config_file.N_BEST_WORST
N_MAP_RANDOM = config_file.N_MAP_RANDOM

# Data to test on
PARQUET_FILE = config_file.PRE_PROCESSING_DF_TEST_PATH

# Output Directory
OUTPUT_DIR = config_file.TEST_OUTPUT_DIR + "/" + MODEL_NAME
os.makedirs(OUTPUT_DIR, exist_ok=True)

WEIGHTS_FILE = config_file.TRAIN_OUTPUT_DIR + "/weights_" + MODEL_NAME + ".pth"
MODEL_CONFIG_FILE = config_file.TRAIN_OUTPUT_DIR + "/config_" + MODEL_NAME + ".json"

#### Load Model and Init Tester

In [25]:
# Load Model Config
with open(MODEL_CONFIG_FILE, 'r') as f:
    model_config = json.load(f)

# Initialize Tester
tester = AISTester(model_config, WEIGHTS_FILE, output_dir=OUTPUT_DIR)

Using device: cpu
Loading weights from models/weights_H256_L64_Lay1_lr0.001_BS64_Drop0.0_20251205_220544_mse.pth...


#### Run Testing and Evaluation

In [26]:
# Run tester pipeline (assumes PARQUET_FILE and tester are defined elsewhere in the notebook)
if os.path.exists(PARQUET_FILE):
    # 1. Evaluate ALL data first
    tester.load_data(PARQUET_FILE)
    tester.evaluate()
        
    # 2. Plot General Stats
    tester.plot_error_distributions()
        
    # 3. Plot Filtered Stats (Example)
    # You can pass a list of IDs to filter just the plot without re-running evaluate
    # my_interesting_ids = ["segment_A", "segment_B"]
    # tester.plot_error_distributions(filter_ids=my_interesting_ids, filename_suffix="_special_group")
        
    # 4. Standard Best/Worst
    tester.plot_best_worst_segments(n=N_BEST_WORST)
        
    # 5. Maps
    tester.generate_maps(n_best_worst=N_BEST_WORST, n_random=N_MAP_RANDOM)

    # 6. Filtered Map Example
    # tester.generate_filtered_map(segment_ids=["segment_1", "segment_2"], map_name="map_special_segments")
        
else:
    print(f"File {PARQUET_FILE} not found.")

Loading data from ais-data/df_preprocessed/pre_processed_df_test.parquet...
Normalizing features...
Grouping segments...
Processed 255 unique segments.
Test data loaded: 255 segments.
Running predictions...
Evaluation complete. Processed 255 segments.
Error distribution plot saved to: test_results/H256_L64_Lay1_lr0.001_BS64_Drop0.0_20251205_220544_mse/error_distribution.png

--- Saving Top 3 Best Reconstructions (Line Plots) ---

--- Saving Top 3 Worst Reconstructions (Line Plots) ---
Map saved: test_results/H256_L64_Lay1_lr0.001_BS64_Drop0.0_20251205_220544_mse/map_BEST_3_segments.html
Map saved: test_results/H256_L64_Lay1_lr0.001_BS64_Drop0.0_20251205_220544_mse/map_WORST_3_segments.html
Map saved: test_results/H256_L64_Lay1_lr0.001_BS64_Drop0.0_20251205_220544_mse/map_RANDOM_5_segments.html
