# Code Implementation
This file contains the main pipeline for the project.

Additional helper functions and modules can be found under `src/`

## Data preprocessing using MapReduce
The data used in this project is maritime data from automatic identification systems (AIS) obtained obtained from the [Danish Maritime Authority](http://aisdata.ais.dk/). The data is available as a csv file for each day and contains a row for each AIS message with columns such as **Timestamp**, **MMSI**, **Latitude**, **Longitude**, and more. MMSI stands for Maritime Mobile Service Identity and is a unique identifier for a vessel.

Uncompressed, the data for a single day takes up around 3GB of memory and we wish to process 3 months worth of data leading to an infeasible amount of data to keep in memory at one time. However, since the data is time series data and vessel voyages often spans across days, in order to properly preprocess the data we can't process the files in isolation. Secondly, we wish to speed up the wall clock time of preprocessing by efficiently utilizing parallel processing on multiple CPU's running on DTU's High Performance Computing (HPC) cluster. This is where MapReduce comes in.

### Split

The preprocessing script is adapted from [CIA-Oceanix/GeoTrackNet](https://github.com/CIA-Oceanix/GeoTrackNet) and first converts each CSV file individually to dictionaries of arrays grouped by MMSI. The grouped dictionaries are saved as pickle files in a temporary directory. This is the split part of MapReduce. A simplified code this is presented below.

In [None]:
import numpy as np
import os
import pickle
import csv
import time
from multiprocessing import Pool, cpu_count

from src.preprocessing.csv2pkl import (map_nav_status_to_int,
                                       convert_str_to_unix,
                                       map_ship_type_to_int,
                                       save_vessel_types,
                                       filter_messages)

def process_single_csv(csv_filename,
                       input_dir,
                       output_dir,
                       vessel_type_dir,
                       lat_min,
                       lat_max,
                       lon_min,
                       lon_max,
                       sog_max):
    
    # Define column indices
    LAT, LON, SOG, COG, HEADING, ROT, NAV_STT, TIMESTAMP, MMSI, SHIPTYPE  = list(range(10))
    
    t_date_str = '-'.join(csv_filename.split('.')[0].split('-')[1:4])
    t_min = time.mktime(time.strptime(t_date_str + ' 00:00:00', "%Y-%m-%d %H:%M:%S"))
    t_max = time.mktime(time.strptime(t_date_str + ' 23:59:59', "%Y-%m-%d %H:%M:%S"))
    
    l_l_msg = [] # list of AIS messages, each row is a message (list of AIS attributes)
    data_path = os.path.join(input_dir, csv_filename)
    
    with open(data_path,"r") as f:
        csvReader = csv.reader(f)
        next(csvReader) # skip the legend row
        count = 1
        for row in csvReader:
            count += 1
            try:
                l_l_msg.append([float(row[3]), # Latitude
                                float(row[4]), # Longitude
                                float(row[7]), # SOG
                                float(row[8]), # COG
                                int(row[9]), # Heading
                                float(row[6]), # ROT
                                int(map_nav_status_to_int(row[5])), # Navigation status
                                int(convert_str_to_unix(row[0])), # Timestamp
                                int(float(row[2])), # MMSI
                                int(map_ship_type_to_int(row[13]))]) # Ship type
            except:
                print(f"Error parsing row {count} in file {csv_filename}. Skipping row.")
                continue
            
    m_msg = np.array(l_l_msg)
        
    if vessel_type_dir is not None:
        save_vessel_types(m_msg, vessel_type_dir, t_date_str) # Save vessel types mapping

    ## Filter messages based on min/max criteria
    m_msg = filter_messages(m_msg, lat_min, lat_max, lon_min, lon_max, sog_max, t_min, t_max)

    ## Build vessel tracks dictionary
    Vs = dict()
    for v_msg in m_msg:
        mmsi = int(v_msg[MMSI])
        if not (mmsi in list(Vs.keys())):
            Vs[mmsi] = np.empty((0,9))
        Vs[mmsi] = np.concatenate((Vs[mmsi], np.expand_dims(v_msg[:9],0)), axis = 0)
    for key in Vs.keys(): # Sort each vessel's messages by timestamp
        Vs[key] = np.array(sorted(Vs[key], key=lambda m_entry: m_entry[TIMESTAMP]))
            

    ## Save to pickle file
    output_filename = csv_filename.replace('csv', 'pkl') 
    with open(os.path.join(output_dir,output_filename),"wb") as f:
        pickle.dump(Vs,f)
        
LON_MIN, LON_MAX, LAT_MIN, LAT_MAX, SOG_MAX, DURATION_MAX = 5.0, 17.0, 54.0, 59.0, 30.0, 24

input_dir = 'data/files/'
output_dir = 'data/pickle_files'
vessel_type_dir = os.path.join(output_dir, 'vessel_types')

l_csv_filename = [filename for filename in os.listdir(input_dir) if filename.endswith('.csv')]
os.makedirs(output_dir, exist_ok=True)

# Process csvs in parallel
tasks = [(csv_file, input_dir, output_dir, vessel_type_dir, 
        LAT_MIN, LAT_MAX, LON_MIN, LON_MAX, SOG_MAX) 
        for csv_file in l_csv_filename]

n_workers = cpu_count() - 1  # Leave 1 core free
with Pool(processes=n_workers) as pool:
    results = [pool.starmap(process_single_csv, tasks)]

### Mapping and shuffling
Now that the full dataset has been chunked (split) we map each item (trajectory) based on MMSI to a MMSI directory ready for preprocessing (reduction).

The resulting temporary directory has the structure:\
```
data/
└── temp_dir/
    ├── 123456789/                      # MMSI (unique vessel identifier)
    │   ├── chunk_0001.pkl              # Segment(s) from input_dir
    │   ├── chunk_0002.pkl
    │
    ├── 987654321/
    │   ├── chunk_0001.pkl
    │
    └── ...                             # One folder per MMSI
```

In [None]:
def map_and_shuffle(input_dir: str, temp_dir: str):
    """ Goes through all input files and re-sorts them by MMSI into a temporary directory. """
    
    # Input files from chunking step
    input_files = [os.path.join(input_dir, f) for f in os.listdir(input_dir) if f.endswith(".pkl")]

    for file_path in input_files:
        with open(file_path, "rb") as f:
            data_dict = pickle.load(f)
            
            for mmsi, track_segment in data_dict.items():
                
                # Create a directory for this specific MMSI
                mmsi_dir = os.path.join(temp_dir, str(mmsi))
                os.makedirs(mmsi_dir, exist_ok=True)
                
                # Save this segment into the MMSI's folder
                # We name it after the original file to avoid collisions
                segment_filename = os.path.basename(file_path)
                output_path = os.path.join(mmsi_dir, segment_filename)
                
                with open(output_path, "wb") as out_f:
                    pickle.dump(track_segment, out_f)
                    
temp_dir = 'data/temp_mapped'
map_and_shuffle(input_dir=output_dir, temp_dir=temp_dir)

### Reduce
In the final step of the MapReduce algorithm, the reduction step, we apply preprocessing of the vessel trajectories. As we consider vessels' trajectories as independent from each other, and we have split and shuffled the trajectories by MMSI in the previous step, we are able to perform this step in parallel. 

The preprocessing includes identifying a vessels "voyages". We define a voyage as a contiguous sequence of AIS messages from the same vessel (possible across days), where the time interval between any two consecutive messages does not exceed two hours, and the vessel is actively moving (i.e., not moored or at anchor). See [D. Nguyen, R. Fablet](https://arxiv.org/pdf/2109.03958) for the full preprocessing rules implemented.

The folder structure for the finally preprocessed files will look like:
```
final_processed/
├── 123456789_0_processed.pkl            # Processed trajectory for MMSI 123456789 (segment 0)
├── 123456789_1_processed.pkl            # (if multiple processed trajectories exist for same MMSI)
├── 987654321_0_processed.pkl
├── 987654321_1_processed.pkl
└── ...
```
where each pickle file constitutes one sample.

In [None]:
from src.preprocessing.preprocessing import preprocess_mmsi_track

def process_single_mmsi(mmsi_info):
    """ Wrapper to unpack arguments for multiprocessing."""
    mmsi, mmsi_dir_path, final_dir = mmsi_info
    
    # Load all segments for this MMSI
    all_segments = []
    segment_files = [f for f in os.listdir(mmsi_dir_path) if f.endswith(".pkl") and not f.startswith("vessel_types_")]
    if not segment_files:
        return None
    for seg_file in segment_files:
            segment_path = os.path.join(mmsi_dir_path, seg_file)
            with open(segment_path, "rb") as f:
                track_segment = pickle.load(f)
                all_segments.append(track_segment)
    
    # Merge into one track
    try:
        full_track = np.concatenate(all_segments, axis=0)
    except ValueError:
        print(f"    MMSI {mmsi}: Error concatenating. Skipping.")
        return None

    # Run processing for single MMSI's track
    processed_data = preprocess_mmsi_track(mmsi, full_track)
    
    # Save final result
    if processed_data:
        for k, traj in processed_data.items():
            final_output_path = os.path.join(final_dir, f"{mmsi}_{k}_processed.pkl")
            data_item = {'mmsi': mmsi, 'traj': traj}
            with open(final_output_path, "wb") as f:
                pickle.dump(data_item, f)
        return True
    return None
    
def reduce(final_dir: str, temp_dir: str,  n_workers: int = None):
    """
    Preprocess vessel trajectories by MMSI in parallel.
    """
    os.makedirs(final_dir, exist_ok=True)
    
    mmsi_folders = os.listdir(temp_dir)
    
    # Prepare list of (mmsi, path, output_dir) tuples for parallel processing
    mmsi_tasks = []
    for mmsi in mmsi_folders:
        mmsi_dir_path = os.path.join(temp_dir, mmsi)
        if os.path.isdir(mmsi_dir_path):
            mmsi_tasks.append((mmsi, mmsi_dir_path, final_dir))
    
    # Process in parallel
    with Pool(processes=n_workers) as pool:
        results = [pool.imap_unordered(process_single_mmsi, mmsi_tasks)]
        
num_workers = cpu_count() - 1  # Leave 1 core free
final_dir = 'data/final_processed'
reduce(final_dir=final_dir, temp_dir=temp_dir, n_workers=num_workers)

### Combine vessel_types and cleanup temporary files

In [None]:
import shutil

vessel_types_combined = dict()
vessel_type_files = [f for f in os.listdir(vessel_type_dir) if f.startswith("vessel_types_") and f.endswith(".pkl")]
for vt_file in vessel_type_files:
    vt_path = os.path.join(vessel_type_dir, vt_file)
    with open(vt_path, "rb") as f:
        vt_mapping = pickle.load(f)
        vessel_types_combined.update(vt_mapping) # In case of conflicts, later files overwrite earlier ones
    os.remove(vt_path)
combined_vt_path = os.path.join(final_dir, "vessel_types.pkl")
with open(combined_vt_path, "wb") as f:
    pickle.dump(vessel_types_combined, f)
    
shutil.rmtree(temp_dir)

## Clustering
TODO

## Something "new"
TODO

## References
D. Nguyen, R. Fablet. "A Transformer Network With Sparse Augmented Data Representation and Cross Entropy Loss for AIS-Based Vessel Trajectory Prediction," in IEEE Access, vol. 12, pp. 21596–21609, 2024.