# Geographic Data Processing

This notebook prepares vessel and port geographic data related to the WSU TRG's [Port Performance Project](https://github.com/epistemetrica/Port-Performance-Project). See the README.md file in the main directory for more info. 

The information relevant to this project is categorized into:
1) geographic information on ports, their associated docks, and the areas where vessels anchor before calling at the port (prepared in this notebook)
2) infrastructure information for each dock/terminal, including cargo type, number and capacity of cranes, etc. 

NOTE since both the uniqueness of port statistical areas and the anchoring behavior of vessels are substantially different between coastal, lake, and inland ports, we restrict our analysis to principal coastal ports at this time. Inland and Lake ports will be analyzed separately at a later date.

## Load and Process Docks and Ports Data

### [Principal Ports from BTS](https://data-usdot.opendata.arcgis.com/datasets/usdot::principal-ports/about)

In [1]:
#prelims
import polars as pl
import pandas as pd
import geopandas as gpd
import time
import plotly.express as px
import matplotlib.pyplot as plt
import contextily as cx
import numpy as np
import glob
import folium
from folium.plugins import HeatMap

#enable string cache for polars categoricals
pl.enable_string_cache()
#display settings
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
pl.Config(tbl_rows=100);

In [2]:
#load principal ports from BTS
ports_gdf = (
    #read in shape file downloaded from BTS
    gpd.read_file('port data/Principal_Ports/Principal_Ports.shp')
    #coerse web mercator
    .to_crs(3857)
    #drop unneeded columns
    .drop([
        'FID', #randomly assigned table id
        'PORT', #unknown numeric ID - not CBP or UN code,
        'RANK',
        'TOTAL',
        'FOREIGN_','EXPORTS', 'IMPORTS', 'DOMESTIC' #breadown of total vol (tons)
    ], axis=1)
    #rename for clarity
    .rename({'TYPE':'port_type'}, axis=1)
)
#keep only coastal ports
ports_gdf = (
    ports_gdf[ports_gdf.port_type == 'C']
    .drop('port_type', axis=1)
    .reset_index(drop=True)
)
#set col names to pythonic lowercase
ports_gdf.columns = ports_gdf.columns.str.lower()

#### Defining port waters

Tracking the time vessels spend in port waters is of primary interested in determining port performance; to that end, we define port waters as the 10km bounding box centered on each principal port.  

In [3]:
def bounds_from_point(gdf, radius, crs=4326, drop=True):
    """
    Returns a bounding box for a point gdf with a given radius
    Inputs:
        gdf: geodataframe with point geometry
        radius: radius in meters
        crs: crs of gdf
        drop: drops all columns except bounding box cols; default True. If False, returns all columns
    Outputs:

    """
    #set crs to 3857 to calculate buffer in meters
    gdf = gdf.to_crs(3857)
    #create buffer based on radius
    gdf['buffer'] = gdf.buffer(radius)
    #set geometry to buffer
    gdf = gdf.set_geometry('buffer')
    #coerse to crs (default 4326 lat/lon)
    gdf = gdf.to_crs(crs)
    #get bounding box
    gdf = pd.concat([gdf, gdf.bounds], axis=1)
    if drop:
        #drop unneeded columns
        gdf = gdf.loc[:,['minx', 'miny', 'maxx', 'maxy']]
    return gdf
    

In [4]:
ports_gdf.head()

Unnamed: 0,port_name,geometry
0,"Albany Port District, NY",POINT (-8209607.618 5257745.826)
1,"Anacortes, WA",POINT (-13647726.157 6189762.265)
2,"Baltimore, MD",POINT (-8522802.779 4757664.339)
3,"Beaumont, TX",POINT (-10474605.816 3514464.019)
4,"Boston, MA",POINT (-7907249.298 5212418.340)


In [5]:
#get port waters - 50km radius
port_waters = (
    bounds_from_point(ports_gdf, 50000, drop=False)
    #keep only needed cols
    .loc[:,['port_name', 'minx', 'miny', 'maxx', 'maxy']]
)
#inspect
port_waters.head()

Unnamed: 0,port_name,minx,miny,maxx,maxy
0,"Albany Port District, NY",-74.197318,42.311436,-73.299002,42.972229
1,"Anacortes, WA",-123.048768,48.197424,-122.150452,48.792714
2,"Baltimore, MD",-77.010798,38.902145,-76.112482,39.597784
3,"Beaumont, TX",-94.544143,29.695462,-93.645827,30.472755
4,"Boston, MA",-71.481187,42.009605,-70.582871,42.673578


### [World Ports locations from World Port Index](https://msi.nga.mil/Publications/WPI)

[this addition is tabled for now]

We will define 'docking events' each time a vessel moors. Unfortunately, vessel AIS transceivers often sending moored status messaged when not in the vacinity of known ports. We do some basic cleaning of the AIS messages when we load them below, but in order to determine whether a mooring status is given in error, we need to determine whether the message was sent from within the proximity of a known port. We define these areas a 5km-wide bounding boxes around each port. 

In [6]:
%%script echo skip
wpi_areas = (
    #read shape file from WPI
    gpd.read_file('port data/wpi_data/WPI_output.shp')
    #drop unneeded columns
    .loc[:, ['main_port_', 'geometry']]
)
#get bounds for each port
wpi_areas = bounds_from_point(wpi_areas, 2500)

skip


### [Docks and Anchorages from Army Corp](https://geospatial-usace.opendata.arcgis.com/datasets/23d91bd988ac4fc9943128965bddfa37_0/about)

In [7]:
#load docks and anchorages from CoE
docks_gdf = (
    #read in shape file downloaded from USACE
    gpd.read_file('port data/Dock/Dock.shp')
    #drop unneeded columns
    .drop([
        'PORT_NAME',
        'UNLOCODE', #UN Location Code, rarely used
        'CITY_OR_TO', 'STATE_POST', 'WTWY_NAME', #unneeded 
        'FID', #randomly assigned table id
        'LONGITUDE', 'LATITUDE', #already coded in 'geometry' 
        'LOCATION_D', #text description of dock location
        'STREET_ADD','ZIPCODE', #street address details
        'PSA_NAME', #statistical area name, rarely used
        'COUNTY_NAM', 'COUNTY_FIP', 'CONGRESS', 'CONGRESS_F', #county and congress info
        'MILE', 'BANK', 'LATITUDE1', 'LONGITUDE1', #redundant locaation data
        'OPERATORS', 'OWNERS', #owner info
        'PURPOSE', #long-form text description of dock uses
        'DOCK', #unknown number (not unique to each row/dock)
        'HIGHWAY_NO', 'RAILWAY_NO', 'LOCATION', #redundant location info
        'COMMODITIE', 'CONSTRUCTI','MECHANICAL', 'REMARKS', 'VERTICAL_D', 
        'DEPTH_MIN', 'DEPTH_MAX','BERTHING_L', 'BERTHING_T', 'DECK_HEIGH', 
        'DECK_HEI_1', #these are rarely used stats on construction
        'SERVICE_IN','SERVICE_TE', #rarely used indicators of data entry date 
    ], axis=1)
    #drop duplicates with matching geometries, keeping most common data
    .groupby('geometry').agg(lambda x: x.mode().iloc[0] if not x.mode().empty else None).reset_index()
    #rename cols for clarity
    .rename(columns={
        'NAV_UNIT_I':'dock_id',
        'NAV_UNIT_N':'dock_name',
        'FACILITY_T':'facility_type'
    })
)
#set col names to pythonic lowercase
docks_gdf.columns = docks_gdf.columns.str.lower()
#coerse back to gdf - groupby appears to have kicked it back to pandas core
docks_gdf = gpd.GeoDataFrame(docks_gdf, geometry='geometry', crs=3857)

#inspect
docks_gdf.head()

Unnamed: 0,geometry,dock_id,dock_name,facility_type
0,POINT (-19217933.954 -1519086.611),0552,ASAU SMALL BOAT HARBOR,
1,POINT (-18999867.643 -1605625.815),058N,PAGO PAGO AMERICAN SAMOA,Dock
2,POINT (-18986748.196 -1606739.459),058M,AUNU'U SMALL BOAT HARBOR,
3,POINT (-18864337.721 -1602880.791),0551,TA'U HARBOR,
4,POINT (-18884324.802 -1593797.256),0550,OFU HARBOR,


### [Port Statistical Areas from BTS](https://geospatial-usace.opendata.arcgis.com/datasets/b7fd6cec8d8c43e4a141d24170e6d82f_0/about)

In [8]:
#load port stat areas from BTS
port_areas_gdf = (
    gpd.read_file('port data/Port Statistical Areas/Ports_and_Port_Statistical_Areas.shp')
    #coerse web mercator
    .to_crs(3857)
    #drop unneeded cols
    .drop(['INSTALLATI', 'MEDIAID', 'METADATAID', 'SDSID', 'DATA_YEAR', 
           'OBJECTID', 'Shape__Are', 'Shape__Len'], axis=1)
    #rename cols
    .rename({
        'geometry':'geometry_area',
        'PORTIDPK':'port_area_id',
        'FEATUREDES':'port_area_desc',
        'FEATURENAM':'port_area_name'
        }, axis=1)
    .set_geometry('geometry_area')
)

#inspect
port_areas_gdf.head()

Unnamed: 0,port_area_desc,port_area_name,port_area_id,geometry_area
0,U.S. Census Bureau municipal limit,"Galveston, TX",2417,"POLYGON ((-10589014.480 3386402.959, -10588974..."
1,"Per legislation, all of Shelby County, TN exce...","Memphis and Shelby County, TN",2294,"MULTIPOLYGON (((-9978777.748 4174967.904, -997..."
2,"All those portions of the St. Louis Bay, St. L...","Duluth-Superior, MN and WI",3924,"POLYGON ((-10274934.649 5888396.250, -10274930..."
3,Area defined by Texas state legislation creati...,"Port Freeport, TX",2408,"POLYGON ((-10672631.006 3404927.407, -10672585..."
4,"Corporate limits of Henderson county, Kentucky.","Henderson County Riverport Authority, KY",2329,"POLYGON ((-9787954.639 4565536.864, -9787769.1..."


### Merge Ports and Docks

Former method of including all docks in the Port Statistical Area was problematic since some PSAs are much larger than reasonable-interpretations of ports (and often much larger than "port waters" as defined above). 

We join docks to their nearest port with a maximum range of 5km. Future edits may pull port info directly from the docks and anchorages data. 

In [9]:
#join port info to nearest docks that are within 5km of port
stops_gdf = (
    docks_gdf.sjoin_nearest(
        ports_gdf, how='inner', max_distance=5000
    )
    .drop('index_right', axis=1)
    #recover port geometry
    .merge(ports_gdf, on='port_name', how='left')
    #rename geometry cols
    .rename(columns={
        'geometry_x':'geometry_dock',
        'geometry_y':'geometry_port'
    })
)


#inspect
stops_gdf.head()

Unnamed: 0,geometry_dock,dock_id,dock_name,facility_type,port_name,geometry_port
0,POINT (-9190384.318 3204025.920),0JU9,PINEY POINT,,"Manatee County Port, FL",POINT (-9190998.357 3202867.355)
1,POINT (-9190910.637 3202310.097),0XDE,"MANATEE COUNTY PORT AUTHORITY, BERTH NOS. 12 A...",Dock,"Manatee County Port, FL",POINT (-9190998.357 3202867.355)
2,POINT (-9190939.135 3202649.098),0ZSY,"MANATEE COUNTY PORT AUTHORITY, BERTH NO. 11",Dock,"Manatee County Port, FL",POINT (-9190998.357 3202867.355)
3,POINT (-9190908.188 3203242.559),0ZU6,"MANATEE COUNTY PORT AUTHORITY, BERTH Nos. 5 and 4",Dock,"Manatee County Port, FL",POINT (-9190998.357 3202867.355)
4,POINT (-9190537.160 3203033.092),0ZTL,"MANATEE COUNTY PORT AUTHORITY, BERTH NO. 7",Dock,"Manatee County Port, FL",POINT (-9190998.357 3202867.355)


## Load and Process AIS Data

AIS messages are obtained from the Marine Cadastre database and processed in a separate notebook; see the README for full details. Note the AIS database contains more than 2B rows; we substantially reduce the data size by loading only the messages that constitute status changes. 

In [10]:
def which_port_waters(port_waters):
    for port_name, minx, miny, maxx, maxy in port_waters.values:
        yield (
            #if lat and lon are within bounds
            (pl.col('lat').is_between(miny, maxy) & 
            pl.col('lon').is_between(minx, maxx))
            #give port name and set False to None
            .replace_strict(old=True, new=port_name, default=None)
            #rename column
            .alias(port_name)
        )

In [11]:
#init list of lazyframes
lfs = []
#process each parquet file individually into lazyframes
for file in glob.glob('ais data/data/ais_clean/*.parquet'):
    #check file integrity 
    pl.scan_parquet(file).collect_schema()
    #read and process file
    lf = (
        pl.scan_parquet(file)
        #drop smaller vessels
        .filter(pl.col('length')>100)
        #sort by mmsi and imo
        .sort(['mmsi', 'imo'])
        #fill null imo over mmsi
        .with_columns(pl.col('imo').forward_fill().over('mmsi'))
        .with_columns(pl.col('imo').backward_fill().over('mmsi'))
        #drop missing vessel id
        .drop_nulls('imo')
        #set status to undefined when moored but non-zero velocity
        .with_columns(
            pl.when((pl.col('status')==5)&(pl.col('speed')!=0))
            .then(pl.col('status')==15)
            .otherwise(pl.col('status'))
        )
        #set status to unknown when at anchor but high velocity (> 1 knot)
        .with_columns(
            pl.when((pl.col('status')==1)&(pl.col('speed')>1))
            .then(pl.col('status')==15)
            .otherwise(pl.col('status'))
        )
        #drop messages from the same vessel with same timestamp
        .unique(subset=['imo', 'time'])
        #identify list of port waters from whence each message came
        .with_columns(
            which_port_waters = (
                pl.concat_str(
                    which_port_waters(port_waters), separator='|', 
                    ignore_nulls=True
                )
                .replace('', 'Not in port waters')
            )
        )
        #split port waters into separate columns
        .with_columns(
            port_waters1 = (
                pl.col('which_port_waters').str.split('|')
                .list.get(0, null_on_oob=True)
                .replace(None, 'Not in port waters')
                ),
            port_waters2 = (
                pl.col('which_port_waters').str.split('|')
                .list.get(1, null_on_oob=True)
                .replace(None, 'Not in port waters')
                ),
            port_waters3 = (
                pl.col('which_port_waters').str.split('|')
                .list.get(2, null_on_oob=True)
                .replace(None, 'Not in port waters')
                )
            #NOTE previous analysis showed that vessels never report more than 3 simultaneous port waters
        )
        #ensure sorting
        .sort(['imo', 'time'])
        #keep status change and entering port waters messages
        .filter(
            (pl.col('status').ne(pl.col('status').shift()).over('imo')) |
            (pl.col('which_port_waters').ne(pl.col('which_port_waters')
                                         .shift()).over('imo'))
        )
    )
    #append to list of lazyframes
    lfs.append(lf)
print('files loaded; beginning collection')
# Collect all lazyframes
ais_df = pl.concat(pl.collect_all(lfs), how='diagonal_relaxed')
#inspect
ais_df.head()

files loaded; beginning collection


mmsi,time,lat,lon,speed,course,heading,status,vessel_name,vessel_type,imo,length,width,draft,cargo,which_port_waters,port_waters1,port_waters2,port_waters3
str,datetime[μs],f64,f64,f64,f64,f64,f64,cat,f64,i64,f64,f64,f64,f64,str,str,str,str
"""366971340""",2018-05-14 12:46:34,45.41117,-83.78528,0.0,14.0,236.0,5.0,"""CASON J CALLAWAY""",70.0,5065392,228.0,21.0,8.0,,"""Not in port waters""","""Not in port waters""","""Not in port waters""","""Not in port waters"""
"""366971340""",2018-05-14 12:49:37,45.41112,-83.78532,0.1,155.0,236.0,0.0,"""CASON J CALLAWAY""",70.0,5065392,228.0,21.0,8.0,,"""Not in port waters""","""Not in port waters""","""Not in port waters""","""Not in port waters"""
"""366971340""",2018-05-14 12:52:32,45.41113,-83.78528,0.0,151.0,236.0,5.0,"""CASON J CALLAWAY""",70.0,5065392,228.0,21.0,8.0,,"""Not in port waters""","""Not in port waters""","""Not in port waters""","""Not in port waters"""
"""366971340""",2018-05-14 12:58:34,45.41108,-83.7853,0.1,198.0,236.0,0.0,"""CASON J CALLAWAY""",70.0,5065392,228.0,21.0,8.0,,"""Not in port waters""","""Not in port waters""","""Not in port waters""","""Not in port waters"""
"""366971340""",2018-05-14 13:01:37,45.4112,-83.78532,0.0,144.0,236.0,5.0,"""CASON J CALLAWAY""",70.0,5065392,228.0,21.0,8.0,,"""Not in port waters""","""Not in port waters""","""Not in port waters""","""Not in port waters"""


In [12]:
#set minimum meaningful status duration (minutes)
min_duration = 5
#set minimum meaningful status duration between equal status (minutes)
min_between_duration = 30
#set minimum status changes threshold
min_status_changes = 20

#process ais data
ais_gdf = (
    ais_df
    #ensure sorting
    .sort(['imo', 'time'])
    #keep status change and entering port waters messages (drops duplicate status across day changes)
    .filter(
        (pl.col('status').ne(pl.col('status').shift()).over('imo')) |
        (pl.col('which_port_waters').ne(pl.col('which_port_waters')
                                        .shift()).over('imo'))
    )
    #drop vessels with very few status changes
    .filter(pl.len().over('imo')>min_status_changes)
    #create duration column in minutes
    .with_columns(
        status_duration = (pl.col('time').shift(-1) - pl.col('time'))
        .over('imo').dt.total_seconds()/60
    )
    #drop messages with very short status duration
    .filter(pl.col('status_duration')>min_duration)
    #drop short changes in status between equal statuses
    .with_columns(
        short = (
            #if statuses of previous and next messages are equal and
            (pl.col('status').shift()==pl.col('status').shift(-1)) &
            #if port waters of previous and next messages are equal and
            (pl.col('which_port_waters').shift()==pl.col('which_port_waters')
            .shift(-1)) & 
            #if duration of status is less than minimum
            (pl.col('status_duration') < min_between_duration)
        ).over('imo')
    )
    #then drop short messages and short column
    .filter(pl.col('short')!=True)
    .drop('short')
    #re-filter to keep status changes and entering port waters messages
    .filter(
        (pl.col('status').ne(pl.col('status').shift()).over('imo')) |
        (pl.col('which_port_waters').ne(pl.col('which_port_waters')
                                        .shift()).over('imo'))
    )
    #recalculate duration column (also resets duration to pl.Duration type)
    .with_columns(
        status_duration = (pl.col('time').shift(-1) - pl.col('time'))
        .over('imo')
    )
    #ensure sort by vessel and time
    .sort(['imo', 'time'])
    #assign call id each time a vessel visits port waters
    #create row index for constructing call id
    .with_row_index('index')
    #create call id for each vessel entry to port waters
    .with_columns(
        call_id1 = (
            #when port waters 1 not equal to next or imo changes
            pl.when(pl.col('port_waters1').ne(pl.col('port_waters1').shift(-1)) |
                    pl.col('imo').ne(pl.col('imo').shift(-1)))
            #set call_id to vessel imo, port waters and index
            .then(pl.col('imo').cast(pl.Utf8)+'_'+pl.col('port_waters1')+'_'+
                  pl.col('index').cast(pl.Utf8)
                )
            #otherwise null
            .otherwise(pl.lit(None))
            #backward fill call_id1
            .backward_fill()
        ),
        call_id2 = (
            #when port waters 2 not equal to next for same vessel
            pl.when(pl.col('port_waters2').ne(pl.col('port_waters2').shift(-1))
                    .over('imo'))
            #set call_id to vessel imo, port waters and index
            .then(pl.col('imo').cast(pl.Utf8)+'_'+pl.col('port_waters2')+'_'+
                  pl.col('index').cast(pl.Utf8)
                )
            #otherwise null
            .otherwise(pl.lit(None))
            #backward fill call_id2
            .backward_fill()
        ),
        call_id3 = (
            #when port waters 3 not equal to next for same vessel
            pl.when(pl.col('port_waters3').ne(pl.col('port_waters3').shift(-1))
                    .over('imo'))
            #set call_id to vessel imo, port waters and index
            .then(pl.col('imo').cast(pl.Utf8)+'_'+pl.col('port_waters3')+'_'+
                  pl.col('index').cast(pl.Utf8)
                )
            #otherwise null
            .otherwise(pl.lit(None))
            #backward fill call_id3
            .backward_fill()
        )
    )
    #drop index
    .drop('index')
)

print('polars dataframe processed')

#inspect
display(ais_gdf.describe())
#convert to pandas dataframe
ais_gdf = ais_gdf.to_pandas()

polars dataframe processed


statistic,mmsi,time,lat,lon,speed,course,heading,status,vessel_name,vessel_type,imo,length,width,draft,cargo,which_port_waters,port_waters1,port_waters2,port_waters3,status_duration,call_id1,call_id2,call_id3
str,str,str,f64,f64,f64,f64,f64,f64,str,f64,f64,f64,f64,f64,f64,str,str,str,str,str,str,str,str
"""count""","""5924473""","""5924473""",5924473.0,5924473.0,5924473.0,5917439.0,5868734.0,5923373.0,"""5924473""",5924473.0,5924473.0,5924473.0,5502082.0,5472040.0,4676054.0,"""5924473""","""5924473""","""5924473""","""5924473""","""5906328""","""5924472""","""5924463""","""5924463"""
"""null_count""","""0""","""0""",0.0,0.0,0.0,7034.0,55739.0,1100.0,"""0""",0.0,0.0,0.0,422391.0,452433.0,1248419.0,"""0""","""0""","""0""","""0""","""18145""","""1""","""10""","""10"""
"""mean""",,"""2021-06-10 20:37:54.546008""",33.490612,-94.791546,4.25202,183.212164,182.994772,1.445492,,74.106387,10256000.0,209.497685,32.299588,10.904182,74.763211,,,,,"""3 days, 17:33:10.366660""",,,
"""std""",,,7.191939,20.126105,6.050943,96.242515,105.244346,2.245107,,5.207397,27923000.0,55.966129,7.86377,3.01897,8.418523,,,,,,,,
"""min""","""205041000""","""2018-01-01 00:22:03""",3.51667,-176.87657,0.0,0.0,0.0,0.0,,70.0,0.0,101.0,0.0,-12.8,0.0,"""Albany Port District, NY""","""Albany Port District, NY""","""Guaynabo, PR""","""Not in port waters""","""0:05:01""","""0_Anacortes, WA_214""","""0_Houston Port Authority, TX_1…","""0_Not in port waters_101"""
"""25%""",,"""2019-09-06 15:55:39""",29.43917,-97.40733,0.0,115.9,94.0,0.0,,70.0,9318620.0,179.0,28.0,8.8,70.0,,,,,"""0:57:08""",,,
"""50%""",,"""2021-07-22 11:03:25""",30.07179,-93.28121,0.1,180.8,181.0,0.0,,70.0,9477270.0,190.0,32.0,10.8,71.0,,,,,"""2:03:50""",,,
"""75%""",,"""2023-02-19 18:26:24""",37.90678,-80.654,9.9,260.9,277.0,5.0,,80.0,9687215.0,230.0,33.0,13.0,80.0,,,,,"""5:45:00""",,,
"""max""","""775345000""","""2024-09-30 23:30:10""",82.36032,148.43574,102.3,359.9,462.0,15.0,,89.0,990184300.0,901.0,86.0,25.5,191.0,"""Yabucoa, PR""","""Yabucoa, PR""","""Yabucoa, PR""","""Yabucoa, PR""","""2431 days, 15:05:15""","""9992268_Virginia, VA, Port of_…","""9992256_South Louisiana, LA, P…","""9992256_Terrebonne Parish Port…"


In [13]:
#convert to geopandas dataframe
ais_gdf = (
    #convert to geodataframe
    gpd.GeoDataFrame(
        ais_gdf,
        geometry=gpd.points_from_xy(ais_gdf.lon, ais_gdf.lat, crs='EPSG:4326')
    )
    #convert to WGS84 pseudo-mercator; giving distances in meters
    .to_crs(3857)
    #drop old lat lon cols
    .drop(['lat', 'lon'], axis=1)
    #rename geometry col for clarity
    .rename({'geometry':'geometry_vessel'}, axis=1)
    .set_geometry('geometry_vessel')
)

#inspect
display(ais_gdf.head())
ais_gdf.info()

Unnamed: 0,mmsi,time,speed,course,heading,status,vessel_name,vessel_type,imo,length,width,draft,cargo,which_port_waters,port_waters1,port_waters2,port_waters3,status_duration,call_id1,call_id2,call_id3,geometry_vessel
0,310762000,2020-04-01 03:17:17,0.0,203.0,53.0,1.0,STENA PRIMORSK,80.0,0,183.0,40.0,,,"Port Freeport, TX","Port Freeport, TX",Not in port waters,Not in port waters,0 days 02:33:41,"0_Port Freeport, TX_0",0_Not in port waters_12,0_Not in port waters_13,POINT (-10592668.485 3353642.843)
1,257246000,2020-04-01 05:50:58,0.0,345.9,324.0,5.0,BOW PALLADIUM,80.0,0,165.0,27.0,,,"Corpus Christi, TX","Corpus Christi, TX",Not in port waters,Not in port waters,0 days 12:59:54,"0_Corpus Christi, TX_1",0_Not in port waters_12,0_Not in port waters_13,POINT (-10842063.107 3225504.603)
2,310762000,2020-04-01 18:50:52,5.0,296.0,313.0,0.0,STENA PRIMORSK,80.0,0,183.0,40.0,,,"Port Freeport, TX","Port Freeport, TX",Not in port waters,Not in port waters,0 days 04:52:12,"0_Port Freeport, TX_3",0_Not in port waters_12,0_Not in port waters_13,POINT (-10596096.012 3355806.945)
3,310762000,2020-04-01 23:43:04,0.0,169.0,124.0,5.0,STENA PRIMORSK,80.0,0,183.0,40.0,,,"Port Freeport, TX","Port Freeport, TX",Not in port waters,Not in port waters,0 days 14:31:20,"0_Port Freeport, TX_3",0_Not in port waters_12,0_Not in port waters_13,POINT (-10612508.958 3367588.012)
4,257246000,2020-04-02 14:14:24,0.0,225.4,104.0,5.0,BOW PALLADIUM,80.0,0,165.0,27.0,,,"Corpus Christi, TX","Corpus Christi, TX",Not in port waters,Not in port waters,0 days 01:07:41,"0_Corpus Christi, TX_4",0_Not in port waters_12,0_Not in port waters_13,POINT (-10856087.136 3229155.035)


<class 'geopandas.geodataframe.GeoDataFrame'>
RangeIndex: 5924473 entries, 0 to 5924472
Data columns (total 22 columns):
 #   Column             Dtype          
---  ------             -----          
 0   mmsi               object         
 1   time               datetime64[us] 
 2   speed              float64        
 3   course             float64        
 4   heading            float64        
 5   status             float64        
 6   vessel_name        category       
 7   vessel_type        float64        
 8   imo                int64          
 9   length             float64        
 10  width              float64        
 11  draft              float64        
 12  cargo              float64        
 13  which_port_waters  object         
 14  port_waters1       object         
 15  port_waters2       object         
 16  port_waters3       object         
 17  status_duration    timedelta64[us]
 18  call_id1           object         
 19  call_id2           object         

## Match AIS messages with relevant port calls

In [14]:
#set max distance (m) from vessel to dock
#NOTE the longest observed vessels are just shy of 400m
max_distance = 400

#join stops to AIS based on mooring locations
calls_gdf = (
    #filter to only ais moorings
    ais_gdf[ais_gdf.status == 5]
    #join nearest dock (within max distance)
    .sjoin_nearest(
        stops_gdf.set_geometry('geometry_dock'), how='inner', 
        exclusive=True, max_distance=max_distance
    )
    #drop unneeded cols
    .drop(['index_right'], axis=1)
    #recover dock geometry
    .merge(stops_gdf[['dock_id', 'geometry_dock']], how='left')
)
#left merge calls_gdf to ais_gdf to create main df
main_gdf = ais_gdf.merge(calls_gdf, how='left')

#inspect
main_gdf.head()

Unnamed: 0,mmsi,time,speed,course,heading,status,vessel_name,vessel_type,imo,length,width,draft,cargo,which_port_waters,port_waters1,port_waters2,port_waters3,status_duration,call_id1,call_id2,call_id3,geometry_vessel,dock_id,dock_name,facility_type,port_name,geometry_port,geometry_dock
0,310762000,2020-04-01 03:17:17,0.0,203.0,53.0,1.0,STENA PRIMORSK,80.0,0,183.0,40.0,,,"Port Freeport, TX","Port Freeport, TX",Not in port waters,Not in port waters,0 days 02:33:41,"0_Port Freeport, TX_0",0_Not in port waters_12,0_Not in port waters_13,POINT (-10592668.485 3353642.843),,,,,,
1,257246000,2020-04-01 05:50:58,0.0,345.9,324.0,5.0,BOW PALLADIUM,80.0,0,165.0,27.0,,,"Corpus Christi, TX","Corpus Christi, TX",Not in port waters,Not in port waters,0 days 12:59:54,"0_Corpus Christi, TX_1",0_Not in port waters_12,0_Not in port waters_13,POINT (-10842063.107 3225504.603),0VLL,PORT OF CORPUS CHRISTI NORTHSIDE BREAKBULK DOC...,Dock,"Corpus Christi, TX",POINT (-10842283.519 3225388.812),POINT (-10842078.691 3225609.319)
2,310762000,2020-04-01 18:50:52,5.0,296.0,313.0,0.0,STENA PRIMORSK,80.0,0,183.0,40.0,,,"Port Freeport, TX","Port Freeport, TX",Not in port waters,Not in port waters,0 days 04:52:12,"0_Port Freeport, TX_3",0_Not in port waters_12,0_Not in port waters_13,POINT (-10596096.012 3355806.945),,,,,,
3,310762000,2020-04-01 23:43:04,0.0,169.0,124.0,5.0,STENA PRIMORSK,80.0,0,183.0,40.0,,,"Port Freeport, TX","Port Freeport, TX",Not in port waters,Not in port waters,0 days 14:31:20,"0_Port Freeport, TX_3",0_Not in port waters_12,0_Not in port waters_13,POINT (-10612508.958 3367588.012),0SRW,"Phillips 66 Co., Freeport Terminal Barge Dock.",Dock,"Port Freeport, TX",POINT (-10616617.761 3370725.130),POINT (-10612730.039 3367583.305)
4,257246000,2020-04-02 14:14:24,0.0,225.4,104.0,5.0,BOW PALLADIUM,80.0,0,165.0,27.0,,,"Corpus Christi, TX","Corpus Christi, TX",Not in port waters,Not in port waters,0 days 01:07:41,"0_Corpus Christi, TX_4",0_Not in port waters_12,0_Not in port waters_13,POINT (-10856087.136 3229155.035),,,,,,


In [15]:
#prepare for conversion to polars
#get lat and lon columns from geometry cols
for geo in ['geometry_vessel', 'geometry_dock', 'geometry_port']:
    #get lat and lon variable names
    lat_name = f'{geo.split('geometry_')[1]}_lat'
    lon_name = f'{geo.split('geometry_')[1]}_lon'
    #set geometry to geo
    main_gdf.set_geometry(geo, inplace=True)
    #set to lat long
    main_gdf = main_gdf.to_crs(4326)
    #extract lat and lon
    main_gdf[lat_name] = main_gdf[geo].y
    main_gdf[lon_name] = main_gdf[geo].x
    #drop geometry cols
    main_gdf.drop(geo, axis=1, inplace=True)

In [16]:
main_df = (
    #convert to polars
    pl.DataFrame(main_gdf)
    #find call_id associated with docking in port waters
    .with_columns(
        #find call_id associated with docking in port waters
        pl.when(pl.col('port_name')==pl.col('port_waters1'))
        .then(pl.col('call_id1'))
        .otherwise(
            pl.when(pl.col('port_name')==pl.col('port_waters2'))
            .then(pl.col('call_id2'))
            .otherwise(
                pl.when(pl.col('port_name')==pl.col('port_waters3'))
                .then(pl.col('call_id3'))
            )
        )
        #named call_id
        .alias('call_id')
    )
    #backfill call_id over the matching call_idN
    .with_columns(
        pl.when(pl.any_horizontal(
                pl.col('^call_id.*$') == 
                pl.col('call_id').backward_fill()))
        .then(pl.col('call_id').backward_fill())
    )
    #forward fill call_id over the matching call_idN
    .with_columns(
        pl.when(pl.any_horizontal(
                pl.col('^call_id.*$') == 
                pl.col('call_id').forward_fill()))
        .then(pl.col('call_id').forward_fill())
    )
    #drop unneeded call_id cols
    .drop('call_id1', 'call_id2', 'call_id3')
    #backfill fill port and dock info over call_id
    .with_columns(
        pl.col(['port_name', 'dock_id', 'dock_name', 'facility_type',
                'dock_lat', 'dock_lon', 'port_lat', 'port_lon'])
        .backward_fill().over('call_id')
    )
    #forward fill over call_id
    .with_columns(
        pl.col(['port_name', 'dock_id', 'dock_name', 'facility_type',
                'dock_lat', 'dock_lon', 'port_lat', 'port_lon'])
        .forward_fill().over('call_id')
    )
    #drop rows not associated with principal port calls
    .filter(pl.col('call_id').is_not_null())
    #set call_id to entry date rather than row index
    .with_columns(
        #remove the row index from call_id
        pl.col('call_id').str.replace(r'_[^_]*$', '')+'_'+
        #add date of entry to port waters
        pl.col('time').dt.strftime('%Y-%m-%d').first().over('call_id')
    )
)

In [17]:
#inspect
main_df.head()

mmsi,time,speed,course,heading,status,vessel_name,vessel_type,imo,length,width,draft,cargo,which_port_waters,port_waters1,port_waters2,port_waters3,status_duration,dock_id,dock_name,facility_type,port_name,vessel_lat,vessel_lon,dock_lat,dock_lon,port_lat,port_lon,call_id
str,datetime[μs],f64,f64,f64,f64,cat,f64,i64,f64,f64,f64,f64,str,str,str,str,duration[μs],str,str,str,str,f64,f64,f64,f64,f64,f64,str
"""257246000""",2020-04-01 05:50:58,0.0,345.9,324.0,5.0,"""BOW PALLADIUM""",80.0,0,165.0,27.0,,,"""Corpus Christi, TX""","""Corpus Christi, TX""","""Not in port waters""","""Not in port waters""",12h 59m 54s,"""0VLL""","""PORT OF CORPUS CHRISTI NORTHSI…","""Dock""","""Corpus Christi, TX""",27.81369,-97.39591,27.814522,-97.39605,27.81277,-97.39789,"""0_Corpus Christi, TX_2020-04-0…"
"""310762000""",2020-04-01 18:50:52,5.0,296.0,313.0,0.0,"""STENA PRIMORSK""",80.0,0,183.0,40.0,,,"""Port Freeport, TX""","""Port Freeport, TX""","""Not in port waters""","""Not in port waters""",4h 52m 12s,"""0SRW""","""Phillips 66 Co., Freeport Term…","""Dock""","""Port Freeport, TX""",28.84401,-95.18635,28.936633,-95.335776,28.96133,-95.3707,"""0_Port Freeport, TX_2020-04-01"""
"""310762000""",2020-04-01 23:43:04,0.0,169.0,124.0,5.0,"""STENA PRIMORSK""",80.0,0,183.0,40.0,,,"""Port Freeport, TX""","""Port Freeport, TX""","""Not in port waters""","""Not in port waters""",14h 31m 20s,"""0SRW""","""Phillips 66 Co., Freeport Term…","""Dock""","""Port Freeport, TX""",28.93667,-95.33379,28.936633,-95.335776,28.96133,-95.3707,"""0_Port Freeport, TX_2020-04-01"""
"""310762000""",2020-04-02 15:22:05,0.0,114.0,124.0,5.0,"""STENA PRIMORSK""",80.0,0,183.0,40.0,,,"""Port Freeport, TX""","""Port Freeport, TX""","""Not in port waters""","""Not in port waters""",7h 18m 8s,"""0SRW""","""Phillips 66 Co., Freeport Term…","""Dock""","""Port Freeport, TX""",28.93666,-95.33378,28.936633,-95.335776,28.96133,-95.3707,"""0_Port Freeport, TX_2020-04-02"""
"""563759000""",2020-05-03 13:09:57,9.5,333.5,333.0,0.0,"""EAGLE KANGAR""",80.0,0,244.0,42.0,,,"""Philadelphia Regional Port, PA…","""Philadelphia Regional Port, PA""","""South Jersey Port Corp, NJ""","""Wilmington, DE""",3h 15m 10s,"""0S1W""","""Mobil Oil Corp., Paulsboro Ref…","""Dock""","""South Jersey Port Corp, NJ""",39.58197,-75.55733,39.84525,-75.268643,39.846445,-75.261085,"""0_South Jersey Port Corp, NJ_2…"


## Save parquet file

In [18]:
#save to parquet file
main_df.write_parquet('port data/dashboard/main.parquet')
