# Port Performance Index Project - Main Notebook

This notebook presents the primary data analysis related to the PPI Project; see the README and other files in the [repo](https://github.com/epistemetrica/Port-Performance-Project) for full details. 

## Data Processing

AIS data is ingested from the Marine Cadastre database via the scripts found in the vessel data folder. Here, we combine the AIS data with port-level data and prepare for analysis. 

In [5]:
#prelims
import polars as pl
import pandas as pd

#enable string cache for polars categoricals
pl.enable_string_cache()
#display settings
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
pl.Config(tbl_rows=200);

### AIS Data

The vessel locations and status (e.g., "under way", "anchored", "moored") data include all AIS messages; for the purposes of the PPI, we only need to know when a vessel *changes* status, so all other observations are dropped. 

We also limit the data for the PPI to vessels over 100m in length. 

In [12]:
ais_lf = (
    #read into lazyframe
    pl.scan_parquet('AIS data ingestion/data/ais_clean/*.parquet')
    #drop smaller vessels
    .filter(pl.col('length')>100)
    #sort by vessel and time
    .sort(['mmsi', 'time'])
    #indicate whether status is the same as previous row (Fill value needed to avoid status 0 evaluating as equal to false)
    .with_columns(
        status_change = (
            pl.col('status').ne(pl.col('status').shift(fill_value=20))
            .over('mmsi')
        )
    )
    #keep only new status pings
    .filter(pl.col('status_change')==True)
    #drop change col
    .drop('status_change')
)

### Port and Dock Data

Locations and descriptions for each dock and port come from the BTS and USACE online databases. 

In [8]:
#load port data
ports_df = (
    #read csv
    pl.read_csv('port data/Principal_Ports.csv')
    #select cols of interest
    .select('X','Y','FID','PORT','TYPE','PORT_NAME','RANK','TOTAL')
)
#load docks and anchorages
docks_df = (
    #read csv
    pl.read_csv('port data/docks.csv', null_values=' ', infer_schema_length=0)
    #select cols of interest
    .select('FID','LONGITUDE','LATITUDE','NAV_UNIT_I','UNLOCODE','NAV_UNIT_N',
            'FACILITY_T','CITY_OR_TO','STATE_POST','WTWY_NAME','PORT_NAME',
            'LOCATION','DOCK')
)