# Ocean Carrier Alliances - Data Analysis for USDA Report

This notebook analyses data on maritime container shipping to inspect the impacts of strategic alliances between ocean freight carriers (Ocean Carrier Alliances, OCAs) on the containerized agricultural export market in the US. Data are pre-processed in the oca_data_prep notebook; see the [project repo](https://github.com/epistemetrica/Ocean-Carrier-Alliances-Project) for full details.

Section 1 presents data summaries and aggregations, Section 2 gives general context for the maritime freight economy and vessel sharing behavior, and Section 3 presents the analytical models and results.

In [22]:
#preliminaries
import numpy as np
import pandas as pd
import polars as pl
import plotly_express as px
import statsmodels.api as sm

#enable string cache for polars categoricals
pl.enable_string_cache()
#display settings
pd.set_option('display.max_columns', None)

In [23]:
#load data
main_lf = (
    #load parquet files to polars lazyframe 
    pl.scan_parquet('../data/main/*.parquet')
    #choose only exports
    .filter(pl.col('direction')=='export')
    #drop unneeded columns
    .drop('origin_territory', 'origin_region', 'direction', 'carrier_name', 
          'carrier_scac', 'vessel_name', 'voyage_number')
    #rename cols for consistency/sanity
    .rename({'unified_carrier_name':'carrier_name',
             'unified_carrier_scac':'scac',
             'arrival_port_name':'dest_port_name',
             'arrival_port_code':'dest_port_code',
             'departure_port_name':'origin_port_name',
             'departure_port_code':'origin_port_code',
             'pc_alliance':'owner_alliance'})
    #reorder columns for sanity
    .select('bol_id', 'date', 'month', 'year', 'teus', 'hs_code', 'lane_id', 
            'lane_name', 'origin_port_name', 'origin_port_code', 'coast_region', 
            'dest_port_name', 'dest_port_code', 'dest_territory', 'dest_region',  
            'drewery_lane', 'dist', 'rate_40', 'rate_20', 'scac', 'carrier_name', 
            'alliance', 'alliance_member', 'vessel_id', 'vessel_capacity', 
            'vessel_owner', 'owner_alliance', 'primary_cargo', 'shared_teus', 
            'cargo_source')
)

## 1. Data Summary and Aggregations

The main dataset holds a unique bill of lading (BOL) on each row and includes data from various sources in the columns described below.  

### Column Definitions

- bol_id - unique identifier of each BOL
- date, month, year - departure date of the export
- teus - volume of the shipment in twenty-foot-equivalent (TEU) units
- hs_code - list of commodity codes included in the shipment
- lane_id - unique identifier of the lane (a combination of origin and destination port codes)
- lane_name - name of the lane
- origin_port_name - name of the US port of origin
- origin_port_code - CBP code for the port of origin
- coast_region - US coastal region of origin
- dest_port_name - name of the foreign destination port
- dest_port_code - CBP code for the foreign destination port
- dest_territory - name of country or territory of destination
- dest_region - global region of destination
- drewery_lane - Drewery lane matched to the lane_id using summed haversine distances between the origin/destination port pairs
- dist - summed haversine distance between the origin/destination port pairs 
- rate_40 - Drewery CFRI rate index for 40-ft equivalent containers when the cargo was carried
- rate_20 - Drewery CFRI rate index for 20-ft equivalent containers (TEU) when the cargo was carried
- scac - Standard Carrier Alpha Code (SCAC) for the firm who carried the cargo
- carrier_name - name of the carrier
- alliance - the alliance to which the carrier belonged when the cargo was sold (NOTE this needs to be updated when we input regional alliance data)
- alliance_member - boolean for whether or not the carrier was a member of any alliance when that cargo was carried
- vessel_id - International Maritime Organization (IMO) code uniquely identifying the vessel
- vessel_capacity - metric of how many TEU can be carried by the vessel at any given time (computed from Net Register Tonnage)
- vessel_owner - the carrier representing the most TEUs carried by that vessel during the month on which the BOL was carried
- owner_alliance - the OCA to which the vessel_owner belonged at the time
- primary_cargo - boolean indicating whether the cargo came from the vessel_owner or not
- shared_teus - number of TEUs if the cargo did not come from the vessel_owner (identical to 'teus' when primary_cargo is 0). 
- cargo_source - indicator for whether the cargo came from a member of the vessel_owner's alliance or not. 


### Summary stats and head

In [24]:
desc = main_lf.describe()
desc.write_csv('../data/misc/usda_maindata_summarystat.csv')
display(desc)
main_lf.limit(5).collect()

statistic,bol_id,date,month,year,teus,hs_code,lane_id,lane_name,origin_port_name,origin_port_code,coast_region,dest_port_name,dest_port_code,dest_territory,dest_region,drewery_lane,dist,rate_40,rate_20,scac,carrier_name,alliance,alliance_member,vessel_id,vessel_capacity,vessel_owner,owner_alliance,primary_cargo,shared_teus,cargo_source
str,str,str,str,f64,f64,str,str,str,str,str,str,str,str,str,str,str,f64,f64,f64,str,str,str,f64,f64,f64,str,str,f64,f64,str
"""count""","""63737316""","""63737242""","""63737317""",63737317.0,63737317.0,"""63735980""","""63737317""","""63737317""","""63737317""","""63737317""","""63737173""","""63737317""","""63737317""","""63734881""","""63734881""","""63734875""",63734875.0,35720688.0,35720688.0,"""63737317""","""63696714""","""63737317""",63737317.0,63737317.0,59237959.0,"""63737317""","""63737317""",63737317.0,63737317.0,"""63737317"""
"""null_count""","""1""","""75""","""0""",0.0,0.0,"""1337""","""0""","""0""","""0""","""0""","""144""","""0""","""0""","""2436""","""2436""","""2442""",2442.0,28016629.0,28016629.0,"""0""","""40603""","""0""",0.0,0.0,4499358.0,"""0""","""0""",0.0,0.0,"""0"""
"""mean""",,"""2015-12-20 00:00:26.571000""",,2015.465141,3.220729,,,,,,,,,,,,1436.760648,1671.20566,1279.975398,,,,0.301567,9231900.0,2160.482837,,,0.702031,1.040772,
"""std""",,,,4.741281,5.982663,,,,,,,,,,,,1194.145132,793.815448,541.86681,,,,,474503.072275,1499.188461,,,,3.692474,
"""min""","""079A_26004878070""","""2007-01-01""","""200701""",2007.0,0.01,"""-1""",,,,,,,,,,,5.93014,400.0,310.0,,,"""2M Alliance""",0.0,196.0,0.0,,"""2M Alliance""",0.0,0.0,"""ally"""
"""25%""",,"""2012-02-27""",,2012.0,2.0,,,,,,,,,,,,425.931559,1100.0,880.0,,,,,9218686.0,905.147059,,,,0.0,
"""50%""",,"""2016-02-20""",,2016.0,2.533158,,,,,,,,,,,,1261.373246,1590.0,1230.0,,,,,9315202.0,2036.911765,,,,0.0,
"""75%""",,"""2019-12-13""",,2019.0,2.533158,,,,,,,,,,,,2377.521062,2020.0,1530.0,,,,,9430868.0,3253.161765,,,,2.0,
"""max""","""zzzz_ZZZZ""","""2023-12-31""","""202312""",2023.0,3729.25,"""ddedo""",,,,,,,,,,,8769.910108,15900.0,12070.0,,,"""The Alliance""",1.0,9979125.0,17889.705882,,"""The Alliance""",1.0,1123.25,"""non-ally"""


bol_id,date,month,year,teus,hs_code,lane_id,lane_name,origin_port_name,origin_port_code,coast_region,dest_port_name,dest_port_code,dest_territory,dest_region,drewery_lane,dist,rate_40,rate_20,scac,carrier_name,alliance,alliance_member,vessel_id,vessel_capacity,vessel_owner,owner_alliance,primary_cargo,shared_teus,cargo_source
str,date,str,i32,f64,str,cat,cat,cat,cat,cat,cat,cat,cat,cat,cat,f64,f64,f64,cat,cat,str,bool,i32,f64,cat,str,bool,f64,str
"""ZIML_IMUORF177673""",2007-01-01,"""200701""",2007,2.533158,"""390120""","""5301_24128""","""Houston — Kingston""","""HOUSTON""","""5301""","""GULF""","""KINGSTON""","""24128""","""JAMAICA""","""CARIBBEAN""","""US Gulf Coast (Houston) to Col…",860.905655,,,"""ZIMU""","""ZIM CONTAINER""","""Non-alliance Carriers""",False,9008562,,"""ZIMU""","""Non-alliance Carriers""",True,0.0,"""non-ally"""
"""MLSL_800184208""",2007-01-01,"""200701""",2007,2.533158,"""901890""","""5301_47563""","""Houston — Cagliari""","""HOUSTON""","""5301""","""GULF""","""CAGLIARI""","""47563""","""ITALY""","""MEDITERRANEAN""","""US Gulf Coast (Houston) to Wes…",585.05394,,,"""MAEU""","""MAERSK LINE""","""Non-alliance Carriers""",False,8212702,,"""MAEU""","""Non-alliance Carriers""",True,0.0,"""non-ally"""
"""MLSL_511755218""",2007-01-01,"""200701""",2007,2.533158,"""080290""","""5301_47563""","""Houston — Cagliari""","""HOUSTON""","""5301""","""GULF""","""CAGLIARI""","""47563""","""ITALY""","""MEDITERRANEAN""","""US Gulf Coast (Houston) to Wes…",585.05394,,,"""MAEU""","""MAERSK LINE""","""Non-alliance Carriers""",False,8212702,,"""MAEU""","""Non-alliance Carriers""",True,0.0,"""non-ally"""
"""MLSL_851964547""",2007-01-01,"""200701""",2007,2.533158,"""732690""","""5301_47536""","""Houston — Gioia Tauro""","""HOUSTON""","""5301""","""GULF""","""GIOIA TAURO""","""47536""","""ITALY""","""MEDITERRANEAN""","""US Gulf Coast (Houston) to Wes…",890.01468,,,"""MAEU""","""MAERSK LINE""","""Non-alliance Carriers""",False,8212702,,"""MAEU""","""Non-alliance Carriers""",True,0.0,"""non-ally"""
"""MLSL_HOUM30858""",2007-01-01,"""200701""",2007,2.533158,"""401199""","""5301_47536""","""Houston — Gioia Tauro""","""HOUSTON""","""5301""","""GULF""","""GIOIA TAURO""","""47536""","""ITALY""","""MEDITERRANEAN""","""US Gulf Coast (Houston) to Wes…",890.01468,,,"""MAEU""","""MAERSK LINE""","""Non-alliance Carriers""",False,8212702,,"""MAEU""","""Non-alliance Carriers""",True,0.0,"""non-ally"""


INSERT VOLUMES BY PORT SIZE VISUAL 

### Top 500 lanes

Containerized frieght volumes are highly concentrated to common shipping lanes (e.g., LA to Tokyo) and while many ports around the world have small container terminals and thus appear in the PIERS database, these ports are visted much less frequentyly and constitute tiny percentages of volumnes on each ship. For these reasons, we suspect the impacts of alliance behavior may be different for small container ports; we drop all but the top 500 lanes in order to focus on the majority of volumes. 

NOTE we may tweak the 500 number 

In [25]:
#get top 500 lanes
top_lanes = (
    main_lf
    .group_by('lane_id')
    .agg(pl.col('teus').sum())
    .sort(by='teus', descending=True)
    .limit(500)
    .collect()
    .to_series()
)

#limit main lf to top lanes 
main_lf = main_lf.filter(pl.col('lane_id').is_in(top_lanes))

### Aggregations

Our analysis takes place at three different levels of group-by aggregation. 

#### Aggregation 1: lane and month

The highest aggregation level we utilize is by lane and month, enabling us to inspect broad lane-level characteristics of the market. 

In [44]:
#main aggregation
lane_month_lf = (
    main_lf
    .with_columns(
        #number of foreign ports accessible from us port
        num_foreign_ports = (pl.col('dest_port_code').unique().count()
                             .over('month', 'origin_port_code')),
        #number of foreign countries accessible from us port
        num_foreign_countries = (pl.col('dest_territory').unique().count()
                                 .over('month', 'origin_port_code')),
        #number of foreign regions accessible from us port
        num_foreign_regions = (pl.col('dest_region').unique().count()
                                 .over('month', 'origin_port_code'))
    )
    .group_by('lane_id', 'month', 'scac')
    .agg(
        teus = pl.col('teus').sum(),
        num_vessels = pl.col('vessel_id').unique().count(),
        num_foreign_ports = pl.col('num_foreign_ports').first(),
        num_foreign_countries = pl.col('num_foreign_countries').first(),
        num_foreign_regions = pl.col('num_foreign_regions').first()
        #NOTE num ports and countries are unique to each lane and month
    )
    #get market share of each carrier for the lane and month
    .with_columns(
        ms_carrier = ((pl.col('teus')/
                      (pl.col('teus').sum().over('lane_id', 'month'))))
    )
    #final grouping
    .group_by('lane_id', 'month')
    .agg(
        hhi_lane = (pl.col('ms_carrier')**2).sum(),
        teus = pl.col('teus').sum(),
        num_vessels = pl.col('num_vessels').unique().count(),
        num_carriers = pl.col('scac').unique().count(),
        num_foreign_ports = pl.col('num_foreign_ports').first(),
        num_foreign_countries = pl.col('num_foreign_countries').first(),
        num_foreign_regions = pl.col('num_foreign_regions').first()
    )

)

#aggregate vessel capacites
cap_lf = (
    main_lf
    .group_by('lane_id', 'month', 'vessel_id')
    .agg(pl.col('vessel_capacity').first())
    .group_by('lane_id', 'month')
    .agg(pl.col('vessel_capacity').sum().alias('lane_capacity'))
)

#merge to lane-month lf
lane_month_lf = (
    lane_month_lf
    .join(
        cap_lf,
        on=['lane_id', 'month'],
        how='left'
    )
)

display(lane_month_lf.limit(5).collect())
lane_month_lf.describe()

lane_id,month,hhi_lane,teus,num_vessels,num_carriers,num_foreign_ports,num_foreign_countries,num_foreign_regions,lane_capacity
cat,str,f64,f64,u32,u32,u32,u32,u32,f64
"""2704_57018""","""201610""",0.971687,835.5,1,2,48,18,5,18804.485294
"""4601_56033""","""202102""",0.311415,1158.02,5,9,62,43,11,131170.441176
"""5301_24128""","""202205""",0.380257,258.626315,4,7,41,29,11,19864.411765
"""2709_55751""","""201607""",0.231632,1228.98,6,12,40,15,4,97187.941176
"""2704_58301""","""202212""",0.329868,885.729473,2,8,45,17,5,34567.573529


statistic,lane_id,month,hhi_lane,teus,num_vessels,num_carriers,num_foreign_ports,num_foreign_countries,num_foreign_regions,lane_capacity
str,str,str,f64,f64,f64,f64,f64,f64,f64,f64
"""count""","""97065""","""97065""",97065.0,97065.0,97065.0,97065.0,97065.0,97065.0,97065.0,97065.0
"""null_count""","""0""","""0""",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
"""mean""",,,0.449774,1475.183443,3.662752,6.726462,36.134188,22.356545,7.074764,34278.345039
"""std""",,,0.280078,2409.527005,1.908144,4.168122,16.958156,11.587789,3.524847,30565.061862
"""min""",,"""200701""",0.063806,0.3,1.0,1.0,1.0,1.0,1.0,0.0
"""25%""",,,0.232586,367.076307,2.0,3.0,27.0,15.0,4.0,9841.544118
"""50%""",,,0.352413,816.26,4.0,6.0,39.0,20.0,7.0,26607.794118
"""75%""",,,0.598667,1694.682485,5.0,9.0,47.0,30.0,11.0,50782.867647
"""max""",,"""202312""",1.0,48794.368538,14.0,23.0,65.0,44.0,11.0,203187.573529


In [40]:
px.histogram(lane_month_lf.collect(), x='hhi_lane')


#### Aggregation 3: lane_id, month, scac, and hs_code

##### HS Codes

A key aspect of our study is to compare the impacts of OCAs on different types of cargo; specifically on agricultural and non-agricultural cargoes. In order to aggregate data across the individual 2-digit HS codes, we must address the fact that this data lies in strings of 4–8 character codes separated by a space. Furthermore, since the PIERS data simply list the HS codes contained within each BOL and do not indicate how much each code contributes to the total volume, we assign equal volumes to each hs_code associated with the BOL. These issues are addressed in two steps:
1. splitting the strings into an interable list and then exploding the table by hs_code to long form where each row corresponds to a unique pair of bol_id and hs_code. 
2. dropping all but the first two digits of the hs_code (note the first two digits of hs_codes describe the highest categorization level)

Note that this results in a somewhat untidy table; if we needed to analyze unique hs_code-bol_id rows, we would then need to re-aggregate over bol_id and hs_code. However, since we are aggregating to land_id, month, scac, and hs_code anyway, we can skip this step. 

In [29]:
def explode_hscodes(lf, hs_digits=2):
    '''
    Expands the dataset into long form where each row corresponds to only a single commodity (HS Code) within each BOL. 
        Assumes that the raw data are effectively grouped by BOL, resulting in hs_code column sometimes containing more than one code;
        this function undoes that group aggregation. 
    INPUTS
        lf - a polars lazyframe containing the relevant data
        hs_digits - int - the number of HS codes to keep
    OUTPUTS
        lf - a polars lazyframe containing the long form data
    '''
    #instantiate index
    index = pl.arange(0, lf.select(pl.len()).collect().item(), eager=True)
    #explode hs_codes 
    lf = (
        lf
        #set nulls to 0 to prevent data loss in later steps
        .with_columns(pl.col('hs_code').replace(None, 'missing'))
        .with_columns(
            #split hs codes into lists
            pl.col('hs_code').str.split(by=' '),
            #create pseudo-index col
            pl.Series(name='index', values=index)
            )
        #explode hs_codes
        .explode('hs_code')
        #drop rows with empty strings (these result from more than one space between hs_codes in the raw data)
        .with_columns(pl.col('hs_code').replace('', None))
        .drop_nulls('hs_code')
        #distribute total volumes evenly across HS Codes -- NOTE this step may change as better metadata is collected
        .with_columns(
            (pl.col('teus')/(pl.col('index').count().over('index'))),
            #replace originally-null values
            pl.col('hs_code').replace('missing', None)
        )
        .with_columns(
            #drop all but first two hs_code digits
            pl.col('hs_code').str.slice(0,hs_digits)
            #cast to list of integers, replacing non-integers with null
            .cast(pl.List(pl.Int32), strict=False),
            )
        #drop index
        .drop('index')
    )
    return lf 

In [5]:
#explode hs_codes and simplify to 2-digit
long_lf = explode_hscodes(main_lf)

Now that the hs_code column is clean, we perform the actual aggregation 

In [70]:
lane_month_scac_hs_lf = (
    long_lf
    #prep vessel cap for agg by hs_code (note vessel capacites were copied in the explode step)
    .with_columns(
        #sum over window to be used in group_by step + vessel_id
        pl.col('vessel_capacity').sum()
        .over('lane_id', 'month', 'scac', 'hs_code', 'vessel_id')
        .alias('cap')
    )
    .group_by('lane_id', 'month', 'scac', 'hs_code')
    .agg(
        #sum of teus
        pl.col('teus').sum(),
        #number of vessels in service
        pl.col('vessel_id').unique().count().alias('num_unique_vessels'),
        #total vessel capacity
        pl.col('vessel_capacity').sum().alias('total_capacity')
    )
)

In [71]:
agg1_lf.describe()

statistic,lane_id,month,scac,hs_code,teus,num_vessels,total_capacity
str,str,str,str,f64,f64,f64,f64
"""count""","""13000114""","""13000114""","""13000114""",12999859.0,13000114.0,13000114.0,12331310.0
"""null_count""","""0""","""0""","""0""",255.0,0.0,0.0,668804.0
"""mean""",,,,,15.790681,1.738639,4079.752038
"""std""",,,,,75.354151,1.16678,3921.141514
"""min""",,"""200701""",,,0.000968,1.0,0.0
"""25%""",,,,,2.0,1.0,1754.485294
"""50%""",,,,,4.0,1.0,2945.0
"""75%""",,,,,10.132631,2.0,5038.235294
"""max""",,"""202312""",,,19674.521504,24.0,75546.08795
