# Ocean Carrier Alliances - Data Analysis for USDA Report

This notebook analyses data on maritime container shipping to inspect the impacts of strategic alliances between ocean freight carriers (Ocean Carrier Alliances, OCAs) on the containerized agricultural export market in the US. Data are pre-processed in the oca_data_prep notebook; see the [project repo](https://github.com/epistemetrica/Ocean-Carrier-Alliances-Project) for full details.

Section 1 presents data summaries and aggregations, Section 2 gives general context for the maritime freight economy and vessel sharing behavior, and Section 3 presents the analytical models and results.

In [2]:
#preliminaries
import numpy as np
import pandas as pd
import polars as pl
import plotly_express as px
import statsmodels.api as sm

#enable string cache for polars categoricals
pl.enable_string_cache()
#display settings
pd.set_option('display.max_columns', None)

In [3]:
#load data
main_lf = (
    #load parquet files to polars lazyframe 
    pl.scan_parquet('../data/main/*.parquet')
    #choose only exports
    .filter(pl.col('direction')=='export')
    #drop unneeded columns
    .drop('origin_territory', 'origin_region', 'direction', 'carrier_name', 
          'carrier_scac', 'vessel_name', 'voyage_number')
    #rename cols for consistency/sanity
    .rename({'unified_carrier_name':'carrier_name',
             'unified_carrier_scac':'scac',
             'arrival_port_name':'dest_port_name',
             'arrival_port_code':'dest_port_code',
             'departure_port_name':'origin_port_name',
             'departure_port_code':'origin_port_code',
             'pc_alliance':'owner_alliance'})
    #reorder columns for sanity
    .select('bol_id', 'date', 'month', 'year', 'teus', 'hs_code', 'lane_id', 
            'lane_name', 'origin_port_name', 'origin_port_code', 'coast_region', 
            'dest_port_name', 'dest_port_code', 'dest_territory', 'dest_region',  
            'drewery_lane', 'dist', 'rate_40', 'rate_20', 'scac', 'carrier_name', 
            'alliance', 'alliance_member', 'vessel_id', 'vessel_capacity', 
            'vessel_owner', 'owner_alliance', 'primary_cargo', 'shared_teus', 
            'cargo_source')
    #set 0 capacities to missing
    .with_columns(
        pl.col('vessel_capacity').replace(0, None)
    )
)

## 1. Data Summary and Aggregations

The main dataset holds a unique bill of lading (BOL) on each row and includes data from various sources in the columns described below.  

### Column Definitions

- bol_id - unique identifier of each BOL
- date, month, year - departure date of the export
- teus - volume of the shipment in twenty-foot-equivalent (TEU) units
- hs_code - list of commodity codes included in the shipment
- lane_id - unique identifier of the lane (a combination of origin and destination port codes)
- lane_name - name of the lane
- origin_port_name - name of the US port of origin
- origin_port_code - CBP code for the port of origin
- coast_region - US coastal region of origin
- dest_port_name - name of the foreign destination port
- dest_port_code - CBP code for the foreign destination port
- dest_territory - name of country or territory of destination
- dest_region - global region of destination
- drewery_lane - Drewery lane matched to the lane_id using summed haversine distances between the origin/destination port pairs
- dist - summed haversine distance between the origin/destination port pairs 
- rate_40 - Drewery CFRI rate index for 40-ft equivalent containers when the cargo was carried
- rate_20 - Drewery CFRI rate index for 20-ft equivalent containers (TEU) when the cargo was carried
- scac - Standard Carrier Alpha Code (SCAC) for the firm who carried the cargo
- carrier_name - name of the carrier
- alliance - the alliance to which the carrier belonged when the cargo was sold (NOTE this needs to be updated when we input regional alliance data)
- alliance_member - boolean for whether or not the carrier was a member of any alliance when that cargo was carried
- vessel_id - International Maritime Organization (IMO) code uniquely identifying the vessel
- vessel_capacity - metric of how many TEU can be carried by the vessel at any given time (computed from Net Register Tonnage)
- vessel_owner - the carrier representing the most TEUs carried by that vessel during the month on which the BOL was carried
- owner_alliance - the OCA to which the vessel_owner belonged at the time
- primary_cargo - boolean indicating whether the cargo came from the vessel_owner or not
- shared_teus - number of TEUs if the cargo did not come from the vessel_owner (identical to 'teus' when primary_cargo is 0). 
- cargo_source - indicator for whether the cargo came from a member of the vessel_owner's alliance or not. 


### Summary stats and head

In [3]:
desc = main_lf.describe()
desc.write_csv('../data/misc/usda_maindata_summarystat.csv')
display(desc)
main_lf.limit(5).collect()

statistic,bol_id,date,month,year,teus,hs_code,lane_id,lane_name,origin_port_name,origin_port_code,coast_region,dest_port_name,dest_port_code,dest_territory,dest_region,drewery_lane,dist,rate_40,rate_20,scac,carrier_name,alliance,alliance_member,vessel_id,vessel_capacity,vessel_owner,owner_alliance,primary_cargo,shared_teus,cargo_source
str,str,str,str,f64,f64,str,str,str,str,str,str,str,str,str,str,str,f64,f64,f64,str,str,str,f64,f64,f64,str,str,f64,f64,str
"""count""","""63737316""","""63737242""","""63737317""",63737317.0,63737317.0,"""63735980""","""63737317""","""63737317""","""63737317""","""63737317""","""63737173""","""63737317""","""63737317""","""63734881""","""63734881""","""63734875""",63734875.0,35720688.0,35720688.0,"""63737317""","""63696714""","""63737317""",63737317.0,63737317.0,57041483.0,"""63737317""","""63737317""",63737317.0,63737317.0,"""63737317"""
"""null_count""","""1""","""75""","""0""",0.0,0.0,"""1337""","""0""","""0""","""0""","""0""","""144""","""0""","""0""","""2436""","""2436""","""2442""",2442.0,28016629.0,28016629.0,"""0""","""40603""","""0""",0.0,0.0,6695834.0,"""0""","""0""",0.0,0.0,"""0"""
"""mean""",,"""2015-12-20 00:00:26.571000""",,2015.465141,3.220729,,,,,,,,,,,,1436.760648,1671.20566,1279.975398,,,,0.301567,9231900.0,2243.675778,,,0.702031,1.040772,
"""std""",,,,4.741281,5.982663,,,,,,,,,,,,1194.145132,793.815448,541.86681,,,,,474503.072275,1465.419488,,,,3.692474,
"""min""","""079A_26004878070""","""2007-01-01""","""200701""",2007.0,0.01,"""-1""",,,,,,,,,,,5.93014,400.0,310.0,,,"""2M Alliance""",0.0,196.0,0.073529,,"""2M Alliance""",0.0,0.0,"""ally"""
"""25%""",,"""2012-02-27""",,2012.0,2.0,,,,,,,,,,,,425.931559,1100.0,880.0,,,,,9218686.0,1027.647059,,,,0.0,
"""50%""",,"""2016-02-20""",,2016.0,2.533158,,,,,,,,,,,,1261.373246,1590.0,1230.0,,,,,9315202.0,2156.544118,,,,0.0,
"""75%""",,"""2019-12-13""",,2019.0,2.533158,,,,,,,,,,,,2377.521062,2020.0,1530.0,,,,,9430868.0,3295.441176,,,,2.0,
"""max""","""zzzz_ZZZZ""","""2023-12-31""","""202312""",2023.0,3729.25,"""ddedo""",,,,,,,,,,,8769.910108,15900.0,12070.0,,,"""The Alliance""",1.0,9979125.0,17889.705882,,"""The Alliance""",1.0,1123.25,"""non-ally"""


bol_id,date,month,year,teus,hs_code,lane_id,lane_name,origin_port_name,origin_port_code,coast_region,dest_port_name,dest_port_code,dest_territory,dest_region,drewery_lane,dist,rate_40,rate_20,scac,carrier_name,alliance,alliance_member,vessel_id,vessel_capacity,vessel_owner,owner_alliance,primary_cargo,shared_teus,cargo_source
str,date,str,i32,f64,str,cat,cat,cat,cat,cat,cat,cat,cat,cat,cat,f64,f64,f64,cat,cat,str,bool,i32,f64,cat,str,bool,f64,str
"""ZIML_IMUORF177673""",2007-01-01,"""200701""",2007,2.533158,"""390120""","""5301_24128""","""Houston — Kingston""","""HOUSTON""","""5301""","""GULF""","""KINGSTON""","""24128""","""JAMAICA""","""CARIBBEAN""","""US Gulf Coast (Houston) to Col…",860.905655,,,"""ZIMU""","""ZIM CONTAINER""","""Non-alliance Carriers""",False,9008562,,"""ZIMU""","""Non-alliance Carriers""",True,0.0,"""non-ally"""
"""MLSL_800184208""",2007-01-01,"""200701""",2007,2.533158,"""901890""","""5301_47563""","""Houston — Cagliari""","""HOUSTON""","""5301""","""GULF""","""CAGLIARI""","""47563""","""ITALY""","""MEDITERRANEAN""","""US Gulf Coast (Houston) to Wes…",585.05394,,,"""MAEU""","""MAERSK LINE""","""Non-alliance Carriers""",False,8212702,,"""MAEU""","""Non-alliance Carriers""",True,0.0,"""non-ally"""
"""MLSL_511755218""",2007-01-01,"""200701""",2007,2.533158,"""080290""","""5301_47563""","""Houston — Cagliari""","""HOUSTON""","""5301""","""GULF""","""CAGLIARI""","""47563""","""ITALY""","""MEDITERRANEAN""","""US Gulf Coast (Houston) to Wes…",585.05394,,,"""MAEU""","""MAERSK LINE""","""Non-alliance Carriers""",False,8212702,,"""MAEU""","""Non-alliance Carriers""",True,0.0,"""non-ally"""
"""MLSL_851964547""",2007-01-01,"""200701""",2007,2.533158,"""732690""","""5301_47536""","""Houston — Gioia Tauro""","""HOUSTON""","""5301""","""GULF""","""GIOIA TAURO""","""47536""","""ITALY""","""MEDITERRANEAN""","""US Gulf Coast (Houston) to Wes…",890.01468,,,"""MAEU""","""MAERSK LINE""","""Non-alliance Carriers""",False,8212702,,"""MAEU""","""Non-alliance Carriers""",True,0.0,"""non-ally"""
"""MLSL_HOUM30858""",2007-01-01,"""200701""",2007,2.533158,"""401199""","""5301_47536""","""Houston — Gioia Tauro""","""HOUSTON""","""5301""","""GULF""","""GIOIA TAURO""","""47536""","""ITALY""","""MEDITERRANEAN""","""US Gulf Coast (Houston) to Wes…",890.01468,,,"""MAEU""","""MAERSK LINE""","""Non-alliance Carriers""",False,8212702,,"""MAEU""","""Non-alliance Carriers""",True,0.0,"""non-ally"""


INSERT VOLUMES BY PORT SIZE VISUAL 

### Top 500 lanes

Containerized frieght volumes are highly concentrated to common shipping lanes (e.g., LA to Tokyo) and while many ports around the world have small container terminals and thus appear in the PIERS database, these ports are visted much less frequentyly and constitute tiny percentages of volumnes on each ship. For these reasons, we suspect the impacts of alliance behavior may be different for small container ports; we drop all but the top 500 lanes in order to focus on the majority of volumes. 

NOTE we may tweak the 500 number 

In [4]:
#get top 500 lanes
top_lanes = (
    main_lf
    .group_by('lane_id')
    .agg(pl.col('teus').sum())
    .sort(by='teus', descending=True)
    .limit(500)
    .collect()
    .to_series()
)

#limit main lf to top lanes 
main_lf = main_lf.filter(pl.col('lane_id').is_in(top_lanes))

In [10]:
df = (
    main_lf
    .group_by('lane_id')
    .agg(pl.col('teus').sum())
    .sort(by='teus', descending=True)
    .with_row_index()
    .collect()
)
px.bar(df, x='index', y='teus', title='TEUs per lane')

### Aggregations

Our analysis takes place at three different levels of group-by aggregation; each described below. Before aggregating, however, we must address two remaining data cleaning tasks: missing vessel capacity info, and hs_code formats. 

#### Vessel Capacities

Capacity data is rarely missing in the time periods from which data are published (2013-2022) and progressively more missing the further outside we move from that time period since not all vessels operating in 2008, for example, were still operating by the time capacity data became available. If left unaddressed, this would cause capacities to be under-estimated during time periods outside the 2013–22 window. When available, we fill missing capacities with the average of known capacities within a given lane, month, and carrier, making the tacit assumption that within such a short time frame, missing capacites are not likely to be systematically different than known capacities on the same lane and carrier. When this data is not available, we fill nulls with the mean over lane and month, and finally fill with the mean over the lane as a last resort. 


In [5]:
#aggregate vessel capacites
main_lf = (
    main_lf
    #impute missing vessel capacities
    .with_columns(
        #fill with mean over lane, month, and carrier where available
        pl.when(pl.col('vessel_capacity').is_null())
        .then(pl.col('vessel_capacity').mean().over('lane_id', 'month', 'scac'))
        .otherwise(pl.col('vessel_capacity'))
    )
    .with_columns(
        #fill remaning nulls with mean over lane and month where available
        pl.when(pl.col('vessel_capacity').is_null())
        .then(pl.col('vessel_capacity').mean().over('lane_id', 'month'))
        .otherwise(pl.col('vessel_capacity'))
    )
    .with_columns(
        #fill remaining nulls with mean over lane
        pl.when(pl.col('vessel_capacity').is_null())
        .then(pl.col('vessel_capacity').mean().over('lane_id'))
        .otherwise(pl.col('vessel_capacity'))
    )
)

##### HS Codes

A key aspect of our study is to compare the impacts of OCAs on different types of cargo; specifically on agricultural and non-agricultural cargoes. In order to aggregate data across the individual 2-digit HS codes, we must address the fact that this data lies in strings of 4–8 character codes separated by a space. Furthermore, since the PIERS data simply list the HS codes contained within each BOL and do not indicate how much each code contributes to the total volume, we assign equal volumes to each hs_code associated with the BOL. These issues are addressed in two steps:
1. splitting the strings into an interable list and then exploding the table by hs_code to long form where each row corresponds to a unique pair of bol_id and hs_code. 
2. dropping all but the first two digits of the hs_code (note the first two digits of hs_codes describe the highest categorization level)

Note that this results in a somewhat untidy table; if we needed to analyze unique hs_code-bol_id rows, we would then need to re-aggregate over bol_id and hs_code. However, since we are aggregating to lane_id, month, scac, and hs_code anyway, we can skip this step. 

In [11]:
def explode_hscodes(lf, hs_digits=2):
    '''
    Expands the dataset into long form where each row corresponds to only a single commodity (HS Code) within each BOL. 
        Assumes that the raw data are effectively grouped by BOL, resulting in hs_code column sometimes containing more than one code;
        this function undoes that group aggregation. 
    INPUTS
        lf - a polars lazyframe containing the relevant data
        hs_digits - int - the number of HS codes to keep
    OUTPUTS
        lf - a polars lazyframe containing the long form data
    '''
    #instantiate index
    index = pl.arange(0, lf.select(pl.len()).collect().item(), eager=True)
    #explode hs_codes 
    lf = (
        lf
        #set nulls to 0 to prevent data loss in later steps
        .with_columns(pl.col('hs_code').replace(None, 'missing'))
        .with_columns(
            #split hs codes into lists
            pl.col('hs_code').str.split(by=' '),
            #create pseudo-index col
            pl.Series(name='index', values=index)
            )
        #explode hs_codes
        .explode('hs_code')
        #drop rows with empty strings (these result from more than one space between hs_codes in the raw data)
        .with_columns(pl.col('hs_code').replace('', None))
        .drop_nulls('hs_code')
        #distribute total volumes evenly across HS Codes -- NOTE this step may change as better metadata is collected
        .with_columns(
            (pl.col('teus')/(pl.col('index').count().over('index'))),
            #replace originally-null values
            pl.col('hs_code').replace('missing', None)
        )
        .with_columns(
            #drop all but first two hs_code digits
            pl.col('hs_code').str.slice(0,hs_digits)
            #cast to list of integers, replacing non-integers with null
            .cast(pl.List(pl.Int32), strict=False),
            )
        #drop index
        .drop('index')
    )
    return lf 

In [7]:
#explode hs_codes and simplify to 2-digit
long_lf = explode_hscodes(main_lf)


#### Aggregation 1: Lane-Month

The highest aggregation level we utilize is by lane and month, enabling us to inspect broad lane-level characteristics of the market. 

In [8]:
#main aggregation
lane_month_lf = (
    main_lf
    .with_columns(
        #number of foreign ports accessible from us port
        num_foreign_ports = (pl.col('dest_port_code').unique().count()
                             .over('month', 'origin_port_code')),
        #number of foreign countries accessible from us port
        num_foreign_countries = (pl.col('dest_territory').unique().count()
                                 .over('month', 'origin_port_code')),
        #number of foreign regions accessible from us port
        num_foreign_regions = (pl.col('dest_region').unique().count()
                                 .over('month', 'origin_port_code'))
    )
    .group_by('lane_id', 'month', 'scac')
    .agg(
        teus = pl.col('teus').sum(),
        num_vessels = pl.col('vessel_id').unique().count(),
        num_foreign_ports = pl.col('num_foreign_ports').first(),
        num_foreign_countries = pl.col('num_foreign_countries').first(),
        num_foreign_regions = pl.col('num_foreign_regions').first()
        #NOTE num ports and countries are unique to each lane and month
    )
    #get market share of each carrier for the lane and month
    .with_columns(
        ms_carrier = ((pl.col('teus')/
                      (pl.col('teus').sum().over('lane_id', 'month'))))
    )
    #final grouping
    .group_by('lane_id', 'month')
    .agg(
        hhi_lane = (pl.col('ms_carrier')**2).sum(),
        teus = pl.col('teus').sum(),
        num_vessels = pl.col('num_vessels').unique().count(),
        num_carriers = pl.col('scac').unique().count(),
        num_foreign_ports = pl.col('num_foreign_ports').first(),
        num_foreign_countries = pl.col('num_foreign_countries').first(),
        num_foreign_regions = pl.col('num_foreign_regions').first()
    )

)

#aggregate vessel capacities
lane_cap_lf = (
    main_lf
    #aggregate
    .group_by('lane_id', 'month', 'vessel_id')
    .agg(pl.col('vessel_capacity').first())
    .group_by('lane_id', 'month')
    .agg(pl.col('vessel_capacity').sum().alias('lane_capacity'))
)

#merge vessel capacities into lane-month lf
lane_month_lf = (
    lane_month_lf
    .join(
        lane_cap_lf,
        on=['lane_id', 'month'],
        how='left'
    )
)

display(lane_month_lf.limit(5).collect())
lane_month_lf.describe()

lane_id,month,hhi_lane,teus,num_vessels,num_carriers,num_foreign_ports,num_foreign_countries,num_foreign_regions,lane_capacity
cat,str,f64,f64,u32,u32,u32,u32,u32,f64
"""5301_55751""","""200712""",1.0,5.066315,1,1,40,30,11,2500.808824
"""1401_54201""","""202104""",0.362107,1541.553158,4,5,39,25,9,44131.544118
"""1703_57047""","""201205""",0.126272,630.756261,4,12,48,32,11,84403.088235
"""2811_55976""","""202302""",0.190533,777.626315,5,10,32,18,7,73908.704482
"""2709_58201""","""202212""",0.229774,3373.772104,4,10,37,15,4,59888.75


statistic,lane_id,month,hhi_lane,teus,num_vessels,num_carriers,num_foreign_ports,num_foreign_countries,num_foreign_regions,lane_capacity
str,str,str,f64,f64,f64,f64,f64,f64,f64,f64
"""count""","""97065""","""97065""",97065.0,97065.0,97065.0,97065.0,97065.0,97065.0,97065.0,97065.0
"""null_count""","""0""","""0""",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
"""mean""",,,0.449774,1475.183443,3.662752,6.726462,36.134188,22.356545,7.074764,36681.25066
"""std""",,,0.280078,2409.527005,1.908144,4.168122,16.958156,11.587789,3.524847,32113.045777
"""min""",,"""200701""",0.063806,0.3,1.0,1.0,1.0,1.0,1.0,7.205882
"""25%""",,,0.232586,367.076307,2.0,3.0,27.0,15.0,4.0,11576.224893
"""50%""",,,0.352413,816.26,4.0,6.0,39.0,20.0,7.0,28666.54937
"""75%""",,,0.598667,1694.682485,5.0,9.0,47.0,30.0,11.0,53611.470588
"""max""",,"""202312""",1.0,48794.368538,14.0,23.0,65.0,44.0,11.0,219337.223615


#### Aggregation 2: Lane-Month-Carrier

The next level of aggregation is over carriers in addition to lane and month, allowing us to analyze the ports serviced, total volumes and capacities, etc of specific carriers within each lane. 

In [10]:
#main aggregation
lane_month_carrier_lf = (
    main_lf
    .with_columns(
        #number of foreign ports serviced by carrier from us port
        num_foreign_ports = (pl.col('dest_port_code').unique().count()
                             .over('month', 'scac', 'origin_port_code')),
        #number of foreign countries serviced by carrier from us port
        num_foreign_countries = (pl.col('dest_territory').unique().count()
                                 .over('month', 'scac', 'origin_port_code')),
        #number of foreign regions serviced by carrier from us port
        num_foreign_regions = (pl.col('dest_region').unique().count()
                                 .over('month', 'scac', 'origin_port_code'))
    )
    #aggregate
    .group_by('lane_id', 'month', 'scac')
    .agg(
        teus = pl.col('teus').sum(),
        num_foreign_ports = pl.col('num_foreign_ports').first(),
        num_foreign_countries = pl.col('num_foreign_countries').first(),
        num_foreign_regions = pl.col('num_foreign_regions').first()
        #NOTE num ports and countries are unique to each lane and month
    )
    #get market share of each carrier for the lane and month
    .with_columns(
        ms_carrier = ((pl.col('teus')/
                      (pl.col('teus').sum().over('lane_id', 'month'))))
    )
    .with_columns(
        #number of carriers active on the lane
        num_carriers = pl.col('scac').unique().count().over('lane_id', 'month'),
        #hhi
        hhi_lane = (pl.col('ms_carrier')**2).sum().over('lane_id', 'month')
    )
)

#aggregate vessels and capacities
lane_carrier_cap_lf = (
    main_lf
    #drop vessels not from the given carrier
    .with_columns(
        pl.when(pl.col('vessel_owner')==pl.col('scac'))
        .then(pl.col('vessel_capacity'))
        .otherwise(pl.lit(None))
        .alias('vessel_capacity'),
        pl.when(pl.col('vessel_owner')==pl.col('scac'))
        .then(pl.col('vessel_id'))
        .otherwise(pl.lit(None))
        .alias('vessels')
    )
    #aggregate
    .group_by('lane_id', 'month', 'scac', 'vessels')
    .agg(
        pl.col('vessel_capacity').drop_nulls().first()
    )
    .group_by('lane_id', 'month', 'scac')
    .agg(
        pl.col('vessel_capacity').sum().alias('carrier_lane_capacity'),
        pl.col('vessels').unique().count().alias('num_vessels')
    )
)

#merge vessel capacities into lane-month lf
lane_month_carrier_lf = (
    lane_month_carrier_lf
    .join(
        lane_carrier_cap_lf,
        on=['lane_id', 'month', 'scac'],
        how='left'
    )
)

display(lane_month_carrier_lf.limit(5).collect())
lane_month_carrier_lf.describe()

lane_id,month,scac,teus,num_foreign_ports,num_foreign_countries,num_foreign_regions,ms_carrier,num_carriers,hhi_lane,carrier_lane_capacity,num_vessels
cat,str,cat,f64,u32,u32,u32,f64,u32,f64,f64,u32
"""1401_54930""","""202001""","""CMDU""",177.599473,33,22,7,0.121214,11,0.145162,17013.161765,5
"""2704_58309""","""202308""","""ONEY""",2337.175788,36,15,5,0.187199,9,0.218527,27409.373131,8
"""2709_57000""","""200710""","""HLCU""",12.665788,22,10,3,0.025907,11,0.211415,0.0,1
"""1703_51721""","""200703""","""HLCU""",357.175232,25,21,8,0.391667,9,0.281898,12323.897059,5
"""2709_56033""","""201007""","""YMLU""",12.665788,21,11,2,0.011682,11,0.245862,0.0,1


statistic,lane_id,month,scac,teus,num_foreign_ports,num_foreign_countries,num_foreign_regions,ms_carrier,num_carriers,hhi_lane,carrier_lane_capacity,num_vessels
str,str,str,str,f64,f64,f64,f64,f64,f64,f64,f64,f64
"""count""","""652904""","""652904""","""652904""",652904.0,652904.0,652904.0,652904.0,652904.0,652904.0,652904.0,652904.0,652904.0
"""null_count""","""0""","""0""","""0""",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
"""mean""",,,,219.310467,20.84655,13.589437,4.79701,0.148667,9.309255,0.315444,4074.603547,2.193316
"""std""",,,,601.910542,12.779562,8.781094,2.947881,0.211577,4.19194,0.197261,6172.821456,1.699992
"""min""",,"""200701""",,0.03,1.0,1.0,1.0,3.7e-05,1.0,0.063806,0.0,1.0
"""25%""",,,,15.198946,11.0,7.0,2.0,0.018592,6.0,0.180203,0.0,1.0
"""50%""",,,,61.0,20.0,12.0,4.0,0.064114,9.0,0.258551,884.411765,1.0
"""75%""",,,,201.599473,30.0,19.0,7.0,0.180945,12.0,0.380618,6454.632353,3.0
"""max""",,"""202312""",,25915.51,59.0,42.0,11.0,1.0,23.0,1.0,85391.094217,17.0


#### Aggregation 3: Lane-Month-Carrier-Commodity

Aggregating by lane, month, carrier, and commodity (hs code) allows us to inspect the impacts of OCAs for different commodities. 

In [12]:
#main aggregation
lane_month_carrier_hs_lf = (
    main_lf
    .with_columns(
        #number of foreign ports serviced by carrier from us port
        num_foreign_ports = (pl.col('dest_port_code').unique().count()
                             .over('month', 'scac', 'origin_port_code')),
        #number of foreign countries serviced by carrier from us port
        num_foreign_countries = (pl.col('dest_territory').unique().count()
                                 .over('month', 'scac', 'origin_port_code')),
        #number of foreign regions serviced by carrier from us port
        num_foreign_regions = (pl.col('dest_region').unique().count()
                                 .over('month', 'scac', 'origin_port_code'))
    )
    #aggregate
    .group_by('lane_id', 'month', 'scac', 'hs_code')
    .agg(
        teus = pl.col('teus').sum(),
        num_foreign_ports = pl.col('num_foreign_ports').first(),
        num_foreign_countries = pl.col('num_foreign_countries').first(),
        num_foreign_regions = pl.col('num_foreign_regions').first()
        #NOTE num ports and countries are unique to each lane and month
    )
    #get market share of each carrier for the lane and month
    .with_columns(
        ms_carrier = ((pl.col('teus')/
                      (pl.col('teus').sum().over('lane_id', 'month'))))
    )
    .with_columns(
        #number of carriers active on the lane
        num_carriers = pl.col('scac').unique().count().over('lane_id', 'month'),
        #hhi
        hhi_lane = (pl.col('ms_carrier')**2).sum().over('lane_id', 'month'),
        #destination data
        num_foreign_ports = pl.col('num_foreign_ports').sum()
        .over('lane_id', 'month', 'scac'),
        num_foreign_countries = pl.col('num_foreign_countries').sum()
        .over('lane_id', 'month', 'scac'),
        num_foreign_regions = pl.col('num_foreign_regions').sum()
        .over('lane_id', 'month', 'scac'),
    )
    #drop missing hs codes
    .drop_nulls(subset='hs_code')
)

#aggregate vessels and capacities
lane_carrier_cap_lf = (
    main_lf
    #drop vessels not from the given carrier
    .with_columns(
        pl.when(pl.col('vessel_owner')==pl.col('scac'))
        .then(pl.col('vessel_capacity'))
        .otherwise(pl.lit(None))
        .alias('vessel_capacity'),
        pl.when(pl.col('vessel_owner')==pl.col('scac'))
        .then(pl.col('vessel_id'))
        .otherwise(pl.lit(None))
        .alias('vessels')
    )
    #aggregate
    .group_by('lane_id', 'month', 'scac', 'vessels')
    .agg(
        pl.col('vessel_capacity').drop_nulls().first()
    )
    .group_by('lane_id', 'month', 'scac')
    .agg(
        pl.col('vessel_capacity').sum().alias('carrier_lane_capacity'),
        pl.col('vessels').unique().count().alias('num_vessels')
    )
)

#merge vessel capacities into lane-month lf
lane_month_carrier_hs_lf = (
    lane_month_carrier_hs_lf
    .join(
        lane_carrier_cap_lf,
        on=['lane_id', 'month', 'scac'],
        how='left'
    )
)

display(lane_month_carrier_hs_lf.limit(5).collect())
lane_month_carrier_hs_lf.describe()

lane_id,month,scac,hs_code,teus,num_foreign_ports,num_foreign_countries,num_foreign_regions,ms_carrier,num_carriers,hhi_lane,carrier_lane_capacity,num_vessels
cat,str,cat,str,f64,u32,u32,u32,f64,u32,f64,f64,u32
"""5201_24579""","""201612""","""AMLU""","""871190""",0.32,600,400,200,0.000368,2,0.140856,171.764706,2
"""1101_60267""","""201008""","""SUDU""","""940540""",2.533158,90,45,45,0.004292,3,0.028275,3257.867647,3
"""2709_58303""","""201705""","""OOLU""","""420292 3901""",4.0,100,36,12,0.007401,5,0.475007,16426.838235,4
"""5301_30107""","""201201""","""CNIU""","""290244""",2.533158,236,118,118,0.00085,13,0.00933,0.0,1
"""4601_51721""","""201304""","""NYKS""","""741999""",5.066315,576,456,168,0.002336,9,0.09541,1306.985294,2


statistic,lane_id,month,scac,hs_code,teus,num_foreign_ports,num_foreign_countries,num_foreign_regions,ms_carrier,num_carriers,hhi_lane,carrier_lane_capacity,num_vessels
str,str,str,str,str,f64,f64,f64,f64,f64,f64,f64,f64,f64
"""count""","""12563744""","""12563744""","""12563744""","""12563744""",12563744.0,12563744.0,12563744.0,12563744.0,12563744.0,12563744.0,12563744.0,12563744.0,12563744.0
"""null_count""","""0""","""0""","""0""","""0""",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
"""mean""",,,,,11.396725,1424.889628,923.83063,289.68627,0.007726,8.69019,0.046556,5546.317133,3.093696
"""std""",,,,,62.039388,1954.62419,1226.50463,350.23617,0.028789,5.00509,0.062723,7258.622481,2.121315
"""min""",,"""200701""",,"""-1""",0.01,1.0,1.0,1.0,5e-06,1.0,0.003091,0.0,1.0
"""25%""",,,,,2.533158,264.0,185.0,72.0,0.000796,5.0,0.014845,0.0,1.0
"""50%""",,,,,2.533158,702.0,468.0,175.0,0.001971,9.0,0.026781,2173.161765,3.0
"""75%""",,,,,7.0,1804.0,1161.0,380.0,0.005155,12.0,0.052905,8837.867647,4.0
"""max""",,"""202312""",,"""ddedo""",14945.550355,20461.0,14350.0,6050.0,1.0,23.0,1.0,85391.094217,17.0


## 2. Visuals and Descriptive Analysis

In [13]:
df = (
    main_lf
    #group by alliance, scac, and year
    .group_by('alliance', 'scac', 'year')
    .agg(pl.col('teus').sum())
    #get proportion 
    .with_columns((pl.col('teus')/pl.col('teus').sum().over('year')).alias('prop_volume'))
    .collect()
)

#plot
fig = px.bar(
    df, x='year', y='prop_volume',
    color='alliance',
    text='scac', 
    barmode='stack',  
    title='Percent of Export Volumes Over Time',
    width=900,
    height=500,
    color_discrete_map={
        'Non-alliance Carriers':'#636EFA',
        'The Alliance':'#EF553B',
        'Ocean Alliance':'#00CC96',
        'CYKHE':'#ABC3FA',
        'New World All (NWA)':'#FFA15A',
        'Grand Alliance IV':'#19D3F3',
        'G6 Alliance':'#FF6692',
        'MSC/CMA CGM':'#B6E880',
        '2M Alliance':'#FECB52',
        'Ocean Three':'#FF97FF',
        'CYKH':'#17BECF',
    },
    labels={
        'alliance':'Alliance:',
        'prop_volume':'Volume (Percent of Total)',
        'year':'Time'
    },
    category_orders={
        'alliance':[
            'Non-alliance Carriers', 'The Alliance', 'CYKHE', 'Ocean Alliance',  
            'CYKH', 'Ocean Three', '2M Alliance', 'G6 Alliance', 'MSC/CMA CGM', 'New World All (NWA)', 'Grand Alliance IV'
        ]
    }
    )

fig.show()

NOTE we need to update masterAlliances.csv with new unified SCAC codes

In [16]:
df = (
    main_lf
    #get cargo source column
    .with_columns(
        pl.when(pl.col('primary_cargo')==1)
        .then(pl.lit('self'))
        .otherwise(
            pl.when(pl.col('alliance')==pl.col('owner_alliance'))
            .then(pl.lit('ally'))
            .otherwise(pl.lit('non-ally'))
        )
        .alias('cargo_source')
    )
    #group by source and year
    .group_by('cargo_source', 'month')
    .agg(
        pl.col('teus').sum()
    )
    .with_columns((pl.col('teus')/pl.col('teus').sum().over('month')).alias('prop_volume'))
    .cast({'cargo_source':pl.Utf8})
    .sort(by=['month'])
    .collect()
)

px.area(
    df.with_columns(pl.col('month').str.to_datetime('%Y%m')), x='month', y='prop_volume',
    color='cargo_source',
    color_discrete_map={
        'self':'#636EFA',
        'ally':'#EF553B',
        'non-ally':'#00CC96'
    },
    labels={
        'prop_volume':'Share of Total Capacity',
        'cargo_source':'Cargo Source',
        'month':'Time',
        'self':'Self',
        'ally':'Allies',
        'non-ally':'Others'
    },
    category_orders={
        'cargo_source':['self', 'ally', 'non-ally']
    },
    width=900,
    height=500,
    title='Cargo Source Over Time'
)