# Ocean Carrier Alliances - Data Analysis for USDA Report

This notebook analyses data on maritime container shipping to inspect the impacts of strategic alliances between ocean freight carriers (Ocean Carrier Alliances, OCAs) on the containerized agricultural export market in the US. Data are pre-processed in the oca_data_prep notebook; see the [project repo](https://github.com/epistemetrica/Ocean-Carrier-Alliances-Project) for full details.

Section 1 presents data summaries and aggregations, Section 2 gives general context for the maritime freight economy and vessel sharing behavior, and Section 3 presents the analytical models and results.

In [61]:
#preliminaries
import numpy as np
import pandas as pd
import polars as pl
import plotly_express as px
import statsmodels.api as sm
from datetime import datetime
from datetime import date

#enable string cache for polars categoricals
pl.enable_string_cache()
#display settings
pd.set_option('display.max_columns', None)

In [62]:
#load data
main_lf = (
    #load parquet files to polars lazyframe 
    pl.scan_parquet('../data/main/*.parquet')
    #choose only exports
    .filter(pl.col('direction')=='export')
    #drop unneeded columns
    .drop('origin_territory', 'origin_region', 'direction', 'carrier_name', 
          'carrier_scac', 'vessel_name', 'voyage_number')
    #rename cols for consistency/sanity
    .rename({'unified_carrier_name':'carrier_name',
             'unified_carrier_scac':'scac',
             'arrival_port_name':'dest_port_name',
             'arrival_port_code':'dest_port_code',
             'departure_port_name':'origin_port_name',
             'departure_port_code':'origin_port_code',
             'pc_alliance':'owner_alliance'})
    #reorder columns for sanity
    .select('bol_id', 'date', 'month', 'year', 'teus', 'hs_code', 'lane_id', 
            'lane_name', 'origin_port_name', 'origin_port_code', 'coast_region', 
            'dest_port_name', 'dest_port_code', 'dest_territory', 'dest_region',  
            'drewery_lane', 'dist', 'rate_40', 'rate_20', 'scac', 'carrier_name', 
            'alliance', 'alliance_member', 'vessel_id', 'vessel_capacity', 
            'vessel_owner', 'owner_alliance', 'primary_cargo', 'shared_teus', 
            'cargo_source')
    #set 0 capacities to missing
    .with_columns(
        pl.col('vessel_capacity').replace(0, None)
    )
)

## 1. Data Summary and Aggregations

The main dataset holds a unique bill of lading (BOL) on each row and includes data from various sources in the columns described below.  

### Column Definitions

- bol_id - unique identifier of each BOL
- date, month, year - departure date of the export
- teus - volume of the shipment in twenty-foot-equivalent (TEU) units
- hs_code - list of commodity codes included in the shipment
- lane_id - unique identifier of the lane (a combination of origin and destination port codes)
- lane_name - name of the lane
- origin_port_name - name of the US port of origin
- origin_port_code - CBP code for the port of origin
- coast_region - US coastal region of origin
- dest_port_name - name of the foreign destination port
- dest_port_code - CBP code for the foreign destination port
- dest_territory - name of country or territory of destination
- dest_region - global region of destination
- drewery_lane - Drewery lane matched to the lane_id using summed haversine distances between the origin/destination port pairs
- dist - summed haversine distance between the origin/destination port pairs 
- rate_40 - Drewery CFRI rate index for 40-ft equivalent containers when the cargo was carried
- rate_20 - Drewery CFRI rate index for 20-ft equivalent containers (TEU) when the cargo was carried
- scac - Standard Carrier Alpha Code (SCAC) for the firm who carried the cargo
- carrier_name - name of the carrier
- alliance - the alliance to which the carrier belonged when the cargo was sold (NOTE this needs to be updated when we input regional alliance data)
- alliance_member - boolean for whether or not the carrier was a member of any alliance when that cargo was carried
- vessel_id - International Maritime Organization (IMO) code uniquely identifying the vessel
- vessel_capacity - metric of how many TEU can be carried by the vessel at any given time (computed from Net Register Tonnage)
- vessel_owner - the carrier representing the most TEUs carried by that vessel during the month on which the BOL was carried
- owner_alliance - the OCA to which the vessel_owner belonged at the time
- primary_cargo - boolean indicating whether the cargo came from the vessel_owner or not
- shared_teus - number of TEUs if the cargo did not come from the vessel_owner (identical to 'teus' when primary_cargo is 0). 
- cargo_source - indicator for whether the cargo came from a member of the vessel_owner's alliance or not. 


### Summary stats and head

In [63]:
desc = main_lf.describe()
desc.write_csv('../data/misc/usda_maindata_summarystat.csv')
display(desc)
main_lf.limit(5).collect()

statistic,bol_id,date,month,year,teus,hs_code,lane_id,lane_name,origin_port_name,origin_port_code,coast_region,dest_port_name,dest_port_code,dest_territory,dest_region,drewery_lane,dist,rate_40,rate_20,scac,carrier_name,alliance,alliance_member,vessel_id,vessel_capacity,vessel_owner,owner_alliance,primary_cargo,shared_teus,cargo_source
str,str,str,str,f64,f64,str,str,str,str,str,str,str,str,str,str,str,f64,f64,f64,str,str,str,f64,f64,f64,str,str,f64,f64,str
"""count""","""63763013""","""63762939""","""63763014""",63763014.0,63763014.0,"""63761677""","""63763014""","""63763014""","""63763014""","""63763014""","""63762870""","""63763014""","""63763014""","""63760578""","""63760578""","""63760550""",63760550.0,35746203.0,35746203.0,"""63763014""","""63722411""","""63763014""",63763014.0,63763014.0,57064681.0,"""63763014""","""63763014""",63763014.0,63763014.0,"""63763014"""
"""null_count""","""1""","""75""","""0""",0.0,0.0,"""1337""","""0""","""0""","""0""","""0""","""144""","""0""","""0""","""2436""","""2436""","""2464""",2464.0,28016811.0,28016811.0,"""0""","""40603""","""0""",0.0,0.0,6698333.0,"""0""","""0""",0.0,0.0,"""0"""
"""mean""",,"""2015-12-19 23:11:36.166000""",,2015.465074,3.223902,,,,,,,,,,,,1436.624569,1670.840787,1279.72525,,,,0.449098,9232000.0,2244.391506,,,0.701818,1.043588,
"""std""",,,,4.740342,5.993241,,,,,,,,,,,,1194.021012,793.706898,541.807286,,,,,474436.381625,1465.719912,,,,3.705648,
"""min""","""079A_26004878070""","""2007-01-01""","""200701""",2007.0,0.01,"""-1""",,,,,,,,,,,5.93014,400.0,310.0,,,"""2M Alliance""",0.0,196.0,0.073529,,"""2M Alliance""",0.0,0.0,"""ally"""
"""25%""",,"""2012-02-28""",,2012.0,2.0,,,,,,,,,,,,425.931559,1100.0,880.0,,,,,9218686.0,1027.647059,,,,0.0,
"""50%""",,"""2016-02-19""",,2016.0,2.533158,,,,,,,,,,,,1261.373246,1590.0,1230.0,,,,,9315202.0,2157.941176,,,,0.0,
"""75%""",,"""2019-12-13""",,2019.0,2.533158,,,,,,,,,,,,2377.521062,2020.0,1530.0,,,,,9430868.0,3320.073529,,,,2.0,
"""max""","""zzzz_ZZZZ""","""2023-12-31""","""202312""",2023.0,3729.25,"""ddedo""",,,,,,,,,,,8769.910108,15900.0,12070.0,,,"""The Alliance""",1.0,9979125.0,17889.705882,,"""The Alliance""",1.0,1123.25,"""non-ally"""


bol_id,date,month,year,teus,hs_code,lane_id,lane_name,origin_port_name,origin_port_code,coast_region,dest_port_name,dest_port_code,dest_territory,dest_region,drewery_lane,dist,rate_40,rate_20,scac,carrier_name,alliance,alliance_member,vessel_id,vessel_capacity,vessel_owner,owner_alliance,primary_cargo,shared_teus,cargo_source
str,date,str,i32,f64,str,cat,cat,cat,cat,cat,cat,cat,cat,cat,cat,f64,f64,f64,cat,cat,str,bool,i32,f64,cat,str,bool,f64,str
"""MDSC_MSCUNE549034""",2007-01-01,"""200701""",2007,2.533158,"""4805""","""2002_24722""","""New Orleans — Caucedo""","""NEW ORLEANS""","""2002""","""GULF""","""CAUCEDO""","""24722""","""DOMINICAN REPUBLIC""","""CARIBBEAN""","""US Gulf Coast (Houston) to Col…",1627.659697,,,"""MSCU""","""MEDITERRANEAN SHIPPING COMPANY""","""Non-alliance Carriers""",False,8420892,,"""MSCU""","""Non-alliance Carriers""",True,0.0,"""non-ally"""
"""MLSL_511654419""",2007-01-01,"""200701""",2007,2.533158,"""630900""","""5301_47563""","""Houston — Cagliari""","""HOUSTON""","""5301""","""GULF""","""CAGLIARI""","""47563""","""ITALY""","""MEDITERRANEAN""","""US Gulf Coast (Houston) to Wes…",585.05394,,,"""MAEU""","""MAERSK LINE""","""Non-alliance Carriers""",False,8212702,,"""MAEU""","""Non-alliance Carriers""",True,0.0,"""non-ally"""
"""HAPL_CHI070108570""",2007-01-01,"""200701""",2007,2.533158,"""120600""","""5301_47536""","""Houston — Gioia Tauro""","""HOUSTON""","""5301""","""GULF""","""GIOIA TAURO""","""47536""","""ITALY""","""MEDITERRANEAN""","""US Gulf Coast (Houston) to Wes…",890.01468,,,"""HLCU""","""HAPAG LLOYD""","""Grand Alliance IV""",True,8212702,,"""MAEU""","""Non-alliance Carriers""",False,2.533158,"""non-ally"""
"""MLSL_511590672""",2007-01-01,"""200701""",2007,2.533158,"""210690""","""5301_47563""","""Houston — Cagliari""","""HOUSTON""","""5301""","""GULF""","""CAGLIARI""","""47563""","""ITALY""","""MEDITERRANEAN""","""US Gulf Coast (Houston) to Wes…",585.05394,,,"""MAEU""","""MAERSK LINE""","""Non-alliance Carriers""",False,8212702,,"""MAEU""","""Non-alliance Carriers""",True,0.0,"""non-ally"""
"""MLSL_511655731""",2007-01-01,"""200701""",2007,2.533158,"""120740""","""5301_47536""","""Houston — Gioia Tauro""","""HOUSTON""","""5301""","""GULF""","""GIOIA TAURO""","""47536""","""ITALY""","""MEDITERRANEAN""","""US Gulf Coast (Houston) to Wes…",890.01468,,,"""MAEU""","""MAERSK LINE""","""Non-alliance Carriers""",False,8212702,,"""MAEU""","""Non-alliance Carriers""",True,0.0,"""non-ally"""


In [64]:
main_lf.approx_n_unique().collect()


`LazyFrame.approx_n_unique` is deprecated. Use `select(pl.all().approx_n_unique())` instead.



bol_id,date,month,year,teus,hs_code,lane_id,lane_name,origin_port_name,origin_port_code,coast_region,dest_port_name,dest_port_code,dest_territory,dest_region,drewery_lane,dist,rate_40,rate_20,scac,carrier_name,alliance,alliance_member,vessel_id,vessel_capacity,vessel_owner,owner_alliance,primary_cargo,shared_teus,cargo_source
u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32
54996974,6253,204,17,5804,1234135,22400,22210,149,148,6,1378,1394,203,14,102,15440,562,395,2697,2698,11,2,17167,5231,2521,11,2,3354,2


INSERT VOLUMES BY PORT SIZE VISUAL 

### Top 500 lanes

Containerized frieght volumes are highly concentrated to common shipping lanes (e.g., LA to Tokyo) and while many ports around the world have small container terminals and thus appear in the PIERS database, these ports are visted much less frequentyly and constitute tiny percentages of volumnes on each ship. For these reasons, we suspect the impacts of alliance behavior may be different for small container ports; we drop all but the top 500 lanes in order to focus on the majority of volumes. 

NOTE we may tweak the 500 number 

In [15]:
#get top 500 lanes
top_lanes = (
    main_lf
    .group_by('lane_id')
    .agg(pl.col('teus').sum())
    .sort(by='teus', descending=True)
    .limit(500)
    .collect()
    .to_series()
)

#limit main lf to top lanes 
main_lf = main_lf.filter(pl.col('lane_id').is_in(top_lanes))

In [16]:
df = (
    main_lf
    .group_by('lane_id')
    .agg(pl.col('teus').sum())
    .sort(by='teus', descending=True)
    .with_row_index()
    .collect()
)
px.bar(df, x='index', y='teus', title='TEUs per lane')

### Aggregations

Our analysis takes place at three different levels of group-by aggregation; each described below. Before aggregating, however, we must address two remaining data cleaning tasks: missing vessel capacity info, and hs_code formats. 

#### Vessel Capacities

Capacity data is rarely missing in the time periods from which data are published (2013-2022) and progressively more missing the further outside we move from that time period since not all vessels operating in 2008, for example, were still operating by the time capacity data became available. If left unaddressed, this would cause capacities to be under-estimated during time periods outside the 2013–22 window. When available, we fill missing capacities with the average of known capacities within a given lane, month, and carrier, making the tacit assumption that within such a short time frame, missing capacites are not likely to be systematically different than known capacities on the same lane and carrier. When this data is not available, we fill nulls with the mean over lane and month, and finally fill with the mean over the lane as a last resort. 


In [17]:
#aggregate vessel capacites
main_lf = (
    main_lf
    #impute missing vessel capacities
    .with_columns(
        #fill with mean over lane, month, and carrier where available
        pl.when(pl.col('vessel_capacity').is_null())
        .then(pl.col('vessel_capacity').mean().over('lane_id', 'month', 'scac'))
        .otherwise(pl.col('vessel_capacity'))
    )
    .with_columns(
        #fill remaning nulls with mean over lane and month where available
        pl.when(pl.col('vessel_capacity').is_null())
        .then(pl.col('vessel_capacity').mean().over('lane_id', 'month'))
        .otherwise(pl.col('vessel_capacity'))
    )
    .with_columns(
        #fill remaining nulls with mean over lane
        pl.when(pl.col('vessel_capacity').is_null())
        .then(pl.col('vessel_capacity').mean().over('lane_id'))
        .otherwise(pl.col('vessel_capacity'))
    )
)

##### HS Codes

A key aspect of our study is to compare the impacts of OCAs on different types of cargo; specifically on agricultural and non-agricultural cargoes. In order to aggregate data across the individual 2-digit HS codes, we must address the fact that this data lies in strings of 4–8 character codes separated by a space. Furthermore, since the PIERS data simply list the HS codes contained within each BOL and do not indicate how much each code contributes to the total volume, we assign equal volumes to each hs_code associated with the BOL. These issues are addressed in two steps:
1. splitting the strings into an interable list and then exploding the table by hs_code to long form where each row corresponds to a unique pair of bol_id and hs_code. 
2. dropping all but the first two digits of the hs_code (note the first two digits of hs_codes describe the highest categorization level)

Note that this results in a somewhat untidy table; if we needed to analyze unique hs_code-bol_id rows, we would then need to re-aggregate over bol_id and hs_code. However, since we are aggregating to lane_id, month, scac, and hs_code anyway, we can skip this step. 

In [18]:
def explode_hscodes(lf, hs_digits=2):
    '''
    Expands the dataset into long form where each row corresponds to only a single commodity (HS Code) within each BOL. 
        Assumes that the raw data are effectively grouped by BOL, resulting in hs_code column sometimes containing more than one code;
        this function undoes that group aggregation. 
    INPUTS
        lf - a polars lazyframe containing the relevant data
        hs_digits - int - the number of HS codes to keep
    OUTPUTS
        lf - a polars lazyframe containing the long form data
    '''
    #instantiate index
    index = pl.arange(0, lf.select(pl.len()).collect().item(), eager=True)
    #explode hs_codes 
    lf = (
        lf
        #set nulls to 0 to prevent data loss in later steps
        .with_columns(pl.col('hs_code').replace(None, 'missing'))
        .with_columns(
            #split hs codes into lists
            pl.col('hs_code').str.split(by=' '),
            #create pseudo-index col
            pl.Series(name='index', values=index)
            )
        #explode hs_codes
        .explode('hs_code')
        #drop rows with empty strings (these result from more than one space between hs_codes in the raw data)
        .with_columns(pl.col('hs_code').replace('', None))
        .drop_nulls('hs_code')
        #distribute total volumes evenly across HS Codes -- NOTE this step may change as better metadata is collected
        .with_columns(
            (pl.col('teus')/(pl.col('index').count().over('index'))),
            #replace originally-null values
            pl.col('hs_code').replace('missing', None)
        )
        .with_columns(
            #drop all but first two hs_code digits
            pl.col('hs_code').str.slice(0,hs_digits)
            #cast to integers, replacing non-integers with null
            .cast(pl.Int32, strict=False),
            )
        #drop index
        .drop('index')
    )
    return lf 

In [19]:
#explode hs_codes and simplify to 2-digit
main_lf = explode_hscodes(main_lf)

##### Missing months

Not all lanes are serviced every month, resulting in aggregated tables having fewer rows than they would otherwise. In order to account for months with no services as 0 teus, for example, we coerse the aggregated tables to include all months in the study timefrime.  

In [22]:
#construct dataframe of all lanes over all months

#get unique lanes
lanes_df = main_lf.select('lane_id').unique().collect()
#create df of all months
all_months_df = pl.DataFrame(
    pl.date_range(date(2007,1,1), date(2023,12,1), interval='1mo', eager=True)
    .dt.strftime('%Y%m'),
    schema={'month':pl.Utf8}
    )
#cross join to create full set of unique pairs
all_lanes_months_df = (
    lanes_df
    .join(all_months_df, how='cross')
)


#### Aggregation 1: Lane-Month

The highest aggregation level we utilize is by lane and month, enabling us to inspect broad lane-level characteristics of the market. 

In [23]:
#main aggregation
lane_month_lf = (
    main_lf
    .with_columns(
        #number of foreign ports accessible from us port
        num_foreign_ports = (pl.col('dest_port_code').unique().count()
                             .over('month', 'origin_port_code')),
        #number of foreign countries accessible from us port
        num_foreign_countries = (pl.col('dest_territory').unique().count()
                                 .over('month', 'origin_port_code')),
        #number of foreign regions accessible from us port
        num_foreign_regions = (pl.col('dest_region').unique().count()
                                 .over('month', 'origin_port_code'))
    )
    .group_by('lane_id', 'month', 'scac')
    .agg(
        teus = pl.col('teus').sum(),
        alliance = pl.col('alliance').first(),
        num_vessels = pl.col('vessel_id').unique().count(),
        num_foreign_ports = pl.col('num_foreign_ports').first(),
        num_foreign_countries = pl.col('num_foreign_countries').first(),
        num_foreign_regions = pl.col('num_foreign_regions').first()
        #NOTE num ports and countries are unique to each lane and month
    )
    #get market share of each carrier for the lane and month
    .with_columns(
        ms_carrier = ((pl.col('teus')/
                      (pl.col('teus').sum().over('lane_id', 'month'))))
    )
    #final grouping
    .group_by('lane_id', 'month')
    .agg(
        hhi_lane = (pl.col('ms_carrier')**2).sum(),
        teus = pl.col('teus').sum(),
        num_vessels = pl.col('num_vessels').unique().count(),
        num_carriers = pl.col('scac').unique().count(),
        num_alliances = pl.col('alliance').unique().count(),
        num_foreign_ports = pl.col('num_foreign_ports').first(),
        num_foreign_countries = pl.col('num_foreign_countries').first(),
        num_foreign_regions = pl.col('num_foreign_regions').first()
    )

)

#aggregate vessel capacities
lane_cap_lf = (
    main_lf
    #aggregate
    .group_by('lane_id', 'month', 'vessel_id')
    .agg(pl.col('vessel_capacity').first())
    .group_by('lane_id', 'month')
    .agg(pl.col('vessel_capacity').sum().alias('lane_capacity'))
)

#merge vessel capacities into lane-month lf
lane_month_lf = (
    lane_month_lf
    .join(
        lane_cap_lf,
        on=['lane_id', 'month'],
        how='left'
    )
)

#merge drewery lanes and rates into lane-month lf
rates_lf = (
    main_lf
    .select('lane_id', 'month', 'drewery_lane', 'dist', 'rate_40', 'rate_20')
    .unique(subset=['lane_id', 'month']) #NOTE this should be unique already?
)
#merge
lane_month_lf = (
    lane_month_lf
    .join(
        rates_lf, 
        on=['lane_id', 'month'],
        how='left'
    )
)

#coerse frame to include all possible months
lane_month_lf = (
    all_lanes_months_df.lazy()
    .join(
        lane_month_lf,
        on=['lane_id', 'month'],
        how='left'
    )
    #fill nulls
    .with_columns(
        pl.col('teus').fill_null(0),
        pl.col('num_vessels').fill_null(0),
        pl.col('num_carriers').fill_null(0),
        pl.col('num_alliances').fill_null(0),
        pl.col('num_foreign_ports').fill_null(0),
        pl.col('num_foreign_countries').fill_null(0),
        pl.col('num_foreign_regions').fill_null(0),
        pl.col('lane_capacity').fill_null(0)
    )
)

#save to csv
lane_month_lf.collect().write_csv('../data/usda_aggregations/lane_month.csv')

display(lane_month_lf.sort(by='month').limit(5).collect())
lane_month_lf.describe()

lane_id,month,hhi_lane,teus,num_vessels,num_carriers,num_alliances,num_foreign_ports,num_foreign_countries,num_foreign_regions,lane_capacity,drewery_lane,dist,rate_40,rate_20
cat,str,f64,f64,u32,u32,u32,u32,u32,u32,f64,cat,f64,f64,f64
"""1703_35177""","""200701""",0.194482,463.567855,4,8,3,43,32,11,10986.029412,"""US East Coast (New York) to Br…",1149.382682,,
"""4601_41251""","""200701""",0.367263,1324.841465,3,8,2,60,42,11,12942.235051,"""US East Coast (New York) to UK…",339.064744,,
"""1303_58201""","""200701""",0.831675,392.63944,2,3,1,12,12,5,10845.790441,"""US East Coast (New York) to Ho…",287.87386,,
"""3001_58000""","""200701""",,0.0,0,0,0,0,0,0,0.0,,,,
"""2704_58857""","""200701""",0.169333,1185.517792,4,10,3,36,16,5,43172.192973,"""US West Coast (Los Angeles) to…",285.249544,,


statistic,lane_id,month,hhi_lane,teus,num_vessels,num_carriers,num_alliances,num_foreign_ports,num_foreign_countries,num_foreign_regions,lane_capacity,drewery_lane,dist,rate_40,rate_20
str,str,str,f64,f64,f64,f64,f64,f64,f64,f64,f64,str,f64,f64,f64
"""count""","""102000""","""102000""",97065.0,102000.0,102000.0,102000.0,102000.0,102000.0,102000.0,102000.0,102000.0,"""97065""",97065.0,49738.0,49738.0
"""null_count""","""0""","""0""",4935.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,"""4935""",4935.0,52262.0,52262.0
"""mean""",,,0.449528,1406.335301,3.486118,6.407559,2.72748,34.259343,21.253951,6.726471,34968.061738,,1314.252251,1603.674253,1237.05718
"""std""",,,0.280117,2378.364493,2.020358,4.317535,1.331855,18.203147,12.295121,3.763864,32347.113303,,1003.786978,760.79759,528.156237
"""min""",,"""200701""",0.063806,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,5.93014,400.0,310.0
"""25%""",,,0.232376,303.78788,2.0,3.0,2.0,16.0,13.0,4.0,8853.257321,,552.798207,1050.0,840.0
"""50%""",,,0.351852,762.48046,3.0,6.0,3.0,38.0,20.0,6.0,26917.941176,,1231.437882,1490.0,1170.0
"""75%""",,,0.598247,1627.165788,5.0,9.0,4.0,46.0,30.0,11.0,51857.426471,,1780.936587,1980.0,1480.0
"""max""",,"""202312""",1.0,48794.368538,14.0,23.0,7.0,65.0,44.0,11.0,218729.107,,6620.504391,8250.0,5070.0


In [34]:
lane_month_lf.approx_n_unique().collect()


`LazyFrame.approx_n_unique` is deprecated. Use `select(pl.all().approx_n_unique())` instead.



lane_id,month,hhi_lane,teus,num_vessels,num_carriers,num_alliances,num_foreign_ports,num_foreign_countries,num_foreign_regions,lane_capacity,drewery_lane,dist,rate_40,rate_20
u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32
500,204,84206,72654,15,24,8,46,35,11,86779,76,337,474,351


#### Aggregation 2: Lane-Month-Carrier

The next level of aggregation is over carriers in addition to lane and month, allowing us to analyze the ports serviced, total volumes and capacities, etc of specific carriers within each lane. 

In [25]:
#main aggregation
lane_month_carrier_lf = (
    main_lf
    .with_columns(
        #number of foreign ports serviced by carrier from us port
        num_foreign_ports = (pl.col('dest_port_code').unique().count()
                             .over('month', 'scac', 'origin_port_code')),
        #number of foreign countries serviced by carrier from us port
        num_foreign_countries = (pl.col('dest_territory').unique().count()
                                 .over('month', 'scac', 'origin_port_code')),
        #number of foreign regions serviced by carrier from us port
        num_foreign_regions = (pl.col('dest_region').unique().count()
                                 .over('month', 'scac', 'origin_port_code'))
    )
    #aggregate
    .group_by('lane_id', 'month', 'scac')
    .agg(
        teus = pl.col('teus').sum(),
        alliance = pl.col('alliance').first(),
        num_foreign_ports = pl.col('num_foreign_ports').first(),
        num_foreign_countries = pl.col('num_foreign_countries').first(),
        num_foreign_regions = pl.col('num_foreign_regions').first()
        #NOTE num ports and countries are unique to each lane and month
    )
    #get market share of each carrier for the lane and month
    .with_columns(
        ms_carrier = ((pl.col('teus')/
                      (pl.col('teus').sum().over('lane_id', 'month'))))
    )
    .with_columns(
        #number of carriers active on the lane
        num_carriers = pl.col('scac').unique().count().over('lane_id', 'month'),
        #hhi
        hhi_lane = (pl.col('ms_carrier')**2).sum().over('lane_id', 'month')
    )
)

#aggregate vessels and capacities
lane_carrier_cap_lf = (
    main_lf
    #drop vessels not from the given carrier
    .with_columns(
        pl.when(pl.col('vessel_owner')==pl.col('scac'))
        .then(pl.col('vessel_capacity'))
        .otherwise(pl.lit(None))
        .alias('vessel_capacity'),
        pl.when(pl.col('vessel_owner')==pl.col('scac'))
        .then(pl.col('vessel_id'))
        .otherwise(pl.lit(None))
        .alias('vessels')
    )
    #aggregate
    .group_by('lane_id', 'month', 'scac', 'vessels')
    .agg(
        pl.col('vessel_capacity').drop_nulls().first()
    )
    .group_by('lane_id', 'month', 'scac')
    .agg(
        pl.col('vessel_capacity').sum().alias('carrier_lane_capacity'),
        pl.col('vessels').unique().count().alias('num_vessels')
    )
)

#merge vessel capacities into lane-month lf
lane_month_carrier_lf = (
    lane_month_carrier_lf
    .join(
        lane_carrier_cap_lf,
        on=['lane_id', 'month', 'scac'],
        how='left'
    )
)

#merge
lane_month_carrier_lf = (
    lane_month_carrier_lf
    .join(
        rates_lf, 
        on=['lane_id', 'month'],
        how='left'
    )
)

#coerse frame to include all possible lane-month-carrier combos
#get all carriers
all_carriers_df = main_lf.select('scac').unique().collect()
#cross join with all_lanes_months to get full set
all_lanes_months_carriers_df = (
    all_lanes_months_df
    .join(
        all_carriers_df,
        how='cross'
    )
    #drop repeats just in case
    .unique()
)

lane_month_carrier_lf = (
    all_lanes_months_carriers_df.lazy()
    .join(
        lane_month_carrier_lf,
        on=['lane_id', 'month', 'scac'],
        how='left'
    )
    #fill nulls
    .with_columns(
        pl.col('teus').fill_null(0),
        pl.col('num_vessels').fill_null(0),
        pl.col('num_carriers').fill_null(0),
        pl.col('num_foreign_ports').fill_null(0),
        pl.col('num_foreign_countries').fill_null(0),
        pl.col('num_foreign_regions').fill_null(0),
        pl.col('carrier_lane_capacity').fill_null(0)
    )
)

#save to csv
lane_month_carrier_lf.collect().write_csv('../data/usda_aggregations/lane_month_carrier.csv')

display(lane_month_carrier_lf.limit(5).collect())
lane_month_carrier_lf.describe()

lane_id,month,scac,teus,alliance,num_foreign_ports,num_foreign_countries,num_foreign_regions,ms_carrier,num_carriers,hhi_lane,carrier_lane_capacity,num_vessels,drewery_lane,dist,rate_40,rate_20
cat,str,cat,f64,str,u32,u32,u32,f64,u32,f64,f64,u32,cat,f64,f64,f64
"""1703_24128""","""201111""","""ANLC""",0.0,,0,0,0,,0,,0.0,0,,,,
"""2811_57047""","""202312""","""GIML""",0.0,,0,0,0,,0,,0.0,0,,,,
"""4601_24128""","""201508""","""PRGH""",0.0,,0,0,0,,0,,0.0,0,,,,
"""2811_58895""","""201412""","""BNUH""",0.0,,0,0,0,,0,,0.0,0,,,,
"""5203_23600""","""201009""","""LBLU""",0.0,,0,0,0,,0,,0.0,0,,,,


statistic,lane_id,month,scac,teus,alliance,num_foreign_ports,num_foreign_countries,num_foreign_regions,ms_carrier,num_carriers,hhi_lane,carrier_lane_capacity,num_vessels,drewery_lane,dist,rate_40,rate_20
str,str,str,str,f64,str,f64,f64,f64,f64,f64,f64,f64,f64,str,f64,f64,f64
"""count""","""80988000""","""80988000""","""80988000""",80988000.0,"""653571""",80988000.0,80988000.0,80988000.0,653571.0,80988000.0,653571.0,80988000.0,80988000.0,"""653571""",653571.0,345480.0,345480.0
"""null_count""","""0""","""0""","""0""",0.0,"""80334429""",0.0,0.0,0.0,80334429.0,0.0,80334429.0,0.0,0.0,"""80334429""",80334429.0,80642520.0,80642520.0
"""mean""",,,,1.771203,,0.167648,0.10951,0.038664,0.148515,0.075186,0.315278,32.905865,0.017692,,1120.773101,1486.765283,1157.41574
"""std""",,,,57.596236,,2.181968,1.448101,0.50396,0.211436,0.914602,0.197136,663.879179,0.248497,,863.244958,735.959586,515.286982
"""min""",,"""200701""",,0.0,"""2M Alliance""",0.0,0.0,0.0,3.7e-05,0.0,0.063806,0.0,0.0,,5.93014,400.0,310.0
"""25%""",,,,0.0,,0.0,0.0,0.0,0.018582,0.0,0.180287,0.0,0.0,,445.895706,950.0,770.0
"""50%""",,,,0.0,,0.0,0.0,0.0,0.064037,0.0,0.258435,0.0,0.0,,1149.382682,1370.0,1080.0
"""75%""",,,,0.0,,0.0,0.0,0.0,0.180757,0.0,0.380187,0.0,0.0,,1544.956374,1810.0,1400.0
"""max""",,"""202312""",,25915.51,"""The Alliance""",59.0,42.0,11.0,1.0,23.0,1.0,84876.489282,17.0,,6620.504391,8250.0,5070.0


In [65]:
lane_month_carrier_lf.approx_n_unique().collect()


`LazyFrame.approx_n_unique` is deprecated. Use `select(pl.all().approx_n_unique())` instead.



lane_id,month,scac,teus,alliance,num_foreign_ports,num_foreign_countries,num_foreign_regions,ms_carrier,num_carriers,hhi_lane,carrier_lane_capacity,num_vessels,drewery_lane,dist,rate_40,rate_20
u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32
500,204,793,115389,12,60,43,12,546303,24,83009,101174,18,76,337,474,351


#### Aggregation 3: Lane-Month-Carrier-Commodity

Aggregating by lane, month, carrier, and commodity (hs code) allows us to inspect the impacts of OCAs for different commodities. 

In [27]:
#main aggregation
lane_month_carrier_hs_lf = (
    main_lf
    .with_columns(
        #number of foreign ports serviced by carrier from us port
        num_foreign_ports = (pl.col('dest_port_code').unique().count()
                             .over('month', 'scac', 'origin_port_code')),
        #number of foreign countries serviced by carrier from us port
        num_foreign_countries = (pl.col('dest_territory').unique().count()
                                 .over('month', 'scac', 'origin_port_code')),
        #number of foreign regions serviced by carrier from us port
        num_foreign_regions = (pl.col('dest_region').unique().count()
                                 .over('month', 'scac', 'origin_port_code'))
    )
    #aggregate
    .group_by('lane_id', 'month', 'scac', 'hs_code')
    .agg(
        teus = pl.col('teus').sum(),
        alliance = pl.col('alliance').first(),
        num_foreign_ports = pl.col('num_foreign_ports').first(),
        num_foreign_countries = pl.col('num_foreign_countries').first(),
        num_foreign_regions = pl.col('num_foreign_regions').first()
        #NOTE num ports and countries are unique to each lane and month
    )
    #get market share of each carrier for the lane and month
    .with_columns(
        #overall market share
        ms_carrier = ((pl.col('teus').sum().over('lane_id', 'month', 'scac')/
                      (pl.col('teus').sum().over('lane_id', 'month')))),
        #get market share per hs code
        ms_carrier_hs = ((pl.col('teus')/ #note teu is already summed over lane, month, scac, and hs code
                      (pl.col('teus').sum().over('lane_id', 'month', 'hs_code'))))
    )
    .with_columns(
        #number of carriers active on the lane
        num_carriers = pl.col('scac').unique().count().over('lane_id', 'month'),
        #hhi per hs code
        hhi_lane_hs = ((pl.col('ms_carrier_hs')**2).sum()
                       .over('lane_id', 'month', 'hs_code')),
        #destination data
        num_foreign_ports = pl.col('num_foreign_ports').sum()
        .over('lane_id', 'month', 'scac'),
        num_foreign_countries = pl.col('num_foreign_countries').sum()
        .over('lane_id', 'month', 'scac'),
        num_foreign_regions = pl.col('num_foreign_regions').sum()
        .over('lane_id', 'month', 'scac'),
    )
    #drop missing hs codes
    .drop_nulls(subset='hs_code')
)

#aggregate vessels and capacities
lane_carrier_cap_lf = (
    main_lf
    #drop vessels not from the given carrier
    .with_columns(
        pl.when(pl.col('vessel_owner')==pl.col('scac'))
        .then(pl.col('vessel_capacity'))
        .otherwise(pl.lit(None))
        .alias('vessel_capacity'),
        pl.when(pl.col('vessel_owner')==pl.col('scac'))
        .then(pl.col('vessel_id'))
        .otherwise(pl.lit(None))
        .alias('vessels')
    )
    #aggregate
    .group_by('lane_id', 'month', 'scac', 'vessels')
    .agg(
        pl.col('vessel_capacity').drop_nulls().first()
    )
    .group_by('lane_id', 'month', 'scac')
    .agg(
        pl.col('vessel_capacity').sum().alias('carrier_lane_capacity'),
        pl.col('vessels').unique().count().alias('num_vessels')
    )
)

#merge lane-month hhi into lane-month-carrier-hs lf
lane_month_carrier_hs_lf = (
    lane_month_carrier_hs_lf
    .join(
        lane_month_lf.select('lane_id', 'month', 'hhi_lane'),
        on=['lane_id', 'month'],
        how='left'
    )
)


#merge vessel capacities into lane-month-carrier-hs lf
lane_month_carrier_hs_lf = (
    lane_month_carrier_hs_lf
    .join(
        lane_carrier_cap_lf,
        on=['lane_id', 'month', 'scac'],
        how='left'
    )
)

#merge rates
lane_month_carrier_hs_lf = (
    lane_month_carrier_hs_lf
    .join(
        rates_lf, 
        on=['lane_id', 'month'],
        how='left'
    )
)

#coerse frame to include all possible months
lane_month_carrier_hs_lf = (
    all_lanes_months_df.lazy()
    .join(
        lane_month_carrier_hs_lf,
        on=['lane_id', 'month'],
        how='left'
    )
    #fill nulls
    .with_columns(
        pl.col('teus').fill_null(0),
        pl.col('num_vessels').fill_null(0),
        pl.col('num_carriers').fill_null(0),
        pl.col('num_foreign_ports').fill_null(0),
        pl.col('num_foreign_countries').fill_null(0),
        pl.col('num_foreign_regions').fill_null(0),
        pl.col('carrier_lane_capacity').fill_null(0)
    )
)

#save to csv
#lane_month_carrier_hs_lf.collect().write_csv('../data/usda_aggregations/lane_month_carrier_hs.csv')

display(lane_month_carrier_hs_lf.limit(5).collect())
lane_month_carrier_hs_lf.describe()

lane_id,month,scac,hs_code,teus,alliance,num_foreign_ports,num_foreign_countries,num_foreign_regions,ms_carrier,ms_carrier_hs,num_carriers,hhi_lane_hs,hhi_lane,carrier_lane_capacity,num_vessels,drewery_lane,dist,rate_40,rate_20
cat,str,cat,i32,f64,str,u32,u32,u32,f64,f64,u32,f64,f64,f64,u32,cat,f64,f64,f64
"""1902_21500""","""200701""",,,0.0,,0,0,0,,,0,,,0.0,0,,,,
"""1902_21500""","""200702""",,,0.0,,0,0,0,,,0,,,0.0,0,,,,
"""1902_21500""","""200703""",,,0.0,,0,0,0,,,0,,,0.0,0,,,,
"""1902_21500""","""200704""","""TFXL""",84.0,2.533158,"""Non-alliance Carriers""",9,6,3,1.0,1.0,1,1.0,1.0,363.676471,1,"""US Gulf Coast (Houston) to Mex…",1599.017345,,
"""1902_21500""","""200705""","""TFXL""",27.0,2.533158,"""Non-alliance Carriers""",9,6,3,1.0,1.0,1,1.0,1.0,363.676471,1,"""US Gulf Coast (Houston) to Mex…",1599.017345,,


statistic,lane_id,month,scac,hs_code,teus,alliance,num_foreign_ports,num_foreign_countries,num_foreign_regions,ms_carrier,ms_carrier_hs,num_carriers,hhi_lane_hs,hhi_lane,carrier_lane_capacity,num_vessels,drewery_lane,dist,rate_40,rate_20
str,str,str,str,f64,f64,str,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,str,f64,f64,f64
"""count""","""6979267""","""6979267""","""6974332""",6974332.0,6979267.0,"""6974332""",6979267.0,6979267.0,6979267.0,6974332.0,6974332.0,6979267.0,6974332.0,6974332.0,6979267.0,6979267.0,"""6974332""",6974332.0,3382054.0,3382054.0
"""null_count""","""0""","""0""","""4935""",4935.0,0.0,"""4935""",0.0,0.0,0.0,4935.0,4935.0,0.0,4935.0,4935.0,0.0,0.0,"""4935""",4935.0,3597213.0,3597213.0
"""mean""",,,,47.497776,20.552739,,464.340385,300.761016,101.643246,0.258668,0.457895,9.367733,0.578199,0.329703,5316.288772,2.82371,,1024.708998,1501.142534,1157.898215
"""std""",,,,28.909877,97.522343,,445.766386,290.170129,95.734389,0.277054,0.380004,4.705674,0.294473,0.23431,7067.637963,2.026667,,904.264206,791.18154,541.416631
"""min""",,"""200701""",,-1.0,0.0,"""2M Alliance""",0.0,0.0,0.0,3.7e-05,3.9e-05,0.0,0.078869,0.063806,0.0,0.0,,5.93014,400.0,310.0
"""25%""",,,,25.0,2.533158,,133.0,90.0,35.0,0.061634,0.104172,6.0,0.328189,0.171841,0.0,1.0,,219.743793,930.0,760.0
"""50%""",,,,41.0,5.066315,,330.0,215.0,70.0,0.150812,0.333333,9.0,0.505799,0.251365,2124.117647,2.0,,1035.355115,1350.0,1060.0
"""75%""",,,,76.0,12.665788,,666.0,425.0,140.0,0.345592,0.957447,13.0,0.951837,0.388242,8614.926471,4.0,,1526.608008,1870.0,1420.0
"""max""",,"""202312""",,99.0,19674.521504,"""The Alliance""",4256.0,2964.0,836.0,1.0,1.0,23.0,1.0,1.0,84876.489282,17.0,,6620.504391,8250.0,5070.0


## 2. Visuals and Descriptive Analysis

In [28]:
df = (
    main_lf
    #group by alliance, scac, and year
    .group_by('alliance', 'scac', 'year')
    .agg(pl.col('teus').sum())
    #get proportion 
    .with_columns((pl.col('teus')/pl.col('teus').sum().over('year')).alias('prop_volume'))
    .collect()
)

#plot
fig = px.bar(
    df, x='year', y='prop_volume',
    color='alliance',
    text='scac', 
    barmode='stack',  
    title='Percent of Export Volumes Over Time',
    width=900,
    height=500,
    color_discrete_map={
        'Non-alliance Carriers':'#636EFA',
        'The Alliance':'#EF553B',
        'Ocean Alliance':'#00CC96',
        'CYKHE':'#ABC3FA',
        'New World All (NWA)':'#FFA15A',
        'Grand Alliance IV':'#19D3F3',
        'G6 Alliance':'#FF6692',
        'MSC/CMA CGM':'#B6E880',
        '2M Alliance':'#FECB52',
        'Ocean Three':'#FF97FF',
        'CYKH':'#17BECF',
    },
    labels={
        'alliance':'Alliance:',
        'prop_volume':'Volume (Percent of Total)',
        'year':'Time'
    },
    category_orders={
        'alliance':[
            'Non-alliance Carriers', 'The Alliance', 'CYKHE', 'Ocean Alliance',  
            'CYKH', 'Ocean Three', '2M Alliance', 'G6 Alliance', 'MSC/CMA CGM', 'New World All (NWA)', 'Grand Alliance IV'
        ]
    }
    )

fig.show()

NOTE we need to update masterAlliances.csv with new unified SCAC codes

In [29]:
df = (
    main_lf
    #get cargo source column
    .with_columns(
        pl.when(pl.col('primary_cargo')==1)
        .then(pl.lit('self'))
        .otherwise(
            pl.when(pl.col('alliance')==pl.col('owner_alliance'))
            .then(pl.lit('ally'))
            .otherwise(pl.lit('non-ally'))
        )
        .alias('cargo_source')
    )
    #group by source and year
    .group_by('cargo_source', 'month')
    .agg(
        pl.col('teus').sum()
    )
    .with_columns((pl.col('teus')/pl.col('teus').sum().over('month')).alias('prop_volume'))
    .cast({'cargo_source':pl.Utf8})
    .sort(by=['month'])
    .collect()
)

px.area(
    df.with_columns(pl.col('month').str.to_datetime('%Y%m')), x='month', y='prop_volume',
    color='cargo_source',
    color_discrete_map={
        'self':'#636EFA',
        'ally':'#EF553B',
        'non-ally':'#00CC96'
    },
    labels={
        'prop_volume':'Share of Total Capacity',
        'cargo_source':'Cargo Source',
        'month':'Time',
        'self':'Self',
        'ally':'Allies',
        'non-ally':'Others'
    },
    category_orders={
        'cargo_source':['self', 'ally', 'non-ally']
    },
    width=900,
    height=500,
    title='Cargo Source Over Time'
)

In [30]:
px.scatter(
    lane_month_lf.select('teus', 'hhi_lane', 'month')
    .with_columns(pl.col('month').str.to_date('%Y%m').dt.year().alias('year'))
    .collect(),
    y='teus', x='hhi_lane',
    animation_frame='year'
)

In [31]:
def sharing_by_carrier_med(lf, scac, carrier_name=None):
    '''
    ad hoc function to inspect sharing by scac over time for a given carrier
    inputs:
        scac - str - the SCAC for the carrier of interest
        carrier_name - str - the string name of the carrier of interest for plot title
    '''
    #get top sharing partners
    top_partners = (
        lf
        #get shared cargo col
        .with_columns((pl.col('teus')*(1-pl.col('primary_cargo'))).alias('shared_teus'))
        #only maersk vessels but not maersk cargo
        .filter((pl.col('vessel_owner')==scac)&(pl.col('scac')!=scac))
        #group by carrier
        .group_by('carrier_name')
        .agg(pl.col('shared_teus').sum())
        #filter to top 10
        .sort(by='shared_teus', descending=True)
        .limit(5)
        #retail only carriers
        .drop('shared_teus')
        #collect
        .collect()
        #convert from dataframe to series
        .to_series()
    )

    #make df for graphing
    df = (
        lf
        #only maersk vessels but not maersk cargo
        .filter((pl.col('vessel_owner')==scac)&(pl.col('scac')!=scac))
        #create partners column
        .with_columns(
            pl.when(pl.col('carrier_name').is_in(top_partners))
            .then(pl.col('carrier_name'))
            .otherwise(pl.lit('Other Carriers'))
            .alias('partner')
        )
        #get percentages of other carriers
        .group_by('carrier_name', 'partner','month')
        .agg(pl.col('teus').sum())
        .with_columns(pl.col('teus').sum().over('month').alias('total_shared'))
        .with_columns((pl.col('teus')/pl.col('total_shared')).alias('prop_shared'))
        .with_columns(pl.col('month').str.to_datetime('%Y%m'))
        #sort for plotting
        .sort(by='month')
        .cast({'partner':pl.Utf8})
        .collect()
    )
    #plot
    fig = px.bar(df, x='month', y='prop_shared', color='partner', 
                title=str('Proportion of shared cargo on '+carrier_name+' ships by Carrier') if carrier_name else str('Proportion of shared cargo on '+scac+' ships by Carrier'),
                labels={
                    'prop_shared':'Percentage of Shared Cargo',
                    'month':'Time',
                    'partner':'Carrier'
                },
                width=1100,
                height=500,
                category_orders={
                    'partner':['Other Carriers', 'SAFMARINE', 'ANL CONTAINER LINE',  'CMA-CGM','HYUNDAI', 'MEDITERRANEAN SHIPPING COMPANY']
                }
    )
    fig.add_vline(x=datetime.strptime("2015-01-01", "%Y-%m-%d").timestamp() * 1000, 
                  line_width=4, line_dash='dash',
                  annotation_text="2M Alliance begins"
                  )
    fig.show()

In [32]:
sharing_by_carrier_med(main_lf, 'MSCU', 'Mediterranean')

In [33]:
def sharing_by_carrier_evergreen(lf, scac, carrier_name=None):
    '''
    ad hoc function to inspect sharing by scac over time for a given carrier
    inputs:
        scac - str - the SCAC for the carrier of interest
        carrier_name - str - the string name of the carrier of interest for plot title
    '''
    #get top sharing partners
    top_partners = (
        lf
        #get shared cargo col
        .with_columns((pl.col('teus')*(1-pl.col('primary_cargo'))).alias('shared_teus'))
        #only maersk vessels but not maersk cargo
        .filter((pl.col('vessel_owner')==scac)&(pl.col('scac')!=scac))
        #group by carrier
        .group_by('carrier_name')
        .agg(pl.col('shared_teus').sum())
        #filter to top 10
        .sort(by='shared_teus', descending=True)
        .limit(5)
        #retail only carriers
        .drop('shared_teus')
        #collect
        .collect()
        #convert from dataframe to series
        .to_series()
    )

    #make df for graphing
    df = (
        lf
        #only maersk vessels but not maersk cargo
        .filter((pl.col('vessel_owner')==scac)&(pl.col('scac')!=scac))
        #create partners column
        .with_columns(
            pl.when(pl.col('carrier_name').is_in(top_partners))
            .then(pl.col('carrier_name'))
            .otherwise(pl.lit('Other Carriers'))
            .alias('partner')
        )
        #get percentages of other carriers
        .group_by('carrier_name', 'partner','month')
        .agg(pl.col('teus').sum())
        .with_columns(pl.col('teus').sum().over('month').alias('total_shared'))
        .with_columns((pl.col('teus')/pl.col('total_shared')).alias('prop_shared'))
        .with_columns(pl.col('month').str.to_datetime('%Y%m'))
        #sort for plotting
        .sort(by='month')
        .cast({'partner':pl.Utf8})
        .collect()
    )
    #plot
    fig = px.bar(df, x='month', y='prop_shared', color='partner', 
                title=str('Proportion of shared cargo on '+carrier_name+' ships by Carrier') if carrier_name else str('Proportion of shared cargo on '+scac+' ships by Carrier'),
                labels={
                    'prop_shared':'Percentage of Shared Cargo',
                    'month':'Time',
                    'partner':'Carrier'
                },
                width=1100,
                height=500,
                category_orders={
                    'partner':['Other Carriers', 'SAFMARINE', 'ANL CONTAINER LINE',  'CMA-CGM','HYUNDAI', 'MEDITERRANEAN SHIPPING COMPANY']
                }
    )
    fig.add_vline(x=datetime.strptime("2017-04-01", "%Y-%m-%d").timestamp() * 1000, 
                  line_width=4, line_dash='dash',
                  annotation_text="Joined Ocean Alliance"
                  )
    fig.add_vline(x=datetime.strptime("2014-04-01", "%Y-%m-%d").timestamp() * 1000, 
                  line_width=4, line_dash='dash',
                  annotation_text="Joined CYKHE Alliance",
                  annotation_position="top left"
                  )
    fig.show()

sharing_by_carrier_evergreen(main_lf, 'EGLV', 'Evergreen')

#### Visualizing Lane Concentration 

In [54]:
#inspect concentration of volumes by lane

#get avg vessel cap
mean_vessel_cap = (
    main_lf.select(pl.mean('vessel_capacity')).collect()[0,0]
)

#get top 200 lanes
top_200lanes = (
    main_lf
    .group_by('lane_id')
    .agg(pl.col('teus').sum())
    .sort(by='teus', descending=True)
    .limit(200)
    .collect()
    .to_series()
)

#construct lane sizes per month
df = (
    main_lf
    .filter(pl.col('lane_id').is_in(top_200lanes))
    .group_by('lane_id', 'month')
    .agg(pl.col('teus').sum())
    .group_by('lane_id')
    .agg(pl.col('teus').mean().alias('avg_monthly_volume'))
    .sort(by='avg_monthly_volume', descending=True)
    .with_row_index()
    .collect()
)

In [60]:
fig = px.bar(
    df, x='index', y='avg_monthly_volume', 
    labels={
        'index':'Lane Rank by Average Monthly Volume',
        'avg_monthly_volume':'Average Monthly Volume (TEU)'
    }, 
    title='Average Monthly Volume Concentration Across Top 200 Lanes',
    width=1100, height=500
    )
fig.add_hline(y=mean_vessel_cap, annotation_text=('Average Single-vessel Capacity ({} TEUs)').format(round(mean_vessel_cap)))
fig.show()

## Modeling

### Alliance-level Fixed Effects Model

$Y_{alt} = \beta_0 + L_{lt} + A_{at} + AA_{alt} + \beta_1 A_{at}*AA_{alt} + \epsilon_{alt}$

Where
$Y_{alt}$ is the outcome variable of interest (rate, capacity utilization, frequency of service, etc)
