# Ocean Carrier Alliances - Data Analysis for USDA Report

This notebook analyses data on maritime container shipping to inspect the impacts of strategic alliances between ocean freight carriers (Ocean Carrier Alliances, OCAs) on the containerized agricultural export market in the US. Data are pre-processed in the oca_data_prep notebook; see the [project repo](https://github.com/epistemetrica/Ocean-Carrier-Alliances-Project) for full details.

Section 1 presents data summaries and aggregations, Section 2 gives general context for the maritime freight economy and vessel sharing behavior, and Section 3 presents the analytical models and results.

In [2]:
#preliminaries
import numpy as np
import pandas as pd
import polars as pl
import plotly_express as px
import statsmodels.api as sm
from datetime import datetime
from datetime import date

#enable string cache for polars categoricals
pl.enable_string_cache()
#display settings
pd.set_option('display.max_columns', None)

In [3]:
#load data
main_lf = (
    #load parquet files to polars lazyframe 
    pl.scan_parquet('../data/main/*.parquet')
    #choose only exports
    .filter(pl.col('direction')=='export')
    #drop unneeded columns
    .drop('origin_territory', 'origin_region', 'direction', 'carrier_name', 
          'carrier_scac', 'vessel_name', 'voyage_number')
    #rename cols for consistency/sanity
    .rename({'unified_carrier_name':'carrier_name',
             'unified_carrier_scac':'scac',
             'arrival_port_name':'dest_port_name',
             'arrival_port_code':'dest_port_code',
             'departure_port_name':'origin_port_name',
             'departure_port_code':'origin_port_code',
             'pc_alliance':'owner_alliance'})
    #reorder columns for sanity
    .select('bol_id', 'date', 'month', 'year', 'teus', 'hs_code', 'lane_id', 
            'lane_name', 'origin_port_name', 'origin_port_code', 'coast_region', 
            'dest_port_name', 'dest_port_code', 'dest_territory', 'dest_region',  
            'drewery_lane', 'dist', 'rate_40', 'rate_20', 'scac', 'carrier_name', 
            'alliance', 'alliance_member', 'vessel_id', 'vessel_capacity', 
            'vessel_owner', 'owner_alliance', 'primary_cargo', 'shared_teus', 
            'cargo_source')
    #set 0 capacities to missing
    .with_columns(
        pl.col('vessel_capacity').replace(0, None)
    )
)

## 1. Exports Data Summary

The main dataset holds a unique bill of lading (BOL) on each row and includes data from various sources in the columns described below.  

### Column Definitions

- bol_id - unique identifier of each BOL
- date, month, year - departure date of the export
- teus - volume of the shipment in twenty-foot-equivalent (TEU) units
- hs_code - list of commodity codes included in the shipment
- lane_id - unique identifier of the lane (a combination of origin and destination port codes)
- lane_name - name of the lane
- origin_port_name - name of the US port of origin
- origin_port_code - CBP code for the port of origin
- coast_region - US coastal region of origin
- dest_port_name - name of the foreign destination port
- dest_port_code - CBP code for the foreign destination port
- dest_territory - name of country or territory of destination
- dest_region - global region of destination
- drewery_lane - Drewery lane matched to the lane_id using summed haversine distances between the origin/destination port pairs
- dist - summed haversine distance between the origin/destination port pairs 
- rate_40 - Drewery CFRI rate index for 40-ft equivalent containers when the cargo was carried
- rate_20 - Drewery CFRI rate index for 20-ft equivalent containers (TEU) when the cargo was carried
- scac - Standard Carrier Alpha Code (SCAC) for the firm who carried the cargo
- carrier_name - name of the carrier
- alliance - the alliance to which the carrier belonged when the cargo was sold (NOTE this needs to be updated when we input regional alliance data)
- alliance_member - boolean for whether or not the carrier was a member of any alliance when that cargo was carried
- vessel_id - International Maritime Organization (IMO) code uniquely identifying the vessel
- vessel_capacity - metric of how many TEU can be carried by the vessel at any given time (computed from Net Register Tonnage)
- vessel_owner - the carrier representing the most TEUs carried by that vessel during the month on which the BOL was carried
- owner_alliance - the OCA to which the vessel_owner belonged at the time
- primary_cargo - boolean indicating whether the cargo came from the vessel_owner or not
- shared_teus - number of TEUs if the cargo did not come from the vessel_owner (identical to 'teus' when primary_cargo is 0). 
- cargo_source - indicator for whether the cargo came from a member of the vessel_owner's alliance or not. 


### Summary stats and head

In [4]:
desc = main_lf.describe()
desc.write_csv('../data/usda_aggregations/usda_maindata_summarystat.csv')
#display(desc)
#main_lf.limit(5).collect()

In [5]:
#main_lf.approx_n_unique().collect()

## 2. Additional Data Preparation

In order to accomplish our analyses, we must address some data preparation steps unique to this USDA report. Limiting by the top 500 lanes and grouping ports together allows us to analyse the relevant ports at more intuitive levels of aggregation (e.g., ports locaded in the same general metro area) and comparte substitute ports and lanes to eachother. We also impute missing vessel capacity data and expand bills of lading that contain multiple hs codes to individual rows per commodity. Finally, within each aggregation, we coerse the tables to include all possible combinations of the group variables since we want to observe when, for example, a given carrier chooses not to operate any vessels on a given lane that month. 

### Top 500 lanes

Containerized frieght volumes are highly concentrated to common shipping lanes (e.g., LA to Tokyo) and while many ports around the world have small container terminals and thus appear in the PIERS database, these ports are visted much less frequentyly and constitute tiny percentages of volumnes on each ship. For these reasons, we suspect the impacts of alliance behavior may be different for small container ports; we drop all but the top 500 lanes in order to focus on the majority of volumes. 

In [6]:
#get top 500 lanes
top_lanes = (
    main_lf
    .group_by('lane_id')
    .agg(pl.col('teus').sum())
    .sort(by='teus', descending=True)
    .limit(500)
    .collect()
    .to_series()
)

#limit main lf to top lanes 
main_lf = main_lf.filter(pl.col('lane_id').is_in(top_lanes))

### Port Groups

An important aspect of our study is to compare activity on substitute lanes when an alliance is active on another. For example, if an alliance has a high share of the Seattle-Tokyo market, that may effect the Oakland-Tokyo market as shippers move to or from that lane. 

In order to analyse these effects, we must first identify which ports and lanes are substitutes for eachother. The PIERS database has done much of this work for us, grouping domestic ports into US coastal groups and foreign ports into individual countries as well as regions (e.g., Northern Europe, SE Asia). We add further detail to the 34 US ports represented in the top 500 lanes by manually grouping ports that share a Combined Statistical Area (using [2023 data obtained from BLS](https://www.bls.gov/bls/omb-bulletin-23-01-revised-delineations-of-metropolitan-statistical-areas.pdf)) or an inland waterway (e.g., the Puget Sound, Chesapeake Bay).  

In [7]:
#create us port groups
us_port_groups_df = (
    #read excel file
    pl.read_excel('../data/misc/usports.xlsx')
    .with_columns(
        #use port name unless port_region already filled
        origin_port_group = (
            pl.when(pl.col('port_region').is_null())
            .then(pl.col('origin_port_name'))
            .otherwise(pl.col('port_region'))
        ).cast(pl.Categorical),
        #cast as categories
        origin_port_name = pl.col('origin_port_name').cast(pl.Categorical)
    )
    #drop port region 
    .drop('port_region')
)

main_lf = (
    main_lf
    #join with main lf
    .join(
        us_port_groups_df.select('origin_port_name', 'origin_port_group').lazy(),
        on='origin_port_name',
        how='left'
    )
    #make substitue lanes cols
    .with_columns(
        lane_group = (pl.col('origin_port_group').cast(pl.Utf8)+'_'+
                      pl.col('dest_territory').cast(pl.Utf8)).cast(pl.Categorical),
        lane_region = (pl.col('coast_region').cast(pl.Utf8)+'_'+
                       pl.col('dest_region').cast(pl.Utf8)).cast(pl.Categorical)
    )
)


#### Vessel Capacities

Capacity data is rarely missing in the time periods from which data are published (2013-2022) and progressively more missing the further outside we move from that time period since not all vessels operating in 2008, for example, were still operating by the time capacity data became available. If left unaddressed, this would cause capacities to be under-estimated during time periods outside the 2013–22 window. When available, we fill missing capacities with the average of known capacities within a given lane, month, and carrier, making the tacit assumption that within such a short time frame, missing capacites are not likely to be systematically different than known capacities on the same lane and carrier. When this data is not available, we fill nulls with the mean over lane and month, and finally fill with the mean over the lane as a last resort. 


In [8]:
#aggregate vessel capacites
main_lf = (
    main_lf
    #impute missing vessel capacities
    .with_columns(
        #fill with mean over lane, month, and carrier where available
        pl.when(pl.col('vessel_capacity').is_null())
        .then(pl.col('vessel_capacity').mean().over('lane_id', 'month', 'scac'))
        .otherwise(pl.col('vessel_capacity'))
    )
    .with_columns(
        #fill remaning nulls with mean over lane and month where available
        pl.when(pl.col('vessel_capacity').is_null())
        .then(pl.col('vessel_capacity').mean().over('lane_id', 'month'))
        .otherwise(pl.col('vessel_capacity'))
    )
    .with_columns(
        #fill remaining nulls with mean over lane
        pl.when(pl.col('vessel_capacity').is_null())
        .then(pl.col('vessel_capacity').mean().over('lane_id'))
        .otherwise(pl.col('vessel_capacity'))
    )
)

##### HS Codes

A key aspect of our study is to compare the impacts of OCAs on different types of cargo; specifically on agricultural and non-agricultural cargoes. In order to aggregate data across the individual 2-digit HS codes, we must address the fact that this data lies in strings of 4–8 character codes separated by a space. Furthermore, since the PIERS data simply list the HS codes contained within each BOL and do not indicate how much each code contributes to the total volume, we assign equal volumes to each hs_code associated with the BOL. These issues are addressed in two steps:
1. splitting the strings into an interable list and then exploding the table by hs_code to long form where each row corresponds to a unique pair of bol_id and hs_code. 
2. dropping all but the first two digits of the hs_code (note the first two digits of hs_codes describe the highest categorization level)

Note that this results in a somewhat untidy table; if we needed to analyze unique hs_code-bol_id rows, we would then need to re-aggregate over bol_id and hs_code. However, since we are aggregating to lane_id, month, scac, and hs_code anyway, we can skip this step. 

In [9]:
def explode_hscodes(lf, hs_digits=2):
    '''
    Expands the dataset into long form where each row corresponds to only a single commodity (HS Code) within each BOL. 
        Assumes that the raw data are effectively grouped by BOL, resulting in hs_code column sometimes containing more than one code;
        this function undoes that group aggregation. 
    INPUTS
        lf - a polars lazyframe containing the relevant data
        hs_digits - int - the number of HS codes to keep
    OUTPUTS
        lf - a polars lazyframe containing the long form data
    '''
    #instantiate index
    index = pl.arange(0, lf.select(pl.len()).collect().item(), eager=True)
    #explode hs_codes 
    lf = (
        lf
        #set nulls to 0 to prevent data loss in later steps
        .with_columns(pl.col('hs_code').replace(None, 'missing'))
        .with_columns(
            #split hs codes into lists
            pl.col('hs_code').str.split(by=' '),
            #create pseudo-index col
            pl.Series(name='index', values=index)
            )
        #explode hs_codes
        .explode('hs_code')
        #drop rows with empty strings (these result from more than one space between hs_codes in the raw data)
        .with_columns(pl.col('hs_code').replace('', None))
        .drop_nulls('hs_code')
        #distribute total volumes evenly across HS Codes -- NOTE this step may change as better metadata is collected
        .with_columns(
            (pl.col('teus')/(pl.col('index').count().over('index'))),
            #replace originally-null values
            pl.col('hs_code').replace('missing', None)
        )
        .with_columns(
            #drop all but first two hs_code digits
            pl.col('hs_code').str.slice(0,hs_digits)
            #cast to integers, replacing non-integers with null
            .cast(pl.Int32, strict=False),
            )
        #drop index
        .drop('index')
    )
    return lf 

In [10]:
#explode hs_codes and simplify to 2-digit
main_lf = explode_hscodes(main_lf)

##### Missing Months and other group variables

Not all lanes are serviced every month, resulting in aggregated tables having fewer rows than they would otherwise. In order to account for months with no services as 0 teus, for example, we coerse the aggregated tables to include all combinations of the group variables. 

In [11]:
#define expansion function to account for missing combinations

def expand(lf, group_vars=['lane_id']):
    '''
    insert doc string

    NOTE automatically explands to all months, in addition to supplied group_vars
    '''
    #get lf of all months
    months_lf = pl.LazyFrame(
        pl.date_range(date(2007,1,1), date(2023,12,1), 
                      interval='1mo', eager=True)
        #coerse to match main data month format
        .dt.strftime('%Y%m'),
        schema={'month':pl.Utf8}
        )
    #init expanded lf
    allcombos_lf = months_lf
    #get other groups
    for var in group_vars:
        #create lf from group var
        var_lf = lf.select(var).unique()
        #cross join to create combos
        allcombos_lf = allcombos_lf.join(var_lf, how='cross')
    #append groupvars to include month
    group_vars.append('month')
    #join to lf
    expanded_lf = (
        allcombos_lf.join(
            lf, on=group_vars, how='left'
            )
    )
    print('Note: A LazyFrame has been expanded to all combinations of months and the supplied group_vars; remember to fill nulls as appropriate.')
    return expanded_lf

### Alliances and Individual carriers

In our analysis of market concentration, we sometimes treat alliances as if they were a single carrier (and, equivalently, individual carriers as if they were their own alliance). 

In [12]:
main_lf = (
    main_lf
    .with_columns(
        pl.when(pl.col('alliance')!='Non-alliance Carriers')
        .then(pl.col('alliance'))
        .otherwise(pl.col('scac'))
        .alias('alliance_alt')
    )
)

### Identifying Alliance-operated vessels

Alliance agreements enable carriers to cooperatively operate vessels on given services (loops of ports constituting a complete voyage). The available data do not enable us to directly identify such alliance-operated vessels; however, we expect alliance-operated vessels to carry predominantly cargoes from the alliance partners. 

We thus define a vessel as being alliance-operated when 1) the vessel carries cargo from more than one carrier and 2) more than 90% of the cargoes carried on that vessel during the month are from the same alliance. 

Unfortunately, we cannot distinguish between the case where a carrier in an alliance independently operates a vessel, carrying exclusively it's own cargo, and the case where an alliance operates a vessel that happens to carry cargo from a single carrier. 

In [13]:
#identify alliance vessels
main_lf = (
    main_lf
    .with_columns(
        #get alliance volumes
        alliance_teus = (
            pl.when(pl.col('cargo_source')=='ally')
            .then(pl.col('teus')).otherwise(pl.lit(0))
        )
    )
    .with_columns(
        #create col for alliance vessel
        alliance_vessel = (
            pl.when(
                (((pl.col('alliance_teus').sum().over('vessel_id', 'month'))/
                (pl.col('teus').sum().over('vessel_id', 'month')))>0.9) &
                (pl.col('scac').n_unique().over('vessel_id', 'month') > 1)
            )
            .then(pl.col('vessel_id')))
            .otherwise(pl.lit(None))
        )
    )


## 3. Panel Construction

We analyze monthly panels at various levels of aggregation, inspecting lanes at the most aggregated level and the breakdown cargoes carried by specific carriers on invidual vessels on each lane at the most detailed level. Summary statistics are reported individually for each panel. 


In [14]:
def get_hhi(agg_lf, market_var, panel_vars, volume_var='teus', firm_var='scac', 
            time_var='month', name='hhi', source_lf=main_lf, 
            sub_group_var=None):
    if sub_group_var:
        hhi_lf = (
            source_lf
            #compute volumes for each firm on the given market
            .with_columns(
                ((pl.col(volume_var).sum()
                .over([firm_var, time_var, sub_group_var])) - 
                (pl.col(volume_var).sum()
                .over([market_var, firm_var, time_var]))).alias('volume')
            )
            #compute market share for each firm
            .group_by(set([*panel_vars, firm_var, time_var]))
            .agg(pl.col('volume').first())
            .with_columns(
                (((pl.col('volume'))/
                (pl.col('volume').sum().over(*panel_vars, time_var)))
                    *100).alias('ms')
            )
            #group by panel variables and compute hhi
            .group_by(*panel_vars, time_var)
            .agg(
                (pl.col('ms')**2).sum().alias(name)
            )
        )
    else:
        hhi_lf = (
            source_lf
            .with_columns(
                pl.col(volume_var).sum().over([market_var, firm_var, time_var]).alias('volume')
            )
            #compute market share for each firm
            .group_by(set([*panel_vars, firm_var, time_var]))
            .agg(pl.col('volume').first())
            .with_columns(
                (((pl.col('volume'))/(pl.col('volume').sum().over(*panel_vars, time_var)))
                    *100).alias('ms')
            )
            #group by panel variables and compute hhi
            .group_by(*panel_vars, time_var)
            .agg(
                (pl.col('ms')**2).sum().alias(name)
            )
        )
    lf = agg_lf.join(hhi_lf.fill_nan(None), on=[*panel_vars, time_var], how='left')

    return lf

### 3.1 Lane Panel

The highest aggregation level we utilize is by lane and month, enabling us to inspect broad lane-level characteristics of the market. 

In [15]:
#main aggregation
lanepanel_lf = (
    main_lf
    .with_columns(
        #number of foreign ports accessible from us port
        num_foreign_ports = (pl.col('dest_port_code').n_unique()
                             .over('month', 'origin_port_code')),
        #number of foreign countries accessible from us port
        num_foreign_countries = (pl.col('dest_territory').n_unique()
                                 .over('month', 'origin_port_code')),
        #number of foreign regions accessible from us port
        num_foreign_regions = (pl.col('dest_region').n_unique()
                                 .over('month', 'origin_port_code'))
    )
    #group
    .group_by('lane_id', 'month')
    .agg(
        teus = pl.col('teus').sum(),
        num_vessels = pl.col('vessel_id').n_unique(),
        num_carriers = pl.col('scac').n_unique(),
        num_alliances = pl.col('alliance').n_unique(),
        num_alliance_vessels = pl.col('alliance_vessel').n_unique(),
        num_foreign_ports = pl.col('num_foreign_ports').first(),
        num_foreign_countries = pl.col('num_foreign_countries').first(),
        num_foreign_regions = pl.col('num_foreign_regions').first()
    )
)

#get carrier hhi (traditional HHI) 
#on lane
lanepanel_lf = get_hhi(lanepanel_lf, market_var='lane_id', 
                       panel_vars=['lane_id'], name='hhi_lane')
#on direct substitute lanes
lanepanel_lf = get_hhi(lanepanel_lf, market_var='lane_id', 
                       panel_vars=['lane_id'], name='hhi_directsub_lanes',
                       sub_group_var='lane_group')
#on potential substitute lanes
lanepanel_lf = get_hhi(lanepanel_lf, market_var='lane_id', 
                       panel_vars=['lane_id'], name='hhi_potentialsub_lanes',
                       sub_group_var='lane_region')
#at origin port
lanepanel_lf = get_hhi(lanepanel_lf, market_var='origin_port_code', 
                       panel_vars=['lane_id'], name='hhi_origin_port')
#at destination port
lanepanel_lf = get_hhi(lanepanel_lf, market_var='dest_port_code', 
                       panel_vars=['lane_id'], name='hhi_dest_port')
#on set of direct substitute origin ports
lanepanel_lf = get_hhi(lanepanel_lf, market_var='origin_port_code', 
                       panel_vars=['lane_id'], name='hhi_directsub_origin_ports',
                       sub_group_var='origin_port_group')
#on set of potential substitute origin ports
lanepanel_lf = get_hhi(lanepanel_lf, market_var='origin_port_code', 
                       panel_vars=['lane_id'], name='hhi_potentialsub_origin_ports',
                       sub_group_var='coast_region')
#on set of direct substitute destination ports
lanepanel_lf = get_hhi(lanepanel_lf, market_var='dest_port_code', 
                       panel_vars=['lane_id'], name='hhi_directsub_dest_ports',
                       sub_group_var='dest_territory')
#on set of potential substitute destination ports
lanepanel_lf = get_hhi(lanepanel_lf, market_var='dest_port_code', 
                       panel_vars=['lane_id'], name='hhi_potentialsub_dest_ports',
                       sub_group_var='dest_region')

#get alliance hhi (counts alliances as individual carriers, and carriers not in alliances as their own alliance) 
#on lane
lanepanel_lf = get_hhi(lanepanel_lf, market_var='lane_id', firm_var='alliance_alt',
                       panel_vars=['lane_id'], name='hhi_alliance_lane')
#on direct substitute lanes
lanepanel_lf = get_hhi(lanepanel_lf, market_var='lane_id', firm_var='alliance_alt',
                       panel_vars=['lane_id'], name='hhi_alliance_directsub_lanes',
                       sub_group_var='lane_group')
#on potential substitute lanes
lanepanel_lf = get_hhi(lanepanel_lf, market_var='lane_id', firm_var='alliance_alt',
                       panel_vars=['lane_id'], name='hhi_alliance_potentialsub_lanes',
                       sub_group_var='lane_region')
#at origin port
lanepanel_lf = get_hhi(lanepanel_lf, market_var='origin_port_code', firm_var='alliance_alt',
                       panel_vars=['lane_id'], name='hhi_alliance_origin_port')
#at destination port
lanepanel_lf = get_hhi(lanepanel_lf, market_var='dest_port_code', firm_var='alliance_alt',
                       panel_vars=['lane_id'], name='hhi_alliance_dest_port')
#on set of direct substitute origin ports
lanepanel_lf = get_hhi(lanepanel_lf, market_var='origin_port_code', firm_var='alliance_alt',
                       panel_vars=['lane_id'], name='hhi_alliance_directsub_origin_ports',
                       sub_group_var='origin_port_group')
#on set of potential substitute origin ports
lanepanel_lf = get_hhi(lanepanel_lf, market_var='origin_port_code', firm_var='alliance_alt',
                       panel_vars=['lane_id'], name='hhi_alliance_potentialsub_origin_ports',
                       sub_group_var='coast_region')
#on set of direct substitute destination ports
lanepanel_lf = get_hhi(lanepanel_lf, market_var='dest_port_code', firm_var='alliance_alt',
                       panel_vars=['lane_id'], name='hhi_alliance_directsub_dest_ports',
                       sub_group_var='dest_territory')
#on set of potential substitute destination ports
lanepanel_lf = get_hhi(lanepanel_lf, market_var='dest_port_code', firm_var='alliance_alt',
                       panel_vars=['lane_id'], name='hhi_alliance_potentialsub_dest_ports',
                       sub_group_var='dest_region')

#aggregate vessel capacities
lane_cap_lf = (
    main_lf
    #aggregate
    .group_by('lane_id', 'month', 'vessel_id')
    .agg(
        #get vessel capacity
        pl.col('vessel_capacity').first(),
        #get number of turns
        pl.col('date').n_unique().alias('turns')
    )
    #compute capacity contributed by vessel
    .with_columns(
        cap_from_vessel = (pl.col('vessel_capacity'))*(pl.col('turns'))
    )
    .group_by('lane_id', 'month')
    .agg(
        pl.col('cap_from_vessel').sum()
        .alias('lane_capacity')
    )
)

#merge vessel capacities into lane-month lf
lanepanel_lf = (
    lanepanel_lf
    .join(
        lane_cap_lf,
        on=['lane_id', 'month'],
        how='left'
    )
)

#merge drewery lanes and rates into lane-month lf
rates_lf = (
    main_lf
    .select('lane_id', 'month', 'drewery_lane', 'dist', 'rate_40', 'rate_20')
    .unique(subset=['lane_id', 'month']) #NOTE this should be unique already?
)
#merge
lanepanel_lf = (
    lanepanel_lf
    .join(
        rates_lf, 
        on=['lane_id', 'month'],
        how='left'
    )
)

#coerse frame to include all possible months
lanepanel_expanded_lf = (
    expand(lanepanel_lf)
    #fill nulls
    .with_columns(
        pl.col('teus').fill_null(0),
        pl.col('num_vessels').fill_null(0),
        pl.col('num_carriers').fill_null(0),
        pl.col('num_alliances').fill_null(0),
        pl.col('num_foreign_ports').fill_null(0),
        pl.col('num_foreign_countries').fill_null(0),
        pl.col('num_foreign_regions').fill_null(0),
        pl.col('lane_capacity').fill_null(0)
    )
)


Note: A LazyFrame has been expanded to all combinations of months and the supplied group_vars; remember to fill nulls as appropriate.


In [15]:
lanepanel_lf.describe()

statistic,lane_id,month,teus,num_vessels,num_carriers,num_alliances,num_alliance_vessels,num_foreign_ports,num_foreign_countries,num_foreign_regions,hhi_lane,hhi_directsub_lanes,hhi_potentialsub_lanes,hhi_origin_port,hhi_dest_port,hhi_directsub_origin_ports,hhi_potentialsub_origin_ports,hhi_directsub_dest_ports,hhi_potentialsub_dest_ports,hhi_alliance_lane,hhi_alliance_directsub_lanes,hhi_alliance_potentialsub_lanes,hhi_alliance_origin_port,hhi_alliance_dest_port,hhi_alliance_directsub_origin_ports,hhi_alliance_potentialsub_origin_ports,hhi_alliance_directsub_dest_ports,hhi_alliance_potentialsub_dest_ports,lane_capacity,drewery_lane,dist,rate_40,rate_20
str,str,str,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,str,f64,f64,f64
"""count""","""97065""","""97065""",97065.0,97065.0,97065.0,97065.0,97065.0,97065.0,97065.0,97065.0,97065.0,55269.0,94954.0,97065.0,97065.0,41368.0,91949.0,65997.0,94808.0,97065.0,55591.0,94991.0,97065.0,97065.0,42022.0,91973.0,66206.0,94864.0,97065.0,"""97065""",97065.0,49738.0,49738.0
"""null_count""","""0""","""0""",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,41796.0,2111.0,0.0,0.0,55697.0,5116.0,31068.0,2257.0,0.0,41474.0,2074.0,0.0,0.0,55043.0,5092.0,30859.0,2201.0,0.0,"""0""",0.0,47327.0,47327.0
"""mean""",,,1477.836509,13.233122,6.733333,2.873971,4.140689,36.001164,22.334549,7.068459,4495.28239,4400.467419,3742.549042,3647.416118,4198.09248,5018.440642,3358.668995,4387.454157,3672.974994,5270.857912,5367.176201,4631.737862,4539.476244,5004.785168,5875.687239,4392.923499,5371.344215,4578.071516,38759.507421,,1314.252251,1603.674253,1237.05718
"""std""",,,2416.308639,8.968438,4.170768,1.218577,3.940149,16.896547,11.606937,3.53123,2801.173542,2908.983136,2869.49764,2841.503327,2866.370587,3092.120619,2710.833747,2974.142904,2821.730063,2640.08253,2596.313772,2603.590846,2585.174803,2647.12428,2706.363468,2531.451255,2637.769729,2554.608892,34284.436255,,1003.786978,760.79759,528.156237
"""min""",,"""200701""",0.3,1.0,1.0,1.0,1.0,1.0,1.0,1.0,638.060924,696.393654,671.618778,683.402222,667.852364,832.515862,640.373898,693.324644,662.505129,934.034602,1176.698776,1042.844946,956.542718,986.279004,1483.19386,1045.607733,1130.744245,1029.23643,7.205882,,5.93014,400.0,310.0
"""25%""",,,367.0,6.0,3.0,2.0,1.0,27.0,15.0,4.0,2323.762649,2135.646687,1652.645277,1629.145867,2011.941845,2417.033909,1471.597653,2067.689918,1640.553877,3159.850062,3378.213633,2741.674629,2741.86181,2940.445712,3661.688876,2680.300999,3361.907069,2775.833662,12329.020764,,552.798207,1050.0,840.0
"""50%""",,,816.621577,12.0,6.0,3.0,3.0,39.0,20.0,7.0,3518.518519,3386.468708,2606.489526,2518.32829,3103.158845,3903.091648,2283.911322,3334.209844,2577.757375,4497.300131,4668.171139,3682.775888,3536.619875,4149.003781,4954.594159,3413.60176,4581.359281,3652.442625,29840.588235,,1231.437882,1490.0,1170.0
"""75%""",,,1697.215643,18.0,9.0,4.0,6.0,46.0,30.0,11.0,5982.468955,5884.725581,5000.993717,4512.110397,5588.492624,7858.014162,4083.612827,5925.485533,4818.655492,7155.563423,7024.793388,5716.882644,5352.907106,6557.383492,9092.970522,5108.905617,7166.446547,5494.924147,55931.372549,,1780.936587,1980.0,1480.0
"""max""",,"""202312""",48794.368538,72.0,23.0,7.0,37.0,65.0,44.0,11.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,282385.498673,,6620.504391,8250.0,5070.0


In [16]:
#%%script echo skipping
#save
lanepanel_lf.collect().write_parquet('../data/usda_aggregations/lane_month.parquet')
lanepanel_lf.describe().write_csv('../data/usda_aggregations/lane_month_summary.csv')

### 3.1 Lane-Carrier Panel

The next level of aggregation is over carriers in addition to lane and month, allowing us to analyze the ports serviced, total volumes and capacities, etc of specific carriers within each lane. 

In [16]:
def get_leaveout_hhi(agg_lf, market_var, panel_vars, volume_var='teus', firm_var='scac', 
            time_var='month', name='leaveout_hhi', source_lf=main_lf,
            sub_group_var=None, dropmarket=False):
    if sub_group_var:
        leaveout_hhi_lf = (
            source_lf
            #compute volumes for each firm on the given group, less the market
            .with_columns(
                ((pl.col(volume_var).sum()
                .over([firm_var, time_var, sub_group_var])) - 
                (pl.col(volume_var).sum()
                .over([market_var, firm_var, time_var]))).alias('volume')
            )
            #c
            .group_by([market_var, firm_var, time_var])
            .agg(pl.col('volume').first())
            .with_columns(
                leaveout_sum = (pl.col('volume').sum()
                                .over([market_var, time_var]))-
                                (pl.col('volume')),
                leaveout_sumsq = ((pl.col('volume')**2).sum()
                                  .over([market_var, time_var])*10000)-
                                  ((pl.col('volume')**2)*10000)
            )
            .with_columns(
                leaveout_hhi = (pl.col('leaveout_sumsq')/
                                (pl.col('leaveout_sum')**2))
            )
        )
    else:
        leaveout_hhi_lf = (
            source_lf
            .group_by([market_var, firm_var, time_var])
            .agg(volume = pl.col(volume_var).sum())
            .with_columns(
                leaveout_sum = (pl.col('volume').sum()
                                .over([market_var, time_var]))-
                                (pl.col('volume')),
                leaveout_sumsq = ((pl.col('volume')**2).sum()
                                  .over([market_var, time_var])*10000)-
                                  ((pl.col('volume')**2)*10000)
            )
            .with_columns(
                leaveout_hhi = (pl.col('leaveout_sumsq')/
                                (pl.col('leaveout_sum')**2))
            )
        )
    
    lf = (
        agg_lf
        #get market var into agg_lf
        .join(
            source_lf.select(set([*panel_vars, market_var, time_var]))
            .unique(),
            on=set([*panel_vars, time_var]), how='left'
        )
        #merge in hhi
        .join(
            leaveout_hhi_lf.fill_nan(None)
            .select([market_var, firm_var, time_var, 'leaveout_hhi']), 
            on=[market_var, firm_var, time_var], how='left'
        )
    )
    if dropmarket:
        lf = lf.drop(market_var)
    return lf

In [17]:
def get_alliance_ms(agg_lf, market_var, panel_vars, volume_var='teus', 
            time_var='month', name='ms_alliance', source_lf=main_lf,
            sub_group_var=None):
    '''
    computes market share for each carrier's alliance on the lane that month
    '''
    if sub_group_var:
        ms_lf = (
            source_lf
            #compute volumes for each alliance
            .with_columns(
                ((pl.col(volume_var).sum()
                .over(['alliance_alt', time_var, sub_group_var])) - 
                (pl.col(volume_var).sum()
                .over([market_var, 'alliance_alt', time_var]))).alias('volume')
            )
            #compute market share for each alliance
            .group_by(set([*panel_vars, 'alliance_alt', time_var]))
            .agg(
                pl.col('volume').first()
            )
            .with_columns(
                (((pl.col('volume'))/
                (pl.col('volume').sum().over(*panel_vars, time_var)))
                    *100).alias(name)
            )
        )
    else:
        ms_lf = (
            source_lf
            #compute volumes for each alliance
            .group_by(set([*panel_vars, 'alliance_alt', time_var]))
            .agg(
                #get alliance volume
                pl.col(volume_var).sum().alias('volume')
            )
            #compute market share
            .with_columns(
                (((pl.col('volume'))/(pl.col('volume').sum().over(*panel_vars, 
                                                                  time_var)))
                    *100).alias(name)
            )
        )
    lf = agg_lf.join(
        ms_lf.drop('volume').fill_nan(None),
        on=[*panel_vars, 'alliance_alt', time_var], how='left')

    return lf

In [18]:
#main aggregation
lane_month_carrier_lf = (
    main_lf
    .with_columns(
        #number of foreign ports serviced by carrier from us port
        num_foreign_ports = (pl.col('dest_port_code').unique().count()
                             .over('month', 'scac', 'origin_port_code')),
        #number of foreign countries serviced by carrier from us port
        num_foreign_countries = (pl.col('dest_territory').unique().count()
                                 .over('month', 'scac', 'origin_port_code')),
        #number of foreign regions serviced by carrier from us port
        num_foreign_regions = (pl.col('dest_region').unique().count()
                                 .over('month', 'scac', 'origin_port_code'))
    )
    #aggregate
    .group_by('lane_id', 'month', 'scac')
    .agg(
        teus = pl.col('teus').sum(),
        alliance = pl.col('alliance').first(),
        num_foreign_ports = pl.col('num_foreign_ports').first(),
        num_foreign_countries = pl.col('num_foreign_countries').first(),
        num_foreign_regions = pl.col('num_foreign_regions').first()
    )
    #get market share of each carrier for the lane and month
    .with_columns(
        ms_carrier = ((pl.col('teus')/
                      (pl.col('teus').sum().over('lane_id', 'month'))))
    )
    .with_columns(
        #number of carriers active on the lane
        num_carriers = pl.col('scac').unique().count().over('lane_id', 'month'),
        #hhi
        hhi_lane = (pl.col('ms_carrier')**2).sum().over('lane_id', 'month')
    )
)
#get alliance_alt col 
lane_month_carrier_lf = lane_month_carrier_lf.join(
    main_lf.select('lane_id', 'month', 'scac', 'alliance_alt').unique(),
    on=['lane_id', 'month', 'scac'], how='left'
)

#get market shares of the carrier's alliance
#on the lane
lane_month_carrier_lf = get_alliance_ms(
    lane_month_carrier_lf, market_var='lane_id', panel_vars=['lane_id'],
    name='ms_alliance_lane')
#on direct substitute lanes
lane_month_carrier_lf = get_alliance_ms(
    lane_month_carrier_lf, market_var='lane_id', panel_vars=['lane_id'],
    name='ms_alliance_directsub_lanes', sub_group_var='lane_group')
#on potential substitute lanes
lane_month_carrier_lf = get_alliance_ms(
    lane_month_carrier_lf, market_var='lane_id', panel_vars=['lane_id'],
    name='ms_alliance_potentialsub_lanes', sub_group_var='lane_region')
#at the origin port
lane_month_carrier_lf = get_alliance_ms(
    lane_month_carrier_lf, market_var='origin_port_code', panel_vars=['lane_id'],
    name='ms_alliance_origin_port')
#at the destination port
lane_month_carrier_lf = get_alliance_ms(
    lane_month_carrier_lf, market_var='dest_port_code', panel_vars=['lane_id'],
    name='ms_alliance_dest_port')
#at direct substitute origin ports
lane_month_carrier_lf = get_alliance_ms(
    lane_month_carrier_lf, market_var='origin_port_code', panel_vars=['lane_id'],
    name='ms_alliance_directsub_origin_ports', sub_group_var='origin_port_group')
#at potential substitute origin ports
lane_month_carrier_lf = get_alliance_ms(
    lane_month_carrier_lf, market_var='origin_port_code', panel_vars=['lane_id'],
    name='ms_alliance_potentialsub_origin_ports', sub_group_var='coast_region')
#at direct substitute dest ports
lane_month_carrier_lf = get_alliance_ms(
    lane_month_carrier_lf, market_var='dest_port_code', panel_vars=['lane_id'],
    name='ms_alliance_directsub_dest_ports', sub_group_var='dest_territory')
#at potential substitute dest ports
lane_month_carrier_lf = get_alliance_ms(
    lane_month_carrier_lf, market_var='dest_port_code', panel_vars=['lane_id'],
    name='ms_alliance_potentialsub_dest_ports', sub_group_var='dest_region')

#get HHI of alliances
lane_month_carrier_lf = (
    lane_month_carrier_lf
    .join(
        lanepanel_lf.select(
            'lane_id', 'month', 'hhi_alliance_lane', 
            'hhi_alliance_directsub_lanes', 'hhi_alliance_potentialsub_lanes', 
            'hhi_alliance_origin_port', 'hhi_alliance_dest_port', 
            'hhi_alliance_directsub_origin_ports', 
            'hhi_alliance_potentialsub_origin_ports', 
            'hhi_alliance_directsub_dest_ports', 
            'hhi_alliance_potentialsub_dest_ports')
        .unique(),
        on=['lane_id', 'month'], how='left'
    )
)

#get leaveout HHIs - Concentration on each market leaving out the carrier's alliance
#on the lane
leaveout_hhi_lane = get_leaveout_hhi(
    agg_lf=lane_month_carrier_lf, market_var='lane_id',
    panel_vars=['lane_id', 'alliance_alt'], firm_var='alliance_alt',
    sub_group_var='origin_port_group', name='leaveout_hhi_lane'
)
#on direct substitute lanes
leaveout_hhi_directsub_lanes = get_leaveout_hhi(
    agg_lf=lane_month_carrier_lf, market_var='lane_id',
    panel_vars=['lane_id', 'alliance_alt'], firm_var='alliance_alt',
    sub_group_var='lane_group', name='leaveout_hhi_directsub_lanes'
)
#on potential substitute lanes
leaveout_hhi_potentialsub_lanes = get_leaveout_hhi(
    agg_lf=lane_month_carrier_lf, market_var='lane_id',
    panel_vars=['lane_id', 'alliance_alt'], firm_var='alliance_alt',
    sub_group_var='lane_region', name='leaveout_hhi_potentialsub_lanes'
)
#on the origin port
leaveout_hhi_origin_port= get_leaveout_hhi(
    agg_lf=lane_month_carrier_lf, market_var='origin_port_code',
    panel_vars=['lane_id', 'alliance_alt'], firm_var='alliance_alt',
    name='leaveout_hhi_origin_ports', dropmarket=True
)
#on direct substitutes for the origin port
leaveout_hhi_directsub_origin_ports = get_leaveout_hhi(
    agg_lf=lane_month_carrier_lf, market_var='origin_port_code',
    panel_vars=['lane_id', 'alliance_alt'], firm_var='alliance_alt',
    sub_group_var='origin_port_group', 
    name='leaveout_hhi_directsub_origin_ports', dropmarket=True
)
#on potential subsittues for the origon port
leaveout_hhi_potentialsub_origin_ports = get_leaveout_hhi(
    agg_lf=lane_month_carrier_lf, market_var='origin_port_code',
    panel_vars=['lane_id', 'alliance_alt'], firm_var='alliance_alt',
    sub_group_var='coast_region', 
    name='leaveout_hhi_potentialsub_origin_ports', dropmarket=True
)
#on the destination port
leaveout_hhi_dest_port= get_leaveout_hhi(
    agg_lf=lane_month_carrier_lf, market_var='dest_port_code',
    panel_vars=['lane_id', 'alliance_alt'], firm_var='alliance_alt',
    name='leaveout_hhi_dest_ports', dropmarket=True
)
#on direct substitutes for the destination port
leaveout_hhi_directsub_dest_ports = get_leaveout_hhi(
    agg_lf=lane_month_carrier_lf, market_var='dest_port_code',
    panel_vars=['lane_id', 'alliance_alt'], firm_var='alliance_alt',
    sub_group_var='origin_port_group', 
    name='leaveout_hhi_directsub_dest_ports', dropmarket=True
)
#on potential substitutes for the destination port
leaveout_hhi_potentialsub_dest_ports = get_leaveout_hhi(
    agg_lf=lane_month_carrier_lf, market_var='dest_port_code',
    panel_vars=['lane_id', 'alliance_alt'], firm_var='alliance_alt',
    sub_group_var='coast_region', 
    name='leaveout_hhi_potentialsub_dest_ports', dropmarket=True
)

#aggregate vessel capacities
lane_cap_lf = (
    main_lf
    #aggregate
    .group_by('lane_id', 'month', 'vessel_id')
    .agg(
        #get vessel capacity
        pl.col('vessel_capacity').first(),
        #get number of turns
        pl.col('date').n_unique().alias('turns')
    )
    #compute capacity contributed by vessel
    .with_columns(
        cap_from_vessel = (pl.col('vessel_capacity'))*(pl.col('turns'))
    )
    .group_by('lane_id', 'month')
    .agg(
        pl.col('cap_from_vessel').sum()
        .alias('lane_capacity')
    )
)

#merge vessel capacities into lane-month lf
lane_month_carrier_lf = (
    lane_month_carrier_lf
    .join(
        lane_cap_lf,
        on=['lane_id', 'month'],
        how='left'
    )
)


#merge
lane_month_carrier_lf = (
    lane_month_carrier_lf
    .join(
        rates_lf, 
        on=['lane_id', 'month'],
        how='left'
    )
)

#expand to include all combos?


In [20]:
#inspect
lane_month_carrier_lf.describe()

statistic,lane_id,month,scac,teus,alliance,num_foreign_ports,num_foreign_countries,num_foreign_regions,ms_carrier,num_carriers,hhi_lane,alliance_alt,ms_alliance_lane,ms_alliance_directsub_lanes,ms_alliance_potentialsub_lanes,ms_alliance_origin_port,ms_alliance_dest_port,ms_alliance_directsub_origin_ports,ms_alliance_potentialsub_origin_ports,ms_alliance_directsub_dest_ports,ms_alliance_potentialsub_dest_ports,hhi_alliance_lane,hhi_alliance_directsub_lanes,hhi_alliance_potentialsub_lanes,hhi_alliance_origin_port,hhi_alliance_dest_port,hhi_alliance_directsub_origin_ports,hhi_alliance_potentialsub_origin_ports,hhi_alliance_directsub_dest_ports,hhi_alliance_potentialsub_dest_ports,lane_capacity,drewery_lane,dist,rate_40,rate_20
str,str,str,str,f64,str,f64,f64,f64,f64,f64,f64,str,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,str,f64,f64,f64
"""count""","""656021""","""656021""","""656021""",656021.0,"""656021""",656021.0,656021.0,656021.0,656021.0,656021.0,656021.0,"""656021""",656021.0,406473.0,651562.0,656021.0,656021.0,287779.0,649886.0,468361.0,653418.0,656021.0,406473.0,651562.0,656021.0,656021.0,287779.0,649886.0,468361.0,653418.0,656021.0,"""656021""",656021.0,347581.0,347581.0
"""null_count""","""0""","""0""","""0""",0.0,"""0""",0.0,0.0,0.0,0.0,0.0,0.0,"""0""",0.0,249548.0,4459.0,0.0,0.0,368242.0,6135.0,187660.0,2603.0,0.0,249548.0,4459.0,0.0,0.0,368242.0,6135.0,187660.0,2603.0,0.0,"""0""",0.0,308440.0,308440.0
"""mean""",,,,219.987928,,20.763018,13.557693,4.787254,0.148427,9.325052,0.315115,,24.707075,24.019353,23.98328,24.707075,24.707075,24.505781,23.59969,24.369898,23.870514,4127.017666,4495.169432,3463.327203,3328.658221,3818.350882,4877.706675,3306.205231,4475.490331,3442.834285,53969.313929,,1120.578571,1485.144499,1156.263087
"""std""",,,,603.493696,,12.719689,8.785234,2.949428,0.211302,4.192727,0.197033,,26.430017,26.568773,22.631257,26.430017,26.430017,27.850209,21.188,26.55924,22.350546,2047.433045,2213.622522,1804.33392,1671.818931,1956.574276,2353.129862,1760.532231,2263.315487,1758.302368,36165.841271,,863.375037,734.992798,514.714981
"""min""",,"""200701""",,0.03,"""2M Alliance""",1.0,1.0,1.0,3.7e-05,1.0,0.063806,"""2M Alliance""",0.003659,0.0,0.0,0.003659,0.003659,0.0,0.0,0.0,0.0,934.034602,1176.698776,1042.844946,956.542718,986.279004,1483.19386,1045.607733,1130.744245,1029.23643,7.205882,,5.93014,400.0,310.0
"""25%""",,,,15.198946,,11.0,7.0,2.0,0.018561,6.0,0.180195,,4.014485,2.616055,6.896552,4.014485,4.014485,0.431034,6.640763,2.730055,7.10737,2660.958656,2924.613148,2285.325384,2344.214415,2466.368105,3288.354062,2277.455489,2929.284427,2323.03649,26515.530186,,445.895706,950.0,770.0
"""50%""",,,,61.0,,20.0,12.0,4.0,0.064003,9.0,0.258319,,14.826247,13.842482,17.559285,14.826247,14.826247,14.274713,19.359373,14.247312,17.687964,3576.430176,3941.535548,3024.79084,2938.011826,3310.67036,4121.257431,2892.611539,3876.329385,3030.305953,46616.764706,,1149.382682,1370.0,1080.0
"""75%""",,,,202.0,,29.0,19.0,7.0,0.18068,12.0,0.38004,,36.9805,37.720706,34.743895,36.9805,36.9805,40.100675,34.295437,38.366964,34.913505,5024.789444,5415.552861,3953.525774,3720.535576,4569.876767,5544.271422,3664.763988,5305.961699,3903.317684,75261.32152,,1544.956374,1810.0,1390.0
"""max""",,"""202312""",,25915.51,"""The Alliance""",59.0,42.0,11.0,1.0,23.0,1.0,"""ZZZZ""",100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,282385.498673,,6620.504391,8250.0,5070.0


In [19]:
#save 
lane_month_carrier_lf.collect().write_parquet('../data/usda_aggregations/lane_month_carrier.parquet')
lane_month_carrier_lf.describe().write_csv('../data/usda_aggregations/lane_month_carrier_summary.csv')

### 3.1 Lane-Carrier-Vessel Panel

The most detailed level of aggregation inspects cargoes carried by each carrier on individual vessels operating on the lane in a given month. 

In [None]:
#main aggregation
lane_carrier_vessel_lf = (
    main_lf
    .with_columns(
        #number of foreign ports serviced by vessel from us port
        num_foreign_ports = (pl.col('dest_port_code').unique().count()
                             .over('month', 'vessel_id', 'scac', 
                                   'origin_port_code')),
        #number of foreign countries serviced by vessel from us port
        num_foreign_countries = (pl.col('dest_territory').unique().count()
                                 .over('month', 'vessel_id', 'scac', 
                                       'origin_port_code')),
        #number of foreign regions serviced by vessel from us port
        num_foreign_regions = (pl.col('dest_region').unique().count()
                                 .over('month', 'vessel_id', 'scac', 
                                       'origin_port_code'))
    )
    .group_by('month', 'lane_id', 'scac', 'vessel_id')
    .agg(
        #alliance to which the carrier belongs
        carrier_alliance = pl.col('alliance').first(),
        #owner of the vessel
        vessel_owner = pl.col('vessel_owner').first(),
        #whether or not the vessel is being operated by an alliance
        alliance_operated = pl.col('alliance_vessel').is_not_null().first(),
        #total cargoes carried by the vessel owner on that vessel
        teus = pl.col('teus').sum(),
        #same as above
        num_foreign_ports = pl.col('num_foreign_ports').first(),
        num_foreign_countries = pl.col('num_foreign_countries').first(),
        num_foreign_regions = pl.col('num_foreign_regions').first()
    )
)

#get capacities contributed by each vessel
cap_lf = (
    main_lf
    #aggregate
    .group_by('lane_id', 'month', 'vessel_id')
    .agg(
        #get vessel capacity
        pl.col('vessel_capacity').first(),
        #get number of turns
        pl.col('date').n_unique().alias('turns')
    )
    #compute capacity contributed by vessel
    .with_columns(
        cap_from_vessel = (pl.col('vessel_capacity'))*(pl.col('turns'))
    )
    #drop unneeded cols
    .drop('vessel_capacity', 'turns')
)
#merge vessel cap into main agg lf
lane_carrier_vessel_lf = (
    lane_carrier_vessel_lf
    .join(cap_lf, on=['lane_id', 'month', 'vessel_id'], how='left')
)

#merge rates into lf
lane_carrier_vessel_lf = (
    lane_carrier_vessel_lf
    .join(
        rates_lf, 
        on=['lane_id', 'month'],
        how='left'
    )
)

In [23]:
lane_carrier_vessel_lf.describe()

statistic,month,lane_id,scac,vessel_id,carrier_alliance,vessel_owner,alliance_operated,teus,num_foreign_ports,num_foreign_countries,num_foreign_regions,cap_from_vessel,drewery_lane,dist,rate_40,rate_20
str,str,str,str,f64,str,str,f64,f64,f64,f64,f64,f64,str,f64,f64,f64
"""count""","""2375156""","""2375156""","""2375156""",2375156.0,"""2375156""","""2375156""",2375156.0,2375156.0,2375156.0,2375156.0,2375156.0,2375156.0,"""2375156""",2375156.0,1275313.0,1275313.0
"""null_count""","""0""","""0""","""0""",0.0,"""0""","""0""",0.0,0.0,0.0,0.0,0.0,0.0,"""0""",0.0,1099843.0,1099843.0
"""mean""",,,,9326800.0,,,0.238668,60.394433,5.865227,4.128453,1.866946,3058.328677,,1051.67516,1443.210365,1126.259875
"""std""",,,,267217.708225,,,,198.328561,4.108259,2.549416,1.000231,1758.793321,,816.140853,709.980484,500.043906
"""min""","""200701""",,,651327.0,"""2M Alliance""",,0.0,0.03,1.0,1.0,1.0,0.147059,,5.93014,400.0,310.0
"""25%""",,,,9245756.0,,,,5.066315,3.0,2.0,1.0,1856.397059,,307.362434,930.0,760.0
"""50%""",,,,9321237.0,,,,17.732104,5.0,4.0,2.0,2613.970588,,1133.326061,1330.0,1040.0
"""75%""",,,,9448815.0,,,,53.196311,8.0,5.0,2.0,4059.117647,,1509.389377,1720.0,1360.0
"""max""","""202312""",,,9970947.0,"""The Alliance""",,1.0,18542.714188,33.0,23.0,10.0,30193.382353,,6620.504391,8250.0,5070.0


In [None]:
#save 
lane_carrier_vessel_lf.collect().write_parquet('../data/usda_aggregations/lane_month_carrier_vessel.parquet')
lane_carrier_vessel_lf.describe().write_csv('../data/usda_aggregations/lane_month_carrier_vessel_summary.csv')

## 4. Visuals and Descriptive Analysis

In [25]:
#visualize port volumes
df = (
    main_lf
    .group_by('origin_port_group', 'coast_region', 'month')
    .agg(pl.col('teus').sum())
    .group_by('origin_port_group', 'coast_region')
    .agg(pl.col('teus').mean())
    .sort('teus', descending=True)
    .collect()
)

px.bar(
    df.sort(by='teus'), x='origin_port_group', y='teus', 
    color='coast_region',
    title='US Export Volumes (monthly average TEUs, all time)',
    labels={
        'origin_port_group':'US Ports (CSA groups)',
        'teus':'Monthly Average Volume (TEUS)',
        'coast_region':'Coastal Region'
    },
    width=1100,
    height=500,
)

In [26]:
df = (
    main_lf
    #group by alliance, scac, and year
    .group_by('alliance', 'scac', 'year')
    .agg(pl.col('teus').sum())
    #get proportion 
    .with_columns((pl.col('teus')/pl.col('teus').sum().over('year')).alias('prop_volume'))
    .collect()
)

#plot
fig = px.bar(
    df, x='year', y='prop_volume',
    color='alliance',
    text='scac', 
    barmode='stack',  
    title='Percent of Export Volumes Over Time',
    width=900,
    height=500,
    color_discrete_map={
        'Non-alliance Carriers':'#636EFA',
        'The Alliance':'#EF553B',
        'Ocean Alliance':'#00CC96',
        'CYKHE':'#ABC3FA',
        'New World All (NWA)':'#FFA15A',
        'Grand Alliance IV':'#19D3F3',
        'G6 Alliance':'#FF6692',
        'MSC/CMA CGM':'#B6E880',
        '2M Alliance':'#FECB52',
        'Ocean Three':'#FF97FF',
        'CYKH':'#17BECF',
    },
    labels={
        'alliance':'Alliance:',
        'prop_volume':'Volume (Percent of Total)',
        'year':'Time'
    },
    category_orders={
        'alliance':[
            'Non-alliance Carriers', 'The Alliance', 'CYKHE', 'Ocean Alliance',  
            'CYKH', 'Ocean Three', '2M Alliance', 'G6 Alliance', 'MSC/CMA CGM', 'New World All (NWA)', 'Grand Alliance IV'
        ]
    }
    )

fig.show()

In [27]:
df = (
    main_lf
    #get cargo source column
    .with_columns(
        pl.when(pl.col('primary_cargo')==1)
        .then(pl.lit('self'))
        .otherwise(
            pl.when(pl.col('alliance')==pl.col('owner_alliance'))
            .then(pl.lit('ally'))
            .otherwise(pl.lit('non-ally'))
        )
        .alias('cargo_source')
    )
    #group by source and year
    .group_by('cargo_source', 'month')
    .agg(
        pl.col('teus').sum()
    )
    .with_columns((pl.col('teus')/pl.col('teus').sum().over('month')).alias('prop_volume'))
    .cast({'cargo_source':pl.Utf8})
    .sort(by=['month'])
    .collect()
)

px.area(
    df.with_columns(pl.col('month').str.to_datetime('%Y%m')), x='month', y='prop_volume',
    color='cargo_source',
    color_discrete_map={
        'self':'#636EFA',
        'ally':'#EF553B',
        'non-ally':'#00CC96'
    },
    labels={
        'prop_volume':'Share of Total Capacity',
        'cargo_source':'Cargo Source',
        'month':'Time',
        'self':'Self',
        'ally':'Allies',
        'non-ally':'Others'
    },
    category_orders={
        'cargo_source':['self', 'ally', 'non-ally']
    },
    width=900,
    height=500,
    title='Cargo Source Over Time'
)

In [32]:
px.scatter(
    lanepanel_lf.select('teus', 'hhi_lane', 'month')
    .with_columns(pl.col('month').str.to_date('%Y%m').dt.year().alias('year'))
    .sort(by='year')
    .collect(),
    y='teus', x='hhi_lane',
    animation_frame='year'
)

In [29]:
def sharing_by_carrier_med(lf, scac, carrier_name=None):
    '''
    ad hoc function to inspect sharing by scac over time for a given carrier
    inputs:
        scac - str - the SCAC for the carrier of interest
        carrier_name - str - the string name of the carrier of interest for plot title
    '''
    #get top sharing partners
    top_partners = (
        lf
        #get shared cargo col
        .with_columns((pl.col('teus')*(1-pl.col('primary_cargo'))).alias('shared_teus'))
        #only maersk vessels but not maersk cargo
        .filter((pl.col('vessel_owner')==scac)&(pl.col('scac')!=scac))
        #group by carrier
        .group_by('carrier_name')
        .agg(pl.col('shared_teus').sum())
        #filter to top 10
        .sort(by='shared_teus', descending=True)
        .limit(5)
        #retail only carriers
        .drop('shared_teus')
        #collect
        .collect()
        #convert from dataframe to series
        .to_series()
    )

    #make df for graphing
    df = (
        lf
        #only maersk vessels but not maersk cargo
        .filter((pl.col('vessel_owner')==scac)&(pl.col('scac')!=scac))
        #create partners column
        .with_columns(
            pl.when(pl.col('carrier_name').is_in(top_partners))
            .then(pl.col('carrier_name'))
            .otherwise(pl.lit('Other Carriers'))
            .alias('partner')
        )
        #get percentages of other carriers
        .group_by('carrier_name', 'partner','month')
        .agg(pl.col('teus').sum())
        .with_columns(pl.col('teus').sum().over('month').alias('total_shared'))
        .with_columns((pl.col('teus')/pl.col('total_shared')).alias('prop_shared'))
        .with_columns(pl.col('month').str.to_datetime('%Y%m'))
        #sort for plotting
        .sort(by='month')
        .cast({'partner':pl.Utf8})
        .collect()
    )
    #plot
    fig = px.bar(df, x='month', y='prop_shared', color='partner', 
                title=str('Proportion of shared cargo on '+carrier_name+' ships by Carrier') if carrier_name else str('Proportion of shared cargo on '+scac+' ships by Carrier'),
                labels={
                    'prop_shared':'Percentage of Shared Cargo',
                    'month':'Time',
                    'partner':'Carrier'
                },
                width=1100,
                height=500,
                category_orders={
                    'partner':['Other Carriers', 'SAFMARINE', 'ANL CONTAINER LINE',  'CMA-CGM','HYUNDAI', 'MEDITERRANEAN SHIPPING COMPANY']
                }
    )
    fig.add_vline(x=datetime.strptime("2015-01-01", "%Y-%m-%d").timestamp() * 1000, 
                  line_width=4, line_dash='dash',
                  annotation_text="2M Alliance begins"
                  )
    fig.show()

In [30]:
sharing_by_carrier_med(main_lf, 'MSCU', 'Mediterranean')

In [31]:
def sharing_by_carrier_evergreen(lf, scac, carrier_name=None):
    '''
    ad hoc function to inspect sharing by scac over time for a given carrier
    inputs:
        scac - str - the SCAC for the carrier of interest
        carrier_name - str - the string name of the carrier of interest for plot title
    '''
    #get top sharing partners
    top_partners = (
        lf
        #get shared cargo col
        .with_columns((pl.col('teus')*(1-pl.col('primary_cargo'))).alias('shared_teus'))
        #only maersk vessels but not maersk cargo
        .filter((pl.col('vessel_owner')==scac)&(pl.col('scac')!=scac))
        #group by carrier
        .group_by('carrier_name')
        .agg(pl.col('shared_teus').sum())
        #filter to top 10
        .sort(by='shared_teus', descending=True)
        .limit(5)
        #retail only carriers
        .drop('shared_teus')
        #collect
        .collect()
        #convert from dataframe to series
        .to_series()
    )

    #make df for graphing
    df = (
        lf
        #only maersk vessels but not maersk cargo
        .filter((pl.col('vessel_owner')==scac)&(pl.col('scac')!=scac))
        #create partners column
        .with_columns(
            pl.when(pl.col('carrier_name').is_in(top_partners))
            .then(pl.col('carrier_name'))
            .otherwise(pl.lit('Other Carriers'))
            .alias('partner')
        )
        #get percentages of other carriers
        .group_by('carrier_name', 'partner','month')
        .agg(pl.col('teus').sum())
        .with_columns(pl.col('teus').sum().over('month').alias('total_shared'))
        .with_columns((pl.col('teus')/pl.col('total_shared')).alias('prop_shared'))
        .with_columns(pl.col('month').str.to_datetime('%Y%m'))
        #sort for plotting
        .sort(by='month')
        .cast({'partner':pl.Utf8})
        .collect()
    )
    #plot
    fig = px.bar(df, x='month', y='prop_shared', color='partner', 
                title=str('Proportion of shared cargo on '+carrier_name+' ships by Carrier') if carrier_name else str('Proportion of shared cargo on '+scac+' ships by Carrier'),
                labels={
                    'prop_shared':'Percentage of Shared Cargo',
                    'month':'Time',
                    'partner':'Carrier'
                },
                width=1100,
                height=500,
                category_orders={
                    'partner':['Other Carriers', 'SAFMARINE', 'ANL CONTAINER LINE',  'CMA-CGM','HYUNDAI', 'MEDITERRANEAN SHIPPING COMPANY']
                }
    )
    fig.add_vline(x=datetime.strptime("2017-04-01", "%Y-%m-%d").timestamp() * 1000, 
                  line_width=4, line_dash='dash',
                  annotation_text="Joined Ocean Alliance"
                  )
    fig.add_vline(x=datetime.strptime("2014-04-01", "%Y-%m-%d").timestamp() * 1000, 
                  line_width=4, line_dash='dash',
                  annotation_text="Joined CYKHE Alliance",
                  annotation_position="top left"
                  )
    fig.show()

sharing_by_carrier_evergreen(main_lf, 'EGLV', 'Evergreen')

#### Visualizing Lane Concentration 

In [33]:
#inspect concentration of volumes by lane

#get avg vessel cap
mean_vessel_cap = (
    main_lf.select(pl.mean('vessel_capacity')).collect()[0,0]
)

#get top 200 lanes
top_200lanes = (
    main_lf
    .group_by('lane_id')
    .agg(pl.col('teus').sum())
    .sort(by='teus', descending=True)
    .limit(200)
    .collect()
    .to_series()
)

#construct lane sizes per month
df = (
    main_lf
    .filter(pl.col('lane_id').is_in(top_200lanes))
    .group_by('lane_id', 'month')
    .agg(pl.col('teus').sum())
    .group_by('lane_id')
    .agg(pl.col('teus').mean().alias('avg_monthly_volume'))
    .sort(by='avg_monthly_volume', descending=True)
    .with_row_index()
    .collect()
)

In [34]:
fig = px.bar(
    df, x='index', y='avg_monthly_volume', 
    labels={
        'index':'Lane Rank by Average Monthly Volume',
        'avg_monthly_volume':'Average Monthly Volume (TEU)'
    }, 
    title='Average Monthly Volume Concentration Across Top 200 Lanes',
    width=1100, height=500
    )
fig.add_hline(y=mean_vessel_cap, annotation_text=('Average Single-vessel Capacity ({} TEUs)').format(round(mean_vessel_cap)))
fig.show()