# PIERS Container BOL Data ETL 

This notebook builds an ETL pipeline for S&P Global's PIERS data. Data is extracted from CSV files downloaded from the Global Trade Analytics Suite, assigned appropriate datatypes, concatendated into a single dataframe, and loaded to an Apache Parquet file for storage.

In [1]:
#import libraries
import pandas as pd
import os
import time

#display settings
pd.set_option('display.max_columns', None)

In [3]:
#define path
path = 'data/raw/'
#get list of data files, ignoring any hidden files in directory 
datafiles = [file for file in os.listdir(path) if not file.startswith('.')]
#init filenumber
filenumber = 1

#extract from csv to clean dataframes and concat
print('Processing CSVs. \n There are {} files to process. \n'.format(len(datafiles)))
for filename in datafiles:
    start = time.time()
    print('Processing file number {}: {} ...'.format(filenumber, filename))
    #extract file
    file_df = pd.read_csv(path+filename, engine='pyarrow')
    #load to parquet
    file_df.to_parquet('data/raw_parquet/'+filename[:-3]+'parquet')
    #drop file df
    del file_df
    end = time.time()
    print('File number {} complete.'.format(filenumber))
    print('Processing this file took {} seconds.\n'.format(end-start))
    filenumber += 1

Processing CSVs. 
 There are 25 files to process. 

Processing file number 1: PIERS import records 2022-09-01 to 2023-01- 31 CD439702FC014C1E964C1F770095E3E9.csv ...
File number 1 complete.
Processing this file took 16.47180485725403 seconds.

Processing file number 2: PIERS import records 2021-01-01 to 2021-06-30 885FF5029B5F4A7D9984B64AC1CAC155.csv ...
File number 2 complete.
Processing this file took 19.736983060836792 seconds.

Processing file number 3: PIERS import records 2023-09-06 to 2023-11- 31 63E23DF3FF4C45D396DDF74608618685.csv ...
File number 3 complete.
Processing this file took 9.332447052001953 seconds.

Processing file number 4: PIERS import records 2019 01-06 DDE7D8AFA5C540B9BEF53A673D284078.csv ...
File number 4 complete.
Processing this file took 15.226129293441772 seconds.

Processing file number 5: PIERS import records 2018 07-12 27309015DCA54EA5A6C584E83935A9A9.csv ...
File number 5 complete.
Processing this file took 16.653332948684692 seconds.

Processing file 

## Extract and Transform

Read from csv into a pandas dataframe with appropriate dtypes

Note for future optimization: build a dictionary of column dtypes and assign within read_csv. 

In [4]:
def piers_imports_extractor(data):
    '''
    Extracts from downloaded PIERS csv files and performs initial cleaning 
    INPUT:
        data - str - the csv file to be extracted, including the path from current directory
    OUTPUT:
        df - pandas dataframe with appropriate column names and dtypes
    '''
    #read csv file 
    df = pd.read_parquet(data) # using the pyarrow engine engages more cpu cores 
    #unpack strings to list objects
    df['Container Number'] = df['Container Number'].str.split()
    df['Quantity of Commodity Short Description'] = df['Quantity of Commodity Short Description'].str.split(pat=';')
    df['Commodity Short Description'] = df['Commodity Short Description'].str.split(pat=',')
    #recast dates to datetime 
    df['Arrival Date'] = pd.to_datetime(df['Arrival Date'].astype(str), format='%Y%m%d') 
    #recast to int
    df['Quantity'] = pd.to_numeric(df['Quantity'], downcast='integer')
    #recast to categorical dtypes
    df[['Weight Unit', 'Quantity Type', 'Territory of Origin', 'Region of Origin', 'Port of Arrival Code', 'Port of Arrival',
        'Port of Departure Code', 'Port of Departure', 'Final Destination', 'Coastal Region', 'Clearing District', 'Place of Receipt',
        'Shipper', 'Carrier', 'SCAC', 'Mode of Transport']
        ] = df[['Weight Unit', 'Quantity Type', 'Territory of Origin', 'Region of Origin', 'Port of Arrival Code', 'Port of Arrival',
                'Port of Departure Code', 'Port of Departure', 'Final Destination', 'Coastal Region', 'Clearing District', 'Place of Receipt',
                'Shipper', 'Carrier', 'SCAC', 'Mode of Transport']].astype('category')
    return df    

In [9]:
#redefine path
path = 'data/raw_parquet/'
#get list of data files, ignoring any hidden files in directory 
datafiles = [file for file in os.listdir(path) if not file.startswith('.')]
#initialize dataframe
imports_df = pd.DataFrame()
#init filenumber
filenumber = 1

datafiles

['PIERS import records 2012 01-06 6913D922629F4A60A8B58DFBB18718D0.parquet',
 'PIERS import records 2021-07-01 to 2021-12-31 2594E636259E4E6D9E98DB25CBD3D4C8.parquet',
 'PIERS import records 2013 07-12 AAD497A1356E480EA50A63677D75192D.parquet',
 'PIERS import records 2022-03-01 to 2022-08- 31 DC6C0A9417E1411796365E8CB68EA104.parquet',
 'PIERS import records 2020 07-12 F4590742040E42B4A8945821EE4526BC.parquet',
 'PIERS import records 2019 01-06 DDE7D8AFA5C540B9BEF53A673D284078.parquet',
 'PIERS import records 2016 07-12 0E1786F127724FC7840C3C221E3AB4EC.parquet',
 'PIERS import records 2018 07-12 27309015DCA54EA5A6C584E83935A9A9.parquet',
 'PIERS import records 2013 01-06 67D22B8619EA42078798D01517489910.parquet',
 'PIERS import records 2022 01-02 8C26A4D120BD40FD9682336FAD5E0FA6.parquet',
 'PIERS import records 2023-09-06 to 2023-11- 31 63E23DF3FF4C45D396DDF74608618685.parquet',
 'PIERS import records 2015 01-06 902B742F13194517B3A3848669B9892F.parquet',
 'PIERS import records 2020 01-0

In [10]:

#extract from csv to clean dataframes and concat
print('Processing CSVs. \n There are {} files to process. \n'.format(len(datafiles)))
for filename in datafiles:
    start = time.time()
    print('Processing file number {}: {} ...'.format(filenumber, filename))
    file_df = piers_imports_extractor(path+filename)
    imports_df = pd.concat([imports_df, file_df])
    del file_df
    end = time.time()
    print('File number {} complete.'.format(filenumber))
    print('Processing this file took {} seconds.'.format(end-start))
    print('The dataframe is now {} GB.\n'.format(imports_df.memory_usage().sum()/1000000000))
    filenumber += 1
#recast to categorical dtypes
imports_df[
    ['Weight Unit', 'Quantity Type', 'Territory of Origin', 'Region of Origin', 'Port of Arrival Code', 'Port of Arrival',
    'Port of Departure Code', 'Port of Departure', 'Final Destination', 'Coastal Region', 'Clearing District', 'Place of Receipt',
    'Shipper', 'Carrier', 'SCAC', 'Mode of Transport']
    ] = imports_df[
        ['Weight Unit', 'Quantity Type', 'Territory of Origin', 'Region of Origin', 'Port of Arrival Code', 'Port of Arrival',
            'Port of Departure Code', 'Port of Departure', 'Final Destination', 'Coastal Region', 'Clearing District', 'Place of Receipt',
            'Shipper', 'Carrier', 'SCAC', 'Mode of Transport']].astype('category')

Processing CSVs. 
 There are 25 files to process. 

Processing file number 1: PIERS import records 2012 01-06 6913D922629F4A60A8B58DFBB18718D0.parquet ...
File number 1 complete.
Processing this file took 11.835404872894287 seconds.
The dataframe is now 1.070675829 GB.

Processing file number 2: PIERS import records 2021-07-01 to 2021-12-31 2594E636259E4E6D9E98DB25CBD3D4C8.parquet ...
File number 2 complete.
Processing this file took 22.896639823913574 seconds.
The dataframe is now 3.77726936 GB.

Processing file number 3: PIERS import records 2013 07-12 AAD497A1356E480EA50A63677D75192D.parquet ...
File number 3 complete.
Processing this file took 16.12999677658081 seconds.
The dataframe is now 5.39808344 GB.

Processing file number 4: PIERS import records 2022-03-01 to 2022-08- 31 DC6C0A9417E1411796365E8CB68EA104.parquet ...
File number 4 complete.
Processing this file took 26.352632999420166 seconds.
The dataframe is now 7.86135384 GB.

Processing file number 5: PIERS import records 

  new_result = trans(result).astype(dtype)


File number 9 complete.
Processing this file took 27.411241054534912 seconds.
The dataframe is now 17.203660544 GB.

Processing file number 10: PIERS import records 2022 01-02 8C26A4D120BD40FD9682336FAD5E0FA6.parquet ...
File number 10 complete.
Processing this file took 32.768775939941406 seconds.
The dataframe is now 18.02033462 GB.

Processing file number 11: PIERS import records 2023-09-06 to 2023-11- 31 63E23DF3FF4C45D396DDF74608618685.parquet ...
File number 11 complete.
Processing this file took 24.703068256378174 seconds.
The dataframe is now 19.015365836 GB.

Processing file number 12: PIERS import records 2015 01-06 902B742F13194517B3A3848669B9892F.parquet ...
File number 12 complete.
Processing this file took 37.7004930973053 seconds.
The dataframe is now 20.624424356 GB.

Processing file number 13: PIERS import records 2020 01-06 D80C005382EB47B69FF55D01E5952D83.parquet ...
File number 13 complete.
Processing this file took 63.46302103996277 seconds.
The dataframe is now 22

  imports_df = pd.concat([imports_df, file_df])


File number 18 complete.
Processing this file took 101.74683904647827 seconds.
The dataframe is now 31.508173064 GB.

Processing file number 19: PIERS import records 2014 01-06 3D15D62D5DCC4DC0B185D2A13DA5AF19.parquet ...
File number 19 complete.
Processing this file took 164.76898789405823 seconds.
The dataframe is now 33.091623596 GB.

Processing file number 20: PIERS import records 2015 07-12 43DA9BAAE693446D808507337697B86D.parquet ...
File number 20 complete.
Processing this file took 343.6006050109863 seconds.
The dataframe is now 34.808230676 GB.

Processing file number 21: PIERS import records 2018 01-06 FE04571CA7F547A3BF974AEC156D7628.parquet ...
File number 21 complete.
Processing this file took 322.1223130226135 seconds.
The dataframe is now 36.598352384 GB.

Processing file number 22: PIERS import records 2022-09-01 to 2023-01- 31 CD439702FC014C1E964C1F770095E3E9.parquet ...
File number 22 complete.
Processing this file took 373.3160581588745 seconds.
The dataframe is now 

In [11]:
# inspect output 
display(imports_df.head())
imports_df.info()

Unnamed: 0,Shipper,Shipper Address,Consignee,Consignee Address,Notify Party,Notify Party Address,Also Notify Party,Also Notify Party Address,Weight,Weight Unit,Quantity,Quantity Type,TEUs,Carrier,SCAC,Vessel Name,Voyage Number,Bill of Lading Number,Pre Carrier,IMO Number,Inbond Code,Estimated Value,Territory of Origin,Region of Origin,Port of Arrival Code,Port of Arrival,Port of Departure Code,Port of Departure,Final Destination,Mode of Transport,Arrival Date,Container Number,Container Piece Count,Clearing District,Coastal Region,Raw Commodity Description,Commodity Short Description,HS Code,JOC Code,Marks Container Number,Marks Description,Place of Receipt,Quantity of Commodity Short Description
0,ORDER,ZZ,ORDER,XX,,,,,0.0,KG,3020,CM,0.0,ORIENT OVERSEAS CONTAINER LINE,OOLU,NYK ROMULUS,13,OOLU2016430430,,9416989.0,0.0,0.0,CHINA (MAINLAND),NORTH EAST ASIA,1401.0,NORFOLK,57020.0,NINGPO,,,2012-06-30,"[OOLU9053138, OOLU3689708]",2,"NORFOLK, VIRGINIA",EAST,"CHAIR, HARR, HIBACK, BLACK CHAIR, HARR, HIBA...","[FURNITURE, FIXTURES; NOS (* 7275)]",9401,7275000,,ZZ; ZZ,NINGPO,[610]
1,ORDER,ZZ,ORDER,XX,,,,,0.0,KG,1,CM,0.0,K LINE,KKLU,,46,EXDO6394667164,,,0.0,0.0,CHINA (MAINLAND),NORTH EAST ASIA,2709.0,LONG BEACH,57035.0,SHANGHAI,,,2012-06-30,[KKFU7074852],1,"LOS ANGELES, CALIFORNIA",WEST,MICRO FIBRE SHEETS  ,"[CURTIN, DRAPE, LINEN, SHEET, TOWEL]",630239,3665000,,LC,SHANGHAI,[9]
2,ORDER,ZZ,ORDER,XX,,,,,0.0,KG,0,CM,0.0,MAERSK LINE,MAEU,SEA LAND MERCURY,1211,MAEU864136320,,9106194.0,0.0,0.0,HUNGARY,NORTH EUROPE,4601.0,NEW YORK,42870.0,BREMERHAVEN,,,2012-06-30,[MRKU7311374],1,,EAST,72 DRUMS ON 18 PALLETS OF KOMA D 503 LOADD...,[CHEMICALS; NOS],382490,4999997,,20,BUDAPEST,[18]
3,BABOLNA BIOENVIROMENTAL CENTRE,SZALLAS U 6,PRECISION SCIENCE INC,1517 W KNUDSEN DR,PRECISION SCIENCE INC,1517 W KNUDSEN DR,,,0.0,KG,0,CM,0.0,MAERSK LINE,MAEU,SEA LAND MERCURY,1211,MAEU557437978,,9106194.0,0.0,0.0,GERMANY,NORTH EUROPE,4601.0,NEW YORK,42870.0,BREMERHAVEN,,,2012-06-30,[MSKU8202750],1,,EAST,24 STEEL DRUMS UN 1A1 GROSS WE IGHT: 1582.200...,[PESTICIDE; NOS (* 4051510-40)],380890,4051540,,LC,BREMERHAVEN,[24]
4,ORDER,ZZ,ORDER,XX,,,,,0.0,KG,0,,0.0,HYUNDAI,HDMU,HYUNDAI DYNASTY,23,HDMUYNWB8004548,,9347578.0,0.0,0.0,CHINA (MAINLAND),NORTH EAST ASIA,4601.0,NEW YORK,57078.0,YANTIAN,,,2012-06-30,[CRSU9286041],1,,EAST,SHIPPER'S LOAD & COUNT (180CTNS) CY / CY  CO...,"[HIBACHI, BARBEQUE, GRILL, ACCESS]",732111,6535205,,40,YANTIAN,[180]


<class 'pandas.core.frame.DataFrame'>
Index: 136178647 entries, 0 to 5290211
Data columns (total 43 columns):
 #   Column                                   Dtype         
---  ------                                   -----         
 0   Shipper                                  category      
 1   Shipper Address                          object        
 2   Consignee                                object        
 3   Consignee Address                        object        
 4   Notify Party                             object        
 5   Notify Party Address                     object        
 6   Also Notify Party                        object        
 7   Also Notify Party Address                object        
 8   Weight                                   float64       
 9   Weight Unit                              category      
 10  Quantity                                 int64         
 11  Quantity Type                            category      
 12  TEUs                           

## Load

In [12]:
#save to parquet file
imports_df.to_parquet('data/piers_imports.parquet', index=False)

#delete imports df
del imports_df

Restarted kernel after previous cell

In [3]:
df = pd.read_parquet('data/piers_imports.parquet', engine='fastparquet') #requires fastparquet dependency 

In [4]:
df.columns

Index(['Shipper', 'Shipper Address', 'Consignee', 'Consignee Address',
       'Notify Party', 'Notify Party Address', 'Also Notify Party',
       'Also Notify Party Address', 'Weight', 'Weight Unit', 'Quantity',
       'Quantity Type', 'TEUs', 'Carrier', 'SCAC', 'Vessel Name',
       'Voyage Number', 'Bill of Lading Number', 'Pre Carrier', 'IMO Number',
       'Inbond Code', 'Estimated Value', 'Territory of Origin',
       'Region of Origin', 'Port of Arrival Code', 'Port of Arrival',
       'Port of Departure Code', 'Port of Departure', 'Final Destination',
       'Mode of Transport', 'Arrival Date', 'Container Number',
       'Container Piece Count', 'Clearing District', 'Coastal Region',
       'Raw Commodity Description', 'Commodity Short Description', 'HS Code',
       'JOC Code', 'Marks Container Number', 'Marks Description',
       'Place of Receipt', 'Quantity of Commodity Short Description'],
      dtype='object')

In [6]:
df['Bill of Lading Number'].value_counts()

Bill of Lading Number
HLCUTOR220605761    7
SUDU231582438005    7
CHSL338704389HCM    6
EXDO62R0276535      6
EXDO62R0276537      6
                   ..
EXDO63Z8075099      1
EXDO617741618       1
EXDO6395357610      1
EXDO621115434       1
COSU6082255780      1
Name: count, Length: 135571302, dtype: int64