# International Trade Data - API

In [40]:
import numpy as np
import pandas as pd
import requests
import time
import copy

# no decimal places
pd.options.display.float_format = '{:,.0f}'.format
# show more rows
pd.set_option('display.max_rows', 200)

### Monthly International Trade Data API Structure

API for MITD data has two endpoints: exports and imports. This means that although similar, API calls for exports and imports will have to be made separately. There are several datasets available for each, but as mentioned previously, we will be using the HS data. 

The general structure of the API call includes the UNIFORM RESOURCE IDENTIFIERS (URIs) of the MITD endpoints
    ``https://api.census.gov/data/timeseries/intltrade/exports/hs?get=``, where hs stands for the HS dataset. This is an example for the exports endpoint. Getting imports is as simple as typing imports instead of exports.
    
To complete the rest of the API call, we use parameters. There are more than 50 available parameters for both exports and imports. Most of them are similar, but there are several that can be used with one group only. For this project, we will use the following:
            
            CTY_CODE        --> country code
            ALL_VAL_MO      --> total value, monthly
            CNT_VAL_MO      --> containerized vessel value
            CNT_WGT_MO      --> containerized vessel shipping weight
            AIR_VAL_MO      --> air value, monthly
            AIR_WGT_MO      --> air shipping weight, monthly
            VES_VAL_MO      --> vessel value, monthly
            VES_WGT_MO      --> vessel shipping weight, monthly
            SUMMARY_LVL=DET --> selecting individual countries, not predefined groups
            time            --> YYYY-MM format, or from+YYYY-MM+to+YYYY-MM if multiple months
            E(I)_COMMODITY  --> export/import commodity code
            DISTRICT        --> district code
 
Here is an example of an API call (click to view in browser): 
https://api.census.gov/data/timeseries/intltrade/exports/hs?get=DISTRICT,E_COMMODITY,ALL_VAL_MO,AIR_VAL_MO,AIR_WGT_MO,VES_VAL_MO,VES_WGT_MO&COMM_LVL=HS2&key=c64e82bd341ac24cc8223afd0458afb0f3436c66&time=from+2013-01+to+2019-12&CTY_CODE=1220&DISTRICT=59 

For more information on how to build MITD API queries, refer to the [International Trade Data - API User Guide](https://www.census.gov/foreign-trade/reference/guides/Guide%20to%20International%20Trade%20Datasets.pdf) published by the U.S. Census Bureau.

* * *

## Preparing Iterables

* * *


### Countries

Country codes are 4-digit unique identifiers and some examples include 1000, 1010, 1220, etc. For example, 1220 stands for Canada. This list consists of **241** countries and it is updated to reflect current geopolitical factors. The original list can be found [here](https://www.census.gov/foreign-trade/schedules/c/country2.txt).
Country codes will be used as the iterable for the outermost loop.

In [41]:
countries = pd.read_csv('countries.csv', header=None, dtype=object)

# remove the last column, the country abbreviations, since it is not necessary
countries.drop(labels=2, axis=1, inplace=True)
# set column names
country_cols = ['COUNTRY_CODE', 'COUNTRY']
countries.columns = country_cols

# abbreviate long country names
#   Falkland Islands (Islas Malvinas)                          --> Falkland Islands
#   Denmark, except Greenland                                  --> Denmark
#   Germany (Federal Repulic of Germany)                       --> Germany
#   Moldova (Repulic of Moldova)                               --> Moldova
#   Holy See (Vatican City)                                    --> Vatican City
#   Syria (Syrian Arab Republic)                               --> Syria
#   Gaza Strip administered by Israel                          --> Gaza Strip
#   West Bank administered by Israel                           --> West Bank
#   Yemen (Republic of Yemen)                                  --> Yemen
#   Burma (Myanmar)                                            --> Myanmar
#   Laos (Lao People's Democratic Republic)                    --> Laos
#   North Korea (democratic People's Republic of Korea)        --> North Korea
#   South Korea (Republic of Korea)                            --> South Korea
#   Samoa (Western Samoa)                                      --> Samoa
#   Micronesia, Federated States of                            --> Micronesia
#   Congo, Republic of the Congo                               --> Congo-Brazzaville
#   Congo, Democratic Republic of the Congo (formerly Za[ire)] --> Congo-Kinshasa
#   Tanzania (United Republic of Tanzania)                     --> Tanzania
#   Christmas Island (in the Indian Ocean)                     --> Christmas Island

# make a dictionary with shorter country name alternatives
updated_country_names = {'Falkland Islands (Islas Malvinas)': 'Falkland Islands',
                         'Denmark, except Greenland': 'Denmark',
                         'Germany (Federal Republic of Germany)': 'Germany',
                         'Moldova (Republic of Moldova)': 'Moldova',
                         'Holy See (Vatican City)': 'Vatican City',
                         'Syria (Syrian Arab Republic)': 'Syria',
                         'Gaza Strip administered by Israel': 'Gaza Strip',
                         'West Bank administered by Israel': 'West Bank',
                         'Yemen (Republic of Yemen)': 'Yemen',
                         'Burma (Myanmar)': 'Myanmar',
                         'Laos (Lao People\'s Democratic Republic)': 'Laos',
                         'North Korea (Democratic People\'s Republic of Korea)': 'North Korea',
                         'South Korea (Republic of Korea)': 'South Korea',
                         'Samoa (Western Samoa)': 'Samoa',
                         'Micronesia, Federated States of': 'Micronesia',
                         'Congo, Republic of the Congo': 'Congo-Brazzaville',
                         'Congo, Democratic Republic of the Congo (formerly Za': 'Congo-Kinshasa',
                         'Tanzania (United Republic of Tanzania)': 'Tanzania',
                         'Christmas Island (in the Indian Ocean)': 'Christmas Island'}

# clean any unwanted whitespace
countries.loc[:,'COUNTRY_CODE'] = countries.loc[:,'COUNTRY_CODE'].str.strip()
countries.loc[:,'COUNTRY'] = countries.loc[:,'COUNTRY'].str.strip()
# replace the long names
countries.loc[:,'COUNTRY'].replace(to_replace=updated_country_names, inplace=True)
# prepare a list of country codes needed for API calls
country_codes = countries.loc[:,'COUNTRY_CODE'].tolist()
# preview a portion of country_codes
print(country_codes[0:10])

['5310', '4810', '7210', '9510', '4271', '7620', '2481', '2484', '3570', '4631']


### Districts

District codes are 2-digit unique identifiers for a total of **45** U.S. districts. The codes range from 01 to 59. An example district code is 53, which stands for Houston-Galveston, TX. The list of districts is also sourced from the [U.S.Census Bureau](https://www.census.gov/foreign-trade/schedules/d/dist2.txt).
District codes will also be used for iteration. District code groups (based on the first digit) will be used when possible so more data is obtained in less iterations.

In [42]:
districts = pd.read_csv('districts.csv', header=None, dtype=object)
# rename the columns
districts.columns = ['DISTRICT_CODE', 'DISTRICT']

# add the district that's missing on the official district list, but shows up in the data
districts.loc[44] = ['59', 'NORFOLK/MOBILE/CHARLESTON, VA/AL/SC']

# prepare a list of district codes
district_codes = districts.loc[:,'DISTRICT_CODE'].tolist()
# for APIs, divide districts in groups based on the first digit
district_codes_single = districts.loc[:,'DISTRICT_CODE'].str.extract(r'([0-9])').iloc[:,0].unique().tolist()
# preview single-digit district codes
district_codes_single

['0', '1', '2', '3', '4', '5']

In [43]:
# save each district group in a separate list
district_groups = []
for group_code in district_codes_single:
    group = districts[districts['DISTRICT_CODE'].str.contains(group_code + r'[0-9]')].loc[:,'DISTRICT_CODE']
    district_groups.append(group.tolist())
    
# preview district groups
district_groups

[['01', '02', '04', '05', '07', '09'],
 ['10', '11', '13', '14', '15', '16', '17', '18', '19'],
 ['20', '21', '23', '24', '25', '26', '27', '28', '29'],
 ['30', '31', '32', '33', '34', '35', '36', '37', '38', '39'],
 ['41', '45', '46', '47', '49'],
 ['51', '52', '53', '54', '55', '59']]

***
## Making the API Calls

***

In [44]:
# test slice
alter_country = country_codes[0:5]

# choose to iterate over the slice or entire country list
# country_iterable = country_codes
country_iterable = alter_country

In [45]:
# insert your API key here
key = 'c64e82bd341ac24cc8223afd0458afb0f3436c66' 

# time_interval = '2013-01'
time_interval = 'from+2019-01+to+2019-12'

# Fixed URI parts 
exports_endpoint = "https://api.census.gov/data/timeseries/intltrade/exports/hs?get=DISTRICT,E_COMMODITY,ALL_VAL_MO,CNT_VAL_MO,CNT_WGT_MO,AIR_VAL_MO,AIR_WGT_MO,VES_VAL_MO,VES_WGT_MO&COMM_LVL=HS2&key="
imports_endpoint = "https://api.census.gov/data/timeseries/intltrade/imports/hs?get=DISTRICT,I_COMMODITY,GEN_VAL_MO,CNT_VAL_MO,CNT_WGT_MO,AIR_VAL_MO,AIR_WGT_MO,VES_VAL_MO,VES_WGT_MO&COMM_LVL=HS2&key="

# start measuring time elapsed
start = time.time()
# track outcome of each call
quality_control = list()
# save each country's result in a dictionary
countries_data = dict()

for country in country_iterable:
    # control variable used to identify the first successful API call for a given country
    status_success = 0
    
    for district in district_codes_single:
        # build the custom API call
        exp_api = exports_endpoint + key + '&time=' + time_interval + '&CTY_CODE=' + country + '&DISTRICT=' + district + '*'
        imp_api = imports_endpoint + key + '&time=' + time_interval + '&CTY_CODE=' + country + '&DISTRICT=' + district + '*'
        # requesting APIs
        for api in [exp_api, imp_api]:
            api_response = requests.get(api)
            
            qa = [country, district, 'E' if api == exp_api else 'I', api_response.status_code]
            quality_control.append(qa)
            
            # manage successful APIs, i.e. status code == 200
            if api_response.status_code == 200:
                status_success += 1
                # save the first successful query result as a dataframe
                if status_success == 1:
                    data = pd.DataFrame(api_response.json())
                    # denote the endpoint
                    data['type'] = 'Exports' if api == exp_api else 'Imports'
                # concatenate subsequent successful query results
                else:
                    data_a = pd.DataFrame(api_response.json())
                    data_a['type'] = 'Exports' if api == exp_api else 'Imports'
                    data = pd.concat([data, data_a.iloc[1:,:]])
            
            # manage too large requests by iterating through individual districts instead of district groups
            elif api_response.status_code == 500:
                for district_2 in district_groups[int(district)]:
                    if qa[2] == 'E':
                        api_2 = exports_endpoint + key + '&time=' + time_interval + '&CTY_CODE=' + country + '&DISTRICT=' + district_2
                    elif qa[2] == 'I':
                        api_2 = imports_endpoint + key + '&time=' + time_interval + '&CTY_CODE=' + country + '&DISTRICT=' + district_2
                    
                    api_response_2 = requests.get(api_2)
                    qa_2 = [country, district_2, qa[2], api_response_2.status_code]
                    quality_control.append(qa_2)
                    
                    if api_response_2.status_code == 200:
                        status_success += 1
                        if status_success == 1:
                            data = pd.DataFrame(api_response_2.json())
                            # denote the data type
                            data['type'] = 'Exports' if qa[2] == 'E' else 'Imports'
                        # concatenate subsequent successful query results
                        else:
                            data_a = pd.DataFrame(api_response_2.json())
                            data_a['type'] = 'Exports' if qa[2] == 'E' else 'Imports'
                            data = pd.concat([data, data_a.iloc[1:,:]])
    
    # check if any data was obtained and save data accordingly
    if status_success != 0:
        countries_data[country] = data
    
    # ensure slice is long enough for correct execution of the conditionals
    # do note, however, that these are rough estimates at best
    if len(country_iterable) > 4:
        if country_codes[int(len(country_iterable)*.25)] == country:
            first_quarter = (time.time()-start)/60
            print('25% completed')
            print('Estimated time remaining: ', round(first_quarter*3, 2), 'minutes')
        elif country_codes[int(len(country_iterable)*.5)] == country:
            second_quarter = (time.time()-start)/60
            print('50% completed')
            print('Estimated time remaining: ', round(second_quarter, 2), 'minutes')
        elif country_codes[int(len(country_iterable)*.75)] == country:
            third_quarter = (time.time()-start)/60
            print('75% completed')
            print('Estimated time remaining: ', round(third_quarter/3, 2), 'minutes')
        elif country_codes[len(country_iterable)] == country:
            print('Complete!\n')
    
end = time.time()
run_time = end - start

25% completed
Estimated time remaining:  1.94 minutes
50% completed
Estimated time remaining:  0.99 minutes
75% completed
Estimated time remaining:  0.42 minutes


***

## Cleaning the Data

***
To prepare the data for use, we need to do several steps including assigning meaningful column names, deleting unecessary columns and rows, adding year and month columns and converting intrinsically numerical variables from strings to integers.

In [46]:
# preview the current data format
countries_data[country_codes[2]]

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,type
0,DISTRICT,E_COMMODITY,ALL_VAL_MO,CNT_VAL_MO,CNT_WGT_MO,AIR_VAL_MO,AIR_WGT_MO,VES_VAL_MO,VES_WGT_MO,COMM_LVL,time,CTY_CODE,DISTRICT,Exports
1,09,04,152880,0,0,0,0,0,0,HS2,2019-08,7210,09,Exports
2,09,04,0,0,0,0,0,0,0,HS2,2019-09,7210,09,Exports
3,09,04,0,0,0,0,0,0,0,HS2,2019-10,7210,09,Exports
4,09,04,0,0,0,0,0,0,0,HS2,2019-11,7210,09,Exports
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
84,54,33,0,0,0,0,0,0,0,HS2,2019-08,7210,54,Imports
85,54,33,0,0,0,0,0,0,0,HS2,2019-09,7210,54,Imports
86,54,33,0,0,0,0,0,0,0,HS2,2019-10,7210,54,Imports
87,54,33,0,0,0,0,0,0,0,HS2,2019-11,7210,54,Imports


In [47]:
# clean-up function
# dictionary with preferred replacements for default column names obtained through API
def clean_up(df):
    column_names = {'CTY_CODE': 'COUNTRY_CODE',
                'ALL_VAL_MO': 'VALUE',
                'GEN_VAL_MO': 'VALUE',
                'CNT_VAL_MO': 'CONTAINER_VALUE',
                'CNT_WGT_MO': 'CONTAINER_WEIGHT',
                'AIR_VAL_MO': 'AIR_VALUE',
                'AIR_WGT_MO': 'AIR_WEIGHT',
                'VES_VAL_MO': 'VESSEL_VALUE',
                'VES_WGT_MO': 'VESSEL_WEIGHT',
                'E_COMMODITY': 'COMMODITY_CODE',
                'I_COMMODITY': 'COMMODITY_CODE',
                'DISTRICT': 'DISTRICT_CODE',
                'Imports': 'TYPE',
                'Exports': 'TYPE'}

    # drop the extra district_code column
    df.drop(labels=0, axis=1, inplace=True)
    # replace the default column names (which are now in the first row) with the preferred ones
    df.iloc[0,:].replace(to_replace=column_names, inplace=True)
    # take the first row, convert to upper case, create a list and assign as a header
    df.columns = df.iloc[0].str.upper().tolist()
    # delete the first row
    df.drop(labels=0, inplace=True)
    # drop the rows with total value equal to 0
    df = df[df.loc[:,'VALUE'] != '0'].copy()

    # separate the TIME column into columns YEAR and MONTH
    df[['YEAR','MONTH']] = df['TIME'].str.split("-", expand=True)
    # delete the TIME column
    df.drop(labels='TIME', axis=1, inplace=True)
    # reset index
    df.reset_index(drop=True, inplace=True)
    
    return df

In [48]:
# call the function to clean the data
for cdf in countries_data:
    countries_data[cdf] = clean_up(countries_data[cdf])

In [49]:
# combine the dataframes
get_the_first_one = 0

concat_start = time.time()

for cdf in countries_data:
    if get_the_first_one == 0:
        get_the_first_one += 1
        final_data = pd.DataFrame(countries_data[cdf])
    elif get_the_first_one == 1:
        final_data = pd.concat([final_data, pd.DataFrame(countries_data[cdf].iloc[1:,:])])

concat_end = time.time()

concat_time = concat_end - concat_start
print('Time needed to combine all the dataframes:', round(concat_time/60, 2), 'minutes')

Time needed to combine all the dataframes: 0.0 minutes


In [50]:
# preview of the initial data format
final_data.head()

Unnamed: 0,COMMODITY_CODE,VALUE,CONTAINER_VALUE,CONTAINER_WEIGHT,AIR_VALUE,AIR_WEIGHT,VESSEL_VALUE,VESSEL_WEIGHT,COMM_LVL,COUNTRY_CODE,DISTRICT_CODE,TYPE,YEAR,MONTH
0,38,4531,0,0,4531,44,0,0,HS2,5310,4,Exports,2019,5
1,41,10800,10800,675,0,0,10800,675,HS2,5310,4,Exports,2019,4
2,84,61095,0,0,61095,157,0,0,HS2,5310,4,Exports,2019,1
3,84,173374,32400,18144,140974,216,32400,18144,HS2,5310,4,Exports,2019,2
4,84,12448,0,0,12448,55,0,0,HS2,5310,4,Exports,2019,4


In [51]:
# convert numerical parameters to int
num_params = ['VALUE', 'CONTAINER_VALUE', 'CONTAINER_WEIGHT', 'AIR_VALUE', 'AIR_WEIGHT', 'VESSEL_VALUE', 'VESSEL_WEIGHT']
for param in num_params:
    final_data[param] = pd.to_numeric(final_data[param], errors='coerce')

# check the data info
final_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6394 entries, 0 to 240
Data columns (total 14 columns):
COMMODITY_CODE      6394 non-null object
VALUE               6394 non-null int64
CONTAINER_VALUE     6394 non-null int64
CONTAINER_WEIGHT    6394 non-null int64
AIR_VALUE           6394 non-null int64
AIR_WEIGHT          6394 non-null int64
VESSEL_VALUE        6394 non-null int64
VESSEL_WEIGHT       6394 non-null int64
COMM_LVL            6394 non-null object
COUNTRY_CODE        6394 non-null object
DISTRICT_CODE       6394 non-null object
TYPE                6394 non-null object
YEAR                6394 non-null object
MONTH               6394 non-null object
dtypes: int64(7), object(7)
memory usage: 749.3+ KB


***

## Performance Summary 

***

In [52]:
# convert the performance logger to a df
performance = pd.DataFrame(quality_control)
# save call summary to a dictionary
perf_dict = performance.iloc[:,3].value_counts().to_dict()

# number of unresolved failed API calls
trouble = performance[performance.iloc[:,1].str.contains('[0-9][0-9]') & performance.iloc[:,1] == 500].shape[0]
if trouble != 0:
    print('Revision needed!')

In [53]:
# summary
print('API calls made: {:,}'.format(performance.shape[0]))
print('Data download time: ', round(run_time/3600, 2), 'hours')
print('API call breakdown:')
for key in perf_dict:
    print(' '*3, key, ' --> {:>4,}'.format(perf_dict[key]))
print('Unresolved API calls:', trouble)
# print('Total rows downloaded: {:,}'.format(data_raw))
print('Total rows after cleaning: {:,}'.format(final_data.shape[0]))

API calls made: 60
Data download time:  0.03 hours
API call breakdown:
    200  -->   48
    204  -->   12
Unresolved API calls: 0
Total rows after cleaning: 6,394


***
## Transforming and Exporting the Data
***
Depending on the end use, data can be exported in the current format or aggregated in many different ways. One way is to  aggregate by years.

In [54]:
# download the monthly format
final_data.to_csv('monthly_trade_data.csv')

In [56]:
# download the yearly format
# several columns are necessary to build a unique identifier in this case
yearly = final_data.groupby(['COUNTRY_CODE', 'COMMODITY_CODE', 'DISTRICT_CODE', 'TYPE', 'YEAR'], as_index=False) \
                      [['VALUE', 'CONTAINER_VALUE', 'CONTAINER_WEIGHT', 'AIR_VALUE', 'AIR_WEIGHT', 'VESSEL_VALUE', 'VESSEL_WEIGHT']].sum()

# preview of the aggregated data
yearly.head()

Unnamed: 0,COUNTRY_CODE,COMMODITY_CODE,DISTRICT_CODE,TYPE,YEAR,VALUE,CONTAINER_VALUE,CONTAINER_WEIGHT,AIR_VALUE,AIR_WEIGHT,VESSEL_VALUE,VESSEL_WEIGHT
0,4271,16,11,Exports,2019,17640,0,0,17640,4496,0,0
1,4271,16,30,Exports,2019,4800,0,0,4800,174,0,0
2,4271,16,55,Exports,2019,4800,0,0,4800,1765,0,0
3,4271,19,10,Exports,2019,13168,13168,6873,0,0,13168,6873
4,4271,21,10,Exports,2019,12673,0,0,12673,389,0,0


One way to use the data is to export it in `csv` format.

In [None]:
yearly.to_csv('yearly_trade_data.csv')