Review session: Trade, product space and economic complexity
============================================================



September 16 2021, Matte Hartog



## Notes



Google colab link:

[https://colab.research.google.com/github/matteha/product-space-eci-workshop/blob/main/product-space-eci-workshop.ipynb](https://colab.research.google.com/github/matteha/product-space-eci-workshop/blob/main/product-space-eci-workshop.ipynb)



## To do first



In Google Colab:

1.  Turn on Table of Contents: (in browser, click on &rsquo;View&rsquo; in top, then &rsquo;Table of Contents&rsquo;)

2.  Expand all sections (&rsquo;View&rsquo; > &rsquo;Expand Sections&rsquo; if not greyed out)

(In Google Colab equations will show up properly, in github they don&rsquo;t work)



## Outline of lab session



-   Introduction to trade data

-   Calculating RCAs, product co-occurences and product proximity, density / density regressions

-   Product space visualization

-   Calculating Economic Complexity / Product Complexity



## Trade data



### Background



The product space is, as well as its derivations / related measures such as economic complexity and the Growth&rsquo;s annual rankings of countries by economic complexity (at [https://atlas.cid.harvard.edu](https://atlas.cid.harvard.edu)), are based on trade data between countries.

The Growth Lab maintains and periodically updates a cleaned version of trade data at Harvard Dataverse:

[https://dataverse.harvard.edu/dataverse/atlas](https://dataverse.harvard.edu/dataverse/atlas)

This dataset contains bilateral trade data among 235 countries and territories in thousands of different products categories (a description of the data can be found at: [http://atlas.cid.harvard.edu/downloads](http://atlas.cid.harvard.edu/downloads)).

How does the data look like? We will explore the data in Python using the &rsquo;pandas&rsquo; (most popular Python package for data analysis).



#### Footnote on trade and services (ICT, tourism, etc.):



-   Services and tourism are included in the Growth Lab&rsquo;s Atlas and trade data as well as of September 2018. See announcement at:

[https://atlas.cid.harvard.edu/announcements/2018/services-press-release](https://atlas.cid.harvard.edu/announcements/2018/services-press-release)

Obtained from IMF, trade in services covers four categories of economic activities between producers and consumers across borders:

-   services supplied from one country to another (e.g. call centers)
-   consumption in other countries (e.g. international tourism)
-   firms with branches in other countries (e.g. bank branches overseas)
-   individuals supplying services in another country (e.g. IT consultant abroad)



### Load necessary Python libraries



In [1]:
# -- Global settings
# - import python libraries necessary for this workshop
# suppress warnings on google colab for now
import warnings
warnings.filterwarnings("ignore")
# to interact with os, e.g. to execute shell comands such as 'ls', 'pwd' etc.
import os
# to do data processing
import pandas as pd
# backend of pandas, working with matrices
import numpy as np
# to visualize data (in pandas)
import matplotlib.pyplot as plt
import matplotlib.colors as colors
# to process a json file
import json
# work with regex in python
import re
# work with networks in python, to create product space
import networkx as nx
# python tools to work with combinations of arrays
from itertools import count
from itertools import combinations
from itertools import product
# to run regressions
import statsmodels.api as sm
# to download files
import urllib.request, json
# -- set scientific notation to display numbers fully rather than exponential
pd.set_option('display.float_format', '{:.2f}'.format)
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = 'all' # Show all results of jupyter
import seaborn as sns
sns.set_style('whitegrid') # Display grids on dark background
# Enlarged pandas display - more colums and rows with greater width
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 100000)
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_colwidth',300)
print('necessary libraries loaded')

### Download trade dataset and load into memory



In [1]:
# Load the necessary data into pandas

# In pandas terminilogy this is called a 'dataframe' (df)
product_classification = 'hs' # Harmonized System 1992; alternative is 'SITC - Standard Industrial Trade Classification'

N_digits = '4' # alternative is 2 or 6, the higher the more detailed product info


# Trade data: we're using s3 storage from Amazon here because we can directly download the data into pandas in Google Colab but this is no longer maintained by the Growth Lab - rather download from Dataverse.

data_url = f"https://intl-atlas-downloads.s3.amazonaws.com/country_{product_classification}product{N_digits}digit_year.csv.zip"
print('Downloading data and loading into memory')
df_orig = pd.read_csv(data_url, compression="zip", low_memory=False)

# Fix product label strings ('hs_product_name_short_en') (some products with different product codes erronuously have the same strings - hence remove these duplicates)
# e.g. product codes 5209 and 5211 in Zimbabwe have same product string
# download original UN classification
with urllib.request.urlopen("https://comtrade.un.org/data/cache/classificationH0.json") as url:
    hs1992_json = json.loads(url.read())
dft = pd.DataFrame.from_dict(hs1992_json['results'])[['text']]
dft['hs_product_code'] = dft['text'].str.split('-').str[0].str.strip()
dft['hs_product_name_short_en'] = dft['text'].str.split('-',1).str[1].str.strip()
dft['N_dig'] = dft['hs_product_code'].str.len()
dft2 = dft[dft['N_dig']==int(N_digits)].copy()
df_orig = pd.merge(df_orig,dft2[['hs_product_code','hs_product_name_short_en']],how='left',on=f'hs_product_code') # unmerged are services (obtained from IMF)
# replace product name now with downloaded strings (if not missing in either)
df_orig['hs_product_name_short_en_new'] = df_orig['hs_product_name_short_en_x']
df_orig.loc[ df_orig['hs_product_name_short_en_y'].notnull(),'hs_product_name_short_en_new'] = df_orig['hs_product_name_short_en_y']
df_orig.drop(['hs_product_name_short_en_x'],axis=1,inplace=True,errors='ignore')
df_orig.drop(['hs_product_name_short_en_y'],axis=1,inplace=True,errors='ignore')
df_orig.rename(columns={f'hs_product_name_short_en_new':f'hs_product_name_short_en'}, inplace=True)

# Cross check that each row is a unique year-location-product entry
df_orig['count'] = 1
df_orig['sum'] = df_orig.groupby(['year','location_name_short_en','hs_product_name_short_en'])['count'].transform('sum')
if df_orig['sum'].max() != 1:
    print(f'duplicates found, stopping')
    stop

# rename variable names for convenience
df_orig.rename(columns={f'location_name_short_en':f'country_name'}, inplace=True)
df_orig.rename(columns={f'location_code':f'country_code'}, inplace=True)
df_orig.rename(columns={f'hs_product_code':f'product_code'}, inplace=True)
df_orig.rename(columns={f'hs_product_name_short_en':f'product_name'}, inplace=True)

# Keep only relevant columns
df_orig = df_orig[['year',
         'country_code',
         'country_name',
         'product_code',
         'product_name',
         'export_value']]

print('trade dataset ready')

### Exploring the trade data



#### Structure of dataset



In [1]:
# show 5 random rows
df_orig.sample(n=5)

#### What years are in the data?



In [1]:
df_orig['year'].unique()

#### How many products are in the data?



In [1]:
df_orig['product_name'].nunique()

#### Finding specific countries / products based on partial string matching



In [1]:
STRING = 'Netherland'
df_orig[df_orig['country_name'].str.contains(STRING)][['country_name']].drop_duplicates()

# Can also include regex expressions here, e.g. to ignore lower/uppercase ('wine' vs 'Wine')
STRING = 'wine'
df_orig[df_orig['product_name'].str.contains(STRING,flags=re.IGNORECASE, regex=True)][['product_name']].drop_duplicates()

# [goto error]
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
/tmp/ipykernel_10002/3130682011.py in <module>
      1 STRING = 'Netherland'
----> 2 df_orig[df_orig['country_name'].str.contains(STRING)][['country_name']].drop_duplicates()
      3 
      4 # Can also include regex expressions here, e.g. to ignore lower/uppercase ('wine' vs 'Wine')
      5 STRING = 'wine'

NameError: name 'df_orig' is not defined

#### Example: What were the major export products of the USA in 2012?



In [1]:
# create a 'dataframe' called 'df2' with only exports from USA in 2012
df2 = df_orig[ (df_orig['country_code']=='USA') & (df_orig['year'] == 2012) ].copy()
# create another dataframe 'df3' that contains the sum of exports per product
df3 = df2.groupby(['product_code','product_name'],as_index=False)['export_value'].sum()
# sort
df3.sort_values(by=['export_value'],ascending=False,inplace=True)
# show first 10 rows
df3[0:10]

#### Example: How did exports of Cars evolve over time in the USA?



From about 10 billion USD up to almost $60 billion USD.



In [1]:
df2 = df_orig[ (df_orig['country_code']=='USA')].copy()
#df3 = df2[df2['product_name']=='Cars']
df3 = df2[df2['product_code']=='8703']
df3.plot(x='year', y='export_value')

## Revealed comparative advantage (RCA)



What products are countries specialized in? For that, following Hidalgo et al. (2007), we calculate the Revealed Comparative Advantage (RCA) of each country-product pair: how much a country &rsquo;over-exports&rsquo; a product in comparison to all other countries.

Technically this is the Balassa index of comparative advantage, calculated as follows for product $p$ and country $c$ at time $t$:

\begin{equation} \label{e_RCA}
{RCA}_{cpt}=\frac{X_{cpt}/X_{ct}}{X_{pt}/X_{t}}
\tag{1}
\end{equation}

where $X_{cpt}$ represents the total value of country $c$’s exports of product $p$ at time $t$ across all importers. An omitted subscript indicates a summation over the omitted dimension, e.g.: $X_{t}=\sum \limits_{c,p,t} X_{cpt}$.

A product-country pair with $RCA>1$ means that the product is over-represented in the country&rsquo;s export basket.

We use the original trade dataset (&rsquo;df<sub>orig</sub>&rsquo;) that is loaded into memory:



In [1]:
def calc_rca(data,country_col,product_col,time_col,value_col):
    """
    Calculates Revealed Comparative Advantage (RCA) of country-product-time combinations

    Returns:
        pandas dataframe with RCAs
    """

    # Aggregate to country-product-time dataframe
    print('creating all country-product-time combinations')
    # - add all possible products for each country with export value 0
    # - else matrices later on will have missing values in them, complicating calculations
    df_all = pd.DataFrame(list(product(data[time_col].unique(), data[country_col].unique(),data[product_col].unique())))
    df_all.columns=[time_col,country_col,product_col]
    print('merging data in')
    df_all = pd.merge(df_all,data[[time_col,country_col,product_col,value_col]],how='left',on=[time_col,country_col,product_col])
    df_all.loc[df_all[value_col].isnull(),value_col] = 0

    # Calculate the properties
    print('calculating properties')
    df_all['Xcpt'] = df_all[value_col]
    df_all['Xct'] = df_all.groupby([country_col, time_col])[value_col].transform(sum)
    df_all['Xpt'] = df_all.groupby([product_col, time_col])[value_col].transform(sum)
    df_all['Xt'] = df_all.groupby([time_col])[value_col].transform('sum')
    df_all['RCAcpt'] = (df_all['Xcpt']/df_all['Xct'])/(df_all['Xpt']/df_all['Xt'])
    df_all.drop(['Xcpt','Xct','Xpt','Xt'],axis=1,inplace=True,errors='ignore')

    return df_all

df_rca = calc_rca(data=df_orig,country_col='country_name',product_col='product_name',time_col='year',value_col='export_value')

print('rca dataframe ready')

# show results
df_rca[0:10]

### Example: What products are The Netherlands and Saudi Arabia specialized in, in 2000?



(Note that different commands are chained together here; can also be ran separately)



In [1]:
# The Netherlands
print("\n The Netherlands: \n")

df_rca[ (df_rca['year']==2000) & (df_rca['country_name']=='Netherlands')].sort_values(by=['RCAcpt'],ascending=False)[['product_name','RCAcpt','year']][0:5]

print("\n Saudi Arabia:\n")

# Saudi Arabia
df_rca[ (df_rca['year']==2000) & (df_rca['country_name']=='Saudi Arabia')].sort_values(by=['RCAcpt'],ascending=False)[['product_name','RCAcpt','year']][0:5]

## Product proximity (based on co-occurences)



### Calculating product co-occurences



Knowing which countries are specialized in which products, the next step analyzes the extent to which two products are over-represented ($RCA>1$) in the same countries.

As noted in the lecture, the main insight supporting this inference is that countries will produce combinations of products that require similar capabilities.

Hence we infer capabilities from trade patterns, because the capabilities of a country is a priori hard to determine and capabilities themselves are hard to observe.

Hence, **the degree to which two products cooccur in the export baskets of the same countries provides an indication of how similar the capability requirements of the two products are**.

We will calculate the co-occurence matrix of products below.

First, a product is &rsquo;present&rsquo; in a country if the country exports the product with $RCA>1$:

\begin{equation} \label{e_presence}
M_{cp}=\begin{cases}
    1 & \text{if ${RCA}_{cp}>1$}; \\
    0 & \text{elsewhere.}
    \end{cases}
\tag{2}
\end{equation}



In [1]:
df_rca['Mcp'] = 0
df_rca.loc[df_rca['RCAcpt']>1,'Mcp'] = 1

Next, we calculate how often two products are present in the same countries, using the Mcp threshold:

\begin{equation} \label{e_cooc}
C<sub>pp&rsquo;</sub>=&sum; \limits<sub>c</sub> M<sub>cp</sub> M<sub>cp&rsquo;</sub>
\tag{3}
\end{equation]

We will use the first year of data in the dataset, **1995,** below.

Note that to reduce yearly votality, Hidalgo et al. (2007) aggregate the trade data across multiple years (1998-2000) when calculating RCAs and product proximities for the product space. (However, when comparing the product space across years, they do use individual years).



In [1]:
def calc_cpp(data,country_col,product_col):
    """
    Calculates product co-occurences in countries

    Returns:
        pandas dataframe with co-occurence value for each product pair
    """

    # Create combinations within country_col (i.e. countries) of entities (i.e. products)
    dft = (data.groupby(country_col)[product_col].apply(lambda x: pd.DataFrame(list(combinations(x,2))))
            .reset_index(level=1, drop=True)
            .reset_index())
    dft.rename(columns={0:f'product_1'}, inplace=True)
    dft.rename(columns={1:f'product_2'}, inplace=True)

    # Create second half of matrix (assymmetrical):
    # product 1 X product 2 == product 2 X product 1
    dft2 = dft.copy()
    dft2.rename(columns={f'product_1':f'product_2t'}, inplace=True)
    dft2.rename(columns={f'product_2':f'product_1'}, inplace=True)
    dft2.rename(columns={f'product_2t':f'product_2'}, inplace=True)
    dft3 = pd.concat([dft,dft2],axis=0,sort=False)

    # Now calculate N of times that products occur together
    dft3['count'] = 1
    dft3 = dft3.groupby(['product_1','product_2'],as_index=False)['count'].sum()
    dft3.rename(columns={f'count':f'Cpp'}, inplace=True)

    return dft3

# Keep only year 1995
dft = df_rca[df_rca['year']==1995].copy()

# Keep only country-product combinations where Mcp == 1 (thus RCAcp > 1)
dft = dft[dft['Mcp']==1]

# Calculate cpp
df_cpp = calc_cpp(dft,country_col='country_name',product_col='product_name')

print('cpp product co-occurences dataframe ready')

### Products that co-occur most often



In [1]:
# -- show products that co-occur most often
df_cpp.sort_values(by=['Cpp'],ascending=False,inplace=True)
df_cpp[0:10]

### Normalize product co-occurences (cpp) as in Hidalgo et al. 2007



To get an accurate value of product proximity, we need to correct these numbers for the extent to which products are present in general in trade flows between countries.

To do so, Hidalgo et al. (2007) calculate product proximity as follows, defining it as the minimum of two conditional probabilities:

\begin{equation}
C_{ppt'}  = \min \left( \frac{C_{pp'}}{C_{p}},\frac{C_{pp'}}{C_{p'}} \right)
\tag{4}
\end{equation}

The minimum here is used to elimate a &rsquo;false positive&rsquo;.

Hence we correct for how prevalent specialization in product $i$ and product $j$ is across countries (i.e. the &rsquo;ubiquity&rsquo; of the products).



In [1]:
# We calculate the ubiquity of each product and add it to the cpp matrix, then take the minimum of conditional probabilities

# again we use the year 1995 here
dft = df_rca[df_rca['year']==1995].copy()
df_ub = dft.groupby(['product_name'],as_index=False)['Mcp'].sum()
df_ub.rename(columns={f'product_name':f'product_1'}, inplace=True)
# merge ubiqity into cpp matrix
df_cppt = pd.merge(df_cpp,df_ub,how='left',on=f'product_1')
df_ub.rename(columns={f'product_1':f'product_2'}, inplace=True)
df_cppt = pd.merge(df_cppt,df_ub,how='left',on=f'product_2')

# take minimum of conditional probabilities
df_cppt['kpi'] = df_cppt['Cpp']/df_cppt['Mcp_x']
df_cppt['kpj'] = df_cppt['Cpp']/df_cppt['Mcp_y']
df_cppt['phi'] = df_cppt['kpi']
df_cppt.loc[df_cppt['kpj']<df_cppt['kpi'],'phi'] = df_cppt['kpj']

# show most proximate products
df_cppt.sort_values(by=['phi'],ascending=False,inplace=True)
df_cppt[0:10]