Lab session 2: Trade, product space and economic complexity
===========================================================



June 7 2021, Matte Hartog



## Notes



## Outline of lab session



-   Introduction to trade data
-   Calculating RCAs, product co-occurences and product proximity, density / density regressions
-   Product space visualization
-   Calculating Economic Complexity / Product Complexity



## To do first



In Google Colab:

1.  Turn on Table of Contents: (in browser, click on &rsquo;View&rsquo; in top, then &rsquo;Table of Contents&rsquo;)

2.  Expand all sections (&rsquo;View&rsquo; > &rsquo;Expand Sections&rsquo; if not greyed out)

(In Google Colab equations will show up properly)



## Trade data



### Background



The product space is, as well as its derivations / related measures such as economic complexity and the Growth&rsquo;s annual rankings of countries by economic complexity (at [https://atlas.cid.harvard.edu](https://atlas.cid.harvard.edu)), are based on trade data between countries.

The Growth Lab maintains and periodically updates a cleaned version of trade data at [https://intl-atlas-downloads.s3.amazonaws.com/index.html](https://intl-atlas-downloads.s3.amazonaws.com/index.html).

This dataset contains bilateral trade data among 235 countries and territories in thousands of different products categories (a description of the data can be found at: [http://atlas.cid.harvard.edu/downloads](http://atlas.cid.harvard.edu/downloads)).

How does the data look like? We will explore the data in Python using the &rsquo;pandas&rsquo; (most popular Python package for data analysis).



#### Footnote on trade and services (ICT, tourism, etc.):



-   Services and tourism are included in the Growth Lab&rsquo;s Atlas and trade data as well as of September 2018. See announcement at:

[https://atlas.cid.harvard.edu/announcements/2018/services-press-release](https://atlas.cid.harvard.edu/announcements/2018/services-press-release)

Obtained from IMF, trade in services covers four categories of economic activities between producers and consumers across borders:

-   services supplied from one country to another (e.g. call centers)
-   consumption in other countries (e.g. international tourism)
-   firms with branches in other countries (e.g. bank branches overseas)
-   individuals supplying services in another country (e.g. IT consultant abroad)



### Load necessary Python libraries



In [1]:
# -- Global settings
# - import python libraries necessary for this workshop
# suppress them on google colab for now
import warnings
warnings.filterwarnings("ignore")
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.colors as colors
import json
import networkx as nx
from itertools import count
from itertools import combinations
from itertools import product
import statsmodels.api as sm
# -- set scientific notation to display numbers fully rather than exponential
pd.set_option('display.float_format', '{:.2f}'.format)
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = 'all' # Show all results of jupyter
import seaborn as sns
sns.set_style('whitegrid') # Display grids on dark background
pd.set_option('display.max_columns', 500) # Broaden pandas display in jupyter console
pd.set_option('display.width', 100000)
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_colwidth',300)
print('necessary libraries loaded')

### Download trade dataset and load into memory



In [1]:
# Load the necessary data into a pandas 'dataframe' (df)
product_classification = 'hs' # Harmonized System 1992; alternative is 'SITC - Standard Industrial Trade Classification'
N_digits = '4' # alternative is 2 or 6, the higher the more detailed product info
data_url = f"https://intl-atlas-downloads.s3.amazonaws.com/country_{product_classification}product{N_digits}digit_year.csv.zip"
print('Downloading data and loading into memory')
df_orig = pd.read_csv(data_url, compression="zip", low_memory=False)

# Fix product label strings ('hs_product_name_short_en') (some products have erronuously duplicate strings: will contact Growth Lab's Atlas team)
# e.g. product codes 5209 and 5211 in Zimbabwe have same product string
import urllib.request, json
with urllib.request.urlopen("https://comtrade.un.org/data/cache/classificationH0.json") as url:
    hs1992_json = json.loads(url.read())
dft = pd.DataFrame.from_dict(hs1992_json['results'])[['text']]
dft['hs_product_code'] = dft['text'].str.split('-').str[0].str.strip()
dft['hs_product_name_short_en'] = dft['text'].str.split('-',1).str[1].str.strip()
dft['N_dig'] = dft['hs_product_code'].str.len()
dft2 = dft[dft['N_dig']==int(N_digits)].copy()
df_orig = pd.merge(df_orig,dft2[['hs_product_code','hs_product_name_short_en']],how='left',on=f'hs_product_code') # unmerged are services (obtained from IMF)
# replace product name now with downloaded strings (if not missing in either)
df_orig['hs_product_name_short_en_new'] = df_orig['hs_product_name_short_en_x']
df_orig.loc[ df_orig['hs_product_name_short_en_y'].notnull(),'hs_product_name_short_en_new'] = df_orig['hs_product_name_short_en_y']
df_orig.drop(['hs_product_name_short_en_x'],axis=1,inplace=True,errors='ignore')
df_orig.drop(['hs_product_name_short_en_y'],axis=1,inplace=True,errors='ignore')
df_orig.rename(columns={f'hs_product_name_short_en_new':f'hs_product_name_short_en'}, inplace=True)

# Cross check that each row is a unique year-location-product entry
df_orig['count'] = 1
df_orig['sum'] = df_orig.groupby(['year','location_name_short_en','hs_product_name_short_en'])['count'].transform('sum')
if df_orig['sum'].max() != 1:
    print(f'duplicates found, stopping')
    stop

# Keep only relevant columns
df_orig = df_orig[['year',
         'location_code',
         'location_name_short_en',
         'hs_product_code',
         'hs_product_name_short_en',
         'export_value']]

print('trade dataset ready')

### Exploring the trade data



#### Structure of dataset



In [1]:
# show 5 random rows
df_orig.sample(n=5)

#### What years are in the data?



In [1]:
df_orig['year'].unique()

#### How many products are in the data?



In [1]:
df_orig['hs_product_name_short_en'].nunique()

#### Finding specific countries / products based on partial string matching



In [1]:
STRING = 'Netherland'
df_orig[df_orig['location_name_short_en'].str.contains(STRING)][['location_name_short_en']].drop_duplicates()

STRING = 'Wine'
df_orig[df_orig['hs_product_name_short_en'].str.contains(STRING)][['hs_product_name_short_en']].drop_duplicates()

#### Example: What were the major export products of the USA in 2012?



In [1]:
df2 = df_orig[ (df_orig['location_code']=='USA') & (df_orig['year'] == 2012) ].copy()
df3 = df2.groupby(['hs_product_code','hs_product_name_short_en'],as_index=False)['export_value'].sum()
df3.sort_values(by=['export_value'],ascending=False,inplace=True)
df3[0:10]

#### Example: How did exports of Cars evolve over time in the USA?



From about 10 billion USD up to almost $60 billion USD.



In [1]:
df2 = df_orig[ (df_orig['location_code']=='USA')].copy()
#df3 = df2[df2['hs_product_name_short_en']=='Cars']
df3 = df2[df2['hs_product_code']=='8703']
df3.plot(x='year', y='export_value')

## Revealed comparative advantage (RCA)



What products are countries specialized in? For that, following Hidalgo et al. (2007), we calculate the Revealed Comparative Advantage (RCA) of each country-product pair: how much a country &rsquo;over-exports&rsquo; a product in comparison to all other countries.

Technically this is the Balassa index of comparative advantage, calculated as follows for product $p$ and country $c$ at time $t$:

\begin{equation} \label{e_RCA}
{RCA}_{cpt}=\frac{X_{cpt}/X_{ct}}{X_{pt}/X_{t}}
\tag{1}
\end{equation}

where $X_{cpt}$ represents the total value of country $c$’s exports of product $p$ at time $t$ across all importers. An omitted subscript indicates a summation over the omitted dimension, e.g.: $X_{t}=\sum \limits_{c,p,t} X_{cpt}$.

A product-country pair with $RCA>1$ means that the product is over-represented in the country&rsquo;s export basket.

We use the original trade dataset (&rsquo;df<sub>orig</sub>&rsquo;) that is loaded into memory:



In [1]:
def calc_rca(data,country_col,product_col,time_col,value_col):
    """
    Calculates Revealed Comparative Advantage (RCA) of country-product-time combinations

    Returns:
        pandas dataframe with RCAs
    """

    # Aggregate to country-product-time dataframe
    print('creating all country-product-time combinations')
    # - add all possible products for each country with export value 0
    # - else matrices later on will have missing values in them, complicating calculations
    df_all = pd.DataFrame(list(product(data[time_col].unique(), data[country_col].unique(),data[product_col].unique())))
    df_all.columns=[time_col,country_col,product_col]
    print('merging data in')
    df_all = pd.merge(df_all,data[[time_col,country_col,product_col,value_col]],how='left',on=[time_col,country_col,product_col])
    df_all.loc[df_all[value_col].isnull(),value_col] = 0

    # Calculate the properties
    print('calculating properties')
    df_all['Xcpt'] = df_all[value_col]
    df_all['Xct'] = df_all.groupby([country_col, time_col])[value_col].transform(sum)
    df_all['Xpt'] = df_all.groupby([product_col, time_col])[value_col].transform(sum)
    df_all['Xt'] = df_all.groupby([time_col])[value_col].transform('sum')
    df_all['RCAcpt'] = (df_all['Xcpt']/df_all['Xct'])/(df_all['Xpt']/df_all['Xt'])
    df_all.drop(['Xcpt','Xct','Xpt','Xt'],axis=1,inplace=True,errors='ignore')

    return df_all

df_rca = calc_rca(data=df_orig,country_col='location_name_short_en',product_col='hs_product_name_short_en',time_col='year',value_col='export_value')

print('rca dataframe ready')

# show results
df_rca[0:10]

### Example: What products are The Netherlands and Saudi Arabia specialized in, in 2000?



In [1]:
# The Netherlands
print("\n The Netherlands: \n")

df_rca[ (df_rca['year']==2000) & (df_rca['location_name_short_en']=='Netherlands')].sort_values(by=['RCAcpt'],ascending=False)[['hs_product_name_short_en','RCAcpt','year']][0:5]

print("\n Saudi Arabia:\n")

# Saudi Arabia
df_rca[ (df_rca['year']==2000) & (df_rca['location_name_short_en']=='Saudi Arabia')].sort_values(by=['RCAcpt'],ascending=False)[['hs_product_name_short_en','RCAcpt','year']][0:5]

## Product proximity (based on co-occurences)



### Calculating product co-occurences



Knowing which countries are specialized in which products, the next step analyzes the extent to which two products are over-represented ($RCA>=1$) in the same countries.

As noted in the lecture, the main insight supporting this inference is that countries will produce combinations of products that require similar capabilities.

Hence we infer capabilities from trade patterns, because the capabilities of a country is a priori hard to determine and capabilities themselves are hard to observe.

Hence, **the degree to which two products cooccur in the export baskets of the same countries provides an indication of how similar the capability requirements of the two products are**.

We will calculate the co-occurence matrix of products below.

First, a product is &rsquo;present&rsquo; in a country if the country exports the product with $RCA>=1$:

\begin{equation} \label{e_presence}
P_{cpt}=\begin{cases}
    1 & \text{if ${RCA}_{cpt}>=1$}; \\
    0 & \text{elsewhere.}
    \end{cases}
\tag{2}
\end{equation}



In [1]:
df_rca['Pcpt'] = 0
df_rca.loc[df_rca['RCAcpt']>=1,'Pcpt'] = 1

Next, we calculate how often two products are present in the same countries, using the Pcp threshold:

\begin{equation} \label{e_cooc}
C_{pp'}=\sum \limits_{c} P_{cp} P_{cp'}
\tag{3}
\end{equation}

We will use the first year of data in the dataset, **1995,** below.

Note that to reduce yearly votality, Hidalgo et al. (2007) aggregate the trade data across multiple years (1998-2000) when calculating RCAs and product proximities for the product space. (However, when comparing the product space across years, they do use individual years).



In [1]:
def calc_cpp(data,country_col,product_col):
    """
    Calculates product co-occurences in countries

    Returns:
        pandas dataframe with co-occurence value for each product pair
    """

    # Create combinations within country_col (i.e. countries) of entities (i.e. products)
    dft = (data.groupby(country_col)[product_col].apply(lambda x: pd.DataFrame(list(combinations(x,2))))
            .reset_index(level=1, drop=True)
            .reset_index())
    dft.rename(columns={0:f'product_1'}, inplace=True)
    dft.rename(columns={1:f'product_2'}, inplace=True)

    # Create second half of matrix (assymmetrical):
    # product 1 X product 2 == product 2 X product 1
    dft2 = dft.copy()
    dft2.rename(columns={f'product_1':f'product_2t'}, inplace=True)
    dft2.rename(columns={f'product_2':f'product_1'}, inplace=True)
    dft2.rename(columns={f'product_2t':f'product_2'}, inplace=True)
    dft3 = pd.concat([dft,dft2],axis=0,sort=False)

    # Now calculate N of times that products occur together
    dft3['count'] = 1
    dft3 = dft3.groupby(['product_1','product_2'],as_index=False)['count'].sum()
    dft3.rename(columns={f'count':f'Cpp'}, inplace=True)

    return dft3

# Keep only year 1995
dft = df_rca[df_rca['year']==1995].copy()

# Keep only country-product combinations where Pcp == 1 (thus RCAcp >= 1)
dft = dft[dft['Pcpt']==1]

# Calculate cpp
df_cpp = calc_cpp(dft,country_col='location_name_short_en',product_col='hs_product_name_short_en')

print('cpp product co-occurences dataframe ready')

### Products that co-occur most often



In [1]:
# -- show products that co-occur most often
df_cpp.sort_values(by=['Cpp'],ascending=False,inplace=True)
df_cpp[0:10]

### Normalize product co-occurences (cpp) as in Hidalgo et al. 2007



To get an accurate value of product proximity, we need to correct these numbers for the extent to which products are present in general in trade flows between countries.

To do so, Hidalgo et al. (2007) calculate product proximity as follows, defining it as the minimum of two conditional probabilities:

\begin{equation}
C_{ppt'}  = \min \left( \frac{C_{pp'}}{C_{p}},\frac{C_{pp'}}{C_{p'}} \right)
\tag{4}
\end{equation}

The minimum here is used to elimate a &rsquo;false positive&rsquo;.

Hence we correct for how prevalent specialization in product $i$ and product $j$ is across countries (i.e. the &rsquo;ubiquity&rsquo; of the products).



In [1]:
# We calculate the ubiquity of each product and add it to the cpp matrix, then take the minimum of conditional probabilities

# again we use the year 1995 here
dft = df_rca[df_rca['year']==1995].copy()
df_ub = dft.groupby(['hs_product_name_short_en'],as_index=False)['Pcpt'].sum()
df_ub.rename(columns={f'hs_product_name_short_en':f'product_1'}, inplace=True)
# merge ubiqity into cpp matrix
df_cppt = pd.merge(df_cpp,df_ub,how='left',on=f'product_1')
df_ub.rename(columns={f'product_1':f'product_2'}, inplace=True)
df_cppt = pd.merge(df_cppt,df_ub,how='left',on=f'product_2')

# take minimum of conditional probabilities
df_cppt['kpi'] = df_cppt['Cpp']/df_cppt['Pcpt_x']
df_cppt['kpj'] = df_cppt['Cpp']/df_cppt['Pcpt_y']
df_cppt['phi'] = df_cppt['kpi']
df_cppt.loc[df_cppt['kpj']<df_cppt['kpi'],'phi'] = df_cppt['kpj']

# show most proximate products
df_cppt.sort_values(by=['phi'],ascending=False,inplace=True)
df_cppt[0:10]

## Product space



### Overview



We now have a measure of similarity between products, which is the core of the product space.

[https://atlas.cid.harvard.edu/explore/network?country=114&year=2018&productClass=HS&product=undefined&startYear=undefined&target=Product&partner=undefined](https://atlas.cid.harvard.edu/explore/network?country=114&year=2018&productClass=HS&product=undefined&startYear=undefined&target=Product&partner=undefined)

![img](/home/linux/Dropbox/proj/org_zhtml_projects/product-space-eci-workshop/imgs/product_space_atlas_website.png)

![img](https://www.dropbox.com/s/izag1xf28yldanf/product_space_atlas_website.png?dl=1)

Below we will explore the product space using Python. The Github repo for this is available at [https://github.com/matteha/py-productspace](https://github.com/matteha/py-productspace).

What we need is information on:

-   Edges (ties) between nodes
    
    Ties between nodes represent the product proximity calculated above. Each product pair has a proximity value, but visualizing all ties, however, would result in a major &ldquo;hairball&rdquo;.
    
    To determine which of the ties to visualize in the product space, a &rsquo;maximum spanning tree algorithm&rsquo; is used (to make sure all nodes are connected directly or indirectly) in conjunction with a certain proximity threshold (0.55 minimum conditional probability). The details can be found in the Supplementary Material of Hidalgo et al. (2007) at [https://science.sciencemag.org/content/suppl/2007/07/26/317.5837.482.DC1](https://science.sciencemag.org/content/suppl/2007/07/26/317.5837.482.DC1).

-   Position of nodes
    -   Each node is a product
    
    -   To position them in the product space, Hidalgo et al. (2007) used a spring embedding algorithm (which positions the nodes in such a way that there are as few crossing ties as possible, using physical simulations with force-directed algorithms), followed by hand-crafting the outcome to further visually separate distinct &rsquo;clusters&rsquo; of products.
        
        We will use this fixed layout for now (Yang Li&rsquo;s workshop will deal with different ways to visualize multi-dimensional data in 2D/3D, e.g. with machine learning, UMAP).
        
        The data on the position of nodes (x, y coordinates) is available in the Atlas data repository at:
        [https://dataverse.harvard.edu/file.xhtml?persistentId=doi:10.7910/DVN/FCDZBN/QSEETD&version=1.1](https://dataverse.harvard.edu/file.xhtml?persistentId=doi:10.7910/DVN/FCDZBN/QSEETD&version=1.1)
        
        We can directly load it into Python using the link below (temporarily for this workshop, when using Harvard&rsquo;s dataverse you&rsquo;d need to sign a short User Agreement form so you can&rsquo;t load data directly from a URL):
        
        [https://www.dropbox.com/s/r601tjoulq1denf/network_hs92_4digit.json?dl=1](https://www.dropbox.com/s/r601tjoulq1denf/network_hs92_4digit.json?dl=1)

-   Size of nodes
    
    The size in the product space represents the total $ in world trade, but one can also use other attributes of nodes (e.g. if nodes are industries, the size could be total employment).

-   Color of nodes
    
    In the product space the node color represents major product groups (e.g. Agriculture, Chemicals) following the Leamer classification. The node coloring data is available in the Atlas data repository at:
    [https://dataverse.harvard.edu/dataverse/atlas?q=&types=files&sort=dateSort&order=desc&page=1](https://dataverse.harvard.edu/dataverse/atlas?q=&types=files&sort=dateSort&order=desc&page=1)
    
    We can directly load it into Python using the link below (again, temporary for this workshop):
    
    [https://www.dropbox.com/s/rlm8hu4pq0nkg63/hs4_hex_colors_intl_atlas.csv?dl=1](https://www.dropbox.com/s/rlm8hu4pq0nkg63/hs4_hex_colors_intl_atlas.csv?dl=1)



### Product space in Python



#### Function to create product space



The function below creates the product space. It uses the networkx package which Sultan elaborated on in the first Python session.



In [1]:
def create_product_space(df_plot_dataframe=None,
                         df_plot_node_col=['node'],
                         df_node_size_col=None,
                         show_legend = 0):

    # Copy dataframe so original won't be overwritten
    df_plot =  df_plot_dataframe.copy()

    NORMALIZE_NODE_SIZE = 1
    if NORMALIZE_NODE_SIZE == 1:

        """
        The distribution of export values is highly skewed, which makes it hard to visualize
        them properly (certain products will overshadow the rest of the network).


        We create a new size column below in which we normalize the export values.
        """

        ### Normalize node size (0.1 to 1)

        def normalize_col(dft,col,minsize=0.1,maxsize=1):
            """
            Normalizes column values with largest and smallest values capped at min at max
            For use in networkx

            returns pandas column
            """

            alpha = maxsize-minsize
            Xl = dft[dft[col]>0][col].quantile(0.10)
            Xh = dft[dft[col]>0][col].quantile(0.95)
            dft['node_size'] = 0
            dft.loc[ dft[col]>=Xh,'node_size'] = maxsize
            dft.loc[ (dft[col]<=Xl) & (dft[col]!=0),'node_size'] = minsize
            dft.loc[ (dft[col]<Xh) & (dft[col]>Xl),'node_size'] = ((alpha*(dft[col]-Xl))/(Xh-Xl))+(1-alpha)
            dft.loc[ (dft[col]<Xh) & (dft[col]>0),'node_size'] = ((alpha*(dft[col]-Xl))/(Xh-Xl))+(1-alpha)

            return dft['node_size']

        df_plot['node_size'] = normalize_col(df_plot,df_node_size_col,minsize=0.1,maxsize=1)

    ADD_COLORS_ATLAS = 1
    if ADD_COLORS_ATLAS == 1:

        # First add product codes from original file (full strings were used for illustrative purposes above but we need the actual codes to merge data from other sources, e.g. node colors)
        df_plot = pd.merge(df_plot,df_orig[['hs_product_name_short_en','hs_product_code']].drop_duplicates(),how='left',on='hs_product_name_short_en')
        dft = pd.read_csv('https://www.dropbox.com/s/rlm8hu4pq0nkg63/hs4_hex_colors_intl_atlas.csv?dl=1')

        # Transform hs_product_code into int (accounts for missing in pandas, if necessary)
        # keep only numeric hs_product_codes (this drops 'unspecified' as well as services for now;
        # - as the latter needs a separate color classification)
        df_plot = df_plot[df_plot['hs_product_code'].astype(str).str.isnumeric()]
        # -- also drop 9999 product code; unknown
        df_plot = df_plot[df_plot['hs_product_code'].astype(str)!='9999']
        # -- to allow merge, rename and transform both variables into int
        dft['hs4'] = dft['hs4'].astype(int)
        df_plot['hs_product_code'] = df_plot['hs_product_code'].astype(int)
        if 'color' in df_plot.columns:
            df_plot.drop(['color'],axis=1,inplace=True,errors='ignore')
        df_plot = pd.merge(df_plot,dft[['hs4','color']],how='left',left_on='hs_product_code',right_on='hs4')
        # drop column merged from dft
        df_plot.drop(['hs4'],axis=1,inplace=True,errors='ignore')

        CREATE_LEGEND = 1
        if CREATE_LEGEND == 1:
            # Atlas classification products
            df_temp = pd.read_csv('https://raw.githubusercontent.com/cid-harvard/classifications/master/product/HS/IntlAtlas/out/hs92_atlas.csv')
            df_temp.rename(columns={'Unnamed: 0': 'internal_code'}, inplace=True)
            # -- keep sections
            df_temp1 = df_temp[df_temp['level']=='section'].copy()
            # -- keep 4 digit
            df_temp2 = df_temp[df_temp['level']=='2digit'].copy()
            # -- keep 4 digit
            df_temp3 = df_temp[df_temp['level']=='4digit'].copy()
            # -- merge parent id of parent id of 4 digits (= 2digit)
            # ---- remake to float
            df_temp3['parent_id'] = df_temp3['parent_id'].astype(object)
            df_temp2['internal_code'] = df_temp2['internal_code'].astype(object)
            df_temp3t = pd.merge(df_temp3,df_temp2[['internal_code','parent_id']],how='left',left_on='parent_id',right_on='internal_code',indicator=True)
            # -- now merge parent_id_y to internal code of df_temp1
            df_temp3t.drop(['_merge'],axis=1,inplace=True)
            df_temp3t2 = pd.merge(df_temp3t,df_temp1[['internal_code','name']],how='left',left_on='parent_id_y',right_on='internal_code',indicator=True)
            # keep only relevant columns
            df_temp4 = df_temp3t2[['code','name_x','name_y']]
            df_temp5 = df_temp4[['code','name_y']]
            df_temp5.rename(columns={'code': 'product'}, inplace=True)
            df_temp5.rename(columns={'name_y': 'name_sector_atlas'}, inplace=True)
            # drop XXXX / services (not in product space)
            drop_categories = ['XXXX','unspecified','travel','transport','ict','financial']
            df_temp5 = df_temp5[ ~(df_temp5['product'].isin(drop_categories))]

            # add to df_temp_plot
            df_temp_plott = df_plot.copy()
            df_temp_plott['hs_product_code'] = df_temp_plott['hs_product_code'].astype(float)
            df_temp5['product'] = df_temp5['product'].astype(float)
            df_temp_plot3 = pd.merge(df_temp_plott,df_temp5,how='left',left_on='hs_product_code',right_on='product')
            df_temp_plot3.drop_duplicates(subset='color',inplace=True)
            df_temp_plot3 = df_temp_plot3[['color','name_sector_atlas']]

            # create color dict for legend
            color_dict = {}
            df_temp_plot3.reset_index(inplace=True,drop=True)
            for ind, row in df_temp_plot3.iterrows():
                color_dict[row['name_sector_atlas']] = row['color']

            """
            def build_legend(data):
                # Build a legend for matplotlib plt from dict
                legend_elements = []
                for key in data:
                    legend_elements.append(Line2D([0], [0], marker='o', color='w', label=key,
                                                    markerfacecolor=data[key], markersize=10))
                return legend_elements
            fig,ax = plt.subplots(1)
            #ax.add_patch(rect) # Add the patch to the Axes
            legend_elements = build_legend(color_dict)
            ax.legend(handles=legend_elements, loc='upper left')
            plt.show()
            """

    ADD_NODE_POSITIONS_ATLAS = 1
    if ADD_NODE_POSITIONS_ATLAS == 1:

        # Load position of nodes (x, y coordinates of nodes from original Atlas file)
        import urllib.request, json
        with urllib.request.urlopen("https://www.dropbox.com/s/r601tjoulq1denf/network_hs92_4digit.json?dl=1") as url:
            networkjs = json.loads(url.read().decode())

    CREATE_NETWORKX_OBJECT_AND_PLOT = 1
    if CREATE_NETWORKX_OBJECT_AND_PLOT == 1:

        # Convert json into python list and dictionary
        nodes = []
        nodes_pos = {}
        for x in networkjs['nodes']:
            nodes.append(int(x['id']))
            nodes_pos[int(x['id'])] = (int(x['x']),-int(x['y']/1.5))

        # Define product space edge list (based on strength from the json)
        edges = []
        for x in networkjs['edges']:
            if x['strength'] > 1 or 1 == 1:
                edges.append((int(x['source']),int(x['target'])))
        dfe = pd.DataFrame(edges)
        dfe.rename(columns={0: 'src'}, inplace=True)
        dfe.rename(columns={1: 'trg'}, inplace=True)

        # Only select edges of nodes that are also present in product space
        dfe2 = pd.DataFrame(np.append(dfe['src'].values,dfe['trg'].values)) # (some products may not be in there)
        dfe2.drop_duplicates(inplace=True)
        dfe2.rename(columns={0: 'node'}, inplace=True)
        dfn2 = pd.merge(df_plot,dfe2,how='left',left_on=df_plot_node_col,right_on='node',indicator=True)

        # Drop products from this dataframe that are not in product space
        dfn2 = dfn2[dfn2['_merge']=='both']

        # Create networkx objects in Python

        # G object = products that will be plotted
        G=nx.from_pandas_edgelist(dfn2,'hs_product_code','hs_product_code')

        # G2 object = all nodes and edges from the original product space
        # - Those that are not plotted will be gray in the background,
        # - e.g. products for which there is no info
        G2=nx.from_pandas_edgelist(dfe,'src','trg')

        # Add node attributes to networkx objects
        # - Create a 'present' variable which indicates that these products are present in product space,
        # - as not all products in product space are present in the data to be plotted
        # - (e.g. because we could filter only to plot products with more than >$40 million in trade)
        df_plot['present'] = 1
        ATTRIBUTES = ['node_size'] + ['color'] + ['present']
        for ATTRIBUTE in ATTRIBUTES:
            dft = df_plot[[df_plot_node_col,ATTRIBUTE]]
            dft['count'] = 1
            dft = dft.groupby([df_plot_node_col,ATTRIBUTE],as_index=False)['count'].sum()
            #** drop if missing , and drop duplicates
            dft.dropna(inplace=True)
            dft.drop(['count'],axis=1,inplace=True)
            dft.drop_duplicates(subset=[df_plot_node_col,ATTRIBUTE],inplace=True)
            dft.set_index(df_plot_node_col,inplace=True)
            dft_dict = dft[ATTRIBUTE].to_dict()
            for i in sorted(G.nodes()):
                try:
                    #G.node[i][ATTRIBUTE] = dft_dict[i]
                    G.nodes[i][ATTRIBUTE] = dft_dict[i]
                except Exception:
                    #G.node[i][ATTRIBUTE] = 'Missing'
                    G.nodes[i][ATTRIBUTE] = 'Missing'
            for i in sorted(G2.nodes()):
                try:
                    #G2.node[i][ATTRIBUTE] = dft_dict[i]
                    G2.nodes[i][ATTRIBUTE] = dft_dict[i]
                except Exception:
                    #G2.node[i][ATTRIBUTE] = 'Missing'
                    G2.nodes[i][ATTRIBUTE] = 'Missing'

        # Cross-check that attributes have been added correctly
        # nx.get_node_attributes(G2,df_color)
        # nx.get_node_attributes(G,df_color)

        # Create color + size lists which networkx uses for plotting
        groups = set(nx.get_node_attributes(G2,'color').values())
        mapping = dict(zip(sorted(groups),count()))
        nodes = G.nodes()
        nodes2 = G2.nodes()
        #colorsl = [G.node[n]['color'] for n in nodes]
        colorsl = [G.nodes[n]['color'] for n in nodes]
        #colorsl2 = [G2.node[n]['color'] for n in nodes2]
        colorsl2 = [G2.nodes[n]['color'] for n in nodes2]
        SIZE_VARIABLE = 'node_size'
        #sizesl = [G.node[n][SIZE_VARIABLE] for n in nodes]
        sizesl = [G.nodes[n][SIZE_VARIABLE] for n in nodes]
        # Adjust value below to increase the PLOTTED size of nodes, depending on the resolution of the final plot
        # (e.g. if you want to zoom in into the product space and thus set a higher resolution, you may want to set this higher)
        #sizesl2 = [G.node[n]['node_size']*350 for n in nodes]
        sizesl2 = [G.nodes[n]['node_size']*350 for n in nodes]

        # Create matplotlib object to draw the product space
        f = plt.figure(1,figsize=(20,20))
        ax = f.add_subplot(1,1,1)

        # turn axes off
        plt.axis('off')

        # set white background
        f.set_facecolor('white')

        # draw full product space in background, transparent with small node_size
        nx.draw_networkx(G2,nodes_pos, node_color='gray',ax=ax,node_size=10,with_labels=False,alpha=0.1)

        # draw the data fed in to the product space
        nx.draw_networkx(G,nodes_pos, node_color=colorsl,ax=ax,node_size=sizesl2,with_labels=False,alpha=1)

        # save file
        # plt.savefig(output_dir_image)

        # show the plot
        plt.show()

        if show_legend == 1:
            # show legend as well
            def build_legend(data):
                # Build a legend for matplotlib plt from dict
                legend_elements = []
                for key in data:
                    legend_elements.append(Line2D([0], [0], marker='o', color='w', label=key,
                                                    markerfacecolor=data[key], markersize=10))
                return legend_elements
            fig,ax = plt.subplots(1)
            #ax.add_patch(rect) # Add the patch to the Axes
            legend_elements = build_legend(color_dict)
            ax.legend(handles=legend_elements, loc='upper left')
            plt.show()

print('defined product space function, ready to plot')

#### Visualizing data in the product space



First we select the country we which to visualize. We&rsquo;ll search for Saudi Arabia below, using the &rsquo;audi&rsquo; string to find out the spelling of the country in the dataset, and we input that country name when defining the dataframe of the product space (&rsquo;df<sub>ps</sub>&rsquo;).



In [1]:
# Find out what 'location_name_short_en corresponds to Saudi Arabia
STRING = 'audi'
df_rca[df_rca['location_name_short_en'].str.contains(STRING)][['location_name_short_en']].drop_duplicates()
# result: Saudi Arabia'

##### Country, RCA, year, export value selections



Next we define what trade properties of Saudi Arabia we want to visualize. The example below visualizes specialiation in 2005 (year=2005, RCAcpt>=1) of only those products with at least 40 million in trade value.

This is outside of the product space function so you can inspect the dataframe before plotting.



In [1]:
# Select country
COUNTRY_STRING = 'Saudi Arabia'
df_ps = df_rca[df_rca['location_name_short_en']==COUNTRY_STRING].copy()

# Cross-check
if df_ps.shape[0] == 0:
    print('Country string set above does not exist in data, typed correctly?')
    STOP

# Select year
df_ps = df_ps[df_ps['year']==2005].copy()

# Select RCA >= 1
df_ps = df_ps[df_ps['RCAcpt']>=1]

# Keep only relevant columns
df_ps = df_ps[['hs_product_name_short_en','export_value']]

# Keep only products with minimum value threshold
exports_min_threshold = 40000000
df_ps = df_ps[df_ps['export_value']>exports_min_threshold]

# Show resulting dataframe
df_ps.sample(n=5)

print('ready to plot in product space')

In [1]:
# Plot in the product space
create_product_space(df_plot_dataframe=df_ps,
                     df_plot_node_col='hs_product_code',
                     df_node_size_col='export_value',
                     show_legend = 0)

##### Product space with legend



Below is a legend of the product space. There&rsquo;s also a &rsquo;show legend&rsquo; option in the &rsquo;create product space&rsquo; function but this option needs to be updated.

![img](https://www.dropbox.com/s/lf4lf8ktqahnovg/Selection_032.png?dl=1)

To see exactly what node represents what product, use the Atlas for now by hovering with the mouse over a node:

[https://atlas.cid.harvard.edu/explore/network?country=186&year=2018&productClass=HS&product=undefined&startYear=undefined&target=Product&partner=undefined](https://atlas.cid.harvard.edu/explore/network?country=186&year=2018&productClass=HS&product=undefined&startYear=undefined&target=Product&partner=undefined)

(This can also be implemented through Python by exporting to html instead of as an image, but not implemented above yet)



## -----------&#x2013;&#x2014; Break: Assignment 1 ------------------



### What product does Ukraine export most in 1995? (excluding services such as &rsquo;transport&rsquo;, &rsquo;ict&rsquo; etc)



### What products is Ukraine specialized in in 1995 and 2005 and how much do they export of these?



### Which product is most related to the product &rsquo;Stainless steel wire&rsquo;?



### Plot Ukraine in the product space in 1995.



How would you characterize Ukraine&rsquo;s position in the product space?



### Plot Ukraine in the product space in 2015.



Do you notice a difference with 1995?



### Plot your own country across different years in the product space. Do the results make sense? Do you notice any patterns?



## Predicting diversification of countries: densities / density regressions



Does a country&rsquo;s position in the the product space predict what products it diversifies into in the future? Indeed it does, according to Hidalgo et al. (2007) and many other studies that have followed. If a product is in close proximity to your current (export) basket of products, you are more likely to diversify into that product as a result: monkeys can only jump to the nearest branch in a tree.

(You can also see this for yourself using the code above, by plotting the same country in the product space across subsequent years. Best done using the SITC trade classification that goes back further in time into the 1970s, rather than the HS classification used above which starts only in 1995).

To test this empirically, one can perform so called &rsquo;density regressions&rsquo;.

For every possible country-product combination, you calculate the extent to which one&rsquo;s existing product portfolio is proximate (using the product proximities calculated earlier) to it, which is refered to as &rsquo;density&rsquo;.

You then test whether density predicts whether country-product combinations that were not present in $t$ are actually present in $t + 1$.

We will do so below.



### Prepare product-product-proximity matrix, all years



First we create a matrix of all possible product combinations in all years and we add proximities to it (we take the earlier product proximity matrix that we created, which was done using 1995 data, and make sure it also contains proximities to products that were not present at all in 1995. To avoid calculation problems with missing values later).



In [1]:
# Create product * product matrix and add proximity for each product
# -- in long format
products = df_rca['hs_product_name_short_en'].unique()
combs = list(combinations(products,2))
df_pp = pd.DataFrame(combs,columns=['product_1','product_2'])
# df_pp.shape # should be N products * N products

# make it asymmetrical
dft = df_pp.copy()
dft.rename(columns={f'product_2':f'product_1t'}, inplace=True)
dft.rename(columns={f'product_1':f'product_2'}, inplace=True)
dft.rename(columns={f'product_1t':f'product_1'}, inplace=True)
df_pp = pd.concat([df_pp,dft],axis=0)

# add proximities
df_pp = pd.merge(df_pp,df_cppt[['product_1','product_2','phi']],how='left',on=[f'product_1','product_2'])

# set proximity to 0 if missing (preferably all products are in matrix)
df_pp.loc[df_pp['phi'].isnull(),'phi'] = 0

print('product-product proximity matrix for all years ready')

### Calculate density



Next, for each country, we take the portfolio of underdeveloped or (&rsquo;not present&rsquo;) products in $t$ (1996 in the example below).

Following Hidalgo et al (2007) we define:

-   &rsquo;underdeveloped&rsquo; as those country-product combinations with with RCA < 0.5.
-   &rsquo;developed&rsquo; as those country-product combinations with an RCA > 1.

(Those with RCAs between 0.5 and 1 Hidalgo et al refer to as &rsquo;inconclusive&rsquo;)

We then use this information in conjunction with the product proximity matrix above to calculate density following Hidalgo et al:

\begin{equation} \label{density}
\omega_{cj} = \frac{
\sum \limits_{i} \chi_{i} \phi_{ij}
}
{\sum \limits_{i} \phi_ij }
\tag{7}
\end{equation}

where $\omega_{ci}$ is the density around product $j$ for the $c^_th$ country, $\chi_{i}$ = 1 if RCA >= 1 and 0 otherwise, and $\phi_{ij}$ is the matrix of conditional proximities between products that we created earlier.

A density of 0.44 would imply that 44% of the neighbouring space of the product seems to be developed in the country.

For each product-country combination that is &rsquo;underdeveloped&rsquo;, we will create a density value below.



In [1]:
def calc_density_hidalgo_et_al(rca_dataframe=None,
                 region_col = None,
                 product_col = None,
                 rca_col = None,
                 year = None,
                 underdeveloped_maximum_rca_threshold = 0.5,
                 developed_minimum_rca_threshold = 1):

    """
    Calculaties densities for product-country combinations.

    Returns a pandas dataframe.

    """

    # Keep only country-product information in year specified
    df_d = rca_dataframe[rca_dataframe['year']==year].copy()

    # drop if countries have 0 exports in whole year
    # - don't calculate densities for these: error in data
    df_d['exports_sum'] = df_d.groupby([region_col])['export_value'].transform('sum')
    df_d = df_d[df_d['exports_sum']!=0]

    # Keep only necessary columns
    df_d = df_d[[region_col,product_col,rca_col]]

    # This will be the country-product density dataframe to which densities are appended below
    df_cpd = pd.DataFrame()

    # We will loop over regions (countries) to save memory
    REGIONS = df_d[region_col].unique()
    indexL = 10
    for index,REGION in enumerate(REGIONS):
        if index == indexL:
            print(f'Done region {index} out of {len(REGIONS)}')
            indexL = indexL + 10
        df_dc = df_d[df_d[region_col]==REGION].copy()

        # For the sake of completion: we want to add all products to the matrix: products that
        # have not (yet) been present in a country are now not in the rca matrix
        # We thus use the pp-matrix for this
        products_exclude_from_not_developed = df_dc[df_dc[rca_col]>=underdeveloped_maximum_rca_threshold][product_col].unique()
        products_not_developed = [x for x in df_pp['product_1'].unique() if x not in products_exclude_from_not_developed]

        # Merge this into proximity matrix (in long form)
        # -- keep only proximities for those products in the country's 'underdeveloped' portfolio
        df_ppt = df_pp[df_pp['product_1'].isin(products_not_developed)].copy()

        # now add products the country does have developed (RCA >= 1 in t)
        df_developed = df_dc[df_dc['RCAcpt']>=developed_minimum_rca_threshold]
        dft = pd.merge(df_ppt,df_developed,how='left',left_on='product_2',right_on=product_col)
        dft.rename(columns={f'RCAcpt':f'RCAcpt_product2'}, inplace=True)

        # Density includes only those products 'developed/present' products (RCA > 1)
        dft['phi_include'] = 0
        dft.loc[ (dft['RCAcpt_product2']>1),'phi_include'] = dft['phi']

        # Take the sum of these for product 1
        dft['density_sum'] = dft.groupby(['product_1'])['phi_include'].transform('sum')

        # Divide this density_sum by sum of all densities as in Hidalgo et al
        dft['density_sum_all'] = dft.groupby(['product_1'])['phi'].transform('sum')
        dft['density'] = dft['density_sum'] / dft['density_sum_all']
        dft.loc[dft['density'].isnull(),'density'] = 0 # 0 if missing, no sum

        # Now drop information on product 2: keep only one observation per product 1
        dft.drop_duplicates(subset='product_1',inplace=True)

        # add region information again
        dft[region_col] = REGION

        # Keep only relevant columns
        dft = dft[[region_col,'product_1','density']]

        # add to country-product density dataframe
        df_cpd = pd.concat([df_cpd,dft],axis=0)

    print(f'country-product densities finished for year {year}')

    return df_cpd

# -- Create country-product densities dataframe
# Loop over countries now, takes about 2 minutes
df_cpd = calc_density_hidalgo_et_al(rca_dataframe=df_rca,
                 region_col = 'location_name_short_en',
                 product_col = 'hs_product_name_short_en',
                 rca_col = 'RCAcpt',
                 year = 1996,
                 underdeveloped_maximum_rca_threshold = 0.5,
                 developed_minimum_rca_threshold = 1)

#+begin_example
Done region 10 out of 220
Done region 20 out of 220
Done region 30 out of 220
Done region 40 out of 220
Done region 50 out of 220
Done region 60 out of 220
Done region 70 out of 220
Done region 80 out of 220
Done region 90 out of 220
Done region 100 out of 220
Done region 110 out of 220
Done region 120 out of 220
Done region 130 out of 220
Done region 140 out of 220
Done region 150 out of 220
Done region 160 out of 220
Done region 170 out of 220
Done region 180 out of 220
Done region 190 out of 220
Done region 200 out of 220
Done region 210 out of 220
country-product densities finished for year 1996
#+end_example

### Add information on product diversification of countries in t + 1



For each country we now have a vector of underdeveloped products and their corresponding density. Next we add information on whether countries actually developed those products in $t + 1$ (i.e. if they have an RCA >= 1 in $t + 1$).

We will use 2005 as $t + 1$ below.



In [1]:
# Now add information on whether these products are present 10 years later
# -- again using the rca matrix
df_future = df_rca[df_rca['year']==2005].copy()

# keep only relevant columns
df_future = df_future[['location_name_short_en','hs_product_name_short_en','RCAcpt','export_value']]

# tag and drop countries with no exports at all: error in data
df_future['exports_sum'] = df_future.groupby('location_name_short_en')['export_value'].transform('sum')
df_cpdf = pd.merge(df_cpd,df_future,how='left',left_on=[f'location_name_short_en',f'product_1'],right_on=['location_name_short_en','hs_product_name_short_en'],indicator=True)
df_cpdf = df_cpdf[df_cpdf['exports_sum']!=0] # (none in 2005 when dropped in 1996 density calculations)

# Remove those with RCA between 0.5 and 1: 'inconclusive' in Hidalgo et al 2007
df_cpdf = df_cpdf[ (df_cpdf['RCAcpt']<0.5) | (df_cpdf['RCAcpt']>1) ]

# Tag if product was developed or not
df_cpdf['present'] = 0
df_cpdf.loc[df_cpdf['RCAcpt']>=1,'present'] = 1

### Plot density distribution



Below we plot the density distribution in $t + 1$ of not-developed and developed products.

Density is generally higher for those products that were developed between $t$ and $t + 1$ than for those that were not.



In [1]:
dfa = pd.DataFrame()
for present in [0,1]:
    dft = df_cpdf[df_cpdf['present']==present].copy()
    dft.shape
    #dfo = pd.DataFrame([0])
    li = []
    for x in range(1,11,1):
        x_min = (x/10)-0.1
        x_max= (x/10)
        if x_max == 1:
            x_max = 1.01
        sh_in_density = dft[ (dft['density']>=x_min) & (dft['density']<x_max)].shape[0]/dft.shape[0]
        li.append([f'{round(x_min,2)}-{round(x_max,2)}', sh_in_density])
    dfo = pd.DataFrame(li)
    dfo.index=dfo[0]
    dfo.drop(0,axis=1,inplace=True)
    dfo.rename(columns={1:f'{present}'}, inplace=True)
    dfa = pd.concat([dfa,dfo],axis=1)

dfa.rename(columns={'0':f'not_developed'}, inplace=True)
dfa.rename(columns={'1':f'developed'}, inplace=True)
dfa.plot.bar()

### Density regression



Finally we run a simple density regression. We will use the &rsquo;statsmodels&rsquo; package in Python for this (imported at the beginning of the notebook in the first code cell).

Different packages are available for this in Python, linearmodels, statsmodels, panelOLS (but removed from pandas), and so on.



In [1]:
# Define X and Y columns / arrays from dataframe
X_cols = ['density']
Y_col = ['present']
# sub-select the columns
X = df_cpdf[X_cols]
Y = df_cpdf[Y_col]
# Add constant
X = sm.add_constant(X)
model = sm.OLS(Y,X)
results = model.fit()
# Show results
print(results.summary())

(To get stronger results, one could also include only the closest-occupied products&rsquo; proximity, for instance (see Hidalgo et al&rsquo;s 2007 Supplementary Section).)



### Model with fixed effects



#### Including dummies



Include FE by including dummies for the variable in question.



In [1]:
# Country fixed effects in statsmodels
countries_d = pd.get_dummies(df_cpdf['location_name_short_en'],drop_first=True)
# add 'd_' in front of variables
countries_d.columns = ['d_'+col for col in countries_d.columns]
# add dummies to main dataframe
df_cpdf2 = pd.concat([df_cpdf,countries_d],axis=1)
##
X_cols = ['density'] + [col for col in df_cpdf2.columns if 'd_' in col]
Y_col = ['present']
##
X = df_cpdf2[X_cols]
Y = df_cpdf2[Y_col]
# Add constant
X = sm.add_constant(X)
model = sm.OLS(Y,X)
print(f'fitting')
results = model.fit()
print(f'ready')
# Show results
print(results.summary())

#### Demeaning



Much faster: One can also demean instead of including dummies, as done below.



In [1]:
# (better to do this in STATA)
MEAN = df_cpdf['density'].mean()
df_cpdf2 = df_cpdf - df_cpdf.groupby(df_cpdf['location_name_short_en']).transform('mean') + MEAN
##
X_cols = ['density']
Y_col = ['present']
##
X = df_cpdf2[X_cols]
Y = df_cpdf2[Y_col]
# Add constant
X = sm.add_constant(X)
model = sm.OLS(Y,X)
print(f'fitting')
results = model.fit()
print(f'ready')
# Show results
print(results.summary())

(Adjust standard errors afterwards).



## Calculating Economic Complexity / Product Complexity using trade data



We know from the product space and density regressions how products are related to one another and how that matters for diversification of countries.

The next step is to look at which parts of the product space are most interesting to ultimately reach / diversify into. Generally complex products are located in the center of the product space, and countries with a higher economic complexity tend to have higher economic growth.

![img](imgs/complex_products_in_product_space.png)

![img](https://www.dropbox.com/s/a231jw76yocjkkr/complex_products_in_product_space.png?dl=1)

Recall from the lecture that the economic complexity index (ECI) and product complexity index (PCI) measures are derived from an iterative method of reflections algorithm on country diversity and product ubiquity (Hidalgo Hausmann 2009), or finding the eigenvalues of a country-product matrix (Mealy et al. 2019)

![img](/home/linux/Dropbox/proj/org_zhtml_projects/product-space-eci-workshop/imgs/countries_products_eci.png)

![img](https://www.dropbox.com/s/dte4vwgk4tvj3rd/countries_products_eci.png?dl=1)

We can calculate these in Python on raw data using the &rsquo;py-ecomplexity&rsquo; package (by Shreyas Gagdgin Matha at the Growth Lab, available at [https://github.com/cid-harvard/py-ecomplexity](https://github.com/cid-harvard/py-ecomplexity)). A STATA package of this is available at:

[https://github.com/cid-harvard/ecomplexity](https://github.com/cid-harvard/ecomplexity)

One can also directly download the PCI value for every product from the Atlas data repository - the ECI of a country is the mean of the PCI values of the products it has a comparative advantage in.



### Using the &rsquo;py-ecomplexity&rsquo; package



#### Installation



One can install it by pointing pip (to install python packages) to the respective library on github, using the following command:



In [1]:
!pip install ecomplexity
print('installed py-ecomplexity')

#### Usage



We will again use again the original trade dataset (df\\<sub>orig</sub>), below.



In [1]:
from ecomplexity import ecomplexity
from ecomplexity import proximity

# To use py-ecomplexity, specify the following columns
trade_cols = {'time':'year',
              'loc':'location_name_short_en',
              'prod':'hs_product_name_short_en',
              'val':'export_value'}
              
# Then run the command
print('calculating ecomplexity')
df_ec = ecomplexity(df_orig, trade_cols)
print('finished calculating')

# Keep selected columns
df_ec = df_ec[['location_name_short_en',
               'hs_product_name_short_en',
               'export_value',
               'year',
               'pci',
               'eci']]

# Show results
df_ec.sample(n=10)

## -----------&#x2013;&#x2014; Break: Assignment ------------------



### What are countries with high complexity in 2015?



### Vice versa, what are countries with low complexity in 2015?



### What are products (PCI) with high complexity in 2015?



### Vice versa, what are products (PCI) with low complexity in 2015?



### Ukraine



#### How did Ukraine&rsquo;s economic complexity evolve over time?



#### How does Ukraine&rsquo;s economic complexity in 2015 compare to other countries? Which countries have comparable economic complexity?



#### What are the most complex products that Ukraine exported in 2015?



## ---



## ---



## ---



## Assignments answers



### Assignment 1:



#### What product does Ukraine export most in 1995? (excluding services such as &rsquo;transport&rsquo;, &rsquo;ict&rsquo; etc)



In [1]:
df2 = df_orig[ (df_orig['location_name_short_en']=='Ukraine') & (df_orig['year'] == 2005) ].copy()
df3 = df2.groupby(['hs_product_code','hs_product_name_short_en'],as_index=False)['export_value'].sum()
df3.sort_values(by=['export_value'],ascending=False,inplace=True)
df3[['hs_product_name_short_en','export_value']][0:5]

#### What products is Ukraine specialized in in 1995 and 2005 and how much do they export of these?



In [1]:
# 1995

# Use the 'df_rca' dataframe for this

df2 = df_rca[ (df_rca['year']==1995) & (df_rca['location_name_short_en']=='Ukraine')].copy()
df2.sort_values(by=['RCAcpt'],ascending=False,inplace=True)
df2[['hs_product_name_short_en','RCAcpt','year','export_value']][0:5]

# 2005
df2 = df_rca[ (df_rca['year']==2005) & (df_rca['location_name_short_en']=='Ukraine')].copy()
df2.sort_values(by=['RCAcpt'],ascending=False,inplace=True)
df2[['hs_product_name_short_en','RCAcpt','year','export_value']][0:5]

#### Which product is most related to the product &rsquo;Stainless steel wire&rsquo;?



In [1]:
PRODUCT = 'Stainless steel wire'
# select only this product
dft = df_cppt[df_cppt['product_1']==PRODUCT].copy()
# Sort from high to low y Crtp
dft.sort_values(by=['phi'],ascending=False,inplace=True)
# Show only first row
dft[0:1]

#### Plot Ukraine in the product space in 1995.



How would you characterize Ukraine&rsquo;s position in the product space?



In [1]:
# Select country
COUNTRY_STRING = 'Ukraine'
df_ps = df_rca[df_rca['location_name_short_en']==COUNTRY_STRING].copy()

# Cross-check
if df_ps.shape[0] == 0:
    print('Country string set above does not exist in data, typed correctly?')
    STOP

# Select year
df_ps = df_ps[df_ps['year']==1995].copy()
#df_ps = df_ps[df_ps['year']==2005].copy()

# Select RCA >= 1
df_ps = df_ps[df_ps['RCAcpt']>=1]

# Keep only relevant columns
df_ps = df_ps[['hs_product_name_short_en','export_value']]

# Keep only products with minimum value threshold
exports_min_threshold = 40000000
df_ps = df_ps[df_ps['export_value']>exports_min_threshold]

# Show resulting dataframe
df_ps.sample(n=5)

# And finally plot in the product space
create_product_space(df_plot_dataframe=df_ps,
                     df_plot_node_col='hs_product_code',
                     df_node_size_col='export_value')
print('plotted')

#### Plot Ukraine in the product space in 2015.



Do you notice a difference with 1995?



In [1]:
# Select country
COUNTRY_STRING = 'Ukraine'
df_ps = df_rca[df_rca['location_name_short_en']==COUNTRY_STRING].copy()

# Cross-check
if df_ps.shape[0] == 0:
    print('Country string set above does not exist in data, typed correctly?')
    STOP

# Select year
df_ps = df_ps[df_ps['year']==2015].copy()
#df_ps = df_ps[df_ps['year']==2005].copy()

# Select RCA >= 1
df_ps = df_ps[df_ps['RCAcpt']>=1]

# Keep only relevant columns
df_ps = df_ps[['hs_product_name_short_en','export_value']]

# Keep only products with minimum value threshold
exports_min_threshold = 40000000
df_ps = df_ps[df_ps['export_value']>exports_min_threshold]

# Show resulting dataframe
df_ps.sample(n=5)

# And finally plot in the product space
create_product_space(df_plot_dataframe=df_ps,
                     df_plot_node_col='hs_product_code',
                     df_node_size_col='export_value',
                     show_legend = 0)
print('plotted')

#### Plot your own country across different years in the product space. Do the results make sense? Do you notice any patterns?



### Assignment 2:



#### What are countries with high complexity in 2015?



In [1]:
qt_high = df_ec[df_ec['year']==2015]['eci'].quantile(0.95)
df_ec[df_ec['eci']>qt_high][['location_name_short_en']].drop_duplicates()[0:10]

#### Vice versa, what are countries with low complexity in 2015?



In [1]:
qt_low = df_ec[df_ec['year']==2015]['eci'].quantile(0.05)
df_ec[df_ec['eci']<qt_low][['location_name_short_en']].drop_duplicates()[0:10]

#### What are products (PCI) with high complexity in 2015?



In [1]:
qt_high = df_ec[df_ec['year']==2015]['pci'].quantile(0.95)
df_ec[df_ec['pci']>qt_high][['hs_product_name_short_en']].drop_duplicates()[0:10]

#### Vice versa, what are products (PCI) with low complexity in 2015?



In [1]:
qt_low = df_ec[df_ec['year']==2015]['pci'].quantile(0.05)
df_ec[df_ec['pci']<qt_low][['hs_product_name_short_en','pci']].drop_duplicates()[0:10]

#### Ukraine



##### How did Ukraine&rsquo;s economic complexity evolve over time?



In [1]:
df = df_ec[df_ec['location_name_short_en']=='Ukraine']
# drop duplicates of products
df.drop_duplicates(subset=['location_name_short_en','year'],inplace=True)
# keep relevant columns
df = df[['location_name_short_en','year','eci']]
# sort by ECI
df.sort_values(by='year',ascending=False,inplace=True)
df.reset_index(inplace=True,drop=True)
df.plot(x='year', y='eci')

##### How does Ukraine&rsquo;s economic complexity in 2015 compare to other countries? Which countries have comparable economic complexity?



In [1]:
df = df_ec[df_ec['year']==2015].copy()
# drop duplicates of countries
df = df[['location_name_short_en','eci']].drop_duplicates()
# sort by ECI
df.sort_values(by='eci',ascending=False,inplace=True)
df.reset_index(inplace=True,drop=True)
# create rank variable
df['rank'] = df.index
# get rank of Ukraine
RANK_UKRAINE = df[df['location_name_short_en']=='Ukraine'].reset_index()['rank'][0]
# check countries ranked directly above and below Ukraine
df[ (df['rank']>RANK_UKRAINE-10) & (df['rank']<RANK_UKRAINE+10)]

##### What are the most complex products that Ukraine exported in 2015?



In [1]:
df = df_ec[df_ec['location_name_short_en']=='Ukraine'].copy()
df = df[df['year']==1995]
df.sort_values(by=['pci'],ascending=False,inplace=True)
df.reset_index(inplace=True,drop=True)
df[0:10][['hs_product_name_short_en','pci']]