### Combine Primary and Secondary Datasets

This file is to merge primary data (USA Export and Imports)
and secondary data (GDP and MFN tarrif data) which involves following steps

1. Rename column names so that they are consistent in primary and secondary datasets.
2. Convert to approriate data types
3. Filling missing values
4. Merging primary and seconday data
5. Writing final data to a csv

In [None]:
# import required packages
import pandas as pd
import numpy as np

In [None]:
# define root folder
ROOT_FOLDER = "."

In [None]:
# read cleaned and aggreated primary(trade) data from csv
primary = pd.read_csv(ROOT_FOLDER + '/data/processed/cleaned_primary_trade_data.csv')
primary.columns = primary.columns.str.lower()
primary = primary.rename(columns={'standardized_country': 'country'})

# Remove commas and convert to integer
primary['import_value'] = primary['import_value'].str.replace(',', '').astype(int)
primary['export_value'] = primary['export_value'].str.replace(',', '').astype(int)

# print first 5 rows
primary.head(5)

In [None]:
# check if nulls values are filled
primary.isnull().sum()

In [None]:
# read cleaned and aggreated secondary(GDP and MFN Tariff) data from csv
secondary = pd.read_csv(ROOT_FOLDER + '/data/processed/final_gdp_tariff.csv')
secondary.columns = secondary.columns.str.lower()
secondary = secondary.rename(columns={'standardized_country': 'country'})

# round MFN Tariff to 2 decimal places
secondary['mfn_by_us_simple_avg'] = secondary['mfn_by_us_simple_avg'].round(2)
secondary['mfn_by_us_weighted_avg'] = secondary['mfn_by_us_weighted_avg'].round(2)
secondary['mfn_on_us_simple_avg'] = secondary['mfn_on_us_simple_avg'].round(2)
secondary['mfn_on_us_weighted_avg'] = secondary['mfn_on_us_weighted_avg'].round(2)

# round GDP and GDP 2015 adj to nearest integer and convert to Int64 type
secondary['gdp'] = np.floor(pd.to_numeric(secondary['gdp'], errors='coerce')).astype('Int64')
secondary['gdp_2015_adj'] = np.floor(pd.to_numeric(secondary['gdp_2015_adj'],
                                                   errors='coerce')).astype('Int64')

# print first 5 rows
secondary.head(5)

In [None]:
# sort by country and year
secondary = secondary.sort_values(by=['country', 'year'])

# interpolate only numeric columns per country
interpolate_cols = ['mfn_by_us_simple_avg', 'mfn_by_us_weighted_avg',
       'mfn_on_us_simple_avg', 'mfn_on_us_weighted_avg', 'gdp',
       'gdp_2015_adj']

secondary[interpolate_cols] = (
    secondary
    .groupby('country')[interpolate_cols]
    .apply(lambda group: group.interpolate(method='linear', limit_direction='both'))
    .reset_index(drop=True)
)

secondary.head(5)


In [None]:
# check nulls
secondary.isnull().sum()

For the major trading partners of the USA, there are no null values.
For other countries, some nulls remain in the GDP and MFN tariff columns.

Since interpolation requires at least a few existing values to estimate the missing ones,
we will ignore those countries for now.
This approach is acceptable because those countries are not major trading partners of the USA.

In [None]:
# since secondary data is from 2008 to 2022 select same years from primary
primary = primary[primary['year'].isin(range(2008, 2023))]

# merge primary and secondary data on Year and Country so that
# each trade record has corresponding GDP and MFN Tariff values
combined = pd.merge(primary, secondary, on=['year', 'country'], how='left')
combined.sort_values(by=['country','year','category', ], inplace=True)
combined = combined.reset_index(drop=True)

In [None]:
combined.head(20)

In [None]:
# save the combined data to csv
combined.to_csv(ROOT_FOLDER + '/data/processed/combined_primary_secondary.csv', index=False)