### Secondary Data Cleaning

This file is to clean and process secondary data (GDP and MFN tarrif data)
which involves following steps

1. Standardizing country names
2. Clean the data
3. Converting wide format to long format
4. Merging MFN Tarrif and GDP data
5. Writing final data to a csv to be merged with primary data

In [None]:
# import all the modules needed
import re
import pandas as pd
import numpy as np
import country_converter as coco

#### Define helper functions to clean and process data

In [None]:
def standardized_country(countries):
    """
    Converts a list of country names or codes to their standardized short names.

    Parameters
    ----------
    countries : list or str
        A list of country names or codes, or a single country name/code, to be standardized.

    Returns
    -------
    list or str
        The standardized short names of the input countries. 
        If a single country is provided, returns a string.
        If a list is provided, returns a list of strings.

    Notes
    -----
    This function uses the `coco` library to perform the conversion. 
    The output format is the short name of each country or `not found` 
    if country name/code is not recognized.
    """
    return coco.convert(names=countries, to='name_short')

In [None]:
def clean_secondary_data(df, country_col, value_col, cols_to_keep,
                         start_year=2015, end_year=2022, not_found_label="not found"):
    """
    Clean and preprocess a dataframe by standardizing country names, 
    filtering columns and rows, and melting into long format.

    Parameters
    ----------
    df : pd.DataFrame
        The input DataFrame containing data with country and year columns.
    country_col : str
        The name of the column in `df` containing country names to be standardized.
    value_col : str
        The name of the value column in the output DataFrame after melting.
    cols_to_keep : list of str
        The list of columns to keep before melting (e.g., standardized country and year columns).
    start_year : int, optional
        The first year to consider when dropping rows with missing data.
    end_year : int, optional
        The last year to consider when dropping rows with missing data.
    not_found_label : str, optional (default="not found")
        The label used to identify unrecognized countries.

    Returns
    -------
    pd.DataFrame
        A cleaned DataFrame with standardized country names, 
        a 'Year' column, and a column for data values.

    Notes
    -----
    Steps performed:
    1. Drop rows with no data between `start_year` and `end_year`.
    2. Standardize country names using `standardized_country`.
    3. Drop rows where the standardized name equals `not_found_label`.
    4. Retain only the specified columns (`cols_to_keep`).
    5. Melt the DataFrame into long format with `Year`, `standardized_country`, and values.
    """
    # Drop rows with no data in the given year range
    year_cols = [str(y) for y in range(start_year, end_year + 1)]
    df = df.dropna(subset=year_cols, how="all").copy()

    # Standardize country names
    df = df.assign(standardized_country=standardized_country(df[country_col]))

    # Drop rows where country standardization failed
    df = df.loc[df["standardized_country"] != not_found_label, :]

    # Keep only relevant columns
    df = df.loc[:, cols_to_keep]

    # Melt into long format
    melted_df = pd.melt(df, id_vars=["standardized_country"], var_name="Year", value_name=value_col)

    return melted_df.copy()

In [None]:
def get_cols_to_keep(start_year=2008, end_year=2022, string_col="standardized_country"):
    """
    Generate a list of columns to keep for data cleaning.

    Parameters
    ----------
    start_year : int, optional
        The first year to include in the list of columns.
    end_year : int, optional
        The last year to include in the list of columns.
    string_cols : list of str, optional
        A list of string column names to include before the year columns.

    Returns
    -------
    list of str
        A list of column names including 'standardized_country' and years
        from `start_year` to `end_year`.
    """
    years = [str(year) for year in range(start_year, end_year + 1)]
    return [string_col] + years

#### Process GDP Data

We have two GDP datasets captured in separate files:

1. GDP in Current USD:
    This indicator is expressed in current prices, meaning no adjustment has been made
    to account for price changes over time. This indicator is expressed in United States dollars.
2. GDP in 2015-Adjusted USD:
    This indicator is expressed in constant prices, meaning the series has been adjusted
    to account for price changes over time. The reference year for this adjustment is 2015.
    This indicator is expressed in United States dollars.
    It removes the effects of price changes and inflation.

Load both GDP datasets for initial EDA; decide later which measure to use based on analysis goals.


In [None]:
# Read GDP data files and skip the last 5 footer rows
raw_gdp_df = pd.read_csv("./../../data/raw/secondary/GDP.csv",
                         skipfooter=5,
                         engine="python")
raw_gdp_2015_adj_df = pd.read_csv("./../../data/raw/secondary/GDP_2015_adjusted.csv",
                                  skipfooter=5,
                                  engine="python")

print(raw_gdp_df.head(2))
print(raw_gdp_2015_adj_df.head(2))

In [None]:
# rows after 216 are not countries but groups such as region, income groups.
# Only country level data is required for our analysis. Hence select only countries
raw_gdp_df = raw_gdp_df.loc[0:216, :]
raw_gdp_2015_adj_df = raw_gdp_2015_adj_df.loc[0:216, :]

In [None]:
# rename columns so that it looks consistent in all sources of data
# remove the text in square brackets from the column names
# e.g. "2008 [YR2008]" to "2008"
raw_gdp_df = raw_gdp_df.rename(
    columns=lambda col: re.sub(r" \[YR\d{4}\]$", '', col)
)

raw_gdp_2015_adj_df = raw_gdp_2015_adj_df.rename(
    columns=lambda col: re.sub(r" \[YR\d{4}\]$", '', col)
)

In [None]:
# get the columns to keep using the function
required_columns = get_cols_to_keep(2008,2022)

# clean the gdp dataframe by passing the required parameters to the function
cleaned_gdp_df = clean_secondary_data(
    df=raw_gdp_df,
    country_col="Country Name",
    value_col="GDP",
    cols_to_keep=required_columns)

# clean the gdp 2015 adjusted dataframe by passing the required parameters to the function
cleaned_gdp_2015_adj_df = clean_secondary_data(
    df=raw_gdp_2015_adj_df,
    country_col="Country Name",
    value_col="GDP_2015_adj",
    cols_to_keep=required_columns)

# print the dataframes after cleaning
print("\n\n")
print(cleaned_gdp_df.head(2))
print("\n\n")
print(cleaned_gdp_2015_adj_df.head(2))

In [None]:
# merge both GDP dataframes on standardized country and year
final_gdp_df = pd.merge(cleaned_gdp_df, cleaned_gdp_2015_adj_df,
                        on=["standardized_country", "Year"],
                        how="inner")
final_gdp_df.head(2)

In [None]:
# write the final gdp dataframe to a csv file
final_gdp_df.to_csv("./../../data/processed/final_gdp_df.csv", index=False)

#### Process MFN tarrif data

https://wits.worldbank.org/CountryProfile/en/Country/USA/StartYear/1991/EndYear/2022/TradeFlow/Import/Partner/BY-COUNTRY/Indicator/MFN-WGHTD-AVRG

https://wits.worldbank.org/CountryProfile/en/Country/USA/StartYear/1991/EndYear/2022/TradeFlow/Import/Partner/BY-COUNTRY/Indicator/MFN-SMPL-AVRG

https://wits.worldbank.org/CountryProfile/en/Country/BY-COUNTRY/StartYear/1988/EndYear/2022/TradeFlow/Import/Partner/USA/Indicator/MFN-SMPL-AVRG

https://wits.worldbank.org/CountryProfile/en/Country/BY-COUNTRY/StartYear/1988/EndYear/2022/TradeFlow/Import/Partner/USA/Indicator/MFN-WGHTD-AVRG



We are using four datasets from the World Bank WITS portal on Most Favored Nation (MFN) tariffs.
- **By US (tariffs imposed by the US on imports):**
  - Weighted Average (United States MFN Weighted Average by country.xlsx)
  - Simple Average (United States MFN Simple Average by country.xlsx)

- **On US (tariffs faced by US exports in partner countries):**
  - Weighted Average (MFN Weighted Average from United States by country.xlsx)
  - Simple Average (MFN Simple Average from United States by country.xlsx)

*Weighted = trade-volume adjusted; Simple = unweighted mean.*

Load four MFN datasets for initial EDA; decide later which measure to use based on analysis.

In [None]:
# get the columns to keep using the function
# this is passed as arugument to the clean_secondary_data function while cleaning all four mfn files
required_columns = get_cols_to_keep(2008,2022)

In [None]:
# read excel file to get MFN simple avg from United States by country
on_us_simple_avg = pd.read_excel(
    io = "./../../data/raw/secondary/MFN Simple Average from United States by country.xlsx",
    sheet_name="Partner-Timeseries")
print("Raw data")
print(on_us_simple_avg.head(3))

# clean the data
cleaned_on_us_simple_avg = clean_secondary_data(
    df=on_us_simple_avg,
    country_col="Reporter Name",
    value_col="mfn_on_us_simple_avg",
    cols_to_keep=required_columns)

# print the cleaned data
print("Cleaned data")
print(cleaned_on_us_simple_avg.head(3))

In [None]:
# read excel file to get MFN weighted avg from United States by country
on_us_weighted_avg = pd.read_excel(
    io="./../../data/raw/secondary/MFN Weighted Average from United States by country.xlsx",
    sheet_name="Partner-Timeseries")
print("Raw data")
print(on_us_weighted_avg.head(3))

# clean the data
cleaned_on_us_weighted_avg = clean_secondary_data(
    df=on_us_weighted_avg,
    country_col="Reporter Name",
    value_col="mfn_on_us_weighted_avg",
    cols_to_keep=required_columns)

# print the cleaned data
print("Cleaned data")
print(cleaned_on_us_weighted_avg.head(3))

In [None]:
# read excel file to get United States MFN Simple Average by country
by_us_simple_avg = pd.read_excel(
    io="./../../data/raw/secondary/United States MFN Simple Average by country.xlsx",
    sheet_name="Partner-Timeseries")
print("Raw data")
print(by_us_simple_avg.head(3))

# clean the data
cleaned_by_us_simple_avg = clean_secondary_data(
    df=by_us_simple_avg,
    country_col="Partner Name",
    value_col="mfn_by_us_simple_avg",
    cols_to_keep=required_columns)

# print the cleaned data
print("Cleaned data")
print(cleaned_by_us_simple_avg.head(3))

In [None]:
# read excel file to get United States MFN Weighted Average by country
by_us_weight_avg = pd.read_excel(
    io="./../../data/raw/secondary/United States MFN Weighted Average by country.xlsx",
    sheet_name="Partner-Timeseries")
print("Raw data")
print(by_us_weight_avg.head(3))

# clean the data
cleaned_by_us_weight_avg = clean_secondary_data(
    df=by_us_weight_avg,
    country_col="Partner Name",
    value_col="mfn_by_us_weight_avg",
    cols_to_keep=required_columns)

# print the cleaned data
print("Cleaned data")
print(cleaned_by_us_weight_avg.head(3))

##### Merge all MFN dataframes to get MFN tariffs
Both weighted and simple, on the US and by the US, across countries and years.


In [None]:
# Join all dataframes on standardized_country and Year columns
final_tariff_df = (
    cleaned_by_us_simple_avg
    .merge(cleaned_by_us_weight_avg, on=["standardized_country", "Year"], how="outer")
    .merge(cleaned_on_us_simple_avg, on=["standardized_country", "Year"], how="outer")
    .merge(cleaned_on_us_weighted_avg, on=["standardized_country", "Year"], how="outer")
)

final_tariff_df.head(5)

In [None]:
# write the final tariff dataframe to a csv file
final_tariff_df.to_csv("./../../data/processed/final_tariff_df.csv", index=False)

In [None]:
# final_tariff_df[final_tariff_df.isnull().any(axis=1)].to_csv("./final_tariff_df_with_nulls.csv", index=False)

#### Merge Tariff and GDP Data

We merge the GDP datasets with MFN tariff datasets to create a final dataframe.
Each row represents a countryâ€“year pair, with GDP (current and 2015-adjusted) and
MFN tariff (simple and weighted, by US and on US) columns.


In [None]:
# merge GDP and Tarrif dfs on standardized_country and Year to get final dataframe
final_gdp_tariff_df = final_tariff_df.merge(
    final_gdp_df, on=["standardized_country", "Year"],
    how="outer")

# write the final gdp and tarrif dataframe to a csv file
final_gdp_tariff_df.to_csv("./../../data/processed/final_gdp_tariff_df.csv", index=False)

# show first 5 rows of the final dataframe
final_gdp_tariff_df.head(5)

In [None]:
# checking nulls in major patner countries

# list of major partner countries
major_countries = [
    "China", "Germany", "Japan", "United Kingdom", "India", "France", 
    "Italy", "Canada", "South Korea", "Russia", "Brazil", "Australia", 
    "Spain", "Mexico", "Indonesia", "Netherlands", "Saudi Arabia", 
    "Turkey", "Switzerland", "Argentina", "Sweden", "Poland", 
    "Belgium", "Thailand", "Austria", "Norway", "United Arab Emirates"
]

major_countries_standardized = standardized_country(major_countries)
df = final_gdp_tariff_df.replace("..", np.nan)
df.tail(10)
df = df[df["standardized_country"].isin(major_countries_standardized)]
df[df.isnull().any(axis=1)]