## World Bank data load
### Agenda
* Data import and correction
* Dataset load
* Cleaning
* Country names normalization and standardization

### Data import and correction
The necessary data is extracted from the orders dataset and corrected to support further exploration.

In [None]:
import wbdata
import pandas as pd
import pycountry
import os
import warnings

warnings.filterwarnings("ignore", category=FutureWarning)
warnings.filterwarnings("ignore", category=DeprecationWarning)
warnings.filterwarnings("ignore", category=SyntaxWarning)

In [2]:
#Load data
file_path = 'transformed\orders_cleaned.csv'
orders_df = pd.read_csv(file_path)

In [3]:
#Change datatypes
orders_df['Order_Date'] = pd.to_datetime(orders_df['Order_Date'], errors='coerce')
orders_df['Shipping_Date'] = pd.to_datetime(orders_df['Shipping_Date'], errors='coerce')

min_date = orders_df['Order_Date'].min()
max_date = orders_df['Order_Date'].max()

print(f'Start date is {min_date}\nEnd date is {max_date}')

Start date is 2015-01-01 00:00:00
End date is 2018-01-31 00:00:00


### Dataset load
GDP and population data are retrieved from the **World Bank** (wbdata) database to enrich the orders dataset with country-level economic indicators. The data is collected for all available countries and years.

In [4]:
#Get the GDP and population data
indicators = {"NY.GDP.MKTP.CD": "GDP", "SP.POP.TOTL": "Population"}

gdp_df = wbdata.get_dataframe(indicators, country="all")

gdp_df.reset_index(inplace=True)
gdp_df.info()

<class 'wbdata.client.DataFrame'>
RangeIndex: 17290 entries, 0 to 17289
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   country     17290 non-null  object 
 1   date        17290 non-null  object 
 2   GDP         14561 non-null  float64
 3   Population  17195 non-null  float64
dtypes: float64(2), object(2)
memory usage: 540.4+ KB


In [5]:
#Date information
gdp_df['date'] = pd.to_datetime(gdp_df['date'], errors='coerce')

min_date = gdp_df['date'].min()
max_date = gdp_df['date'].max()

print(min_date, max_date)

1960-01-01 00:00:00 2024-01-01 00:00:00


### Cleaning
Basic cleaning steps are applied in this stage, including handling missing values. The data is also filtered to a specific time period (2015â€“2018).

In [6]:
#Filling missing values
def fill_gdp_gaps(group):
    group = group.sort_values('date')  

    group['GDP'] = group['GDP'].interpolate(method='linear')
    group['GDP'] = group['GDP'].ffill()
    group['GDP'] = group['GDP'].bfill()

    return group

gdp_filled = gdp_df.groupby('country', group_keys=False).apply(fill_gdp_gaps)

#Check for remaining missing values
missing_gdp = gdp_filled[gdp_filled['GDP'].isna()]
nulls = not missing_gdp.empty

print(f'Are there any missing values in GDP data? {'Yes' if nulls else 'No'}')

if nulls:
    print('Remaining missing values:')
    print(missing_gdp['country'])

Are there any missing values in GDP data? Yes
Remaining missing values:
5004    British Virgin Islands
5003    British Virgin Islands
5002    British Virgin Islands
5001    British Virgin Islands
5000    British Virgin Islands
                 ...          
2279            Not classified
2278            Not classified
2277            Not classified
2276            Not classified
2275            Not classified
Name: country, Length: 260, dtype: object


In [7]:
gdp_filled = gdp_filled.dropna(subset=['GDP'])

gdp_filled.reset_index(drop=True, inplace=True)

print(f'Final number of rows: {len(gdp_filled)}')

Final number of rows: 17030


The original dataset containing order information spans four years (from 2015-01-01 to 2018-01-31). The new GDP data should be filtered to match this same time period.

In [8]:
#Filter for needed period
gdp_filled['date'] = pd.to_datetime(gdp_filled['date'], errors='coerce')

gdp_filled['Year'] = gdp_filled['date'].dt.year
gdp_filled.drop(columns='date', inplace=True)

start_year = orders_df['Order_Date'].dt.year.min()
end_year = orders_df['Order_Date'].dt.year.max()

gdp_filtered = gdp_filled[(gdp_filled['Year']>= start_year) & (gdp_filled['Year']<= end_year)]

gdp_filtered.head()

Unnamed: 0,country,GDP,Population,Year
55,Afghanistan,19134220000.0,33831764.0,2015
56,Afghanistan,18116570000.0,34700612.0,2016
57,Afghanistan,18753460000.0,35688935.0,2017
58,Afghanistan,18053220000.0,36743039.0,2018
120,Africa Eastern and Southern,910002000000.0,607123269.0,2015


In [9]:
#Check for missing values
gdp_filtered.isnull().sum()

country       0
GDP           0
Population    0
Year          0
dtype: int64

### Country names normalization and standardization
World Bank data includes information for countries, regions, and special territories. To enable further analysis, normalization and standardization procedures were applied. Entries without valid country matches were subsequently removed.

In [10]:
#Convert country names to ISO3 codes
def country_to_iso(name):
    try:
        return pycountry.countries.lookup(name).alpha_3
    except:
        return None

gdp_filtered['ISO3'] = gdp_filtered['country'].apply(country_to_iso)

#Check for mismatches
mismatches = gdp_filtered[gdp_filtered['ISO3'].isna()]['country'].unique()

print(f'Mismatched values: {mismatches}\nLength: {len(mismatches)}')

Mismatched values: ['Africa Eastern and Southern' 'Africa Western and Central' 'Arab World'
 'Bahamas, The' 'Caribbean small states' 'Central Europe and the Baltics'
 'Channel Islands' 'Congo, Dem. Rep.' 'Congo, Rep.' "Cote d'Ivoire"
 'Curacao' 'Early-demographic dividend' 'East Asia & Pacific'
 'East Asia & Pacific (IDA & IBRD countries)'
 'East Asia & Pacific (excluding high income)' 'Egypt, Arab Rep.'
 'Euro area' 'Europe & Central Asia'
 'Europe & Central Asia (IDA & IBRD countries)'
 'Europe & Central Asia (excluding high income)' 'European Union'
 'Fragile and conflict affected situations' 'Gambia, The'
 'Heavily indebted poor countries (HIPC)' 'High income'
 'Hong Kong SAR, China' 'IBRD only' 'IDA & IBRD total' 'IDA blend'
 'IDA only' 'IDA total' 'Iran, Islamic Rep.' 'Korea, Rep.' 'Kosovo'
 'Lao PDR' 'Late-demographic dividend' 'Latin America & Caribbean'
 'Latin America & Caribbean (excluding high income)'
 'Latin America & the Caribbean (IDA & IBRD countries)'
 'Least develope

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  gdp_filtered['ISO3'] = gdp_filtered['country'].apply(country_to_iso)


In [11]:
#Manual correction
corrected_countries = {
    'Bahamas, The': 'Bahamas',
    'Congo, Dem. Rep.': 'Democratic Republic of the Congo',
    'Congo, Rep.': 'Republic of the Congo',
    "Cote d'Ivoire": 'Ivory Coast',
    'Egypt, Arab Rep.': 'Egypt',
    'Gambia, The': 'Gambia',
    'Hong Kong SAR, China': 'Hong Kong',
    'Iran, Islamic Rep.': 'Iran',
    'Korea, Rep.': 'South Korea',
    'Kosovo': 'Kosovo',
    'Lao PDR': 'Laos',
    'Macao SAR, China': 'Macao',
    'Micronesia, Fed. Sts.': 'Micronesia',
    'Puerto Rico (US)': 'Puerto Rico',
    'Somalia, Fed. Rep.': 'Somalia',
    'Turkiye': 'Turkey',
    'Venezuela, RB': 'Venezuela',
    'Virgin Islands (U.S.)': 'U.S. Virgin Islands',
    'West Bank and Gaza': 'Palestine',
    'Yemen, Rep.': 'Yemen'
}

#Save to the new column
gdp_filtered['Country_Clean'] = gdp_filtered['country'].map(
    lambda x: corrected_countries.get(x, x))

#Convert to ISO3
gdp_filtered['ISO3_Final'] = gdp_filtered['Country_Clean'].apply(country_to_iso)

#Drop and rename columns
gdp_filtered.drop(columns=['country', 'ISO3'], inplace=True)
gdp_filtered.rename(columns={'Country_Clean': 'Country', 'ISO3_Final': 'ISO3'}, inplace=True)

#Drop the rest of the regions
gdp_filtered = gdp_filtered[gdp_filtered['ISO3'].notna()]

gdp_filtered.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  gdp_filtered['Country_Clean'] = gdp_filtered['country'].map(
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  gdp_filtered['ISO3_Final'] = gdp_filtered['Country_Clean'].apply(country_to_iso)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  gdp_filtered.drop(columns=['country', 'ISO3'], inplace=True)
A value is trying to be set on a copy of a slice from a Dat

Unnamed: 0,GDP,Population,Year,Country,ISO3
55,19134220000.0,33831764.0,2015,Afghanistan,AFG
56,18116570000.0,34700612.0,2016,Afghanistan,AFG
57,18753460000.0,35688935.0,2017,Afghanistan,AFG
58,18053220000.0,36743039.0,2018,Afghanistan,AFG
250,11470170000.0,2731293.0,2015,Albania,ALB


In [12]:
#Save to csv
csv_file = os.path.join('transformed', 'gdp_cleaned.csv')
os.makedirs(os.path.dirname(csv_file), exist_ok=True)

gdp_filtered.to_csv(csv_file, index=False)

print(f'Data saved to {csv_file}')

Data saved to transformed\gdp_cleaned.csv
