# GDP Data extraction and processing


## Project Scenario

An international firm that is looking to expand its business in different countries across the world has recruited you. You have been hired as a junior Data Engineer and are tasked with creating a script that can extract the list of the top 10 largest economies of the world in descending order of their GDPs in Billion USD (rounded to 2 decimal places), as logged by the International Monetary Fund (IMF).


In [1]:
import numpy as np
import pandas as pd

In [2]:
url = "https://web.archive.org/web/20230902185326/https://en.wikipedia.org/wiki/List_of_countries_by_GDP_%28nominal%29"

In [3]:
tables = pd.read_html(url)

# Retain table 3
df = tables[3]

# Replace column headers with column numbers
df.columns = range(df.shape[1])

# Retain columns 0 (country) and 2 (GDP quoted by IMF)
df = df[[0, 2]]

# Retain top 10 economies of the world (1 to 10)
df = df.iloc[1:11, :]

# Assign column names
df.columns = ["Country", "GDP (Million USD)"]

df

Unnamed: 0,Country,GDP (Million USD)
1,United States,26854599
2,China,19373586
3,Japan,4409738
4,Germany,4308854
5,India,3736882
6,United Kingdom,3158938
7,France,2923489
8,Italy,2169745
9,Canada,2089672
10,Brazil,2081235


In [4]:
# Change the data type to int
df["GDP (Million USD)"] = df["GDP (Million USD)"].astype(int)

# Convert to billion
df[["GDP (Million USD)"]] = df[["GDP (Million USD)"]] / 1000

# Round to two decimal places
df[["GDP (Million USD)"]] = np.round(df[["GDP (Million USD)"]], 2)

# Rename column header
df = df.rename(columns={"GDP (Million USD)": "GDP (Billion USD)"})

df

Unnamed: 0,Country,GDP (Billion USD)
1,United States,26854.6
2,China,19373.59
3,Japan,4409.74
4,Germany,4308.85
5,India,3736.88
6,United Kingdom,3158.94
7,France,2923.49
8,Italy,2169.74
9,Canada,2089.67
10,Brazil,2081.24


In [5]:
# Load data to a csv file
df.to_csv("./Largest_economies.csv")