# üåç Countries GDP Web Scraping & Data Analysis
### IBM Data Engineering Specialization ‚Äì Portfolio Project

This project demonstrates how to extract and transform GDP data from an archived Wikipedia page using web scraping and Python. The workflow includes:

- Extracting GDP values from an HTML table  
- Cleaning and structuring the dataset  
- Converting GDP units for clarity  
- Adding analytical features such as GDP Share, Rank, and Normalized Score  
- Exporting the final enriched dataset for further use  

### Data Source (Archived)
https://web.archive.org/web/20230902185326/https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)


## üì¶ Setup & Importing Required Libraries
In this step, we import all required Python libraries used throughout the project.  
We also suppress warnings to keep the notebook output clean.


In [None]:
import urllib.request
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np

import warnings
warnings.filterwarnings("ignore")


## üìå Extracting Nominal GDP Data

The webpage is fetched using a browser-style header and parsed using BeautifulSoup.  
The relevant GDP table is extracted and converted into a structured dataset.


In [None]:
URL = "https://web.archive.org/web/20230902185326/https://en.wikipedia.org/wiki/List_of_countries_by_GDP_%28nominal%29"

# Load webpage
req = urllib.request.Request(URL, headers={'User-Agent': 'Mozilla/5.0'})
html = urllib.request.urlopen(req).read()

# Parse HTML
soup = BeautifulSoup(html, "html.parser")

# Extract all tables
tables = soup.find_all("table")

# IMF Nominal GDP Table
imf_table = tables[3]

# Extract structured rows
rows = []
for tr in imf_table.find_all("tr"):
    tds = tr.find_all("td")
    if len(tds) >= 3:
        country = tds[0].get_text(strip=True)
        gdp_raw = tds[2].get_text(strip=True).replace(",", "")
        if gdp_raw.isdigit():
            rows.append([country, int(gdp_raw)])

# Convert to DataFrame
df = pd.DataFrame(rows, columns=["Country", "GDP (Million USD)"])
df


Unnamed: 0,Country,GDP (Million USD)
0,World,105568776
1,United States,26854599
2,China,19373586
3,Japan,4409738
4,Germany,4308854
...,...,...
187,Marshall Islands,291
188,Palau,262
189,Kiribati,248
190,Nauru,151


## üö´ Removing Aggregate Rows

Only actual countries should be included in the ranking, so rows like **"World"** are removed.


In [None]:
df = df[df["Country"] != "World"].reset_index(drop=True)
df


Unnamed: 0,Country,GDP (Million USD)
0,United States,26854599
1,China,19373586
2,Japan,4409738
3,Germany,4308854
4,India,3736882
...,...,...
186,Marshall Islands,291
187,Palau,262
188,Kiribati,248
189,Nauru,151


## üîù Selecting Top 10 Largest Economies


In [None]:
df = df.head(10).reset_index(drop=True)
df


Unnamed: 0,Country,GDP (Million USD)
0,United States,26854599
1,China,19373586
2,Japan,4409738
3,Germany,4308854
4,India,3736882
5,United Kingdom,3158938
6,France,2923489
7,Italy,2169745
8,Canada,2089672
9,Brazil,2081235


## üîÑ Converting GDP to Billion USD for Better Interpretability


In [None]:
df["GDP (Billion USD)"] = np.round(df["GDP (Million USD)"] / 1000, 2)
df


Unnamed: 0,Country,GDP (Million USD),GDP (Billion USD)
0,United States,26854599,26854.6
1,China,19373586,19373.59
2,Japan,4409738,4409.74
3,Germany,4308854,4308.85
4,India,3736882,3736.88
5,United Kingdom,3158938,3158.94
6,France,2923489,2923.49
7,Italy,2169745,2169.74
8,Canada,2089672,2089.67
9,Brazil,2081235,2081.24


## üìä Adding GDP Share (%), Ranking, and Normalized Score


In [None]:
total_gdp = df["GDP (Billion USD)"].sum()
df["GDP Share (%)"] = np.round((df["GDP (Billion USD)"] / total_gdp) * 100, 2)
df


Unnamed: 0,Country,GDP (Million USD),GDP (Billion USD),GDP Share (%)
0,United States,26854599,26854.6,37.77
1,China,19373586,19373.59,27.25
2,Japan,4409738,4409.74,6.2
3,Germany,4308854,4308.85,6.06
4,India,3736882,3736.88,5.26
5,United Kingdom,3158938,3158.94,4.44
6,France,2923489,2923.49,4.11
7,Italy,2169745,2169.74,3.05
8,Canada,2089672,2089.67,2.94
9,Brazil,2081235,2081.24,2.93


In [None]:
df["Rank"] = df["GDP (Billion USD)"].rank(ascending=False).astype(int)
df = df.sort_values("Rank").reset_index(drop=True)
df


Unnamed: 0,Country,GDP (Million USD),GDP (Billion USD),GDP Share (%),Rank
0,United States,26854599,26854.6,37.77,1
1,China,19373586,19373.59,27.25,2
2,Japan,4409738,4409.74,6.2,3
3,Germany,4308854,4308.85,6.06,4
4,India,3736882,3736.88,5.26,5
5,United Kingdom,3158938,3158.94,4.44,6
6,France,2923489,2923.49,4.11,7
7,Italy,2169745,2169.74,3.05,8
8,Canada,2089672,2089.67,2.94,9
9,Brazil,2081235,2081.24,2.93,10


In [None]:
gdp_min = df["GDP (Billion USD)"].min()
gdp_max = df["GDP (Billion USD)"].max()

df["GDP Normalized Score"] = np.round(
    (df["GDP (Billion USD)"] - gdp_min) / (gdp_max - gdp_min), 3
)

df


Unnamed: 0,Country,GDP (Million USD),GDP (Billion USD),GDP Share (%),Rank,GDP Normalized Score
0,United States,26854599,26854.6,37.77,1,1.0
1,China,19373586,19373.59,27.25,2,0.698
2,Japan,4409738,4409.74,6.2,3,0.094
3,Germany,4308854,4308.85,6.06,4,0.09
4,India,3736882,3736.88,5.26,5,0.067
5,United Kingdom,3158938,3158.94,4.44,6,0.044
6,France,2923489,2923.49,4.11,7,0.034
7,Italy,2169745,2169.74,3.05,8,0.004
8,Canada,2089672,2089.67,2.94,9,0.0
9,Brazil,2081235,2081.24,2.93,10,0.0


## üíæ Exporting the Processed GDP Dataset as CSV


In [None]:
df.to_csv("Largest_economies.csv", index=False)


## üìä Results

The processed dataset presents the **top 10 largest economies** based on nominal GDP. All values were cleaned, converted from millions to billions, and enhanced with GDP share, ranking, and a normalized score for comparison.  

The final output highlights the economic dominance of major countries, with the United States and China leading globally, followed by Japan, Germany, and India. The dataset offers a clear comparative view of each country's economic scale and relative contribution within the top 10 economies.
