# Practice Project: GDP Data extraction and processing
Estimated time needed: 30 minutes

Introduction
In this practice project, you will put the skills acquired through the course to use. You will extract data from a website using webscraping and reqeust APIs process it using Pandas and Numpy libraries.

# Web Scraping Tables using Pandas

Project Scenario:
An international firm that is looking to expand its business in different countries across the world has recruited you. You have been hired as a junior Data Engineer and are tasked with creating a script that can extract the list of the top 10 largest economies of the world in descending order of their GDPs in Billion USD (rounded to 2 decimal places), as logged by the International Monetary Fund (IMF).

The required data seems to be available on the URL mentioned below:

URL: https://web.archive.org/web/20230902185326/https://en.wikipedia.org/wiki/List_of_countries_by_GDP_%28nominal%29

Objectives
After completing this lab you will be able to:

Use Webscraping to extract required information from a website.
Use Pandas to load and process the tabular data as a dataframe.
Use Numpy to manipulate the information contatined in the dataframe.
Load the updated dataframe to CSV file.

In [1]:
#Install required packages
!pip install pandas numpy 
!pip install lxml

# lxml library, which is a powerful and widely-used Python library for processing XML and HTML documents.
# The lxml library is one of the parsers used by pandas for reading HTML tables.



Importing Required Libraries
We recommend you import all required libraries in one place (here):

In [2]:
import numpy as np 
import pandas as pd

# You can also use this section to suppress warnings generated by your code:
def warn(*args, **kwargs):
    pass
import warnings
warnings.warn = warn
warnings.filterwarnings('ignore')

# Exercise 1
Extract the required GDP data from the given URL using Web Scraping.

In [3]:
URL = "https://web.archive.org/web/20230902185326/https://en.wikipedia.org/wiki/List_of_countries_by_GDP_%28nominal%29"

You can use Pandas library to extract the required table directly as a DataFrame. Note that the required table is the third one on the website, as shown in the image below.

<img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-PY0101EN-SkillsNetwork/images/pandas_wbs_3.png">


In [4]:
# Extract tables from webpage using Pandas. Retain table number 3 as the required dataframe.

tables = pd.read_html(URL)
df = tables[3]  # df = tables(2) # the required table will have index 2

In [5]:
tables

[      0     1     2
 0   Aug   SEP   Oct
 1   NaN    02   NaN
 2  2022  2023  2024,
                                                    0
 0  Largest economies in the world by GDP (nominal...,
                                                    0  \
 0  .mw-parser-output .legend{page-break-inside:av...   
 
                                                    1  \
 0  $750 billion – $1 trillion $500–750 billion $2...   
 
                                                    2  
 0  $50–100 billion $25–50 billion $5–25 billion <...  ,
     Country/Territory UN region IMF[1][13]            World Bank[14]  \
     Country/Territory UN region   Estimate       Year       Estimate   
 0               World         —  105568776       2023      100562011   
 1       United States  Americas   26854599       2023       25462700   
 2               China      Asia   19373586  [n 1]2023       17963171   
 3               Japan      Asia    4409738       2023        4231141   
 4             Germany 

How pandas.read_html() Works
Purpose: Reads HTML pages and extracts tables as pandas DataFrame objects.

In [6]:
df

Unnamed: 0_level_0,Country/Territory,UN region,IMF[1][13],IMF[1][13],World Bank[14],World Bank[14],United Nations[15],United Nations[15]
Unnamed: 0_level_1,Country/Territory,UN region,Estimate,Year,Estimate,Year,Estimate,Year
0,World,—,105568776,2023,100562011,2022,96698005,2021
1,United States,Americas,26854599,2023,25462700,2022,23315081,2021
2,China,Asia,19373586,[n 1]2023,17963171,[n 3]2022,17734131,[n 1]2021
3,Japan,Asia,4409738,2023,4231141,2022,4940878,2021
4,Germany,Europe,4308854,2023,4072192,2022,4259935,2021
...,...,...,...,...,...,...,...,...
209,Anguilla,Americas,—,—,—,—,303,2021
210,Kiribati,Oceania,248,2023,223,2022,227,2021
211,Nauru,Oceania,151,2023,151,2022,155,2021
212,Montserrat,Americas,—,—,—,—,72,2021


In [7]:
# Replace the column headers with column numbers

df.columns = range(df.shape[1])

In [None]:
# range(df.shape[1]):

# df.shape: Returns a tuple (number of rows, number of columns) of the DataFrame.
# df.shape[1]: Gives the number of columns in the DataFrame.
# For example:

# If the DataFrame has 4 columns, range(df.shape[1]) produces range(0, 4) → [0, 1, 2, 3].
# Assignment (=):

# The column names of the DataFrame (df.columns) are replaced with these sequential integers.

In [8]:
df

Unnamed: 0,0,1,2,3,4,5,6,7
0,World,—,105568776,2023,100562011,2022,96698005,2021
1,United States,Americas,26854599,2023,25462700,2022,23315081,2021
2,China,Asia,19373586,[n 1]2023,17963171,[n 3]2022,17734131,[n 1]2021
3,Japan,Asia,4409738,2023,4231141,2022,4940878,2021
4,Germany,Europe,4308854,2023,4072192,2022,4259935,2021
...,...,...,...,...,...,...,...,...
209,Anguilla,Americas,—,—,—,—,303,2021
210,Kiribati,Oceania,248,2023,223,2022,227,2021
211,Nauru,Oceania,151,2023,151,2022,155,2021
212,Montserrat,Americas,—,—,—,—,72,2021


In [9]:
# Retain columns with index 0 and 2 (name of country and value of GDP quoted by IMF)

df = df[[0, 2]]

In [10]:
df

Unnamed: 0,0,2
0,World,105568776
1,United States,26854599
2,China,19373586
3,Japan,4409738
4,Germany,4308854
...,...,...
209,Anguilla,—
210,Kiribati,248
211,Nauru,151
212,Montserrat,—


In [11]:
# Retain the Rows with index 1 to 10, indicating the top 10 economies of the world.

df = df.iloc[1:11, :]

In [12]:
df

Unnamed: 0,0,2
1,United States,26854599
2,China,19373586
3,Japan,4409738
4,Germany,4308854
5,India,3736882
6,United Kingdom,3158938
7,France,2923489
8,Italy,2169745
9,Canada,2089672
10,Brazil,2081235


In [13]:
# Assign column names as "Country" and "GDP (Million USD)"

df.columns = ['Country', 'GDP (Million USD)']

In [14]:
df

Unnamed: 0,Country,GDP (Million USD)
1,United States,26854599
2,China,19373586
3,Japan,4409738
4,Germany,4308854
5,India,3736882
6,United Kingdom,3158938
7,France,2923489
8,Italy,2169745
9,Canada,2089672
10,Brazil,2081235


# Exercise 2
Modify the GDP column of the DataFrame, converting the value available in Million USD to Billion USD. Use the round() method of Numpy library to round the value to 2 decimal places. Modify the header of the DataFrame to GDP (Billion USD).

In [15]:
# Change the data type of the 'GDP (Million USD)' column to integer. Use astype() method.

df['GDP (Million USD)'] = df['GDP (Million USD)'].astype(int)

In [16]:
df

Unnamed: 0,Country,GDP (Million USD)
1,United States,26854599
2,China,19373586
3,Japan,4409738
4,Germany,4308854
5,India,3736882
6,United Kingdom,3158938
7,France,2923489
8,Italy,2169745
9,Canada,2089672
10,Brazil,2081235


In [17]:
# Convert the GDP value in Million USD to Billion USD

df[['GDP (Million USD)']] = df[['GDP (Million USD)']] / 1000

In [None]:
# For single-column operations, both methods work. Single brackets are simpler and more commonly used, 
# while double brackets ensure consistency when dealing 
# with multiple columns or preserving the DataFrame structure.

In [18]:
df

Unnamed: 0,Country,GDP (Million USD)
1,United States,26854.599
2,China,19373.586
3,Japan,4409.738
4,Germany,4308.854
5,India,3736.882
6,United Kingdom,3158.938
7,France,2923.489
8,Italy,2169.745
9,Canada,2089.672
10,Brazil,2081.235


In [19]:
# Use numpy.round() method to round the value to 2 decimal places.

df[['GDP (Million USD)']] = np.round(df[['GDP (Million USD)']], 2)

In [20]:
df

Unnamed: 0,Country,GDP (Million USD)
1,United States,26854.6
2,China,19373.59
3,Japan,4409.74
4,Germany,4308.85
5,India,3736.88
6,United Kingdom,3158.94
7,France,2923.49
8,Italy,2169.74
9,Canada,2089.67
10,Brazil,2081.24


In [21]:
# Rename the column header from 'GDP (Million USD)' to 'GDP (Billion USD)'

df.rename(columns = {'GDP (Million USD)' : 'GDP (Billion USD)'})

Unnamed: 0,Country,GDP (Billion USD)
1,United States,26854.6
2,China,19373.59
3,Japan,4409.74
4,Germany,4308.85
5,India,3736.88
6,United Kingdom,3158.94
7,France,2923.49
8,Italy,2169.74
9,Canada,2089.67
10,Brazil,2081.24


# Exercise 3
Load the DataFrame to the CSV file named "Largest_economies.csv"

In [22]:
# Load the DataFrame to the CSV file named "Largest_economies.csv"

df.to_csv('./Largest_economies.csv')

In [23]:
df

Unnamed: 0,Country,GDP (Million USD)
1,United States,26854.6
2,China,19373.59
3,Japan,4409.74
4,Germany,4308.85
5,India,3736.88
6,United Kingdom,3158.94
7,France,2923.49
8,Italy,2169.74
9,Canada,2089.67
10,Brazil,2081.24
