## Web Scraping: Practice exercise

Exercise: extract data from a website using webscraping and reqeust APIs process it using Pandas and Numpy libraries.

Objectives
- Use Webscraping to extract required information from a website.
- Use Pandas to load and process the tabular data as a dataframe.
- Use Numpy to manipulate the information contatined in the dataframe.
- Load the updated dataframe to CSV file.

In [None]:
%pip install lxml   #Required by `pd.read_html(URL)`

Collecting lxml
  Downloading lxml-5.4.0-cp312-cp312-win_amd64.whl.metadata (3.6 kB)
Downloading lxml-5.4.0-cp312-cp312-win_amd64.whl (3.8 MB)
   ---------------------------------------- 0.0/3.8 MB ? eta -:--:--
   -- ------------------------------------- 0.3/3.8 MB ? eta -:--:--
   ----- ---------------------------------- 0.5/3.8 MB 1.7 MB/s eta 0:00:02
   ---------- ----------------------------- 1.0/3.8 MB 1.5 MB/s eta 0:00:02
   ---------------- ----------------------- 1.6/3.8 MB 2.0 MB/s eta 0:00:02
   ------------------------ --------------- 2.4/3.8 MB 2.3 MB/s eta 0:00:01
   --------------------------- ------------ 2.6/3.8 MB 2.0 MB/s eta 0:00:01
   -------------------------------- ------- 3.1/3.8 MB 2.3 MB/s eta 0:00:01
   ----------------------------------- ---- 3.4/3.8 MB 2.2 MB/s eta 0:00:01
   ---------------------------------------- 3.8/3.8 MB 2.1 MB/s eta 0:00:00
Installing collected packages: lxml
Successfully installed lxml-5.4.0
Note: you may need to restart the kernel 

In [2]:
import numpy as np
import pandas as pd


We use the *Pandas library* to extract the required table directly as a DataFrame. Note that the required table is the 3rd one on the website.

In [3]:
# URL to scrape data from 
URL="https://web.archive.org/web/20230902185326/https://en.wikipedia.org/wiki/List_of_countries_by_GDP_%28nominal%29"

# Extract tables from webpage using Pandas. Retain table number 3 as the required dataframe.
tables = pd.read_html(URL)

In [7]:
type(tables)

list

In [None]:
tables[3]   #GDP data is in Table#3

Unnamed: 0_level_0,Country/Territory,UN region,IMF[1][13],IMF[1][13],World Bank[14],World Bank[14],United Nations[15],United Nations[15]
Unnamed: 0_level_1,Country/Territory,UN region,Estimate,Year,Estimate,Year,Estimate,Year
0,World,—,105568776,2023,100562011,2022,96698005,2021
1,United States,Americas,26854599,2023,25462700,2022,23315081,2021
2,China,Asia,19373586,[n 1]2023,17963171,[n 3]2022,17734131,[n 1]2021
3,Japan,Asia,4409738,2023,4231141,2022,4940878,2021
4,Germany,Europe,4308854,2023,4072192,2022,4259935,2021
...,...,...,...,...,...,...,...,...
209,Anguilla,Americas,—,—,—,—,303,2021
210,Kiribati,Oceania,248,2023,223,2022,227,2021
211,Nauru,Oceania,151,2023,151,2022,155,2021
212,Montserrat,Americas,—,—,—,—,72,2021


In [17]:
#Creating dataframe
df = pd.DataFrame(tables[3])
df.head(10)

Unnamed: 0_level_0,Country/Territory,UN region,IMF[1][13],IMF[1][13],World Bank[14],World Bank[14],United Nations[15],United Nations[15]
Unnamed: 0_level_1,Country/Territory,UN region,Estimate,Year,Estimate,Year,Estimate,Year
0,World,—,105568776,2023,100562011,2022,96698005,2021
1,United States,Americas,26854599,2023,25462700,2022,23315081,2021
2,China,Asia,19373586,[n 1]2023,17963171,[n 3]2022,17734131,[n 1]2021
3,Japan,Asia,4409738,2023,4231141,2022,4940878,2021
4,Germany,Europe,4308854,2023,4072192,2022,4259935,2021
5,India,Asia,3736882,2023,3385090,2022,3201471,2021
6,United Kingdom,Europe,3158938,2023,3070668,2022,3131378,2021
7,France,Europe,2923489,2023,2782905,2022,2957880,2021
8,Italy,Europe,2169745,2023,2010432,2022,2107703,2021
9,Canada,Americas,2089672,2023,2139840,2022,1988336,2021


In [18]:
# Replace the column headers with column numbers
print(f"df.shape = {df.shape}")
df.columns = range(df.shape[1])
df.head()

df.shape = (214, 8)


Unnamed: 0,0,1,2,3,4,5,6,7
0,World,—,105568776,2023,100562011,2022,96698005,2021
1,United States,Americas,26854599,2023,25462700,2022,23315081,2021
2,China,Asia,19373586,[n 1]2023,17963171,[n 3]2022,17734131,[n 1]2021
3,Japan,Asia,4409738,2023,4231141,2022,4940878,2021
4,Germany,Europe,4308854,2023,4072192,2022,4259935,2021


In [30]:
# Retain columns with index 0 and 2 ('name of country' and 'value of GDP' quoted by IMF)
df1 = df.copy(deep=True) 
df1 = df1[[0,2]]
df1.head(15)

Unnamed: 0,0,2
0,World,105568776
1,United States,26854599
2,China,19373586
3,Japan,4409738
4,Germany,4308854
5,India,3736882
6,United Kingdom,3158938
7,France,2923489
8,Italy,2169745
9,Canada,2089672


In [31]:
# Retain the Rows with index 1 to 10, indicating the top 10 economies of the world.

#slicing using iloc[r1:r2, c1:c2]
df1 = df1.iloc[0:10,]
df1.head(15)

Unnamed: 0,0,2
0,World,105568776
1,United States,26854599
2,China,19373586
3,Japan,4409738
4,Germany,4308854
5,India,3736882
6,United Kingdom,3158938
7,France,2923489
8,Italy,2169745
9,Canada,2089672


In [24]:
# Assign column names as "Country" and "GDP (Million USD)"
df1.columns = [["Country","GDP (Million USD)"]]
df1.head()

Unnamed: 0,Country,GDP (Million USD)
0,World,105568776
1,United States,26854599
2,China,19373586
3,Japan,4409738
4,Germany,4308854


In [25]:
# Change the data type of the 'GDP (Million USD)' column to integer. Use astype() method.
df1[["GDP (Million USD)"]] = df1[["GDP (Million USD)"]].astype(int)

# Convert the GDP value in Million USD to Billion USD
df1[["GDP (Million USD)"]] = df1[["GDP (Million USD)"]]/1000
df1.head()

Unnamed: 0,Country,GDP (Million USD)
0,World,105568.776
1,United States,26854.599
2,China,19373.586
3,Japan,4409.738
4,Germany,4308.854


In [None]:
# Rename the column header from 'GDP (Million USD)' to 'GDP (Billion USD)'
#df.rename(columns={"OLD_COLUMN_VALUE": "NEW_COLUMN_VALUE"})

df1.rename(columns={"GDP (Million USD)": "GDP (Billion USD)"}, inplace=True)  #remember: it's NOT 'true', it's 'True'
df1.head()

#Also remember `inplace=False` is the DEFAULT -> returns a new df, leaving the original unchanged.

Unnamed: 0,Country,GDP (Billion USD)
0,World,105568.776
1,United States,26854.599
2,China,19373.586
3,Japan,4409.738
4,Germany,4308.854


In [None]:
# Use numpy.round() method to round the value to 2 decimal places.
df1[["GDP (Billion USD)"]] = np.round(df1[["GDP (Billion USD)"]], 2)  #remember: default rounding is to 0 decimals
df1.head()

Unnamed: 0,Country,GDP (Billion USD)
0,World,105568.78
1,United States,26854.6
2,China,19373.59
3,Japan,4409.74
4,Germany,4308.85


In [None]:
# FINALLY:Load the DataFrame to the CSV file named "Artifacts/Largest_economies.csv"
df1.to_csv("Artifacts/Largest_economies.csv")

#we also have .to_excel and .to_json

--end--