# Intro to web scraping and data formating
Web scraping, also known as web harvesting or web data extraction, is the process of extracting information from websites or webpages.

It involves the automatic retrieval of data from web sources

This is important for many applications, i.e data analysis, data mining, content aggregation, etc.

After collecting data from the web you need to format it for downstream analysis. The Python libraries pandas and numpy can be used for that.

Finally, you can store the analysis-ready data in, for example, a CSV file. Remeber that the choice of the storage format depends on the project's requirements.


## Project Scenario: GDP Data extraction and processing

An international firm that is looking to expand its business in different countries across the world. Create a script that can extract the list of the top 10 largest economies of the world in descending order of their GDPs in Billion USD (rounded to 2 decimal places), as logged by the International Monetary Fund (IMF). 

The required data seems to be available on:
URL: https://web.archive.org/web/20230902185326/https://en.wikipedia.org/wiki/List_of_countries_by_GDP_%28nominal%29

## Objectives

 - Use Web scraping to extract required information from a website.
 - Use Pandas to load and process the tabular data as a dataframe.
 - Use Numpy to manipulate the information contained in the dataframe.
 - Store the updated dataframe as CSV file.

---


# Setup
### Importing Required Libraries

In [4]:
import numpy as np
import pandas as pd

In [5]:
# Suppressing warnings generated by the code:
def warn(*args, **kwargs):
    pass
import warnings
warnings.warn = warn
warnings.filterwarnings('ignore')

---


# Extracting the GDP data from the given URL using WebScraping.

Note that we need to extract only the third table from the website, which contains the data we will need for analysis

<img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-PY0101EN-SkillsNetwork/images/pandas_wbs_3.png">


In [16]:
# Extract tables from webpage using the Pandas function read_html(). 
# The read_html() is very handy since it automatically extract tables from webpages and present it in a format suitable for analysis.
tables = pd.read_html("https://web.archive.org/web/20230902185326/https://en.wikipedia.org/wiki/List_of_countries_by_GDP_%28nominal%29")

In [17]:
# Extracting third table - dataframe of interest.
df = tables[3]

In [18]:
# Inspect document
df.head()

Unnamed: 0_level_0,Country/Territory,UN region,IMF[1][13],IMF[1][13],World Bank[14],World Bank[14],United Nations[15],United Nations[15]
Unnamed: 0_level_1,Country/Territory,UN region,Estimate,Year,Estimate,Year,Estimate,Year
0,World,—,105568776,2023,100562011,2022,96698005,2021
1,United States,Americas,26854599,2023,25462700,2022,23315081,2021
2,China,Asia,19373586,[n 1]2023,17963171,[n 3]2022,17734131,[n 1]2021
3,Japan,Asia,4409738,2023,4231141,2022,4940878,2021
4,Germany,Europe,4308854,2023,4072192,2022,4259935,2021


# Data transformation

In [20]:
# Replace the column headers with column numbers
df.columns = range(df.shape[1])
df

Unnamed: 0,0,1,2,3,4,5,6,7
0,World,—,105568776,2023,100562011,2022,96698005,2021
1,United States,Americas,26854599,2023,25462700,2022,23315081,2021
2,China,Asia,19373586,[n 1]2023,17963171,[n 3]2022,17734131,[n 1]2021
3,Japan,Asia,4409738,2023,4231141,2022,4940878,2021
4,Germany,Europe,4308854,2023,4072192,2022,4259935,2021
...,...,...,...,...,...,...,...,...
209,Anguilla,Americas,—,—,—,—,303,2021
210,Kiribati,Oceania,248,2023,223,2022,227,2021
211,Nauru,Oceania,151,2023,151,2022,155,2021
212,Montserrat,Americas,—,—,—,—,72,2021


In [21]:
# Retain columns with index 0 and 2 (name of country and value of GDP quoted by IMF)
df = df[[0,2]]
df

Unnamed: 0,0,2
0,World,105568776
1,United States,26854599
2,China,19373586
3,Japan,4409738
4,Germany,4308854
...,...,...
209,Anguilla,—
210,Kiribati,248
211,Nauru,151
212,Montserrat,—


In [22]:
# Retain the Rows with index 1 to 10, indicating the top 10 economies of the world.
df = df.iloc[1:11,:]
df

Unnamed: 0,0,2
1,United States,26854599
2,China,19373586
3,Japan,4409738
4,Germany,4308854
5,India,3736882
6,United Kingdom,3158938
7,France,2923489
8,Italy,2169745
9,Canada,2089672
10,Brazil,2081235


In [23]:
# Note that columns are labelled as 0 and 2
# Assign column names as "Country" and "GDP (Million USD)"
df.columns = ['Country','GDP (Million USD)']
df

Unnamed: 0,Country,GDP (Million USD)
1,United States,26854599
2,China,19373586
3,Japan,4409738
4,Germany,4308854
5,India,3736882
6,United Kingdom,3158938
7,France,2923489
8,Italy,2169745
9,Canada,2089672
10,Brazil,2081235


In [25]:
# Presenting the GDP values in Billion USD with 2 decimal places

# Change the data type of the 'GDP (Million USD)' column to integer. Use astype() method.
df['GDP (Million USD)'] = df['GDP (Million USD)'].astype(int)
# Convert the GDP value in Million USD to Billion USD
df[['GDP (Million USD)']] = df[['GDP (Million USD)']]/1000
# Use numpy.round() method to round the value to 2 decimal places.
df[['GDP (Million USD)']] = np.round(df[['GDP (Million USD)']], 2)
# Rename the column header from 'GDP (Million USD)' to 'GDP (Billion USD)'
df.rename(columns = {'GDP (Million USD)': 'GDP (Billion USD)'})
df


Unnamed: 0,Country,GDP (Million USD)
1,United States,26.85
2,China,19.37
3,Japan,4.41
4,Germany,4.31
5,India,3.74
6,United Kingdom,3.16
7,France,2.92
8,Italy,2.17
9,Canada,2.09
10,Brazil,2.08


# Data storage

In [27]:
# Save DataFrame as CSV file named "Largest_economies.csv"
df.to_csv('./Largest-economies.csv')


---
