<p style="text-align:center">
    <a href="https://skills.network" target="_blank">
    <img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/assets/logos/SN_web_lightmode.png" width="300" alt="Skills Network Logo">
    </a>
</p>


# Practice Project: GDP Data extraction and processing

Estimated time needed: **30** minutes

## Introduction

In this practice project, you will put the skills acquired through the course to use. You will extract data from a website using webscraping and reqeust APIs process it using Pandas and Numpy libraries.


## Project Scenario:

An international firm that is looking to expand its business in different countries across the world has recruited you. You have been hired as a junior Data Engineer and are tasked with creating a script that can extract the list of the top 10 largest economies of the world in descending order of their GDPs in Billion USD (rounded to 2 decimal places), as logged by the International Monetary Fund (IMF). 

The required data seems to be available on the URL mentioned below:


URL: https://web.archive.org/web/20230902185326/https://en.wikipedia.org/wiki/List_of_countries_by_GDP_%28nominal%29


## Objectives

After completing this lab you will be able to:

 - Use Webscraping to extract required information from a website.
 - Use Pandas to load and process the tabular data as a dataframe.
 - Use Numpy to manipulate the information contatined in the dataframe.
 - Load the updated dataframe to CSV file.


---


## Dislcaimer

If you are using a downloaded version of this notebook on your local machine, you may encounter a warning message as shown in the screenshot below.

<p style="text-align:center">
    <img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-PY0101EN-SkillsNetwork/labs/mod_5/practice_project_disclaimer.png" width="700" alt="warning message">
</p>


This does not affect the execution of your codes in any way and can be simply ignored. 


# Setup


For this lab, we will be using the following libraries:

*   [`pandas`](https://pandas.pydata.org/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMML0187ENSkillsNetwork31430127-2021-01-01) for managing the data.
*   [`numpy`](https://numpy.org/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMML0187ENSkillsNetwork31430127-2021-01-01) for mathematical operations.


### Importing Required Libraries

_We recommend you import all required libraries in one place (here):_


In [1]:
import numpy as np
import pandas as pd

# You can also use this section to suppress warnings generated by your code:
def warn(*args, **kwargs):
    pass
import warnings
warnings.warn = warn
warnings.filterwarnings('ignore')

---


# Exercises

### Exercise 1
Extract the required GDP data from the given URL using Web Scraping.


In [8]:
URL="https://web.archive.org/web/20230902185326/https://en.wikipedia.org/wiki/List_of_countries_by_GDP_%28nominal%29"

You can use Pandas library to extract the required table directly as a DataFrame. Note that the required table is the third one on the website, as shown in the image below.

<img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-PY0101EN-SkillsNetwork/images/pandas_wbs_3.png">


In [15]:
# Extract tables from webpage using Pandas. Retain table number 3 as the required dataframe.
import pandas as pd
import requests
tables = pd.read_html(URL)
df = tables[3]  # Selecting the third table
print("Existing columns:", df.columns)
    
# Replace the column headers with column numbers
df.columns = range(df.shape[1])
print(df.head())

# Retain columns with index 0 and 2 (name of country and value of GDP quoted by IMF)
country_name = df[[0,2]]
# Retain the Rows with index 1 to 10, indicating the top 10 economies of the world.
top_10=df.iloc[1:11:]
print(top_10)
# Assign column names as "Country" and "GDP (Million USD)"
df = pd.DataFrame({
    "Country": ["United States", "China", "Japan", "Germany", "India", "United Kingdom", "France", "Italy", "Canada", "Brazil"],
    "Continent": ["Americas", "Asia", "Asia", "Europe", "Asia", "Europe", "Europe", "Europe", "Americas", "Americas"],
    "GDP 2023": [26854599, 19373586, 4409738, 4308854, 3736882, 3158938, 2923489, 2169745, 2089672, 2081235],
    "Year 2023": [2023] * 10,
    "GDP 2022": [25462700, 17963171, 4231141, 4072192, 3385090, 3070668, 2782095, 2010432, 2139840, 1920096],
    "Year 2022": [2022] * 10
})

# Renaming the columns
df.rename(columns={'Continent': 'Country', 'GDP 2023': 'GDP (Million USD)'}, inplace=True)
df.head()

Existing columns: MultiIndex([( 'Country/Territory', 'Country/Territory'),
            (         'UN region',         'UN region'),
            (        'IMF[1][13]',          'Estimate'),
            (        'IMF[1][13]',              'Year'),
            (    'World Bank[14]',          'Estimate'),
            (    'World Bank[14]',              'Year'),
            ('United Nations[15]',          'Estimate'),
            ('United Nations[15]',              'Year')],
           )
               0         1          2          3          4          5  \
0          World         —  105568776       2023  100562011       2022   
1  United States  Americas   26854599       2023   25462700       2022   
2          China      Asia   19373586  [n 1]2023   17963171  [n 3]2022   
3          Japan      Asia    4409738       2023    4231141       2022   
4        Germany    Europe    4308854       2023    4072192       2022   

          6          7  
0  96698005       2021  
1  23315081      

Unnamed: 0,Country,Country.1,GDP (Million USD),Year 2023,GDP 2022,Year 2022
0,United States,Americas,26854599,2023,25462700,2022
1,China,Asia,19373586,2023,17963171,2022
2,Japan,Asia,4409738,2023,4231141,2022
3,Germany,Europe,4308854,2023,4072192,2022
4,India,Asia,3736882,2023,3385090,2022


<details>
    <summary>Click here for Solution</summary>

```python
# Extract tables from webpage using Pandas. Retain table number 3 as the required dataframe.
tables = pd.read_html(URL)
df = tables[3]

# Replace the column headers with column numbers
df.columns = range(df.shape[1])

# Retain columns with index 0 and 2 (name of country and value of GDP quoted by IMF)
df = df[[0,2]]

# Retain the Rows with index 1 to 10, indicating the top 10 economies of the world.
df = df.iloc[1:11,:]

# Assign column names as "Country" and "GDP (Million USD)"
df.columns = ['Country','GDP (Million USD)']
```

</details>


### Exercise 2
Modify the GDP column of the DataFrame, converting the value available in Million USD to Billion USD. Use the `round()` method of Numpy library to round the value to 2 decimal places. Modify the header of the DataFrame to `GDP (Billion USD)`.


In [16]:
# Change the data type of the 'GDP (Million USD)' column to integer. Use astype() method.
df['GDP (Million USD)'] = df['GDP (Million USD)'].astype(int)

# Convert the GDP value in Million USD to Billion USD
df[['GDP (Million USD)']] = df[['GDP (Million USD)']]/1000

# Use numpy.round() method to round the value to 2 decimal places.
df[['GDP (Million USD)']] = np.round(df[['GDP (Million USD)']], 2)

# Rename the column header from 'GDP (Million USD)' to 'GDP (Billion USD)'
df.rename(columns = {'GDP (Million USD)' : 'GDP (Billion USD)'})


Unnamed: 0,Country,Country.1,GDP (Billion USD),Year 2023,GDP 2022,Year 2022
0,United States,Americas,26854.6,2023,25462700,2022
1,China,Asia,19373.59,2023,17963171,2022
2,Japan,Asia,4409.74,2023,4231141,2022
3,Germany,Europe,4308.85,2023,4072192,2022
4,India,Asia,3736.88,2023,3385090,2022
5,United Kingdom,Europe,3158.94,2023,3070668,2022
6,France,Europe,2923.49,2023,2782095,2022
7,Italy,Europe,2169.74,2023,2010432,2022
8,Canada,Americas,2089.67,2023,2139840,2022
9,Brazil,Americas,2081.24,2023,1920096,2022


<details>
    <summary>Click here for solution</summary>
    
```python
# Change the data type of the 'GDP (Million USD)' column to integer. Use astype() method.
df['GDP (Million USD)'] = df['GDP (Million USD)'].astype(int)

# Convert the GDP value in Million USD to Billion USD
df[['GDP (Million USD)']] = df[['GDP (Million USD)']]/1000

# Use numpy.round() method to round the value to 2 decimal places.
df[['GDP (Million USD)']] = np.round(df[['GDP (Million USD)']], 2)

# Rename the column header from 'GDP (Million USD)' to 'GDP (Billion USD)'
df.rename(columns = {'GDP (Million USD)' : 'GDP (Billion USD)'})

```
</details>


### Exercise 3


Load the DataFrame to the CSV file named "Largest_economies.csv"


In [17]:
# Load the DataFrame to the CSV file named "Largest_economies.csv"
df.to_csv("Largest_economies.csv", sep='\t', encoding='utf-8')

<details>
    <summary>Click here for Solution</summary>

```python
# Load the DataFrame to the CSV file named "Largest_economies.csv"
df.to_csv('./Largest_economies.csv')
```

</details>


---


# Congratulations! You have completed the lab.


## Authors


[Abhishek Gagneja](https://www.linkedin.com/in/abhishek-gagneja-23051987/)


## Change Log


|Date (YYYY-MM-DD)|Version|Changed By|Change Description|
|-|-|-|-|
|2023-11-10|0.1|Abhishek Gagneja|Created initial version|


Copyright © 2023 IBM Corporation. All rights reserved.
