# Project Scenario
An international firm that is looking to expand its business in different countries across the world has recruited you. You have been hired as a junior Data Engineer and are tasked with creating an automated script that can extract the list of all countries in order of their GDPs in billion USDs (rounded to 2 decimal places), as logged by the International Monetary Fund (IMF). Since IMF releases this evaluation twice a year, this code will be used by the organization to extract the information as it is updated.


- Write a data extraction function to retrieve the relevant information from the required URL.
- Transform the available GDP information into 'Billion USD' from 'Million USD'.
- Load the transformed information to the required CSV file and as a database file.
- Run the required query on the database.
- Log the progress of the code with appropriate timestamps.


In [1]:
import requests
import sqlite3
import pandas as pd
from bs4 import BeautifulSoup
import numpy as np
from datetime import datetime

## Task 0: Preliminary - Defining Values 

In [2]:
url = 'https://web.archive.org/web/20230902185326/https://en.wikipedia.org/wiki/List_of_countries_by_GDP_%28nominal%29'
filename = 'Countries_by_GDP.csv'
tablename = 'Countries_by_GDP'
database_name = 'World_Economies.db'
log_file = 'etl_project_log.txt'
df = pd.DataFrame(columns=['Country', 'GDP_USD_millions'])
#Attributes ofinterest are Country and GDP_USD_billion

## Task 1: Extracting information

#### Collect HTML 

In [3]:
response = requests.get(url).text

#Parse the HTML content
data = BeautifulSoup(response, 'html.parser')

- Using the inspect tool we can see that our table information is under the tbody tag. So, let's examine that further. 

In [4]:
#If we print the length of table we can find how many tables are present. 
table = data.find_all('tbody')
print(len(table))

7


In [5]:
#From inspecting the tables we can see our table of interest is the third one. 
#We can comfirm by getting the length of the rows and also printing out the table headers.
rows = table[2].find_all('tr')

In [6]:
#216 rows seems about right
print(len(rows))

216


In [7]:
#Headers seem right
print(rows[0])

#The table headers are "Country/Territory", "UN region", "International Monetary Fund", and so on. We have confirmation that we are now working with the right table. 

<tr class="static-row-header" style="text-align:center;vertical-align:bottom;">
<th rowspan="2">Country/Territory
</th>
<th rowspan="2"><a href="/web/20230902185326/https://en.wikipedia.org/wiki/United_Nations_geoscheme" title="United Nations geoscheme">UN region</a>
</th>
<th colspan="2"><a href="/web/20230902185326/https://en.wikipedia.org/wiki/International_Monetary_Fund" title="International Monetary Fund">IMF</a><sup class="reference" id="cite_ref-GDP_IMF_2-2"><a href="#cite_note-GDP_IMF-2">[1]</a></sup><sup class="reference" id="cite_ref-15"><a href="#cite_note-15">[13]</a></sup>
</th>
<th colspan="2"><a href="/web/20230902185326/https://en.wikipedia.org/wiki/World_Bank" title="World Bank">World Bank</a><sup class="reference" id="cite_ref-16"><a href="#cite_note-16">[14]</a></sup>
</th>
<th colspan="2"><a href="/web/20230902185326/https://en.wikipedia.org/wiki/United_Nations" title="United Nations">United Nations</a><sup class="reference" id="cite_ref-UN_17-0"><a href="#cite_note

In [8]:
#Let's now extract the data from the counrty/territory headers and IMF -> Estimate'

for row in rows:
    cols = row.find_all('td')
    if len(cols)!=0 and row.find('a') and '—' not in row.text:   
        country = cols[0].text.strip()
        gdp = cols[2].text.strip()        
        df1 = pd.DataFrame({'Country' : [country], 'GDP_USD_millions' : [gdp] })
        df = pd.concat([df, df1], ignore_index=True) 

In [9]:
# We can see we extracted the correct information
df.head()

Unnamed: 0,Country,GDP_USD_millions
0,United States,26854599
1,China,19373586
2,Japan,4409738
3,Germany,4308854
4,India,3736882


## Task 2: Transform Information

Three main tasks to complete here:

- Convert gdp from currency format to floating numbers.
- Round gdp to 2 decimal places.
- Modify the name of the column from 'GDP_USD_millions' to 'GDP_USD_billions'.

In [10]:
#Task 1 
df['GDP_USD_millions'] = df['GDP_USD_millions'].str.replace(',', '').astype(float)

In [11]:
#Task 2 
df['GDP_USD_millions'] = (df['GDP_USD_millions'] / 1000).round(2)

In [12]:
#Task 3
df = df.rename(columns= {'GDP_USD_millions' : 'GDP_USD_billions'})

## Task 3: Load Information

In [13]:
#We want to store our informatoin as a csv file
df.to_csv(filename)


sql_connection = sqlite3.connect('World_Economies.db')
#Now store the information in a db using sqlite
df.to_sql(tablename, sql_connection, if_exists='replace', index=False)

180

## Task 4: Query the Database

In [14]:
query_statement = f"SELECT * from {tablename} WHERE GDP_USD_billions >= 100"
print(query_statement)
query_output = pd.read_sql(query_statement, sql_connection)
print(query_output)

sql_connection.close()

SELECT * from Countries_by_GDP WHERE GDP_USD_billions >= 100
          Country  GDP_USD_billions
0   United States          26854.60
1           China          19373.59
2           Japan           4409.74
3         Germany           4308.85
4           India           3736.88
..            ...               ...
63          Kenya            118.13
64         Angola            117.88
65           Oman            104.90
66      Guatemala            102.31
67       Bulgaria            100.64

[68 rows x 2 columns]


## Task 5: Logging Process

In [15]:

def log_progress(message):
    ''' This function logs the mentioned message at a given stage of the code execution to a log file. Function returns nothing'''
    timestamp_format = '%Y-%h-%d-%H:%M:%S' # Year-Monthname-Day-Hour-Minute-Second 
    now = datetime.now() # get current timestamp 
    timestamp = now.strftime(timestamp_format) 
    with open("./etl_project_log.txt","a") as f: 
        f.write(timestamp + ' : ' + message + '\n')

## Conclusion: We can organize the work in the functions below to handle various tasks and make it more automated. 
## We can try it in a seperate py file. Make it an stand alone executable. 

In [16]:

def extract(url):
    ''' This function extracts the required
    information from the website and saves it to a dataframe. The
    function returns the dataframe for further processing. '''
    response = requests.get(url).text
    data = BeautifulSoup(response, 'html.parser')
    table = data.find_all('tbody')
    rows = table[2].find_all('tr')
    for row in rows:
        cols = row.find_all('td')
        if len(cols)!=0 and row.find('a') and '—' not in row.text:   
            country = cols[0].text.strip()
            gdp = cols[2].text.strip()        
            df1 = pd.DataFrame({'Country' : [country], 'GDP_USD_millions' : [gdp] })
            df = pd.concat([df, df1], ignore_index=True)   

    return df

def transform(df):
    ''' This function converts the GDP information from Currency
    format to float value, transforms the information of GDP from
    USD (Millions) to USD (Billions) rounding to 2 decimal places.
    The function returns the transformed dataframe.'''
    df['GDP_USD_millions'] = df['GDP_USD_millions'].str.replace(',', '').astype(float)    
    df['GDP_USD_millions'] = (df['GDP_USD_millions'] / 1000).round(2)
    df = df.rename(columns= {'GDP_USD_millions' : 'GDP_USD_billions'})
    return df

def load_to_csv(df, path):
    ''' This function saves the final dataframe as a `CSV` file 
    in the provided path. Function returns nothing.'''
    df.to_csv(path)

def load_to_db(df, sql_connection, table):
    ''' This function saves the final dataframe as a database table
    with the provided name. Function returns nothing.'''
    df.to_sql(table, sql_connection, if_exists='replace', index=False)

def run_query(query_statement, sql_connection):
    ''' This function runs the stated query on the database table and
    prints the output on the terminal. Function returns nothing. '''
    print(query_statement)
    query_output = pd.read_sql(query_statement, sql_connection)
    print(query_output)

def log_progress(message):
    ''' This function logs the mentioned message at a given stage of the code execution to a log file. Function returns nothing'''
    timestamp_format = '%Y-%h-%d-%H:%M:%S' # Year-Monthname-Day-Hour-Minute-Second 
    now = datetime.now() # get current timestamp 
    timestamp = now.strftime(timestamp_format) 
    with open("./etl_project_log.txt","a") as f: 
        f.write(timestamp + ' : ' + message + '\n')