#### Extract, Transform and Load GDP Data

An international firm that is looking to expand its business in different countries across the world has recruited you. You have been hired as a junior Data Engineer and are tasked with creating an automated script that can extract the list of all countries in order of their GDPs in billion USDs (rounded to 2 decimal places), as logged by the International Monetary Fund (IMF). Since IMF releases this evaluation twice a year, this code will be used by the organization to extract the information as it is updated.

The required data seems to be available on the URL mentioned below:
'https://web.archive.org/web/20230902185326/https://en.wikipedia.org/wiki/List_of_countries_by_GDP_%28nominal%29'

The required information needs to be made accessible as a `CSV` file `Countries_by_GDP.csv` as well as a table `Countries_by_GDP` in a database file `World_Economies.db` with attributes `Country` and `GDP_USD_billion`.

Your boss wants you to demonstrate the success of this code by running a query on the database table to display only the entries with more than a 100 billion USD economy. Also, you should log in a file with the entire process of execution named `etl_project_log.txt`.

You must create a Python code `etl_project_gdp.py` that performs all the required tasks.

##### ETL Process

In [1]:
# Import the libraries for web scraping and data manipulation
import requests
import sqlite3
import numpy as np
import pandas as pd
from bs4 import BeautifulSoup
from datetime import datetime
print("Project libraries have been imported successfully")

# Store final output data and all logs
log_file = "etl_project_log.txt"
target_file = "Countries_by_GDP.csv"

# Initialize all known entities
data_url = 'https://web.archive.org/web/20230902185326/https://en.wikipedia.org/wiki/List_of_countries_by_GDP_%28nominal%29'
table_attribs = ["Country", "GDP_USD_millions"]
db_name = "World_Economies.db"
table_name = "Countries_by_GDP"
csv_path = '~/Documents/IBM-Data-Engineering-Professional/Course 3 - Python Project for Data Engineering/Extract, Transform and Load GDP data/Countries_by_GDP.csv'

# Useful functions for ETL operations on Market capitalization data
# Extraction
def extract(url, table_attribs):
    ''' This function extracts the required
    information from the website and saves it to a dataframe. The
    function returns the dataframe for further processing. 
    '''
    page = requests.get(url).text
    data = BeautifulSoup(page,'html.parser')
    df = pd.DataFrame(columns=table_attribs)
    tables = data.find_all('tbody')
    rows = tables[2].find_all('tr')
    for row in rows:
        col = row.find_all('td')
        if len(col)!=0:
            if col[0].find('a') is not None and '—' not in col[2]:
                data_dict = {"Country": col[0].a.contents[0],
                             "GDP_USD_millions": col[2].contents[0]}
                df1 = pd.DataFrame(data_dict, index=[0])
                df = pd.concat([df,df1], ignore_index=True)
    return df

# Transformation
def transform(df):
    ''' This function converts the GDP information from Currency
    format to float value, transforms the information of GDP from
    USD (Millions) to USD (Billions) rounding to 2 decimal places.
    The function returns the transformed dataframe.
    '''
    GDP_list = df["GDP_USD_millions"].tolist()
    GDP_list = [float("".join(x.split(','))) for x in GDP_list]
    GDP_list = [np.round(x/1000,2) for x in GDP_list]
    df["GDP_USD_millions"] = GDP_list
    df=df.rename(columns = {"GDP_USD_millions":"GDP_USD_billions"})
    return df

# Loading and Logging
def load_to_csv(df, csv_path):
    ''' This function saves the final dataframe as a `CSV` file 
    in the provided path. Function returns nothing.'''
    df.to_csv(csv_path)

def load_to_db(df, sql_connection, table_name):
    ''' This function saves the final dataframe as a database table
    with the provided name. Function returns nothing.'''
    df.to_sql(table_name, sql_connection, if_exists= "replace", index = False)
    
def run_query(query_statement, sql_connection):
    ''' This function runs the stated query on the database table and
    prints the output on the terminal. Function returns nothing. '''
    print(query_statement)
    return pd.read_sql(query_statement, sql_connection)
    
def log_progress(message):
    ''' This function logs the mentioned message at a given stage of the code 
    execution to a log file. Function returns nothing'''
    timestamp_format = '%Y-%h-%d-%H:%M:%S' # Year-Monthname-Day-Hour-Minute-Second 
    now = datetime.now() # get current timestamp 
    timestamp = now.strftime(timestamp_format) 
    with open("./etl_project_log.txt", "a") as f: 
        f.write(timestamp + ', ' + message + '\n')   


Project libraries have been imported successfully


In [2]:
# Testing ETL operations and log progress
# Log the initialization of the ETL process 
log_progress("Preliminaries complete. Initiating ETL process")
extracted_data = extract(data_url, table_attribs)
 
# Log the completion of the Extraction process and begin transformation process
log_progress("Data extraction complete. Initiating Transformation process")
transformed_data = transform(extracted_data) 
print("Transformed Data") 
print(transformed_data) 
 
# Log the completion of the Transformation process and begin loading process
log_progress("Data transformation complete. Initiating loading process")
load_to_csv(transformed_data, csv_path)

log_progress("Data saved to CSV file") 

# Use SQLite3 to create and connect to a new database World_Economies.db
sql_connection = sqlite3.connect(db_name)

log_progress("SQL Connection initiated")

load_to_db(transformed_data, sql_connection, table_name)

log_progress("Data loaded to Database as table. Running the query")

query_statement = f"SELECT * from {table_name} WHERE GDP_USD_billions >= 100"
print(run_query(query_statement, sql_connection))
 
# Log the completion of the process 
log_progress("Process Complete") 

sql_connection.close()

Transformed Data
              Country  GDP_USD_billions
0       United States          26854.60
1               China          19373.59
2               Japan           4409.74
3             Germany           4308.85
4               India           3736.88
..                ...               ...
186  Marshall Islands              0.29
187             Palau              0.26
188          Kiribati              0.25
189             Nauru              0.15
190            Tuvalu              0.06

[191 rows x 2 columns]
SELECT * from Countries_by_GDP WHERE GDP_USD_billions >= 100
          Country  GDP_USD_billions
0   United States          26854.60
1           China          19373.59
2           Japan           4409.74
3         Germany           4308.85
4           India           3736.88
..            ...               ...
64          Kenya            118.13
65         Angola            117.88
66           Oman            104.90
67      Guatemala            102.31
68       Bulgaria     