### Introduction
In this practice project, you will put the skills acquired through the course to use and create a complete ETL pipeline for accessing data from a website and processing it to meet the requirements.

### Project Scenario:
An international firm that is looking to expand its business in different countries across the world has recruited you. You have been hired as a junior Data Engineer and are tasked with creating an automated script that can extract the list of all countries in order of their GDPs in billion USDs (rounded to 2 decimal places), as logged by the International Monetary Fund (IMF). Since IMF releases this evaluation twice a year, this code will be used by the organization to extract the information as it is updated.

The required data seems to be available on the URL mentioned below:

URL
`'https://web.archive.org/web/20230902185326/https://en.wikipedia.org/wiki/List_of_countries_by_GDP_%28nominal%29'`

Objectives
- You have to complete the following tasks for this project
- Write a data extraction function to retrieve the relevant information from the required URL.
- Transform the available GDP information into 'Billion USD' from 'Million USD'.
- Load the transformed information to the required CSV file and as a database file.
- Run the required query on the database.
- Log the progress of the code with appropriate timestamps.

In [None]:
%pip install pandas
%pip install numpy
%pip install bs4

In [2]:
import sqlite3
import requests
import numpy as np
import pandas as pd

from bs4 import BeautifulSoup
from datetime import datetime

In [3]:


db_name = 'World_Economies.db'
table_name = 'Countries_by_GDP'
csv_path = './Countries_by_GDP.csv'
table_attribs = ['Country', 'GDP_USD_millions']
url = 'https://web.archive.org/web/20230902185326/https://en.wikipedia.org/wiki/List_of_countries_by_GDP_%28nominal%29'

def extract(url, table_attribs):
    ''' This function extracts the required
    information from the website and saves it to a dataframe. The
    function returns the dataframe for further processing. '''

    html_page = requests.get(url).text
    data = BeautifulSoup(html_page, 'html.parser')

    df = pd.DataFrame(columns=table_attribs)

    tables = data.find_all('tbody')
    rows = tables[2].find_all('tr')

    for row in rows:
        col = row.find_all('td')

        if len(col) != 0:
            if col[0].find('a') is not None and 'â€”' not in col[2]:
                data_dict = {
                    'Country': col[0].a.contents[0],
                    'GDP_USD_millions': col[2].contents[0]
                }

                temp_df = pd.DataFrame(data_dict, index=[0])
                df = pd.concat([df, temp_df], ignore_index=True)
    return df


def transform(df):
    ''' This function converts the GDP information from Currency
    format to float value, transforms the information of GDP from
    USD (Millions) to USD (Billions) rounding to 2 decimal places.
    The function returns the transformed dataframe.'''

    gdp_tolist = df['GDP_USD_millions'].tolist()
    gdp_billions = []

    for gdp in gdp_tolist:
        converted_gdp = float(''.join(gdp.split(',')))
        # convert to billion
        gdp_billions.append(np.round(converted_gdp/1000, 2))

    df['GDP_USD_millions'] = gdp_billions
    df.rename(columns={'GDP_USD_millions': 'GDP_USD_billions'}, inplace=True)

    return df


def load_to_csv(df, csv_path):
    ''' This function saves the final dataframe as a `CSV` file 
    in the provided path. Function returns nothing.'''
    df.to_csv(csv_path)


def load_to_db(df, sql_connection, table_name):
    ''' This function saves the final dataframe as a database table
    with the provided name. Function returns nothing.'''
    df.to_sql(table_name, sql_connection, if_exists='replace', index=False)


def run_query(query_statement, sql_connection):
    ''' This function runs the stated query on the database table and
    prints the output on the terminal. Function returns nothing. '''
    print(query_statement)
    query_result = pd.read_sql(query_statement, sql_connection)
    print(query_result)


def log_progress(message):
    ''' This function logs the mentioned message at a given stage of the code execution to a log file. Function returns nothing'''
    timestamp_format = '%Y-%h-%d-%H:%M:%S'  # Year-Monthname-Day-Hour-Minute-Second
    now = datetime.now()  # get current timestamp
    timestamp = now.strftime(timestamp_format)
    with open("./etl_project_log.txt", "a") as f:
        f.write(timestamp + ' : ' + message + '\n')

In [4]:
log_progress('Preliminaries complete. Initiating ETL process')

extracted_data = extract(url, table_attribs)

log_progress('Data extraction complete. Initiating Transformation process')

transformed = transform(extracted_data)

log_progress('Data transformation complete. Initiating loading process')

load_to_csv(transformed, csv_path)

log_progress('Data saved to CSV file')

db_connection = sqlite3.connect(db_name)

log_progress('SQL Connection initiated.')

load_to_db(transformed, db_connection, table_name)

log_progress('Data loaded to Database as table. Running the query')

query_statement = f"select * from {table_name} where GDP_USD_billions < 1000"

run_query(query_statement, db_connection)

log_progress('Process Complete.')

db_connection.close()

select * from Countries_by_GDP where GDP_USD_billions < 1000
              Country  GDP_USD_billions
0         Switzerland            869.60
1              Taiwan            790.73
2              Poland            748.89
3           Argentina            641.10
4             Belgium            624.25
..                ...               ...
167  Marshall Islands              0.29
168             Palau              0.26
169          Kiribati              0.25
170             Nauru              0.15
171            Tuvalu              0.06

[172 rows x 2 columns]
