# Project Scenario
An international firm that is looking to expand its business in different countries across the world has recruited you. You have been hired as a junior Data Engineer and are tasked with creating an automated script that can extract the list of all countries in order of their GDPs in billion USDs (rounded to 2 decimal places), as logged by the International Monetary Fund (IMF). Since IMF releases this evaluation twice a year, this code will be used by the organization to extract the information as it is updated.


- Write a data extraction function to retrieve the relevant information from the required URL.
- Transform the available GDP information into 'Billion USD' from 'Million USD'.
- Load the transformed information to the required CSV file and as a database file.
- Run the required query on the database.
- Log the progress of the code with appropriate timestamps.


In [25]:
import requests
import sqlite3
import pandas as pd
from bs4 import BeautifulSoup
import numpy as np
import datetime

## Task 0: Preliminary - Defining Values 

In [26]:
url = 'https://web.archive.org/web/20230902185326/https://en.wikipedia.org/wiki/List_of_countries_by_GDP_%28nominal%29'
filename = 'Countries_by_GDP.csv'
tablename = 'Countries_by_GDP'
database_name = 'World_Economies.db'
log_file = 'etl_project_log.txt'

#Attributes ofinterest are Country and GDP_USD_billion

#### Collect HTML 

In [28]:

#To prevent making repeated calls to website I prefer to download the html to work on it locally.
response = requests.get(url).text

with open('countries_by_gdp.html', 'w', encoding='utf-8') as file:
    file.write(response.text)

# Load the saved HTML file
with open('page.html', 'r', encoding='utf-8') as file:
    html_content = file.read()

# Parse the HTML content
data = BeautifulSoup(html_content, 'html.parser')

KeyboardInterrupt: 

- Using the inspect tool we can see that our table information is under the tbody tag. So, let's examine that further. 

In [None]:
#We can see that there is only one table present let's grab that one. 
table = data.find_all('tbody')
#Now that we have the table of interest let's collect the rows
rows = table[0].find_all('tr')


## Task 1: Extracting information

In [None]:
# Code for ETL operations on Country-GDP data

def extract(url, table_attribs):
    ''' This function extracts the required
    information from the website and saves it to a dataframe. The
    function returns the dataframe for further processing. '''
    data = requests.get(url).text

    return df

def transform(df):
    ''' This function converts the GDP information from Currency
    format to float value, transforms the information of GDP from
    USD (Millions) to USD (Billions) rounding to 2 decimal places.
    The function returns the transformed dataframe.'''

    return df

def load_to_csv(df, csv_path):
    ''' This function saves the final dataframe as a `CSV` file 
    in the provided path. Function returns nothing.'''

def load_to_db(df, sql_connection, table_name):
    ''' This function saves the final dataframe as a database table
    with the provided name. Function returns nothing.'''

def run_query(query_statement, sql_connection):
    ''' This function runs the stated query on the database table and
    prints the output on the terminal. Function returns nothing. '''

def log_progress(message):
    ''' This function logs the mentioned message at a given stage of the code execution to a log file. Function returns nothing'''

''' Here, you define the required entities and call the relevant 
functions in the correct order to complete the project. Note that this
portion is not inside any function.'''

' Here, you define the required entities and call the relevant \nfunctions in the correct order to complete the project. Note that this\nportion is not inside any function.'