# Project Scenario
An international firm that is looking to expand its business in different countries across the world has recruited you. You have been hired as a junior Data Engineer and are tasked with creating an automated script that can extract the list of all countries in order of their GDPs in billion USDs (rounded to 2 decimal places), as logged by the International Monetary Fund (IMF). Since IMF releases this evaluation twice a year, this code will be used by the organization to extract the information as it is updated.

You can find the required data on this webpage: https://web.archive.org/web/20230902185326/https://en.wikipedia.org/wiki/List_of_countries_by_GDP_%28nominal%29

The required information needs to be made accessible as a CSV file **Countries_by_GDP.csv** as well as a table **Countries_by_GDP** in a database file **World_Economies.db** with attributes **Country** and **GDP_USD_billion.**

Your boss wants you to demonstrate the success of this code by running a query on the database table to display only the entries with more than a 100 billion USD economy. Also, log the entire process of execution in a file named **etl_project_log.txt.**

You must create a Python code that performs all the required tasks.<br><br>

# Objectives
1. Write a data extraction function to retrieve the relevant information from the required URL.

2. Transform the available GDP information into *Billion USD* from *Million USD*.

3. Load the transformed information to the required CSV file and as a database file.

4. Run the required query on the database.

5. Log the progress of the code with appropriate timestamps.



## Preliminary

In [1]:
import pandas as pd
import requests
from bs4 import BeautifulSoup
import sqlite3 as sql
from datetime import datetime


A module that was compiled using NumPy 1.x cannot be run in
NumPy 2.3.5 as it may crash. To support both 1.x and 2.x
versions of NumPy, modules must be compiled with NumPy 2.0.
Some module may need to rebuild instead e.g. with 'pybind11>=2.12'.

If you are a user of the module, the easiest solution will be to
downgrade to 'numpy<2' or try to upgrade the affected module.
We expect that some modules will need time to support NumPy 2.

Traceback (most recent call last):  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "C:\Users\Vitor Marinho\anaconda3\Lib\site-packages\ipykernel_launcher.py", line 17, in <module>
    app.launch_new_instance()
  File "C:\Users\Vitor Marinho\anaconda3\Lib\site-packages\traitlets\config\application.py", line 992, in launch_instance
    app.start()
  File "C:\Users\Vitor Marinho\anaconda3\Lib\site-packages\ipykernel\kernelapp.py", line 736, in start
    self.io_loop.start()
  File "C:\Users\Vitor 

AttributeError: _ARRAY_API not found


A module that was compiled using NumPy 1.x cannot be run in
NumPy 2.3.5 as it may crash. To support both 1.x and 2.x
versions of NumPy, modules must be compiled with NumPy 2.0.
Some module may need to rebuild instead e.g. with 'pybind11>=2.12'.

If you are a user of the module, the easiest solution will be to
downgrade to 'numpy<2' or try to upgrade the affected module.
We expect that some modules will need time to support NumPy 2.

Traceback (most recent call last):  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "C:\Users\Vitor Marinho\anaconda3\Lib\site-packages\ipykernel_launcher.py", line 17, in <module>
    app.launch_new_instance()
  File "C:\Users\Vitor Marinho\anaconda3\Lib\site-packages\traitlets\config\application.py", line 992, in launch_instance
    app.start()
  File "C:\Users\Vitor Marinho\anaconda3\Lib\site-packages\ipykernel\kernelapp.py", line 736, in start
    self.io_loop.start()
  File "C:\Users\Vitor 

AttributeError: _ARRAY_API not found


A module that was compiled using NumPy 1.x cannot be run in
NumPy 2.3.5 as it may crash. To support both 1.x and 2.x
versions of NumPy, modules must be compiled with NumPy 2.0.
Some module may need to rebuild instead e.g. with 'pybind11>=2.12'.

If you are a user of the module, the easiest solution will be to
downgrade to 'numpy<2' or try to upgrade the affected module.
We expect that some modules will need time to support NumPy 2.

Traceback (most recent call last):  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "C:\Users\Vitor Marinho\anaconda3\Lib\site-packages\ipykernel_launcher.py", line 17, in <module>
    app.launch_new_instance()
  File "C:\Users\Vitor Marinho\anaconda3\Lib\site-packages\traitlets\config\application.py", line 992, in launch_instance
    app.start()
  File "C:\Users\Vitor Marinho\anaconda3\Lib\site-packages\ipykernel\kernelapp.py", line 736, in start
    self.io_loop.start()
  File "C:\Users\Vitor 

AttributeError: _ARRAY_API not found


A module that was compiled using NumPy 1.x cannot be run in
NumPy 2.3.5 as it may crash. To support both 1.x and 2.x
versions of NumPy, modules must be compiled with NumPy 2.0.
Some module may need to rebuild instead e.g. with 'pybind11>=2.12'.

If you are a user of the module, the easiest solution will be to
downgrade to 'numpy<2' or try to upgrade the affected module.
We expect that some modules will need time to support NumPy 2.

Traceback (most recent call last):  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "C:\Users\Vitor Marinho\anaconda3\Lib\site-packages\ipykernel_launcher.py", line 17, in <module>
    app.launch_new_instance()
  File "C:\Users\Vitor Marinho\anaconda3\Lib\site-packages\traitlets\config\application.py", line 992, in launch_instance
    app.start()
  File "C:\Users\Vitor Marinho\anaconda3\Lib\site-packages\ipykernel\kernelapp.py", line 736, in start
    self.io_loop.start()
  File "C:\Users\Vitor 

AttributeError: _ARRAY_API not found

In [2]:
url = 'https://web.archive.org/web/20230902185326/https://en.wikipedia.org/wiki/List_of_countries_by_GDP_%28nominal%29'
table_attr = ["Country", "GDP_USD_millions"]
db_name = 'World_Economies.db'
table_name = 'Countries_by_GDP'
csv_path = 'Countries_by_GDP.csv'

## Code

In [3]:
# 1. extract
def extract(url, table_attr):
    page = requests.get(url).text
    data = BeautifulSoup(page, 'html.parser')
    df = pd.DataFrame(columns= table_attr)
    tables = data.find_all('tbody')
    rows = tables[2].find_all('tr')
    for row in rows:
        col = row.find_all('td')
        if len(col) != 0:
            if col[0].find('a') is not None and '—' not in col[2]:
                ddict = {
                    "Country": col[0].a.contents[0],
                    "GDP_USD_millions": col[2].contents[0]
                }
                df1 = pd.DataFrame(ddict, index= [0])
                df = pd.concat([df,df1], ignore_index= True)
    return df

In [4]:
df = extract(url, table_attr)
df

Unnamed: 0,Country,GDP_USD_millions
0,United States,26854599
1,China,19373586
2,Japan,4409738
3,Germany,4308854
4,India,3736882
...,...,...
186,Marshall Islands,291
187,Palau,262
188,Kiribati,248
189,Nauru,151


In [5]:
# 2. transform
def transform(df):
    gdp_list = df['GDP_USD_millions'].tolist()
    updated_gdp = list()
    gdp_billion = list()

    for item in gdp_list:
        updated_gdp.append(float(item.replace(",","")))
    for nmb in updated_gdp:
        gdp_billion.append(round(nmb/1000, 2))
        
    df["GDP_USD_millions"] = gdp_billion
    df = df.rename(columns= {"GDP_USD_millions":"GDP_USD_billions"})
    
    return df

In [6]:
df = transform(df)
df

Unnamed: 0,Country,GDP_USD_billions
0,United States,26854.60
1,China,19373.59
2,Japan,4409.74
3,Germany,4308.85
4,India,3736.88
...,...,...
186,Marshall Islands,0.29
187,Palau,0.26
188,Kiribati,0.25
189,Nauru,0.15


In [7]:
# 3. load
def load_to_csv(df, csv_path):
    df.to_csv(csv_path)

def load_to_db(df, sql_conn, table_name):
    df.to_sql(table_name, sql_conn, if_exists= 'replace', index= False)

In [8]:
# 4. run query
def run_query(query_stmt, sql_conn):
    print(query_stmt)
    output = pd.read_sql(query_stmt, sql_conn)
    print(output)

In [9]:
# 5. log
def log_process(message):
    timestamp_format = '%Y-%h-%d-%H:%M:%S' # Year-Monthname-Day-Hour-Minute-Second 
    now = datetime.now() # get current timestamp
    timestamp = now.strftime(timestamp_format)
    with open("etl_project_log.txt", "a") as file:
        file.write(timestamp + ": " + message + "\n")

## Running the project

In [10]:
log_process("Preliminaries complete. Initiating ETL process.")

df = extract(url, table_attr)

log_process("Data extraction complete. Initiating Transformation process")

df = transform(df)

log_process("Data transformation complete. Initiating loading process.")

load_to_csv(df, csv_path)

log_process("Data saved to CSV file.")

sql_conn = sql.connect(db_name)

log_process("SQL Connection initiated.")

load_to_db(df, sql_conn, table_name)

log_process("Data loaded to Database as table. Running the query")

run_query(f"SELECT * FROM {table_name} WHERE GDP_USD_billions >= 100", sql_conn)

sql_conn.close()

log_process('Process Complete.')

SELECT * FROM Countries_by_GDP WHERE GDP_USD_billions >= 100
          Country  GDP_USD_billions
0   United States          26854.60
1           China          19373.59
2           Japan           4409.74
3         Germany           4308.85
4           India           3736.88
..            ...               ...
64          Kenya            118.13
65         Angola            117.88
66           Oman            104.90
67      Guatemala            102.31
68       Bulgaria            100.64

[69 rows x 2 columns]
