## Counties by GDP
**As a data engineer a company that is looking to expand its business across the world recruited me. It wanna make a decision of which counties to enter the market of for expanding its business**
### What should i do for the company:
1. Extract all these information of the following website page [Website](https://web.archive.org/web/20230902185326/https://en.wikipedia.org/wiki/List_of_countries_by_GDP_%28nominal%29).
2. Clean and Transform these dataset to show it in an appealing way for the decision maker.
3. Load the Data set into a JSON file `Countries_by_GDP.json` and into a database table `Countries_by_GDP` in database file called `World_Economics.db` with attributes `Country`, `GDP_USD_billion`.
3. Create log file to monitor the whole process of extracting, transformain, loading processes.

### 1. Import needed libraries

In [130]:
import pandas as pd
import numpy as np
import requests
import sqlite3
import re
from datetime import datetime

### 2.Send a request to get that page

In [118]:
url = 'https://web.archive.org/web/20230902185326/https://en.wikipedia.org/wiki/List_of_countries_by_GDP_%28nominal%29'
response = requests.get(url)
response.status_code

200

In [119]:
json_file = 'Countries_by_GDP.json'
database_name = 'World_Economics.db'
table_name = 'ountries_by_GDP'

#### Because of the needed data is stored in a table in the html page we have two ways to extract this data:
1. using `pandas.read_html()`. (What i intend to use 'the easier one').
2. using BeautifulSoup.

In [120]:
def extract (url):
    df = pd.read_html(url)
    return df

In [121]:
def Transform (df):
    data = df[3]
    data = data.iloc[1:,:4]
    columns = ['Country','Region','GDP_USD_billion', 'Year']
    data.columns = columns
    data.GDP_USD_billion = data['GDP_USD_billion'].apply(lambda x: x if re.search(r'[0-9]+',x) else None)
    data.dropna(inplace=True)

    data.GDP_USD_billion = np.round(data.GDP_USD_billion.astype(float)/1000,2)
    
    columns_to_drop = ['Region', 'Year']
    data.drop(columns = columns_to_drop, inplace=True)
    

    

    return data

In [165]:
def Load(data):
    data.to_json(f'./files/{json_file}')
    conn = sqlite3.connect(f'./files/{database_name}')
    
    data.to_sql (database_name,conn,if_exists = 'replace' ,index=False)
    

In [171]:
def log(message):
    now = datetime.now()
    now  = now.strftime('%Y-%B-%d-%H:%M:%S')
    
    with open('log-file','a') as file:
        file.write(f'{now} ---> {message}\n')
    
    
    
    
    

In [172]:
log('Extracting process is started')
df = extract(url)
log('Extracting process is finished')
log('Transforming process is started')
df = Transform(df)
log('Transforming process is started')
log('Loading process is started')
Load(df)
log('Loading process is started')




In [124]:
df = Transform(df)
df

Unnamed: 0,Country,GDP_USD_billion
1,United States,26854.60
2,China,19373.59
3,Japan,4409.74
4,Germany,4308.85
5,India,3736.88
...,...,...
206,Marshall Islands,0.29
208,Palau,0.26
210,Kiribati,0.25
211,Nauru,0.15
