#<font color='red'>**ETL (Extract-Transform-Load) Process**

---

## **Objective**
To design and implement an ETL process.<br><br>

## **Scenario**
Working at a financial company in Great Britain as Data Engineer, there's an interest about the market cap of international banks. As a request from the Analytics Dept, the deliverable should be in .csv file format for further analysis.<br><br>

## **The Mission** (the who, what, when, where and why)
The `Data Engineering Dept` has to `deliver` a .csv file containing the `'market cap of international banks'` by a given `time span` to the `Analytics Dept` for `further analysis` as part of the decision-making process of the company.<br><br>

## **The Execution** (how to)
* ### **Concept**
The Data Engineering Dept will look for appropriate open sources to acquire the 'market cap of international banks', download the data, implement an ETL process and deliver the required data as a .csv file.
* ### **Tasks**
  1. **Find the desired resources.**<br>
  We found 2 open-source resources `bank_market_cap_1.json` and `bank_market_cap_2.json` via the www. After reviewing them, we will realise that the *'market cap'* of the banks are in USD. Since we reside in Great Britain we need the exchange rate to GBP. As before, we found via www the exchange rates in the file `exchange_rates.csv`.
  2. **Acquire the relevant data.**<br>
  To acquire the data we've found we'll use the `wget` with `!`-excamation mark preceeding the wget command (so we can run terminal linux commands in the notebook). Optionally we can use the `-q` and `--show-progress` flags provided by the wget command to download them without 'verbose' and show only the 'progress bar' respectively.
  3. **Design an ETL process:**<br>
  Knowing the file formats we're going to use along with the repository they reside (in our case the current folder), we need to import appropriate libraries to work with and to define what the ETL process will do and how it is going to it. Having said that, the following steps are going to take place:
    * **Extract**<br>
     First of all, we need to extract the data we downloaded. For that reason we'll create the `extract()` function which it will run the overall extraction process. Since the pandas dataframe is an easy way to handle such files we'll need to import the `pandas` library along with the `glob` library for directory handling. The function will create a dataframe with specific column naming. Then, we'll read each .json file as a dataframe and put it in a list. Lastly, we'll concatenate the 2 dataframes from our list. Appart from that we need to extract the exchange rate to GBP from the .csv file we downloaded.
    * **Tranform**<br>
      For the tranform process, we'll create a function `transform(...)` which is responsible to take the extracted data as a parameter, change the data from the column 'market cap' from USD to GBP rounding to 3 decimal digits. Also, we'll rename this column from USD to GBP for consinstency.
    * **Load**<br>
      In this step we're going to create a function `load(...)` which will take the transformed data from the previous step and load it as a .csv file.
  4. **Monitor the process**<br>
      To monitor our ETL process we're going to create a logging function `log(...)` which will take a message as a parameter (the message will be the status of each task), create a timestamp and create a .txt file. This file it's not requested, though it's convinient for us to keep track of the process. So, we need to import another library to help us with the timestamp which is the `datetime` library.
  


<a name="li"></a>
# Libraries  


In [1]:
import glob  #file handling
import pandas as pd  #dataframe usage
from datetime import datetime #datetime formating for the log file

<a name="dldt"></a>
# Data Acquisition


The data we're going to download are not current/updated and they're used for demonstration purposes only.

In [2]:
!wget -q https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-PY0221EN-SkillsNetwork/labs/module%206/Lab%20-%20Extract%20Transform%20Load/data/bank_market_cap_1.json --show-progress
!wget -q https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-PY0221EN-SkillsNetwork/labs/module%206/Lab%20-%20Extract%20Transform%20Load/data/bank_market_cap_2.json --show-progress
!wget -q https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-PY0221EN-SkillsNetwork/labs/module%206/Final%20Assignment/exchange_rates.csv --show-progress



<a name="expr"></a>
# Extraction Process

This function will `extract .json files` and return them as dataframes.

In [3]:
def extract_from_json(file_to_process):
  dataframe = pd.read_json(file_to_process, orient='columns')
  return dataframe

Defining the `extract function` that finds the .json files bank_market_cap_1.json and bank_market_cap_2.json then calls the function created above to extract the data from them. Finally, we store the data in dataframe using specific columns names.

In [4]:
def extract():
  columns=['Name','Market Cap (US$ Billion)']
  extracted_data = pd.DataFrame(columns=columns)
  dt = []

  #process all json files and append them to a list
  for jsonfile in glob.glob("bank*.json"):
    json_data = extract_from_json(jsonfile)
    dt.append(json_data)

  extracted_data = pd.concat(dt, axis=0)
  return extracted_data

Retreiving the exchange rates from the file `exchange_rates.csv` as a dataframe and find the exchange rate for Great Britain Pounds with the symbol GBP, store it in the variable exchange_rate.

In [5]:
path = 'exchange_rates.csv'

# we set the index to zero because we don't want the index from the .csv file; instead, we're going to use countries' acronyms as such.
exchange_rate = pd.read_csv(path, index_col=0)

In [6]:
# taking a quick look how the exchange rates dataframe is formed
exchange_rate.head(5)

Unnamed: 0,Rates
AUD,1.297088
BGN,1.608653
BRL,5.409196
CAD,1.271426
CHF,0.886083


In [7]:
# retreiving the exchange rate for GBP to use (we set the index to zero [0] because otherwise we'll get a subset of the dataframe)
exchange_rate_to_GBP = exchange_rate.loc['GBP'][0]
print(f'The exchange rate from US Dollars to GBP (Great Britain Pounds) is: {exchange_rate_to_GBP:.3f}')

The exchange rate from US Dollars to GBP (Great Britain Pounds) is: 0.732


<a name="trpr"></a>
# Transformation Process

Defining a `transform function` that:
1. Changes the `'Market Cap (US$\$$ Billion)'` column from USD to GBP
2. Rounds the `Market Cap` column to 3 decimal places
3. Rename `Market Cap (USD Billion)` to `Market Cap (GBP Billion)`

In [8]:
def transform(data, exch_rate):
  # transforming the data to our desired rate
  data['Market Cap (US$ Billion)'] = round(data['Market Cap (US$ Billion)'].apply(lambda x: x * exch_rate), 3)

  # renaming the column from US to GBP
  data = data.rename(columns={'Market Cap (US$ Billion)': 'Market Cap (GBP Billion)'})
  return data

<a name="ldpr"></a>
# Load Process

 Defining a `load function` that takes a dataframe and load it to a .csv file named `bank_market_cap_gbp.csv` making sure to set index to False (so not to have double indexing in the .csv file)

In [9]:
def load(data_in):

  # where the tranformed data go to
  path_to = 'bank_market_cap_gbp.csv'

  # write the tranformed data to the destined file excluding the index
  data_in.to_csv(path_to, index=False)

<a name="ldpr"></a>
# Logging Process

Creating a `log function` to log our ETL sequence.

In [10]:
def log(msg):

  # getting the current timestamp
  now = datetime.now()

  # the format we're going to use for our timestamp : Year-Monthname-Day-Hour-Minute-Second
  timestamp_format = '%Y-%h-%d-%H:%M:%S'

  # formating the current timestamp to our favor
  timestamp = now.strftime(timestamp_format)

  # append the timestamp of the running process to a log file
  with open("logfile.txt", "a") as f:
      f.write(timestamp + ',' + msg + '\n')

<a name="etlpr"></a>
# ETL Process (Get the pieces together)

Finally, we can define an overall ETL process putting all the previous processes together as a one single process.

In [11]:
def etl():
  log("ETL Job Started")

  #>>>>>>>>>> Extract <<<<<<<<<<
  log("Extract Phase Started")
  extracted_data = extract()
  log('Extract Phase Ended')

  #>>>>>>>>>> Transform <<<<<<<<<<
  log('Transform Phase Started')
  transformed_data = transform(extracted_data, exchange_rate_to_GBP)
  log('Transform Phase Ended')

  #>>>>>>>>>> Load <<<<<<<<<<
  log('Load Phase Started')
  load(transformed_data)
  log('Load Phase Ended')

  log("ETL Job Ended")

  print('ETL complete!')

<a name="runetl"></a>
# Run the ETL Process

In [12]:
etl()

ETL complete!
