Skip to content

ETL pipeline that extracts bank market cap and exchange rate data, transforms it, and loads it into a structured format

Notifications You must be signed in to change notification settings

danielmschaves/ibm-data-engineering-python-project

Repository files navigation

IBM Python Project for Data Engineering

Table of Contents

Features

  • Data Extraction: Extracts bank market capitalization data from JSON files and exchange rates from CSV files.
  • Data Transformation: Converts bank market caps to different currencies using fetched exchange rates.
  • Data Loading: Outputs the transformed data into CSV format.
  • Logging: Provides detailed logs for auditing and debugging.
  • Testing: Unit tests to validate each component of the pipeline.

Scenario

As a data engineer working for an international financial analysis company, my job was to collect financial data from various sources such as websites, APIs, and files provided by financial analysis firms.

  • Extract API Data: Collect exchange rate data using the ExchangeRate-API and store the data as a CSV. The data is fetched using the requests library and transformed into a pandas DataFrame.

  • Web Scraping: Scrape the largest banks' information by market capitalization from a Wikipedia page (https://en.wikipedia.org/wiki/List_of_largest_banks) using BeautifulSoup. The scraped data is stored in a pandas DataFrame and saved as a JSON file named bank_market_cap.json.

  • Extract: The first phase involves extracting data from a JSON file. The extracted data is stored in a pandas DataFrame with the columns 'Name' and 'Market Cap (US$ Billion)'.

  • Transform: The second phase transforms the extracted data. The 'Market Cap (US$ Billion)' column is converted from USD to GBP using exchange rate data from the exchange_rates.csv file. The transformed data is rounded to 3 decimal places, and the column is renamed to 'Market Cap (GBP$ Billion)'.

  • Load: The final phase loads the transformed data into a new CSV file named bank_market_cap_gbp.csv. The index is set to False when saving the DataFrame to the CSV file.

  • Logging: A logging function is implemented to keep track of the ETL process. The function logs messages with timestamps in a file named logfile.txt.

API Layer Signup and Configuration

  1. Sign Up for API Layer: To fetch the latest exchange rates, sign up for a free or premium account on API Layer.

  2. API Key: After signing up, navigate to your dashboard to find your API key.

  3. Configure API Key: Create a config.json file in the root directory of the project. Add your API key like this:

    {
      "EXCHANGE_API_KEY": "your_api_key_here"
    }

    Replace "your_api_key_here" with your actual API key.

Installation

  1. Clone the repository:
git clone https://github.com/yourusername/IBM_Python_Project_For_Data_Engineering.git
cd IBM_Python_Project_For_Data_Engineering
  1. Configure python version with pyenv:
pyenv install 3.11.3
pyenv local 3.11.3
  1. Install dependencies using Poetry:
poetry install
  1. Activate the Poetry environment:
poetry shell
  1. Run the ETL pipeline:
python src/etl/etl.py
  1. To run the tests, execute the following command inside the Poetry environment:
pytest

About

ETL pipeline that extracts bank market cap and exchange rate data, transforms it, and loads it into a structured format

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages