IBM Python Project for Data Engineering

Features

Data Extraction: Extracts bank market capitalization data from JSON files and exchange rates from CSV files.
Data Transformation: Converts bank market caps to different currencies using fetched exchange rates.
Data Loading: Outputs the transformed data into CSV format.
Logging: Provides detailed logs for auditing and debugging.
Testing: Unit tests to validate each component of the pipeline.

Scenario

As a data engineer working for an international financial analysis company, my job was to collect financial data from various sources such as websites, APIs, and files provided by financial analysis firms.

Extract API Data: Collect exchange rate data using the ExchangeRate-API and store the data as a CSV. The data is fetched using the requests library and transformed into a pandas DataFrame.
Web Scraping: Scrape the largest banks' information by market capitalization from a Wikipedia page (https://en.wikipedia.org/wiki/List_of_largest_banks) using BeautifulSoup. The scraped data is stored in a pandas DataFrame and saved as a JSON file named bank_market_cap.json.
Extract: The first phase involves extracting data from a JSON file. The extracted data is stored in a pandas DataFrame with the columns 'Name' and 'Market Cap (US$ Billion)'.
Transform: The second phase transforms the extracted data. The 'Market Cap (US$ Billion)' column is converted from USD to GBP using exchange rate data from the exchange_rates.csv file. The transformed data is rounded to 3 decimal places, and the column is renamed to 'Market Cap (GBP$ Billion)'.
Load: The final phase loads the transformed data into a new CSV file named bank_market_cap_gbp.csv. The index is set to False when saving the DataFrame to the CSV file.
Logging: A logging function is implemented to keep track of the ETL process. The function logs messages with timestamps in a file named logfile.txt.

API Layer Signup and Configuration

Sign Up for API Layer: To fetch the latest exchange rates, sign up for a free or premium account on API Layer.
API Key: After signing up, navigate to your dashboard to find your API key.
Configure API Key: Create a config.json file in the root directory of the project. Add your API key like this:
```
{
  "EXCHANGE_API_KEY": "your_api_key_here"
}
```
Replace "your_api_key_here" with your actual API key.

Installation

Clone the repository:

git clone https://github.com/yourusername/IBM_Python_Project_For_Data_Engineering.git
cd IBM_Python_Project_For_Data_Engineering

Configure python version with pyenv:

pyenv install 3.11.3
pyenv local 3.11.3

Install dependencies using Poetry:

poetry install

Activate the Poetry environment:

poetry shell

Run the ETL pipeline:

python src/etl/etl.py

To run the tests, execute the following command inside the Poetry environment:

pytest

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
.github/workflows		.github/workflows
__pycache__		__pycache__
data		data
logs		logs
src		src
tests		tests
.gitignore		.gitignore
README.md		README.md
__init__.py		__init__.py
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

IBM Python Project for Data Engineering

Table of Contents

Features

Scenario

API Layer Signup and Configuration

Installation

About

Uh oh!

Releases

Packages

Uh oh!

Languages

danielmschaves/ibm-data-engineering-python-project

Folders and files

Latest commit

History

Repository files navigation

IBM Python Project for Data Engineering

Table of Contents

Features

Scenario

API Layer Signup and Configuration

Installation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages