- Data Extraction: Extracts bank market capitalization data from JSON files and exchange rates from CSV files.
- Data Transformation: Converts bank market caps to different currencies using fetched exchange rates.
- Data Loading: Outputs the transformed data into CSV format.
- Logging: Provides detailed logs for auditing and debugging.
- Testing: Unit tests to validate each component of the pipeline.
As a data engineer working for an international financial analysis company, my job was to collect financial data from various sources such as websites, APIs, and files provided by financial analysis firms.
-
Extract API Data: Collect exchange rate data using the ExchangeRate-API and store the data as a CSV. The data is fetched using the requests library and transformed into a pandas DataFrame.
-
Web Scraping: Scrape the largest banks' information by market capitalization from a Wikipedia page (https://en.wikipedia.org/wiki/List_of_largest_banks) using BeautifulSoup. The scraped data is stored in a pandas DataFrame and saved as a JSON file named bank_market_cap.json.
-
Extract: The first phase involves extracting data from a JSON file. The extracted data is stored in a pandas DataFrame with the columns 'Name' and 'Market Cap (US$ Billion)'.
-
Transform: The second phase transforms the extracted data. The 'Market Cap (US$ Billion)' column is converted from USD to GBP using exchange rate data from the exchange_rates.csv file. The transformed data is rounded to 3 decimal places, and the column is renamed to 'Market Cap (GBP$ Billion)'.
-
Load: The final phase loads the transformed data into a new CSV file named bank_market_cap_gbp.csv. The index is set to False when saving the DataFrame to the CSV file.
-
Logging: A logging function is implemented to keep track of the ETL process. The function logs messages with timestamps in a file named logfile.txt.
-
Sign Up for API Layer: To fetch the latest exchange rates, sign up for a free or premium account on API Layer.
-
API Key: After signing up, navigate to your dashboard to find your API key.
-
Configure API Key: Create a
config.jsonfile in the root directory of the project. Add your API key like this:{ "EXCHANGE_API_KEY": "your_api_key_here" }Replace
"your_api_key_here"with your actual API key.
- Clone the repository:
git clone https://github.com/yourusername/IBM_Python_Project_For_Data_Engineering.git
cd IBM_Python_Project_For_Data_Engineering- Configure python version with
pyenv:
pyenv install 3.11.3
pyenv local 3.11.3- Install dependencies using Poetry:
poetry install- Activate the Poetry environment:
poetry shell- Run the ETL pipeline:
python src/etl/etl.py- To run the tests, execute the following command inside the Poetry environment:
pytest