This portfolio is a compilation of code which I have created for data engineering works, projects, examples.
-
This project focuses on designing and implementing a data warehouse for a solid waste management company operating in major cities across Brazil. The company's goal is to create a data warehouse that allows for comprehensive reporting on waste collection metrics, including total waste collected by various dimensions such as year, month, truck type, and city.
-
This project demonstrates how to denormalize tables in a data warehouse by creating materialized views to improve query performance for customer analytics. We use a star schema model with fact and dimension tables and build materialized views for optimized data retrieval. This approach showcases core data engineering skills like schema design, denormalization, and the creation of materialized views for high-performance reporting.
-
This project focuses on conducting comprehensive data quality checks within a data warehousing environment. It demonstrates the use of a Python-based framework integrated with PostgreSQL to validate data integrity, ensuring accuracy and consistency.
-
This project focuses on designing a data warehouse for a cloud service provider using their historical billing data. The goal is to organize this data into a star schema that can efficiently support queries related to billing trends, customer insights, and more.
-
This repository contains a simple data engineering project that demonstrates how to build an ETL pipeline using Apache Airflow. The project processes road traffic data from various toll plazas, consolidating data from different formats into a single, transformed CSV file.
-
Example of deploying a Flask web application with continuous deployment using PaaS and Google App Engine.
-
Simple demo of continous integration using Github Actions to perform tests using AWS, Azure or GCP
-
This is a repo for doing a simple example of testing python code
-
AWS Services used: (Amazon S3, AWS IAM, QuickSight, AWS Glue, AWS Lambda, AWS Athena)
Project Goals:- Data Ingestion — Build a mechanism to ingest data from different sources
- ETL System — We are getting data in raw format, transforming this data into the proper format
- Data lake — We will be getting data from multiple sources so we need centralized repo to store them
- Scalability — As the size of our data increases, we need to make sure our system scales with it
- Cloud — We can’t process vast amounts of data on our local computer so we need to use the cloud, in this case, we will use AWS
- Reporting — Build a dashboard to get answers to the question we asked earlier
-
This is an example of a Serverless AI Data Engineering Pipeline.
AWS Services used: (DynamoDB, Lambda, SQS, CloudWatch, Comprehend, S3)
With CloudWatch Timer we will invoke a AWS Lambda function(Producer) that reads a DynamoDB table and send each item as a message to a SQS queue, when a message feeds the queue a trigger will invoke a AWS Lambda function(Consumer) that will perform a request to Wikipedia API and ask for the first result and the first 2 lines of the content, then we will perform a sentiment analysis using AWS Compehend and store the results in a AWS S3 bucket.