ETL_data_pipeline

Author: Dhyey Pandya
Email: dhyeyhp2@illinois.edu

Overview

This data pipeline is made to run in an isolated container in which it polls messages from a standard AWS SQS queue in bathces of 10 messages per iteration.
In each iteration we extract the message body -> parse json data into a dataframe -> mask IP and Device ID fields with SHA256 hash and store the updated dataframe in postgresql user_logins table in postgresql db.
After a batch of messages are processed and stored in db, we delete them from the queue and poll for the next batch of messages.
We stop the process after all the messages in the queue have been processed successfully.
Below is the basic workflow diagram of the entire pipeline.

How to Run

Pre-requisites

Steps to run the pipeline:

Check if you have docker-compose installed

 $ docker-compose --version

Run LocalStack and Postgresql containers

 $ docker-compose up --detach

Build and run docker image for data-pipeline

 $ docker build -t dhyeypandya/fetch_data_pipeline .
 $ docker run dhyeypandya/fetch_data_pipeline

Open postgresql in new shell and check if the data has been populated in the table

 $ psql -d postgres -U postgres -p 5432 -h localhost -W
 psql>$ select * from user_logins;

Questions

● How would you deploy this application in production?

Package the application using Docker and deploy over a cluster of machines or Cloud environment in produciton. Monitor it continuously to ensure it runs smoothly, and automate and schedule the pipeline to run at regular intervals using a cron job.

● What other components would you want to add to make this production ready?

Adding more descriptive error handling and logs to help in debugging.
Setting up monitoring and alerting components to make sure pipeline runs smoothly all the time with minimal downtime.
Data backup and recovery to ensure that valuable data is not lost.

● How can this application scale with a growing dataset.

Parallel processing or Asynchronous handling of each message from queue to scale with increasing data volume.
Distributed processing using Big Data tools such as Spark, HDFS.

● How can PII be recovered later on?

By using reversible hasing techniques or token/key based hashing techniques.

● What are the assumptions you made?

Devie ID and IP fields are never null.
Allowing duplicates assuming there is another downstream process responsible for data cleaning.

Next Steps

This application can be further organized into a structured data pipeline using Airflow so that dependencies can be managed properly between different serices.
Changing the current SHA256 hashing to A reversible hashing technique to recover the masked PII.
Data can also be stored as Documents in NoSQL database like MongoDB, so that we can directly store source json data in its true form.
Data visualization tools such as PowerBI, Tableau etc can be used on top of stored data for analysis.
Splunk can also be used to index the data stored in documents/NoSQL db for interactive and rich Data visualization dashboards.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
Dockerfile		Dockerfile
README.md		README.md
docker-compose.yml		docker-compose.yml
requirements.txt		requirements.txt
run.py		run.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ETL_data_pipeline

Overview

How to Run

Pre-requisites

Steps to run the pipeline:

Questions

Next Steps

Resources

About

Releases

Packages

Languages

dhp98/ETL_data_pipeline

Folders and files

Latest commit

History

Repository files navigation

ETL_data_pipeline

Overview

How to Run

Pre-requisites

Steps to run the pipeline:

Questions

Next Steps

Resources

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages