Skip to content

Educational project on how to build an ETL (Extract, Transform, Load) data pipeline, orchestrated with Airflow.

Notifications You must be signed in to change notification settings

abdellah-idris/ETL-Pipeline-with-Airflow-MongoDB-and-PostgreSQL

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

42 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ETL Pipeline with Airflow, MongoDB and PostgreSQL

AboutPipelineScenarioPrerequisitesSet-upInstallationAirflow InterfacePipeline Task by TaskLearning ResourcesEncountered difficulty Next Steps


About

Educational project on how to build an ETL (Extract, Transform, Load) data pipeline, orchestrated with Airflow.

The data is extracted from Twitter. It is then transformed/processed with Python and loaded/stored in MongoDB and in PostgreSQl.

MongoDB is used as a Database in which json files are stored. Before a load, the database is cleared.

PostgreSQL is used as Database with two tables : users and tweets. Data are not cleared.

Pipeline

Scenario

As a Twitter user, we face the problem of having to manually search for tweets related to a specific domain or user. With a dashboards showing multiple or single tweets related to multiple or single domain will be a significant improvement.

  • Pipeline would be run every hour due to the Tweeter API limit to extract the latest tweets.

Prerequisites


Set-up

Twitter

  • Create a Twitter Developer account.

MongoDB

  • Register to MongoDB Atlas.
  • Create a Database etl and a Collection News.
  • Ensure that network access is enabled.

PostgreSQL:

  • Create a localhost Database named twitterETL.

Airflow:

  • Setup Variables and connection

  • Schema : Represent the PostgreSQl Database Name

Installation

Build the Docker images and start the containers with:

docker build . --tag extending_airflow:2.5.1

Run Airflow:

docker-compose up -d

After everything has been installed, you can check the status of your containers (if they are healthy) with:

docker ps

Note: it might take up to 30 seconds for the containers to have the healthy flag after starting.

Airflow Interface

You can now access the Airflow web interface by going to http://localhost:8080/. If you have not changed them in the docker-compose.yml file, the default user is airflow and password is airflow:

After signing in, the Airflow home page is the DAGs list page. Here you will see all DAGs.

Note: If you update the code in the python DAG script, the airflow DAGs page has to be refreshed

You can access a DAG :

  • PostgreSQL

  • MongoDB

Pipeline Task by Task

Task create_postgres_table

Create PostgreSQl table users and tweets if it does not exist.

Task clear

Clear MongoDB collection.

Task extract_transform

Extract tweets from Twitter API and transform them into a json file, using only needed fields.

TWEET_INFO

  • text
  • created_at

USER_INFO

  • username
  • followers_count
  • following_count
  • created_at
  • description

Task transform

Used in PostgreSQl DAG. Deleting ' character in the text and description field.

Note: ' is a reserved character in PostgreSQl.

Task load

Load the processed data.

  • MongoDB

  • Postgres:

    • Users

- Tweets

Learning Resources

Encountered difficulty

  • Airflow installation and setup
  • Scrape data from Amazon website, as the first idea was to follow prices evolution of a product.

About

Educational project on how to build an ETL (Extract, Transform, Load) data pipeline, orchestrated with Airflow.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published