This repository contains the DAG code used in the Orchestrate pgvector operations with Apache Airflow tutorial.
The DAG in this repository uses the following package:
This section explains how to run this repository with Airflow. Note that you will need to copy the contents of the .env_example
file to a newly created .env
file.
The Postgres connection defined in the .env_example
file will connect to the secondary Postgres database that is created when you run astro dev start
. This secondary Postgres database already has pgvector
installed.
You will need to provide an OpenAI API key of at least tier 1 in the .env
file to use OpenAI for embeddings. If you do not have an OpenAI API key you can change the code in the create_embeddings
function in the query_book_vectors.py
file to use a different embedding method (note that you will likely also need to adjust the MODEL_VECTOR_LENGTH
if you do this).
Download the Astro CLI to run Airflow locally in Docker. astro
is the only package you will need to install locally.
- Run
git clone https://github.com/astronomer/airflow-pgvector-tutorial.git
on your computer to create a local clone of this repository. - Install the Astro CLI by following the steps in the Astro CLI documentation. Docker Desktop/Docker Engine is a prerequisite, but you don't need in-depth Docker knowledge to run Airflow with the Astro CLI.
- Run
astro dev start
in your cloned repository. - After your Astro project has started. View the Airflow UI at
localhost:8080
.
In this project astro dev start
spins up 5 Docker containers:
- The Airflow webserver, which runs the Airflow UI and can be accessed at
https://localhost:8080/
. - The Airflow scheduler, which is responsible for monitoring and triggering tasks.
- The Airflow triggerer, which is an Airflow component used to run deferrable operators.
- The Airflow metadata database, which is a Postgres database that runs on port 5432.
- A second local postgres database with pgvector installed, which runs on port 5433.