## Lesson 5: Scheduling and Dag Parameters

In the previous lesson, you triggered the dags manually. In this lesson, you will learn how to schedule the fetch_data dag so that it automatically runs each hour (time-based scheduling), and how to schedule the query_data dag so that it runs after the embeddings are loaded to the database (data-aware scheduling).

### 5.1. Link to Airflow UI

Run the following cell to the link to the Airflow UI. If asked for username and password, make sure to type `airflow` for both.

In [None]:
import os
airflow_ui = os.environ.get('DLAI_LOCAL_URL').format(port=8080)
airflow_ui #username:airflow password:airflow

<div style="background-color:#fff6e4; padding:15px; border-width:3px; border-color:#f5ecda; border-style:solid; border-radius:6px"> ⏳ <b>Note <code>Airflow UI</code>:</b> 

<p>Changes to dags may take up to 30 seconds to show up in the Airflow UI in this environment! </p>
<p>In the Airflow UI, if you see the error "504 Gateway Timeout", this can happen after 2 hours or after some time of inactivity 25 minutes (if there's no activity for 20 minutes, the jupyter kernel stops and if there's no kernel for 5 minutes, then the jupyter notebook stops and the resources are released). In this case, make sure to refresh the notebook, run the cell that outputs the link to the Airflow UI and then use the link to open the Airflow UI. </p>
</div>

### 5.2. Ensure Lesson 4 dags are in the Airflow UI

Since the lab environment resets after 120 minutes, depending on when you're starting this lesson, you may not see the the dags `fetch_data` and `query_data` from the previous lesson. To ensure the Airflow UI has the dags of the previous lesson, run the following two cells. After 30 seconds, check the Airflow UI. You should see the two dags, you can then manually trigger the dags. 

In [None]:
%%writefile ../../dags/fetch_data.py 
from airflow.sdk import chain, dag, task 


COLLECTION_NAME = "Books" 
BOOK_DESCRIPTION_FOLDER = "/home/jovyan/include/data"
EMBEDDING_MODEL_NAME = "BAAI/bge-small-en-v1.5"


@dag
def fetch_data():

    @task
    def create_collection_if_not_exists() -> None:
        from airflow.providers.weaviate.hooks.weaviate import WeaviateHook

        hook = WeaviateHook("my_weaviate_conn")
        client = hook.get_conn()

        existing_collections = client.collections.list_all()
        existing_collection_names = existing_collections.keys()

        if COLLECTION_NAME not in existing_collection_names:
            print(f"Collection {COLLECTION_NAME} does not exist yet. Creating it...")
            collection = client.collections.create(name=COLLECTION_NAME)
            print(f"Collection {COLLECTION_NAME} created successfully.")
            print(f"Collection details: {collection}")

    _create_collection_if_not_exists = create_collection_if_not_exists()

    @task
    def list_book_description_files() -> list:
        import os
        
        book_description_files = [
            f for f in os.listdir(BOOK_DESCRIPTION_FOLDER)
            if f.endswith('.txt')
        ]
        return book_description_files

    _list_book_description_files = list_book_description_files()

    @task
    def transform_book_description_files(book_description_files: list) -> list:
        import json
        import os

        list_of_book_data = []
        
        for book_description_file in book_description_files:
            with open(
                os.path.join(BOOK_DESCRIPTION_FOLDER, book_description_file), "r"
            ) as f:
                book_descriptions = f.readlines()
            
            titles = [
                book_description.split(":::")[1].strip()
                for book_description in book_descriptions
            ]
            authors = [
                book_description.split(":::")[2].strip()
                for book_description in book_descriptions
            ]
            book_description_text = [
                book_description.split(":::")[3].strip()
                for book_description in book_descriptions
            ]
            
            book_descriptions = [
                {
                    "title": title,
                    "author": author,
                    "description": description,
                }
                for title, author, description in zip(
                    titles, authors, book_description_text
                )
            ]
        
            list_of_book_data.append(book_descriptions)

        

        return list_of_book_data

    _transform_book_description_files = transform_book_description_files(
        book_description_files=_list_book_description_files
    )

    @task
    def create_vector_embeddings(list_of_book_data: list) -> list:
        from fastembed import TextEmbedding

        embedding_model = TextEmbedding(EMBEDDING_MODEL_NAME)  
        
        list_of_description_embeddings = []
        
        for book_data in list_of_book_data:
            book_descriptions = [book["description"] for book in book_data]
            description_embeddings = [
                list(map(float, next(embedding_model.embed([desc])))) for desc in book_descriptions
            ]
        
            list_of_description_embeddings.append(description_embeddings)

        return list_of_description_embeddings

    _create_vector_embeddings = create_vector_embeddings(
        list_of_book_data=_transform_book_description_files
    )

    @task
    def load_embeddings_to_vector_db(
        list_of_book_data: list, list_of_description_embeddings: list
    ) -> None:
        from airflow.providers.weaviate.hooks.weaviate import WeaviateHook
        from weaviate.classes.data import DataObject

        hook = WeaviateHook("my_weaviate_conn")
        client = hook.get_conn()
        collection = client.collections.get(COLLECTION_NAME)

        for book_data_list, emb_list in zip(list_of_book_data, list_of_description_embeddings):
            items = []
            
            for book_data, emb in zip(book_data_list, emb_list):
                item = DataObject(
                    properties={
                        "title": book_data["title"],
                        "author": book_data["author"],
                        "description": book_data["description"],
                    },
                    vector=emb
                )
                items.append(item)
            
            collection.data.insert_many(items)

    _load_embeddings_to_vector_db = load_embeddings_to_vector_db(
        list_of_book_data=_transform_book_description_files,
        list_of_description_embeddings=_create_vector_embeddings,
    )

    chain(
        _create_collection_if_not_exists,
        _load_embeddings_to_vector_db
    )
    


fetch_data()

In [None]:
%%writefile ../../dags/query_data.py 
from airflow.sdk import dag, task  

COLLECTION_NAME = "Books"  
EMBEDDING_MODEL_NAME = "BAAI/bge-small-en-v1.5"

@dag
def query_data():

    @task
    def search_vector_db_for_a_book(query_str: str) -> None:
        from airflow.providers.weaviate.hooks.weaviate import WeaviateHook
        from fastembed import TextEmbedding

        hook = WeaviateHook("my_weaviate_conn")
        client = hook.get_conn()

        embedding_model = TextEmbedding(EMBEDDING_MODEL_NAME)  
        collection = client.collections.get(COLLECTION_NAME)
        
        query_emb = list(embedding_model.embed([query_str]))[0]
        
        results = collection.query.near_vector(
            near_vector=query_emb,
            limit=1,
        )
        for result in results.objects:
            print(f"You should read: {result.properties['title']} by {result.properties['author']}")
            print("Description:")
            print(result.properties["description"])

    search_vector_db_for_a_book(query_str="A philosophical book")


query_data()

### 5.3. Time-based scheduling: update fetch_data 

To make the dag run based on a schedule, you can update the `@dag decorator` by specifying the `start_date` and `schedule` parameters as shown in the following cell. The rest of the dag is the same for now.

Run the following cell to update the dag, and then check the Airflow UI. You don't need to trigger anything in the Airflow UI. 

In [None]:
%%writefile ../../dags/fetch_data.py 
from airflow.sdk import chain, dag, task, Asset
from pendulum import datetime


COLLECTION_NAME = "Books"
BOOK_DESCRIPTION_FOLDER = "/home/jovyan/include/data"
EMBEDDING_MODEL_NAME = "BAAI/bge-small-en-v1.5"


@dag(
    start_date=datetime(2025, 4, 1),
    schedule="@hourly"    
) #new!
def fetch_data():

    @task
    def create_collection_if_not_exists() -> None:
        from airflow.providers.weaviate.hooks.weaviate import WeaviateHook

        hook = WeaviateHook("my_weaviate_conn")
        client = hook.get_conn()

        existing_collections = client.collections.list_all()
        existing_collection_names = existing_collections.keys()

        if COLLECTION_NAME not in existing_collection_names:
            print(f"Collection {COLLECTION_NAME} does not exist yet. Creating it...")
            collection = client.collections.create(name=COLLECTION_NAME)
            print(f"Collection {COLLECTION_NAME} created successfully.")
            print(f"Collection details: {collection}")

    _create_collection_if_not_exists = create_collection_if_not_exists()

    @task
    def list_book_description_files() -> list:
        import os

        book_description_files = [
            f for f in os.listdir(BOOK_DESCRIPTION_FOLDER) if f.endswith(".txt")
        ]
        return book_description_files

    _list_book_description_files = list_book_description_files()

    @task
    def transform_book_description_files(book_description_files: list) -> list:
        import json
        import os

        list_of_book_data = []

        for book_description_file in book_description_files:
            with open(
                os.path.join(BOOK_DESCRIPTION_FOLDER, book_description_file), "r"
            ) as f:
                book_descriptions = f.readlines()

            titles = [
                book_description.split(":::")[1].strip()
                for book_description in book_descriptions
            ]
            authors = [
                book_description.split(":::")[2].strip()
                for book_description in book_descriptions
            ]
            book_description_text = [
                book_description.split(":::")[3].strip()
                for book_description in book_descriptions
            ]

            book_descriptions = [
                {
                    "title": title,
                    "author": author,
                    "description": description,
                }
                for title, author, description in zip(
                    titles, authors, book_description_text
                )
            ]

            list_of_book_data.append(book_descriptions)

        return list_of_book_data

    _transform_book_description_files = transform_book_description_files(
        book_description_files=_list_book_description_files
    )

    @task
    def create_vector_embeddings(list_of_book_data: list) -> list:
        from fastembed import TextEmbedding

        embedding_model = TextEmbedding(EMBEDDING_MODEL_NAME)

        list_of_description_embeddings = []

        for book_data in list_of_book_data:
            book_descriptions = [book["description"] for book in book_data]
            description_embeddings = [
                list(map(float, next(embedding_model.embed([desc]))))
                for desc in book_descriptions
            ]

            list_of_description_embeddings.append(description_embeddings)

        return list_of_description_embeddings

    _create_vector_embeddings = create_vector_embeddings(
        list_of_book_data=_transform_book_description_files
    )

    @task
    def load_embeddings_to_vector_db(
        list_of_book_data: list, list_of_description_embeddings: list
    ) -> None:
        from airflow.providers.weaviate.hooks.weaviate import WeaviateHook
        from weaviate.classes.data import DataObject

        hook = WeaviateHook("my_weaviate_conn")
        client = hook.get_conn()
        collection = client.collections.get(COLLECTION_NAME)

        for book_data_list, emb_list in zip(
            list_of_book_data, list_of_description_embeddings
        ):
            items = []

            for book_data, emb in zip(book_data_list, emb_list):
                item = DataObject(
                    properties={
                        "title": book_data["title"],
                        "author": book_data["author"],
                        "description": book_data["description"],
                    },
                    vector=emb,
                )
                items.append(item)

            collection.data.insert_many(items)

    _load_embeddings_to_vector_db = load_embeddings_to_vector_db(
        list_of_book_data=_transform_book_description_files,
        list_of_description_embeddings=_create_vector_embeddings,
    )

    chain(_create_collection_if_not_exists, _load_embeddings_to_vector_db)


fetch_data()

### 5.4. Data-aware scheduling for query_data 

Here are the updates you will make to both dags to make query_data data aware.

<img src="updates2.png" width="400">

1. You will first update the parameter of the last task `load_embeddings_to_vector_db` of the fetch_data dag: 
     ```
   @task(
        outlets=[Asset("my_book_vector_data")]
    )
    ```
   This means that when the task is done it will emit an AssetEvent informing the Asset object that the collection books has been updated.


2. You will then update the parameter of the dag query_data:
   ```
   @dag(
    schedule=[Asset("my_book_vector_data")]
    )
   ``` 
   This means that the second dag will be triggered whenever there is an update to the collection books in the weaviate database.

**5.4.1. Update the last task of the dag: fetch_data**

Run the following cell and then check the Airflow UI. 

In [None]:
%%writefile ../../dags/fetch_data.py 
from airflow.sdk import chain, dag, task, Asset
from pendulum import datetime


COLLECTION_NAME = "Books"
BOOK_DESCRIPTION_FOLDER = "/home/jovyan/include/data"
EMBEDDING_MODEL_NAME = "BAAI/bge-small-en-v1.5"


@dag(
    start_date=datetime(2025, 4, 1),
    schedule="@hourly"    
)
def fetch_data():

    @task
    def create_collection_if_not_exists() -> None:
        from airflow.providers.weaviate.hooks.weaviate import WeaviateHook

        hook = WeaviateHook("my_weaviate_conn")
        client = hook.get_conn()

        existing_collections = client.collections.list_all()
        existing_collection_names = existing_collections.keys()

        if COLLECTION_NAME not in existing_collection_names:
            print(f"Collection {COLLECTION_NAME} does not exist yet. Creating it...")
            collection = client.collections.create(name=COLLECTION_NAME)
            print(f"Collection {COLLECTION_NAME} created successfully.")
            print(f"Collection details: {collection}")

    _create_collection_if_not_exists = create_collection_if_not_exists()

    @task
    def list_book_description_files() -> list:
        import os

        book_description_files = [
            f for f in os.listdir(BOOK_DESCRIPTION_FOLDER) if f.endswith(".txt")
        ]
        return book_description_files

    _list_book_description_files = list_book_description_files()

    @task
    def transform_book_description_files(book_description_files: list) -> list:
        import json
        import os

        list_of_book_data = []

        for book_description_file in book_description_files:
            with open(
                os.path.join(BOOK_DESCRIPTION_FOLDER, book_description_file), "r"
            ) as f:
                book_descriptions = f.readlines()

            titles = [
                book_description.split(":::")[1].strip()
                for book_description in book_descriptions
            ]
            authors = [
                book_description.split(":::")[2].strip()
                for book_description in book_descriptions
            ]
            book_description_text = [
                book_description.split(":::")[3].strip()
                for book_description in book_descriptions
            ]

            book_descriptions = [
                {
                    "title": title,
                    "author": author,
                    "description": description,
                }
                for title, author, description in zip(
                    titles, authors, book_description_text
                )
            ]

            list_of_book_data.append(book_descriptions)

        return list_of_book_data

    _transform_book_description_files = transform_book_description_files(
        book_description_files=_list_book_description_files
    )

    @task
    def create_vector_embeddings(list_of_book_data: list) -> list:
        from fastembed import TextEmbedding

        embedding_model = TextEmbedding(EMBEDDING_MODEL_NAME)

        list_of_description_embeddings = []

        for book_data in list_of_book_data:
            book_descriptions = [book["description"] for book in book_data]
            description_embeddings = [
                list(map(float, next(embedding_model.embed([desc]))))
                for desc in book_descriptions
            ]

            list_of_description_embeddings.append(description_embeddings)

        return list_of_description_embeddings

    _create_vector_embeddings = create_vector_embeddings(
        list_of_book_data=_transform_book_description_files
    )

    @task(
        outlets=[Asset("my_book_vector_data")]
    ) #new!
    def load_embeddings_to_vector_db(
        list_of_book_data: list, list_of_description_embeddings: list
    ) -> None:
        from airflow.providers.weaviate.hooks.weaviate import WeaviateHook
        from weaviate.classes.data import DataObject

        hook = WeaviateHook("my_weaviate_conn")
        client = hook.get_conn()
        collection = client.collections.get(COLLECTION_NAME)

        for book_data_list, emb_list in zip(
            list_of_book_data, list_of_description_embeddings
        ):
            items = []

            for book_data, emb in zip(book_data_list, emb_list):
                item = DataObject(
                    properties={
                        "title": book_data["title"],
                        "author": book_data["author"],
                        "description": book_data["description"],
                    },
                    vector=emb,
                )
                items.append(item)

            collection.data.insert_many(items)

    _load_embeddings_to_vector_db = load_embeddings_to_vector_db(
        list_of_book_data=_transform_book_description_files,
        list_of_description_embeddings=_create_vector_embeddings,
    )

    chain(_create_collection_if_not_exists, _load_embeddings_to_vector_db)


fetch_data()

**5.4.2. Update query_data**

Run the following cell and check the Airflow UI.

**Note**: If you try the same query of the video, you might get a different book recommendation (in the video, some additional book descriptions were given to the embedding model when creating the vector database). 

In [None]:
%%writefile ../../dags/query_data.py

from airflow.sdk import dag, task, Asset

COLLECTION_NAME = "Books"  
EMBEDDING_MODEL_NAME = "BAAI/bge-small-en-v1.5"


@dag(
    schedule=[Asset("my_book_vector_data")]
) #new!
def query_data():

    @task
    def search_vector_db_for_a_book(query_str: str):
        from airflow.providers.weaviate.hooks.weaviate import WeaviateHook
        from fastembed import TextEmbedding

        hook = WeaviateHook("my_weaviate_conn")
        client = hook.get_conn()

        embedding_model = TextEmbedding(EMBEDDING_MODEL_NAME)  
        collection = client.collections.get(COLLECTION_NAME)
        
        query_emb = list(embedding_model.embed([query_str]))[0]
        
        results = collection.query.near_vector(
            near_vector=query_emb,
            limit=1,
        )
        for result in results.objects:
            print(f"You should read: {result.properties['title']} by {result.properties['author']}")
            print("Description:")
            print(result.properties["description"])

    search_vector_db_for_a_book(query_str="A philosophical book")


query_data()

### 5.5. Adding `params` parameter to the query_data dag

Instead of hardcoding the query in the dag, you can now allow the user to specify the query by specific the `params` for the dag decorator. Run the following cell, it might take around 30 seconds for the dag to be updated in the Airflow UI.

In [None]:
%%writefile ../../dags/query_data.py

from airflow.sdk import dag, task, Asset

COLLECTION_NAME = "Books"  
EMBEDDING_MODEL_NAME = "BAAI/bge-small-en-v1.5"


@dag(
    schedule=[Asset("my_book_vector_data")],
    params={
        "query_str":"A philosophical book"
    }
) #new!
def query_data():

    @task
    def search_vector_db_for_a_book(**context):
        from airflow.providers.weaviate.hooks.weaviate import WeaviateHook
        from fastembed import TextEmbedding

        query_str=context["params"]["query_str"] #new!

        hook = WeaviateHook("my_weaviate_conn")
        client = hook.get_conn()

        embedding_model = TextEmbedding(EMBEDDING_MODEL_NAME)  
        collection = client.collections.get(COLLECTION_NAME)
        
        query_emb = list(embedding_model.embed([query_str]))[0]
        
        results = collection.query.near_vector(
            near_vector=query_emb,
            limit=1,
        )
        for result in results.objects:
            print(f"You should read: {result.properties['title']} by {result.properties['author']}")
            print("Description:")
            print(result.properties["description"])

    search_vector_db_for_a_book()


query_data()

**Optional Part**: Add your own book files

Feel free to add your own book description and then trigger the fetch_data dag.

In [None]:
# Add your own book description file
# Format 
# [Integer Index] ::: [Book Title] ([Release year]) ::: [Author] ::: [Description]

my_book_description = """0 ::: The Idea of the World (2019) ::: Bernardo Kastrup ::: An ontological thesis arguing for the primacy of mind over matter.
1 ::: Exploring the World of Lucid Dreaming (1990) ::: Stephen LaBerge ::: A practical guide to learning and enjoying lucid dreams.
"""

my_book_description_file_name = "my_descs_1.txt"

# Write to file
with open(f"../../include/data/{my_book_description_file_name}", 'w') as f:
    f.write(my_book_description)

In [None]:
# ## Remove a book description file
 
# import os

# my_book_description_file_name = "my_descs_1.txt"

# file_path = f"../../include/data/{my_book_description_file_name}"

# # Remove the file
# if os.path.exists(file_path):
#     os.remove(file_path)
# else:
#     print(f"File not found: {file_path}")

### 5.6. Resources

- [Schedule DAGs in Apache Airflow®](https://www.astronomer.io/docs/learn/scheduling-in-airflow/): Learn all the different ways of scheduling Airflow dags.
- [DAG-level parameters in Airflow](https://www.astronomer.io/docs/learn/airflow-dag-parameters/): A comprehensive list of dag parameters in Airflow.
- [Assets and data-aware scheduling in Airflow](https://www.astronomer.io/docs/learn/airflow-datasets/): Learn how to created advanced data-aware schedules using `Asset`s in Airflow.
- [Access the Apache Airflow context](https://www.astronomer.io/docs/learn/airflow-context/): Learn how to interact with the Airflow context dictionary retrieved with `**context`.
- [Create and use params in Airflow](https://www.astronomer.io/docs/learn/airflow-params/): Learn how to create advanced `params` dictionaries for your Airflow dags.
- [Airflow REST API - Create Asset Event](https://airflow.apache.org/docs/apache-airflow/stable/stable-rest-api-ref.html#operation/create_asset_event): You can update Assets from outside of Airflow using the Airflow REST API.

<div style="background-color:#fff6ff; padding:13px; border-width:3px; border-color:#efe6ef; border-style:solid; border-radius:6px">

<p> ⬇ &nbsp; <b>Download Notebooks:</b> 1) click on the <em>"File"</em> option on the top menu of the notebook and then 2) click on <em>"Download as"</em> and select <em>"Notebook (.ipynb)"</em>.</p>

<p> 📒 &nbsp; For more help, please see the <em>"Appendix – Tips, Help, and Download"</em> Lesson.</p>

</div>