## Lesson 4: Turning Your Notebook Into a Pipeline

In this lesson, you will transform your RAG prototype of lesson 2 into two dags that you will manually trigger. The first dag will fetch the book descriptions, calculate the vector embeddings of the text descriptions and finally load the embeddings to a vector database. The second dag will query data from the vector database.

### 4.1. Link to Airflow UI

Run the following cell to get the link to the Airflow UI. If asked for username and password, make sure to type `airflow` for both.

In [None]:
import os
airflow_ui = os.environ.get('DLAI_LOCAL_URL').format(port=8080)
airflow_ui #username:airflow password:airflow

You won't be using the two dags from the previous lesson (`my_first_dag` and `my_second_dag`). You will create two new dags `fetch_data` and `query_data`. Since the lab environment resets after 120 minutes, depending on when you're starting this lesson, you may not see the the dags of the previous lesson. If this is case, you don't need to worry about that because you will not use them in this lesson.

<div style="background-color:#fff6e4; padding:15px; border-width:3px; border-color:#f5ecda; border-style:solid; border-radius:6px"> ⏳ <b>Note <code>Airflow UI</code>:</b> 

<p>Changes to dags may take up to 30 seconds to show up in the Airflow UI in this environment! </p>
<p>In the Airflow UI, if you see the error "504 Gateway Timeout", this can happen after 2 hours or after some time of inactivity 25 minutes (if there's no activity for 20 minutes, the jupyter kernel stops and if there's no kernel for 5 minutes, then the jupyter notebook stops and the resources are released). In this case, make sure to refresh the notebook, run the cell that outputs the link to the Airflow UI and then use the link to open the Airflow UI. </p>
</div>

### 4.2. Creating two dags: fetch_data and query_data with empty tasks

Run the following two cells, and then check the UI after 30 seconds.

In [None]:
%%writefile ../../dags/fetch_data.py 
from airflow.sdk import chain, dag, task 

@dag
def fetch_data():

    @task
    def create_collection_if_not_exists() -> None:
        pass

    _create_collection_if_not_exists = create_collection_if_not_exists()

    @task
    def list_book_description_files() -> list:
        return []

    _list_book_description_files = list_book_description_files()

    @task
    def transform_book_description_files(book_description_files: list) -> list:
        return []

    _transform_book_description_files = transform_book_description_files(
        book_description_files=_list_book_description_files
    )

    @task
    def create_vector_embeddings(list_of_book_data: list) -> list:
        return []

    _create_vector_embeddings = create_vector_embeddings(
        list_of_book_data=_transform_book_description_files
    )

    @task
    def load_embeddings_to_vector_db(
        list_of_book_data: list, list_of_description_embeddings: list
    ) -> None:
        pass

    _load_embeddings_to_vector_db = load_embeddings_to_vector_db(
        list_of_book_data=_transform_book_description_files,
        list_of_description_embeddings=_create_vector_embeddings,
    )

    chain(
        _create_collection_if_not_exists,
        _load_embeddings_to_vector_db
    )
    

fetch_data()

In [None]:
%%writefile ../../dags/query_data.py 
from airflow.sdk import dag, task  


@dag
def query_data():

    @task
    def search_vector_db_for_a_book(query_str: str) -> None:
        pass

    search_vector_db_for_a_book(query_str="A philosophical book")


query_data()

### 4.3. Filling out the tasks of the two dags

**Optional reading notes related to the code in the next cell**: 

- `include/data`: in this and subsequent lessons, `include/data` is a directory that is provided to you (at `home/jovyan/include`) containing the text files of book descriptions. This directory is linked to the Airflow environment, so when the Airflow worker executes the tasks of reading from the text files, it knows where to find the data. Make sure to check the resource section below if you'd like to learn how to locally set up Airflow with this directory.
- How is Weaviate set up: it's set up as a local standalone server (if you're familiar with Docker, this [Docker image](https://github.com/astronomer/academy-genai/blob/main/docker-compose.override.yml) was used)
- What is `my_weaviate_conn`? 
  - Airflow has a Connection concept for storing credentials that are used to talk to external systems. A Connection is essentially set of parameters - such as username, password and hostname - along with the type of system that it connects to.
  - `my_weaviate_conn` represents the connection details that are needed to connect to weaviate.
  - you can enter these details manually in the Airflow UI before running the dag or you can define them as an environment variable when you set up Airflow. In this course, it's defined as this environment variable:
    ``` python
    AIRFLOW_CONN_MY_WEAVIATE_CONN='{
        "conn_type":"weaviate",
        "host":"localhost",
        "port":"8081",
        "extra":{
            "token":"adminkey",
            "additional_headers":{"X-Openai-Api-Key":"<YOUR OPENAI API KEY>"}, # not used in this course
            "grpc_port":"50051",
            "grpc_host":"localhost",
            "grpc_secure":"False",
            "http_secure":"False"
        }
    }'
    ```
    Make sure check the resource section below if you'd like to learn how to locally set up Airflow with Weaviate.
  - The connection details are passed to a `Weaviate hook`; a Hook is a high-level interface to an external platform that lets you quickly and easily talk to them without having to write low-level code that hits their API or uses special libraries. 

**Update the dags**:

Run the following two cells, and then check the UI after 30 seconds.

In [None]:
%%writefile ../../dags/fetch_data.py 
from airflow.sdk import chain, dag, task 


COLLECTION_NAME = "Books" 
BOOK_DESCRIPTION_FOLDER = "/home/jovyan/include/data"
EMBEDDING_MODEL_NAME = "BAAI/bge-small-en-v1.5"


@dag
def fetch_data():

    @task
    def create_collection_if_not_exists() -> None:
        from airflow.providers.weaviate.hooks.weaviate import WeaviateHook

        hook = WeaviateHook("my_weaviate_conn")
        client = hook.get_conn()

        existing_collections = client.collections.list_all()
        existing_collection_names = existing_collections.keys()

        if COLLECTION_NAME not in existing_collection_names:
            print(f"Collection {COLLECTION_NAME} does not exist yet. Creating it...")
            collection = client.collections.create(name=COLLECTION_NAME)
            print(f"Collection {COLLECTION_NAME} created successfully.")
            print(f"Collection details: {collection}")

    _create_collection_if_not_exists = create_collection_if_not_exists()

    @task
    def list_book_description_files() -> list:
        import os
        
        book_description_files = [
            f for f in os.listdir(BOOK_DESCRIPTION_FOLDER)
            if f.endswith('.txt')
        ]
        return book_description_files

    _list_book_description_files = list_book_description_files()

    @task
    def transform_book_description_files(book_description_files: list) -> list:
        import json
        import os

        list_of_book_data = []
        
        for book_description_file in book_description_files:
            with open(
                os.path.join(BOOK_DESCRIPTION_FOLDER, book_description_file), "r"
            ) as f:
                book_descriptions = f.readlines()
            
            titles = [
                book_description.split(":::")[1].strip()
                for book_description in book_descriptions
            ]
            authors = [
                book_description.split(":::")[2].strip()
                for book_description in book_descriptions
            ]
            book_description_text = [
                book_description.split(":::")[3].strip()
                for book_description in book_descriptions
            ]
            
            book_descriptions = [
                {
                    "title": title,
                    "author": author,
                    "description": description,
                }
                for title, author, description in zip(
                    titles, authors, book_description_text
                )
            ]
        
            list_of_book_data.append(book_descriptions)

        

        return list_of_book_data

    _transform_book_description_files = transform_book_description_files(
        book_description_files=_list_book_description_files
    )

    @task
    def create_vector_embeddings(list_of_book_data: list) -> list:
        from fastembed import TextEmbedding

        embedding_model = TextEmbedding(EMBEDDING_MODEL_NAME)  
        
        list_of_description_embeddings = []
        
        for book_data in list_of_book_data:
            book_descriptions = [book["description"] for book in book_data]
            description_embeddings = [
                list(map(float, next(embedding_model.embed([desc])))) for desc in book_descriptions
            ]
        
            list_of_description_embeddings.append(description_embeddings)

        return list_of_description_embeddings

    _create_vector_embeddings = create_vector_embeddings(
        list_of_book_data=_transform_book_description_files
    )

    @task
    def load_embeddings_to_vector_db(
        list_of_book_data: list, list_of_description_embeddings: list
    ) -> None:
        from airflow.providers.weaviate.hooks.weaviate import WeaviateHook
        from weaviate.classes.data import DataObject

        hook = WeaviateHook("my_weaviate_conn")
        client = hook.get_conn()
        collection = client.collections.get(COLLECTION_NAME)

        for book_data_list, emb_list in zip(list_of_book_data, list_of_description_embeddings):
            items = []
            
            for book_data, emb in zip(book_data_list, emb_list):
                item = DataObject(
                    properties={
                        "title": book_data["title"],
                        "author": book_data["author"],
                        "description": book_data["description"],
                    },
                    vector=emb
                )
                items.append(item)
            
            collection.data.insert_many(items)

    _load_embeddings_to_vector_db = load_embeddings_to_vector_db(
        list_of_book_data=_transform_book_description_files,
        list_of_description_embeddings=_create_vector_embeddings,
    )

    chain(
        _create_collection_if_not_exists,
        _load_embeddings_to_vector_db
    )
    


fetch_data()

In [None]:
%%writefile ../../dags/query_data.py 
from airflow.sdk import dag, task  

COLLECTION_NAME = "Books"  
EMBEDDING_MODEL_NAME = "BAAI/bge-small-en-v1.5"

@dag
def query_data():

    @task
    def search_vector_db_for_a_book(query_str: str) -> None:
        from airflow.providers.weaviate.hooks.weaviate import WeaviateHook
        from fastembed import TextEmbedding

        hook = WeaviateHook("my_weaviate_conn")
        client = hook.get_conn()

        embedding_model = TextEmbedding(EMBEDDING_MODEL_NAME)  
        collection = client.collections.get(COLLECTION_NAME)
        
        query_emb = list(embedding_model.embed([query_str]))[0]
        
        results = collection.query.near_vector(
            near_vector=query_emb,
            limit=1,
        )
        for result in results.objects:
            print(f"You should read: {result.properties['title']} by {result.properties['author']}")
            print("Description:")
            print(result.properties["description"])

    search_vector_db_for_a_book(query_str="A philosophical book")


query_data()

### 4.4. Resources

- How to set up Airflow with weaviate locally using Astro CLI:
  - You can check the last optional video of this course "How to Set up a Local Airflow Environment" that shows you how to replicate the same lab environment locally. It has this companion [github repo](https://github.com/astronomer/orchestrating-workflows-for-genai-deeplearning-ai).
- [Connections & Hooks](https://airflow.apache.org/docs/apache-airflow/stable/authoring-and-scheduling/connections.html#)
  
- [Airflow Weaviate Provider Package](https://airflow.apache.org/docs/apache-airflow-providers-weaviate/stable/index.html): Documentation of the Airflow Weaviate Provider Package which includes the `WeaviateHook`.
- [Airflow Hooks](https://www.astronomer.io/docs/learn/what-is-a-hook/): Learn about Airflow hooks like the `WeaviateHook`.
- [Manage connections in Apache Airflow](https://www.astronomer.io/docs/learn/connections): Learn about the different ways to connect Airflow to other tools.
- [Strategies for custom XCom backends in Airflow](https://www.astronomer.io/docs/learn/custom-xcom-backend-strategies/): Learn how to save data that is passed between tasks in different storage systems.

<div style="background-color:#fff6ff; padding:13px; border-width:3px; border-color:#efe6ef; border-style:solid; border-radius:6px">

<p> ⬇ &nbsp; <b>Download Notebooks:</b> 1) click on the <em>"File"</em> option on the top menu of the notebook and then 2) click on <em>"Download as"</em> and select <em>"Notebook (.ipynb)"</em>.</p>

<p> 📒 &nbsp; For more help, please see the <em>"Appendix – Tips, Help, and Download"</em> Lesson.</p>

</div>