<a href="https://colab.research.google.com/github/adammuhtar/semantic-information-retrieval/blob/main/notebooks/sierra-openai-crewai.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **SIERRA ⛰️ with crewAI 🚣‍♀️🚣‍♂️: Semantic Information Encoding, Retrieval, and Reasoning Agent**

---

*- by Adam Muhtar*

This notebook details steps to run a simple crew of AI agents to enhance report generation process of a technical subject like Near-Earth Objects (NEO).

1. **Semantic Search Initiation**:
   We first run a semantic search on a predefined corpus to retrieve information pertinent to a user's specific request. This process involves understanding the query's context, recognising key concepts, and pulling relevant documents or excerpts that align closely with the user's needs.

2. **Document Drafting**:
   Once the relevant data is gathered, an AI agent uses this information to compose a well-structured briefing document. The document is organised with an overview section that summarises the key findings, followed by detailed sections that delve deeper into specific research findings. This agent ensures that all critical information is covered effectively and is presented in a logical, coherent manner.

3. **Critical Review**:
   The next AI agent reviews the drafted document critically. It assesses both the content and the structure of the report, providing detailed feedback on various aspects such as clarity, completeness, relevance, and coherence. This critique includes specific suggestions for improvement, such as areas where additional information is needed, where simplification might be beneficial, or where the argument needs stronger supporting evidence.

4. **Report Refinement**:
   Following the critique, the initial drafting AI agent or another specialised agent rewrites the report, incorporating the feedback received. This involves adjusting the content where necessary, enhancing the clarity and flow of the information, and ensuring that all sections of the document now align more closely with the best practices of report writing and the specific demands of the briefing's audience.

This AI-driven workflow leverages the capabilities of multiple pre-defined agents with various specialisms, each focusing on different aspects of the process to ensure the final output is of high quality and meets the user's specific needs. The collaboration between these AI agents mimics an iterative approach typically seen in human teams, providing a way forward to incorporate these systems as knowledge worker co-pilots.

## **Table of Contents**

* [1. Notebook setup](#section-1)
* [2. Download corpus and define text pre-processing functions](#section-2)
* [3. Setup semantic search with contextual compressor](#section-3)
* [4. Define agents](#section-4)
* [5. Define tasks](#section-5)
* [6. Define crew](#section-6)

## 1. Notebook Setup <a id="section-1"></a>

This notebook is run using [Google Colab](https://colab.research.google.com/) - Google's implementation of [Jupyter Notebooks](https://jupyter.org/). This notebook will require the following package(s) to be installed:
* `crewai`
* `faiss-cpu`
* `langchain`
* `pymupdf`
* `sentence-transformers`

Running this Colab notebook will require hardware accelerators to access higher RAM runtimes; this instance runs on the Tesla T4 GPU (16 GB GDDR6 @ 320 GB/s) provided for free by Google.

Additionally, we will be using OpenAI's GPT models as the AI agents. We store our [OpenAI API key](https://openai.com/blog/openai-api) in Colab's Secrets tab, accessible under the name "OpenAI".

In [None]:
# Check IP address details if there are restrictions running non-local servers
!curl ipinfo.io

{
  "ip": "34.173.91.161",
  "hostname": "161.91.173.34.bc.googleusercontent.com",
  "city": "Council Bluffs",
  "region": "Iowa",
  "country": "US",
  "loc": "41.2619,-95.8608",
  "org": "AS396982 Google LLC",
  "postal": "51502",
  "timezone": "America/Chicago",
  "readme": "https://ipinfo.io/missingauth"
}

In [None]:
# Query GPU device status/details
!nvidia-smi

Tue Apr 16 07:30:51 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   34C    P8               9W /  70W |      0MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

In [1]:
# Install dependencies
!pip install --quiet --upgrade crewai faiss-cpu langchain pymupdf sentence_transformers

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.6/61.6 kB[0m [31m864.0 kB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m27.0/27.0 MB[0m [31m38.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m817.7/817.7 kB[0m [31m60.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.9/3.9 MB[0m [31m23.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m163.3/163.3 kB[0m [31m19.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m191.4/191.4 kB[0m [31m22.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m292.8/292.8 kB[0m [31m21.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m60.1/60.1 kB[0m [31m7.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━

In [2]:
# Standard library imports
from pathlib import Path
import re
import requests
from typing import List
import textwrap
from urllib.parse import urlparse

# Third party imports
from crewai import Agent, Crew, Process, Task
from google.colab import userdata
from langchain.chat_models import ChatOpenAI
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainFilter, LLMChainExtractor
from langchain.embeddings import HuggingFaceBgeEmbeddings
from langchain.vectorstores import FAISS
from langchain_community.document_loaders import PyMuPDFLoader
import torch
from tqdm import tqdm

## 2. Download corpus and define text pre-processing functions <a id="section-2"></a>

This notebook makes use of several publicly available books and reports from NASA:
* [NASA'S Discovery Program](https://www.nasa.gov/history/nasas-discovery-program-book/)
* [NASA Planetary Defense Strategy and Action Plan](https://www.nasa.gov/directorates/smd/planetary-science-division/planetary-defense-coordination-office/nasa-releases-agency-strategy-for-planetary-defense-to-safeguard-earth/)
* [NACA to NASA to Now](https://www.nasa.gov/history/history-publications-and-resources/nasa-history-series/naca-to-nasa-to-now/)
* [A History of Near-Earth Objects Research](https://www.nasa.gov/history/history-publications-and-resources/nasa-history-series/a-history-of-near-earth-objects-research/)
* [International Space Station Benefits for Humanity](https://www.nasa.gov/ebooks/counting-the-many-ways-the-space-station-benefits-humankind/)
* [Economic Development of Low Earth Orbit](https://www.nasa.gov/ebooks/economic-development-of-low-earth-orbit/)

We download the PDF files and store them in a folder named "sample_docs" within the same directory where the script is executed.

In [3]:
# Function to download files
def download_files(urls: List[str], folder_path: Path) -> None:
    """
    Downloads a list of files from the specified URLs into a given folder.

    Args:
        * urls (`List[str]`): URLs of the files to download.
        * folder_path (`Path`): The Path object for the folder where the files
            will be saved.

    Creates a folder if it doesn't exist and downloads each file from the list into it.
    """
    # Create the folder if it doesn't exist
    folder_path.mkdir(parents=True, exist_ok=True)

    # Loop through the list of files and download each one
    for url in urls:
        try:
            response = requests.get(url)
            response.raise_for_status()  # Raises HTTPError if the HTTP request returned an unsuccessful status code

            # Extract the file name from the URL
            file_name = Path(urlparse(url).path).name

            # Write the file to the specified folder
            with open(folder_path / file_name, "wb") as file:
                file.write(response.content)

        except requests.HTTPError as e:
            print(f"HTTP Error occurred: {e}")
        except Exception as e:
            print(f"An error occurred: {e}")

# Function to pre-process multi-line string into a single line string
def clean_text(input_string: str) -> str:
    """
    Process a multi-line string to remove leading and trailing whitespace,
    replace newline characters with spaces, and collapse multiple spaces into a
    single space.

    Arg:
        input_string (`str`): The multi-line string to be processed.

    Returns:
        `str`: The processed string.
    """
    return re.sub(" +", " ", input_string.strip().replace("\n", " "))

# Function to pre-process text into clean plain text
def preprocess_text(
    text: str,
    encoding: bool = True,
    lowercase: bool = False,
    remove_newlines: bool = True
) -> str:
    """
    Takes in a string and removes newline characters, tab characters, excess
    whitespaces, as well as regularizing common unicode characters.

    Args:
        * text (`str`): Text to pre-process
        * encoding (`bool`): Convert non UTF-8 characters to UTF-8. Default is
        `True`.
        * lowercase (`bool`): Returns the processed string in lowercase if set
        to `True`. Default is `False`.
        * remove_newlines (`bool`): Removes all newline characters in string.
        Default is `True`.

    Returns:
        * `str`: Pre-processed text
    """
    # Fix apostrophes/quotation marks
    _text = re.sub("[‘’]", "'", text)
    _text = re.sub("[“”]", '"', _text)

    if encoding:
        _text = re.sub("(&\\\\#x27;|&#x27;)", "'", _text)

    # Remove newlines, tabs, non-breaking spaces, excess backslashes/whitespaces
    if remove_newlines:
        _text = re.sub("[\n\r]+", " ", _text)
    _text = re.sub("[\t\xa0]+", " ", _text)
    _text = re.sub(r"\\+", "", _text)
    _text = re.sub(r"\s+", " ", _text).strip()

    if lowercase:
        _text = _text.lower()

    return _text

# Function to wrap text while preserving newlines
def wrap_with_newline(text: str, width: int = 80) -> str:
    """
    Wrap text to a specified width while preserving newlines.

    Args:
        * text (`str`): The text to wrap.
        * width (`int`): The maximum width of each line. Default is 80.

    Returns:
        * `str`: The wrapped text.
    """
    lines = text.split("\n")
    wrapped_lines = [textwrap.fill(line, width) for line in lines]
    return "\n".join(wrapped_lines)

In [4]:
# Download files from pre-defined URLs
urls = [
    "https://www.nasa.gov/wp-content/uploads/2024/01/discovery-program-ebook.pdf",
    "https://www.nasa.gov/sites/default/files/atoms/files/nasa_-_planetary_defense_strategy_-_final-508.pdf",
    "https://www.nasa.gov/wp-content/uploads/2023/02/NACA-to-NASA-to-NOW_TAGGED.pdf",
    "https://www3.nasa.gov/sites/default/files/atoms/files/a_history_of_near-earth_object_research_tagged.pdf",
    "https://www.nasa.gov/wp-content/uploads/2019/04/iss_benefits_for_humanity_3rded-508.pdf?emrc=0f36d3",
    "https://www.nasa.gov/wp-content/uploads/2016/01/economic-development-of-low-earth-orbit_tagged_v2.pdf"
]
nasa_dir = Path.cwd() / "nasa"
download_files(urls=urls, folder_path=nasa_dir)

# Extract text from PDFs
docs = []
for pdf in tqdm(nasa_dir.rglob("*pdf"), desc="Processing PDFs", unit="PDF"):
    docs.extend(PyMuPDFLoader(str(pdf)).load())

Processing PDFs: 6PDF [00:11,  1.94s/PDF]


## 3. Setup semantic search with contextual compressor <a id="section-3"></a>

We then create a vector database of our corpus by creating sentence-level embeddings from extracted texts. This allows us to:
* encode extracted texts from documents as vector embeddings.
* store these embeddings and their associated metadata.
* perform semantic similarity searches on these embeddings.

For this step, we use the BAAI General Embedding (BGE) model, based on this model checkpoint: https://huggingface.co/BAAI/bge-large-en-v1.5. At the time of writing, the BGE models are among the top performing models in the Hugging Face [Massive Text Embedding Benchmark (MTEB)](https://huggingface.co/spaces/mteb/leaderboard) leaderboard.

From the retrieved information, we feed the information and the user's query to an LLM ([OpenAI's GPT-4](https://openai.com/research/gpt-4)) to act as a filter and removing any retrieved information that is unnecessary. This 'compresses' the context provided to the downstream agents and removing potentially redundant information from being used as part of the final output.

In [5]:
# Setup encoder for semantic search
model_name = "BAAI/bge-large-en-v1.5"
device = (
    "mps" if torch.backends.mps.is_built()
    else "cuda" if torch.cuda.is_available()
    else "cpu"
)
bge_embeddings = HuggingFaceBgeEmbeddings(
    model_name=model_name,
    model_kwargs={"device": device},
    encode_kwargs={"normalize_embeddings": True}
)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/94.6k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/779 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.34G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/366 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/191 [00:00<?, ?B/s]

In [6]:
# Instantiate the LLM model
llm = ChatOpenAI(
    openai_api_key=userdata.get("OpenAI"),
    model="gpt-4",
    temperature=0
)

# User's query input
query = "How large do meteors need to be to start becoming a threat to life on Earth?"

# Setting up a FAISS retriever, followed by context compressor
retriever = FAISS.from_documents(docs, bge_embeddings).as_retriever()
_filter = LLMChainFilter.from_llm(llm)
compression_retriever = ContextualCompressionRetriever(
    base_compressor=_filter, base_retriever=retriever
)
retrieved_info = compression_retriever.get_relevant_documents(query)

# Reformat retrieved information into a long string
compressed_context = ""
for i, doc in enumerate(retrieved_info, start=1):
    compressed_context += f"Source {i}: " + doc.metadata["title"] + "\n"
    compressed_context += "Page: " + str(doc.metadata["page"]) + "\n"
    compressed_context += "Content: " + preprocess_text(doc.page_content) + "\n\n"

print(wrap_with_newline(compressed_context))

  warn_deprecated(


Source 1: NASA Planetary Defense Strategy and Action Plan
Page: 3
Content: NASA Planetary Defense Strategy And Action Plan | 2 The threat exists
because our planet orbits the Sun amidst millions of objects that cross our
orbit – asteroids and comets. Even a rare interstellar asteroid or comet from
outside our solar system can enter Earth's neighborhood. Characteristics of the
estimated NEO population: • Around 1,000 NEOs greater than one kilometer in size
that are potentially capable of causing global impact effects. Approximately 95
percent of these bodies have been found and none are a current threat. • Around
25,000 objects larger than 140 meters in size, capable of causing regional
devastation, are believed to exist. Less than 50 percent have been detected and
tracked to date. • An estimated 230,000 or more objects exist that are equal to
or larger than 50 meters in size and could destroy a concentrated urban area. It
is estimated that fewer than eight percent of these have been de

## 4. Define agents <a id="section-4"></a>

We use the crewAI framework to define personas of various AI agents, each with their own respective backstories and goals to achieve.

In [7]:
class NasaAgents:
    def __init__(self):
        self.llm = ChatOpenAI(
        openai_api_key=userdata.get("OpenAI"),
        model="gpt-4",
        temperature=0
    )

    def report_agent(self):
        goal_text = """
        Based on all information provided, conduct a through review of all Near
        Earth Object (NEO)-related findings, providing an in-depth report of
        their key findings and outputs detailing their risks and impacts. You
        should always cite your sources.
        """
        backstory_text = """
        You are NASA's chief space risk officer, ensuring the NASA Director is
        briefed with all the most important information relating to risks and
        impacts of threats posed by Near Earth Objects.
        """
        return Agent(
            role="NASA Near Earth Objects Report Writer",
            goal=clean_text(goal_text),
            backstory=clean_text(backstory_text),
            llm=self.llm,
            verbose=True,
            allow_delegation=False
        )

    def reviewer_agent(self):
        goal_text = """
        Based on the report and source information retrieved, conduct a through
        review of the report about the risks and impacts of Near-Earth Object
        (NEO)-related findings. The detailed feedback could include aspects such
        as clarity, completeness, relevance, and coherence. The output should be
        a critique and points on ways to improve the report. Do not make up new
        facts, only use facts from the source information retrieved.
        """
        backstory_text = """
        You are NASA's chief space risk officer, ensuring the all reports containing
        information relating to risks and impacts of threats posed by Near Earth
        Objects are accurate from the information provided.
        """
        return Agent(
            role="NASA Near Earth Objects Report Reviewer",
            goal=clean_text(goal_text),
            backstory=clean_text(backstory_text),
            llm=self.llm,
            verbose=True,
            allow_delegation=False
        )

    def rewriter_agent(self):
        goal_text = """
        Rewrite the report to incorporate the critiques from the original report.
        Do not make up new facts, only use facts from the source information
        retrieved.
        """
        backstory_text = """
        You are NASA's chief space risk officer, ensuring the NASA Director is
        briefed with all the most important information relating to risks and
        impacts of threats posed by Near Earth Objects.
        """
        return Agent(
            role="NASA Near Earth Objects Report Writer",
            goal=clean_text(goal_text),
            backstory=clean_text(backstory_text),
            llm=self.llm,
            verbose=True,
            allow_delegation=False
        )

## 5. Define tasks <a id="section-5"></a>

We then use the crewAI framework to define the various tasks that needs to be performed by the various agents, through pre-defined task and desired output descriptions.

In [8]:
class ResearchTasks:
    def write_report(self, agent, query, retrieved_info):
        task_desc = clean_text(
            f"""
            For the given question '{query}', Compile all the research findings
            into a comprehensive briefing document. Ensure this document contains
            all the relevant entities and technical information provided from
            research, delimited by triple backticks:
            """
        )
        task_desc += "\n\n" + f"```{retrieved_info}```"
        output_desc = clean_text(
            """
            A well-structured briefing document that includes sections for
            the overview, detailed information for the various research findings.
            """
        )
        return Task(
            description=task_desc,
            agent=agent,
            expected_output=output_desc
        )

    def critique_report(self, agent, retrieved_info):
        task_desc = clean_text(
            f"""
            Write a critique of the original reports based on the source
            information provided. The detailed feedback could include aspects
            such as clarity, completeness, relevance, and coherence. Do not make
            up new facts, only use facts from the source information retrieved,
            delimited by triple backticks:
            """
        )
        task_desc += "\n\n" + f"```{retrieved_info}```"
        output_desc = clean_text(
            """
            A series of points that lists detailed pointers of how to improve
            the report based on the facts.
            """
        )
        return Task(
            description=task_desc,
            agent=agent,
            expected_output=output_desc,
            context=[write_report]
        )

    def rewrite_report(self, agent, retrieved_info):
        task_desc = clean_text(
            f"""
            Rewrite the report, utilising all the critique points and research
            findings into an updated comprehensive briefing document. Ensure this
            document sticks to source information retrieved, delimited by triple
            backticks:
            """
        )
        task_desc += "\n\n" + f"```{retrieved_info}```"
        output_desc = clean_text(
            """
            A well-structured briefing document that includes sections for
            the overview, detailed information for the various research findings
            with citations where those information came from, where necessary.
            Do not repeat the original report or the source information.
            """
        )
        return Task(
            description=task_desc,
            agent=agent,
            expected_output=output_desc,
            context=[critique_report]
        )

## 6. Define crew <a id="section-6"></a>

Finally, we assemble the crew of agents and map them to their respective tasks. We then kickoff the process and print out the outputs generated by the crew. An audit trail of the outputs of the various crews are printed and the final output is shown in the final cell.

In [9]:
# Create Agents
agents = NasaAgents()
report_agent = agents.report_agent()
reviewer_agent = agents.reviewer_agent()
rewriter_agent = agents.rewriter_agent()

# Create Tasks
tasks = ResearchTasks()
write_report = tasks.write_report(
    agent=report_agent, query=query, retrieved_info=compressed_context
)
critique_report = tasks.critique_report(
    agent=reviewer_agent, retrieved_info=compressed_context
)
rewrite_report = tasks.rewrite_report(
    agent=rewriter_agent, retrieved_info=compressed_context
)

# Create Crew
neo_crew = Crew(
    agents=[report_agent, reviewer_agent, rewriter_agent],
    tasks=[write_report, critique_report, rewrite_report],
    process=Process.sequential,
    verbose=True
)

# Run the Crew
result = neo_crew.kickoff()

[1m[95m [DEBUG]: == Working Agent: NASA Near Earth Objects Report Writer[00m
[1m[95m [INFO]: == Starting Task: For the given question 'How large do meteors need to be to start becoming a threat to life on Earth?', Compile all the research findings into a comprehensive briefing document. Ensure this document contains all the relevant entities and technical information provided from research, delimited by triple backticks:

```Source 1: NASA Planetary Defense Strategy and Action Plan
Page: 3
Content: NASA Planetary Defense Strategy And Action Plan | 2 The threat exists because our planet orbits the Sun amidst millions of objects that cross our orbit – asteroids and comets. Even a rare interstellar asteroid or comet from outside our solar system can enter Earth's neighborhood. Characteristics of the estimated NEO population: • Around 1,000 NEOs greater than one kilometer in size that are potentially capable of causing global impact effects. Approximately 95 percent of these bodies ha

In [10]:
# Final output of the crew
print(wrap_with_newline(result))

Title: Near-Earth Objects (NEOs): Threats, Impacts, and NASA's Mitigation
Strategies

I. Introduction

Near-Earth Objects (NEOs) are celestial bodies, including asteroids and comets,
that orbit the Sun and cross Earth's orbit. These objects pose a potential
threat due to their proximity and potential for collision with Earth. The
origins of these objects are diverse, ranging from within our solar system to
rare interstellar bodies from outside our solar system. The potential damage
caused by these objects varies significantly, depending on their size and
velocity at the point of impact.

II. NEO Population and Potential Impacts

The estimated NEO population is vast and diverse. It includes around 1,000 NEOs
greater than one kilometer in size, capable of causing global impact effects.
Approximately 95 percent of these bodies have been found and none are currently
considered a threat. There are also around 25,000 objects larger than 140 meters
in size, capable of causing regional devasta