<a href="https://colab.research.google.com/github/adammuhtar/semantic-information-retrieval/blob/main/notebooks/sierra.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **SIERRA ⛰️: Semantic Information Encoding, Retrieval, and Reasoning Agent**

This notebook is a systematic attempt at performing semantic information retrieval from user-provided documents. The first stage of SIERRA involves extraction and meaningful interpretation of content from user-provided documents, mapping text from these documents onto a semantic representation of their latent information. This encoded knowledge is then indexed for efficient retrieval, enabling the system to rapidly locate pertinent information in response to user queries. After the retrieval process, SIERRA leverages a large language model (LLM) to generate coherent and relevant responses based on the retrieved information. Uniquely, the system can also trace and report the source of the information used in these responses, ensuring transparency and credibility. This combination of technologies is a step forward towards building a sophisticated tool for interpreting and synthesising information, ideally one that is capable of providing users with accurate, sourced answers to a wide range of questions.

## **Table of Contents**

* [1. Notebook setup](#section-1)
* [2. Load and embed corpus](#section-2)
* [3. Load LLM; setup Q&A retrieval chain](#section-3)
* [4. Testing Q&A retrieval chain](#section-4)

## 1. Notebook Setup <a name="section-1"></a>

This notebook is run using [Google Colaboratory](https://colab.research.google.com/) (Colab) - Google's implementation of [Jupyter Notebooks](https://jupyter.org/). This notebook will require the following package(s) to be installed:
* `python==3.9.16`
* `accelerate==0.19.0`
* `bitsandbytes==0.38.1`
* `chromadb==0.3.23`
* `langchain==0.0.176`
* `InstructorEmbedding==1.0.0`
* `pypdf==3.9.0`
* `sentencepiece==0.1.99`
* `torch==2.0.1`
* `tiktoken==0.4.0`
* `transformers==4.29.2`
* `xformers==0.0.19`

Running this Colab notebook will require hardware accelerators to access higher RAM runtimes; this instance runs on the Tesla T4 GPU (16 GB GDDR6 @ 320 GB/s) provided for free by Google.

In [1]:
# Query GPU device status/details
!nvidia-smi

Sun May 21 21:10:01 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12    Driver Version: 525.85.12    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   47C    P8     9W /  70W |      0MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [2]:
# Check IP address details if there are restrictions running non-local servers
!curl ipinfo.io

{
  "ip": "34.91.19.43",
  "hostname": "43.19.91.34.bc.googleusercontent.com",
  "city": "Groningen",
  "region": "Groningen",
  "country": "NL",
  "loc": "53.2115,6.5779",
  "org": "AS396982 Google LLC",
  "postal": "9724",
  "timezone": "Europe/Amsterdam",
  "readme": "https://ipinfo.io/missingauth"
}

In [3]:
!pip install --quiet accelerate bitsandbytes chromadb langchain InstructorEmbedding
!pip -q install pypdf sentencepiece tiktoken transformers Xformers

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m219.1/219.1 kB[0m [31m9.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m104.3/104.3 MB[0m [31m2.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m71.3/71.3 kB[0m [31m8.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m873.5/873.5 kB[0m [31m61.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m62.5/62.5 kB[0m [31m8.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m922.6/922.6 kB[0m [31m67.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.0/86.0 kB[0m [31m11.0 MB/s[0m

In [4]:
# Standard library imports
import textwrap

# Third-party imports
from InstructorEmbedding import INSTRUCTOR
from langchain.chains import RetrievalQA
from langchain.document_loaders import DirectoryLoader
from langchain.document_loaders import PyPDFLoader
from langchain.embeddings import HuggingFaceInstructEmbeddings
from langchain.llms import HuggingFacePipeline
from langchain.vectorstores import Chroma
from langchain.text_splitter import RecursiveCharacterTextSplitter
import torch
import transformers
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

  from tqdm.autonotebook import trange


In [5]:
# Check available GPUs for computation
if torch.cuda.is_available():
    num_gpus = torch.cuda.device_count()
    # Print details of all available GPUs
    for i in range(num_gpus):
        gpu_props = torch.cuda.get_device_properties(i)
        print(f"Device details for GPU {i+1}:")
        print(f"* Name: {gpu_props.name}")
        print(f"* Memory size: {round(gpu_props.total_memory / 1024**3, 2)} GB")
        if i == num_gpus-1:
            continue
        else:
            print("-"*79)
    # Get the currently active GPU device and print its name and memory size
    active_gpu = torch.cuda.current_device()
    active_gpu_props = torch.cuda.get_device_properties(active_gpu)
    print("="*79)
    print(f"Currently active GPU device: {active_gpu_props.name}")
    print(f"Memory size: {round(active_gpu_props.total_memory / 1024**3, 2)} GB")
    print("="*79)
else:
    print("No GPU devices found.")

Device details for GPU 1:
* Name: Tesla T4
* Memory size: 14.75 GB
Currently active GPU device: Tesla T4
Memory size: 14.75 GB


## 2. Load corpus and store information in embedding database <a name="section-2"></a>

This notebook makes use of several publicly available books and reports from NASA:
* [NACA to NASA to Now](https://www.nasa.gov/connect/ebooks/naca-to-nasa-to-now.html)
* [NASA Planetary Defense Strategy and Action Plan](https://www.nasa.gov/sites/default/files/atoms/files/nasa_-_planetary_defense_strategy_-_final-508.pdf)
* [Advancing NASA's Climate Strategy 2023](https://www.nasa.gov/sites/default/files/atoms/files/advancing_nasas_climate_strategy_2023.pdf)
* [International Space Station Benefits for Humanity](https://www.nasa.gov/mission_pages/station/research/news/b4h-3rd-ed-book)

In [6]:
!mkdir nasa
!wget -P nasa/ https://www.nasa.gov/sites/default/files/atoms/files/naca_to_nasa_to_now_tagged.pdf
!wget -P nasa/ https://www.nasa.gov/sites/default/files/atoms/files/nasa_-_planetary_defense_strategy_-_final-508.pdf
!wget -P nasa/ https://www.nasa.gov/sites/default/files/atoms/files/advancing_nasas_climate_strategy_2023.pdf
!wget -P nasa/ https://www.nasa.gov/sites/default/files/atoms/files/iss_benefits_for_humanity_3rded-508.pdf

--2023-05-21 21:28:41--  https://www.nasa.gov/sites/default/files/atoms/files/naca_to_nasa_to_now_tagged.pdf
Resolving www.nasa.gov (www.nasa.gov)... 13.227.219.44, 13.227.219.84, 13.227.219.110, ...
Connecting to www.nasa.gov (www.nasa.gov)|13.227.219.44|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 12449719 (12M) [application/pdf]
Saving to: ‘nasa/naca_to_nasa_to_now_tagged.pdf’


2023-05-21 21:28:43 (14.5 MB/s) - ‘nasa/naca_to_nasa_to_now_tagged.pdf’ saved [12449719/12449719]

--2023-05-21 21:28:43--  https://www.nasa.gov/sites/default/files/atoms/files/nasa_-_planetary_defense_strategy_-_final-508.pdf
Resolving www.nasa.gov (www.nasa.gov)... 13.227.219.44, 13.227.219.84, 13.227.219.110, ...
Connecting to www.nasa.gov (www.nasa.gov)|13.227.219.44|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4499697 (4.3M) [application/pdf]
Saving to: ‘nasa/nasa_-_planetary_defense_strategy_-_final-508.pdf’


2023-05-21 21:28:44 (6.93 MB/s) - ‘

In [7]:
# Extract text from each page of the PDF documents
loader = DirectoryLoader("./nasa/", glob="./*.pdf", loader_cls=PyPDFLoader)
documents = loader.load()
len(documents)

606

In [8]:
# Split extracted text into overlapping chunks
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000, chunk_overlap=200
)
texts = text_splitter.split_documents(documents)

We then create a Chroma vector database, which allows us to:
* store embeddings and their metadata
* embed documents and queries
* search embeddings

In [9]:
# Load the instructor-xl to embed corpus into vector space
instructor_embeddings = HuggingFaceInstructEmbeddings(
    model_name="hkunlp/instructor-xl",
    model_kwargs={"device": "cuda"}
)

# Create a Chroma vector database from corpus embeddings
vectordb = Chroma.from_documents(
    documents=texts,
    embedding=instructor_embeddings,
    persist_directory="db"
)

Downloading (…)7f436/.gitattributes:   0%|          | 0.00/1.48k [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/270 [00:00<?, ?B/s]

Downloading (…)/2_Dense/config.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/3.15M [00:00<?, ?B/s]

Downloading (…)0daf57f436/README.md:   0%|          | 0.00/66.3k [00:00<?, ?B/s]

Downloading (…)af57f436/config.json:   0%|          | 0.00/1.52k [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/4.96G [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

Downloading spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

Downloading (…)7f436/tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/2.40k [00:00<?, ?B/s]

Downloading (…)f57f436/modules.json:   0%|          | 0.00/461 [00:00<?, ?B/s]

load INSTRUCTOR_Transformer
max_seq_length  512




## 3. Load LLM; setup Q&A retrieval chain <a name="section-3"></a>

We load the [Dolly 2.0 3B](https://huggingface.co/databricks/dolly-v2-3b) model and tokeniser, and construct a text generation pipeline:

* `tokeniser` is created using `AutoTokenizer` from the `transformers` library and loaded with the pre-trained Dolly 2.0 tokeniser from the "databricks/dolly-v2-3b" model checkpoint. `padding_side` argument is set to pad the left of the input sequence.
* `model` is created using `AutoModelForCausalLM` from the `transformers` library and loaded with the pre-trained Dolly 2.0 model from the "databricks/dolly-v2-3b" checkpoint. `torch_dtype` argument is set to "torch.float16", which uses the reduced precision 16-bit floating point format to speed up the model's computations. `device_map` is set to "auto" to automatically select the device (CPU or GPU) to run the model on.\
* `pipe` is the Hugging Face Transformers pipeline with the following parameters:
    * `max_length=1024`: This sets the maximum length of the generated text. The model will not generate more than 1024 tokens in this case.
    * `temperature=0`: The temperature is used to control the randomness of the model's outputs. A temperature of 0 makes the output completely deterministic, picking only the most likely next word at each step
    * `top_p=0.95`: This is the cumulative distribution function (CDF) used by the model for generating text. The model will only consider a subset of possible tokens for the next word that have a cumulative probability greater than or equal to 0.95.
    * `repetition_penalty=1.15`: This is a factor by which the model penalizes choosing tokens it has already chosen before. This can help prevent the model from getting stuck in a loop, endlessly repeating the same phrase.

In [10]:
# Load model, tokeniser, and text generation pipeline
model_name = "databricks/dolly-v2-3b"
tokeniser = AutoTokenizer.from_pretrained(model_name, padding_side="left")
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    load_in_8bit=True,
    device_map="auto",
    torch_dtype=torch.float16,
    low_cpu_mem_usage=True
)
pipe = pipeline(
    "text-generation",
    model=model, 
    tokenizer=tokeniser, 
    max_length=1024,
    temperature=0,
    top_p=0.95,
    repetition_penalty=1.15
)
local_llm = HuggingFacePipeline(pipeline=pipe)

Downloading (…)okenizer_config.json:   0%|          | 0.00/450 [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/2.11M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/228 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/819 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/5.68G [00:00<?, ?B/s]


Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

 and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
bin /usr/local/lib/python3.10/dist-packages/bitsandbytes/libbitsandbytes_cuda118.so
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so.11.0
CUDA SETUP: Highest compute capability among GPUs detected: 7.5
CUDA SETUP: Detected CUDA version 118
CUDA SETUP: Loading binary /usr/local/lib/python3.10/dist-packages/bitsandbytes/libbitsandbytes_cuda118.so...


  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)
Either way, this might cause trouble in the future:
If you get `CUDA error: invalid device function` errors, the above might be the cause and the solution is to make sure only one ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] in the paths that we search based on your env.
  warn(msg)


With the LLM loaded, we set up a question and answer (Q&A) retrieval chain using LangChain and the previously created Chroma vector database.

In [11]:
# Setup Q&A retrieval chain
retriever = vectordb.as_retriever(search_kwargs={"k": 3})
qa_chain = RetrievalQA.from_chain_type(
    llm=local_llm,
    chain_type="stuff",
    retriever=retriever,
    return_source_documents=True
)

In [12]:
# Function to convert LLM outputs into readable text
def sierra_speak(width: int = 100) -> str:
    """
    This function takes a user's question, parses that question through the Q&A
    retrieval chain and the local LLM model. The result is printed in a neatly
    formatted manner, which includes the sources for the answer at the end.

    Args:
        * width (`int`, optional): Maximum line width for the formatted response.
        Defaults to 100.

    Returns:
        * `str`: Formatted string of the response from LLM model.

    The function works as follows:
    1. Takes a query from the user.
    2. Uses 'qa_chain' function to get the response from the LLM model.
    3. Processes the response to format it neatly with each line not exceeding the given width.
    4. Prints the formatted response.
    5. Lists the sources of the response.
    """
    query = input("Question: ")
    print("="*100)
    llm_response = qa_chain(query)
    response = llm_response["result"]
    response = response.split("\n")
    response = [textwrap.fill(line, width=width) for line in response]
    wrapped_response = "\n".join(response)
    print(wrapped_response)
    print("-"*100)
    for source in llm_response["source_documents"]:
        print(f"{source.metadata['source']} - page {source.metadata['page']}")
    print("-"*100)

## 4. Testing Q&A retrieval chain <a name="section-4"></a>

This section tests the Q&A pipeline built by asking the following questions:
* How large should meteors be to start threaten humans on Earth?
* What are NASA's key priorities in NASA's Climate Strategy?
* How has the International Space Station advanced the field of robotics?
* When did NACA change its name to NASA?
* What are NASA's plans to mitigate risks from Near Earth Objects?

In [13]:
sierra_speak()

Question: How large should meteors be to start threaten humans on Earth?


Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
  attn_scores = torch.where(causal_mask, attn_scores, mask_value)


 The size of meteors that could threaten human populations depends on several factors including
their speed, trajectory, and composition. A meteor about 15 centimeters across would be unlikely to
cause any injuries or deaths if it hit near the ground. However, a meteor traveling 30 kilometers
per second and entering the atmosphere at a 45 degree angle could produce a global explosion
equivalent to about 100 megatons of TNT. Meteor fragments even as small as 5 centimeters could cause
significant property damage.

Meteor size for potential threat to humanity based on speed, trajectory, and composition. Source:
NASA

Note: This answer was adapted from the Planetary Defense Strategy document provided by NASA.


----------------------------------------------------------------------------------------------------
nasa/nasa_-_planetary_defense_strategy_-_final-508.pdf - page 3
nasa/nasa_-_planetary_defense_strategy_-_final-508.pdf - page 3
nasa/nasa_-_planetary_defense_strategy_-_final-508.pdf

In [14]:
sierra_speak()

Question: What are NASA's key priorities in NASA's Climate Strategy?


Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


 Priorities include reducing greenhouse gas emissions from NASA operations; developing technologies
for carbon sequestration; increasing awareness of climate change impacts on society and the
environment; and working with other federal agencies and international partners to develop a
comprehensive plan to reduce greenhouse gases and address climate change.

The following is a list of related documents that provide more detail about these priorities:

        - NASA's Greenhouse Gas Inventory and Reduction Plan
(https://www.nasa.gov/topics/earthsystem/features/greenhousegasinventoryplan.html) provides
information about how NASA will measure, monitor, report, and respond to greenhouse gas emissions.
This inventory includes detailed descriptions of all activities associated with measuring,
monitoring, reporting, and responding to greenhouse gases. It also describes strategies for reducing
emissions through energy conservation measures and using renewable fuels.

        - The Intergovernm

In [15]:
sierra_speak()

Question: How has the International Space Station advanced the field of robotics?


Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


 The International Space Station (ISS), launched in 2001, provided an ideal testbed for developing
multiple space robotics operations including precision and reliability required for longer duration
missions beyond earth orbit.

Through its partnership with NASA and Canadian Space Agency, several key advances were made in the
areas of robotics, imaging, automation and servicing satellites in space.


----------------------------------------------------------------------------------------------------
nasa/iss_benefits_for_humanity_3rded-508.pdf - page 122
nasa/iss_benefits_for_humanity_3rded-508.pdf - page 136
nasa/iss_benefits_for_humanity_3rded-508.pdf - page 122
----------------------------------------------------------------------------------------------------


In [16]:
sierra_speak()

Question: When did NACA change its name to NASA?


Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


 The National Advisory Committee for Aeronautics (NACA) changed its name to NASA in 1958.


----------------------------------------------------------------------------------------------------
nasa/naca_to_nasa_to_now_tagged.pdf - page 2
nasa/naca_to_nasa_to_now_tagged.pdf - page 185
nasa/naca_to_nasa_to_now_tagged.pdf - page 19
----------------------------------------------------------------------------------------------------


In [17]:
sierra_speak()

Question: What are NASA's plans to mitigate risks from Near Earth Objects?


Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.



    The NASA Planetary Defense Strategy and Action Plan describes NASA's current and planned
activities related to planetary defense.

    The strategy identifies three key areas where NASA is focusing its planetary defense efforts:
    1. Surveying the near-Earth object (NEO) population: NASA continues to conduct surveys of the
NEO population using ground-based telescopes as well as space-based telescopes. These surveys
provide information about the orbits of known NEOs, which enables us to predict the orbits of
unknown objects. Additionally, these surveys enable us to identify new classes of NEOs that may
represent a higher risk of impact with Earth. We have identified over 1,000 NEOs larger than 10
meters across, but only a few dozen of those are larger than 100 meters. We estimate there could be
hundreds of thousands of smaller NEOs that pose a greater risk of impact with Earth. To date, no
NEOs larger than 150 meters have been discovered.
    2. Assessing the risk posed by indivi