# Document Question-Answering

This example of [OnPrem.LLM](https://github.com/amaiya/onprem) demonstrates retrieval augmented generation or RAG.

## Basic RAG

### STEP 1: Setup the `LLM` instance

In this first example, we will use a model called **[Zephyr-7B-beta](https://huggingface.co/TheBloke/zephyr-7B-beta-GGUF)**.  When selecting a model, it is important to inspect the model's home page and identify the correct prompt format.  The prompt format for this model is [located here](https://huggingface.co/TheBloke/zephyr-7B-beta-GGUF#prompt-template-zephyr), and we will supply it directly to the `LLM` constructor along with the URL to the specific model file we want (i.e., *zephyr-7b-beta.Q4_K_M.gguf*).  We will offload layers to our GPU(s) to speed up inference using the `n_gpu_layers` parameter. (For more information on GPU acceleration, see [here](https://amaiya.github.io/onprem/#speeding-up-inference-using-a-gpu).) For the purposes of this notebook, we also supply `temperature=0` so that there is no variability in outputs.  You can increase this value for more creativity in the outputs. Finally, we will choose a non-default location for our vector database.

In [None]:
# | notest

from onprem import LLM, utils as U
import tempfile
from textwrap import wrap

In [None]:
# | notest

vectordb_path = tempfile.mkdtemp()

llm = LLM(model_url='https://huggingface.co/TheBloke/zephyr-7B-beta-GGUF/resolve/main/zephyr-7b-beta.Q4_K_M.gguf', 
          prompt_template= "<|system|>\n</s>\n<|user|>\n{prompt}</s>\n<|assistant|>",
          n_gpu_layers=-1,
          temperature=0,
          store_type='dense',
          vectordb_path=vectordb_path,
         verbose=False)

llama_new_context_with_model: n_ctx_per_seq (3904) < n_ctx_train (32768) -- the full capacity of the model will not be utilized


Since OnPrem.LLM includes built-in support for Zephyr, an easier way to instantiate the LLM with Zephyr is as follows:

```python
llm = LLM(default_model='zephyr', 
          n_gpu_layers=-1,
          temperature=0,
          store_type='dense',
          vectordb_path=vectordb_path)
```



### STEP 2: Ingest Documents

When ingesting documents, they can be stored in one of two ways:
1. a **dense** vector store:  a conventional vector database like Chroma
2. a **sparse** vector store: a keyword-search engine

Sparse vector stores compute embeddings on-the-fly at inference time. As a result, sparse vector stores sacrifice a small amount of inference speed for significant speed ups in ingestion speed.  This makes it better suited for larger document sets.  Note that sparse vector stores include the contraint that any passages considered as sources for answers should have at least one word in common with the question being asked. You can specify the kind of vector store by supplying either `store_type="dense"` or `store_type="sparse"` when creating the `LLM` above.  We use a dense vector store in this example, as shown above.

For this example, we will download the 2024 National Defense Autorization Act (NDAA) report and ingest it.

In [None]:
# | notest

U.download('https://www.congress.gov/118/crpt/hrpt125/CRPT-118hrpt125.pdf', '/tmp/ndaa/ndaa.pdf', verify=True)

[██████████████████████████████████████████████████]

In [None]:
# | notest
llm.ingest("/tmp/ndaa/")

Creating new vectorstore at /tmp/tmpmnt6g6l8/dense
Loading documents from /tmp/ndaa/


Loading new documents: 100%|██████████████████████| 1/1 [00:00<00:00,  1.62it/s]
Processing and chunking 672 new documents: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 10.22it/s]


Split into 5202 chunks of text (max. 500 chars each for text; max. 2000 chars for tables)
Creating embeddings. May take some minutes...


100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:17<00:00,  2.95s/it]

Ingestion complete! You can now query your documents using the LLM.ask or LLM.chat methods





### STEP 3: Asking Questions to Your Documents

In [None]:
# | notest

result = llm.ask("What is said about artificial intelligence training and education?")


The context provided discusses the implementation of an AI education strategy required by Section 256 of the National Defense Authorization Act for Fiscal Year 2020. The strategy aims to educate servicemembers in relevant occupational fields, with a focus on data literacy across a broader population within the Department of Defense. The committee encourages the Air Force and Space Force to leverage government-owned training platforms informed by private sector expertise to accelerate learning and career path development. Additionally, the committee suggests expanding existing mobile enabled platforms to train and develop the cyber workforce of the Air Force and Space Force. Overall, there is a recognition that AI continues to be central to warfighting and that proper implementation of these new technologies requires a focus on education and training.

The answer is stored in `results['answer']`. The documents retrieved from the vector store used to generate the answer are stored in `results['source_documents']` above.

In [None]:
# | notest

print('ANSWER:')
print("\n".join(wrap(result['answer'])))
print()
print()
print('REFERENCES')
print()
for d in result['source_documents']:
    print(f"On Page {d.metadata['page']} in {d.metadata['source']}:")
    print(d.page_content)
    print('----------------------------------------')
    print()

ANSWER:
 The context provided discusses the implementation of an AI education
strategy required by Section 256 of the National Defense Authorization
Act for Fiscal Year 2020. The strategy aims to educate servicemembers
in relevant occupational fields, with a focus on data literacy across
a broader population within the Department of Defense. The committee
encourages the Air Force and Space Force to leverage government-owned
training platforms informed by private sector expertise to accelerate
learning and career path development. Additionally, the committee
suggests expanding existing mobile enabled platforms to train and
develop the cyber workforce of the Air Force and Space Force. Overall,
there is a recognition that AI continues to be central to warfighting
and that proper implementation of these new technologies requires a
focus on education and training.


REFERENCES

On Page 359 in /tmp/ndaa/ndaa.pdf:
‘‘servicemembers in relevant occupational fields on matters relating 
to artifi

## Advanced Example: NSF Awards

The example above employed the use of the default dense vector store, Chroma.  By supplying `store_type="sparse"` to `LLM`, a sparse vector store (i.e., keyword search engine) is used instead.  Sparse vector stores index documents faster but requires keyword matches between sources containing answers and the question or query.  Semantic search is still supported through on-demand dense vectorization in OnPrem.LLM.

In this example, we will instantiate a spare store directly and customize the ingestion process to include custom fields using a dataset of 2024 NSF Awards.

### STEP 1: Download the Pre-Process the NSF Data

NSF awards data stores as thousands of JSON files. The code below downloads and parses each JSON file.

In [None]:
# | notest

import os
import zipfile
import requests
import json
from pathlib import Path
from tqdm.notebook import tqdm

# Step 1: Download the ZIP file
url = "https://www.nsf.gov/awardsearch/download?DownloadFileName=2024&All=true&isJson=true"
zip_path = "/tmp/nsf_awards_2024.zip"

if not os.path.exists(zip_path):
    print("Downloading NSF data...")
    with requests.get(url, stream=True) as r:
        r.raise_for_status()
        with open(zip_path, 'wb') as f:
            for chunk in r.iter_content(chunk_size=8192):
                f.write(chunk)
    print("Download complete.")
else:
    print("ZIP file already exists.")

# Step 2: Unzip the file
extract_dir = "nsf_awards_2024"

if not os.path.exists(extract_dir):
    print("Extracting ZIP file...")
    with zipfile.ZipFile(zip_path, 'r') as zip_ref:
        zip_ref.extractall(extract_dir)
    print("Extraction complete.")
else:
    print("Already extracted.")

# Step 3: Function to extract fields from JSON
def extract_fields(data):
    title = data.get("awd_titl_txt", "N/A")
    abstract = data.get("awd_abstract_narration", "N/A")
    
    pgm_ele = data.get("pgm_ele")
    if isinstance(pgm_ele, list) and pgm_ele:
        category = pgm_ele[0].get("pgm_ele_name", "N/A")
    else:
        category = "N/A"

    # Authors
    authors = []
    for pi in data.get("pi", []):
        full_name = pi.get("pi_full_name", "")
        if full_name:
            authors.append(full_name)
    authors_str = ", ".join(authors) if authors else "N/A"

    # Affiliation
    affiliation = data.get("inst", {}).get("inst_name", "N/A")

    # Amount
    raw_amount = data.get("awd_amount", data.get("tot_intn_awd_amt", None))
    try:
        amount = float(raw_amount)
    except (TypeError, ValueError):
        amount = None

    return {
        "title": title or '',
        "abstract": f'{title or ""}' + '\n\n' + f'{abstract or ""}',
        "category": category,
        "authors": authors_str,
        "affiliation": affiliation,
        "amount": amount
    }

# Step 4: Process all JSON files and write results to .txt
output_dir = "/tmp/nsf_text_output"
os.makedirs(output_dir, exist_ok=True)

json_files = list(Path(extract_dir).glob("*.json"))

print(f"Processing {len(json_files)} JSON files...")

nsf_data = []
for json_file in tqdm(json_files):
    with open(json_file, 'r', encoding='utf-8') as f:
        try:
            data = json.load(f)
        except json.JSONDecodeError:
            continue  # skip bad files

    fields = extract_fields(data)
    fields['source'] = str(json_file)
    nsf_data.append(fields)

print("All JSON files processed and saved to list of dictionaries.")


ZIP file already exists.
Already extracted.
Processing 11687 JSON files...


  0%|          | 0/11687 [00:00<?, ?it/s]

All JSON files processed and saved to list of dictionaries.


### STEP 2: Ingest Documents

Let's now store these NSF awards data in a Whoosh-backed sparse vector store.  This is equivalent to supplying `store_type="sparse"` to `LLM`. However, we will explicitly create the SparseStore instance to customize the ingestion process for NSF data.

Since award abstracts are not lengthy, we will forgo chunking the document (e.g., using either `onprem.ingest.chunk_documents` or another chunking tool like [chonkie](https://github.com/chonkie-inc/chonkie)) and instead store each award as a single record in the index.

In [None]:
# | notest

from onprem.ingest import VectorStoreFactory, helpers, chunk_documents
store = VectorStoreFactory.create(
    kind='whoosh',
    persist_location='/tmp/nsf_store'
)

In [None]:
# | notest

docs = []
for d in nsf_data:
    doc = helpers.doc_from_dict(d, content_field='abstract')
    docs.append(doc)
store.add_documents(docs)

100%|███████████████████████████████████████████████████████████████████████████████████████████████████| 11687/11687 [00:10<00:00, 1119.63it/s]


Let's examine the total number of awards stored.

In [None]:
# | notest
store.get_size()

11687

### STEP 3: Explore NSF Awards

We can explore NSF awards by either using an LLM or querying the vector store directly.

The NSF buckets awards into different catgories.  Let's examine all the material-related categories.

In [None]:
# | notest

set([d['category'] for d in store.search('category:*material*', limit=100)['hits']])

{'BIOMATERIALS PROGRAM',
 'ELECTRONIC/PHOTONIC MATERIALS',
 'Mechanics of Materials and Str',
 'SOLID STATE & MATERIALS CHEMIS'}

Let's see  how many of the material-related awards mention AI. 

One of the advantages of sparse vector stores is the ability to easily use complex boolean queries to target specific documents.

In [None]:
# | notest

store.search('("machine learning" OR "artificial intelligence") AND category:*material*', limit=100)['total_hits']

15

We wil now use an LLM to summarize how AI is utilized in this research.

Since NSF awards data are publicly-available, we will use OpenAI's GPT-4o-mini, a cloud LLM.

In [None]:
# | notest

from onprem import LLM
llm = LLM('openai/gpt-4o-mini')
llm.load_vectorstore(custom_vectorstore=store)
result = llm.ask('How is articial intelligene and machine learning used in these research projects?', 
                 limit=16,
                  where_document='("machine learning" OR "artificial intelligence") AND category:*material*')

Artificial intelligence (AI) and machine learning (ML) are utilized in various research projects described in the provided context in several ways:

1. **Data-Driven Approaches**: Many projects leverage AI techniques to analyze complex datasets and identify patterns that are not easily discernible through traditional methods. For example, in the project on engineered photonic materials, AI is used to develop new materials with tailored properties by consolidating information on material compositions and geometries.

2. **Model Development and Prediction**: AI and ML are employed to create predictive models that can simulate the behavior of materials under different conditions. The project on recycled polymers utilizes AI to predict deformation and failure mechanisms in recyclates, enhancing their mechanical performance.

3. **Optimization**: Machine learning algorithms are used for optimizing the design and synthesis processes of materials. In the project focused on luminescent biomate

Awards used to answer the question are shown below.

In [None]:
# | notest

for d in result['source_documents']:
    print(d.metadata['title'])

Conference: Uncertainty Quantification for Machine Learning Integrated Physics Modeling (UQ-MLIP 2024); Arlington, Virginia; 12-14 August 2024
Collaborative Research: DMREF: Accelerating the Design and Development of Engineered Photonic Materials based on a Data-Driven Deep Learning Approach
Collaborative Research: DMREF: Accelerating the Design and Development of Engineered Photonic Materials based on a Data-Driven Deep Learning Approach
EAGER: Generative AI for Learning Emergent Complexity in  Mechanics-driven Coupled Physics Problems
CAREER: Investigating the Role of Microstructure in the High Strain Rate Behavior of Stable Nanocrystalline Alloys
Conference: 10th International Conference on Spectroscopic Ellipsometry
CAREER: Informed Testing — From Full-Field Characterization of Mechanically Graded Soft Materials to Student Equity in the Classroom
2024 Solid State Chemistry Gordon Research Conference and Gordon Research Seminar
Designing Luminescent Biomaterials from First Principle

#|hide

## Additional Tips

The `LLM.ask`and `LLM.ingest` methods include many options for more complex scenarios.  

#### LLM.ingest options

- If supplying `infer_table_structure=True` to `LLM.ingest`, the `LLM.ask` method will automatically consider tables within PDFs when answering questions. This behavior can be controlled with the `table_k` and `table_score_threshold` parameters in `LLM.ask`.
- If suppyling `extract_document_titles=True` to `LLM.ingest`, the title of each document will be inferred and added to each document chunk for potentially better retrieval.
- If supplying `caption_tables=True`, an LLM-generated caption will be added to every extracted table for potentially better table retrieval.
- Increasing chunk size of sources for more answer context
#### LLM.ask options
- If supplying `selfask=True` as an argument, a [Self-Ask prompting strategy](https://learnprompting.org/docs/advanced/few_shot/self_ask) will be used to decompose the question into subquestions.
- Adjusting prompts for QA with `prompt_template` argument to `LLM.ask`
- Increasing number of sources to consider (`k` parameter to `LLM.ask`)
- Filtering sources with `filters` and `where_document`
- Adding a score threshold for sources