# RAG
In this section we will start to see the glimpses of RAG. We start by figuring out how to handle external documents.

As with data storage, we have many, _**many**_ options for processing external documents. In this section we will make use of LlamaIndex 

In [1]:
from pathlib import Path
from llama_index.readers.file import PyMuPDFReader
from llama_index.core.node_parser import SentenceSplitter

In [29]:
text_parser = SentenceSplitter(
    chunk_size=1024,
    chunk_overlap=128,
)

loader = PyMuPDFReader()
documents = loader.load(file_path="data/paper.pdf")

In [30]:
documents

[Document(id_='fc0959d0-f777-4a98-875a-5e91d55de278', embedding=None, metadata={'total_pages': 7, 'file_path': 'data/paper.pdf', 'source': '1'}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, text='A P P L I ED S C I E N C E S A N D EN G I N E E R I N G\nCopyright © 2019\nThe Authors, some\nrights reserved;\nexclusive licensee\nAmerican Association\nfor the Advancement\nof Science. No claim to\noriginalU.S.Government\nWorks. Distributed\nunder a Creative\nCommons Attribution\nNonCommercial\nLicense 4.0 (CC BY-NC).\nAvalanches and criticality in self-organized\nnanoscale networks\nJ. B. Mallinson, S. Shirai, S. K. Acharya, S. K. Bose, E. Galli, S. A. Brown*\nCurrent efforts to achieve neuromorphic computation are focused on highly organized architectures, such as\nintegrated circuits and regular arrays of memristors, which lack the complex interconnectivity of the brain and\nso are unable to exhibit brain-like dynamics. New architectures are required, 

We have 7 items in documents, one for each page.

In [31]:
print(documents[0].text[:1000] + "...")

A P P L I ED S C I E N C E S A N D EN G I N E E R I N G
Copyright © 2019
The Authors, some
rights reserved;
exclusive licensee
American Association
for the Advancement
of Science. No claim to
originalU.S.Government
Works. Distributed
under a Creative
Commons Attribution
NonCommercial
License 4.0 (CC BY-NC).
Avalanches and criticality in self-organized
nanoscale networks
J. B. Mallinson, S. Shirai, S. K. Acharya, S. K. Bose, E. Galli, S. A. Brown*
Current efforts to achieve neuromorphic computation are focused on highly organized architectures, such as
integrated circuits and regular arrays of memristors, which lack the complex interconnectivity of the brain and
so are unable to exhibit brain-like dynamics. New architectures are required, both to emulate the complexity of
the brain and to achieve critical dynamics and consequent maximal computational performance. We show here
that electrical signals from self-organized networks of nanoparticles exhibit brain-like spatiotemporal correla-

## Extracting images
It is probably handy to have the images extracted from the pdf. This is not always easy to do, but for this paper, we can use PyMuPDF to extract the images. Objects in a pdf are identified by a `xref` (cross reference) number.

If you know this number, you can extract the image. But how do you find the `xref` number? One method is use PyMuPDF's image extraction functions. We can just loop through all `xref`s and try and extract the image. If it doesn't work, then it's not an image! PyMuPDF will do most of this for us.

In [19]:
import os
import fitz
from tqdm import tqdm

workdir = "data"

def extract_images(workdir):
    for each_path in os.listdir(workdir):
        if ".pdf" in each_path:
            doc = fitz.Document((os.path.join(workdir, each_path)))

            for i in tqdm(range(len(doc)), desc="pages"):
                for img in tqdm(doc.get_page_images(i), desc="page_images"):
                    xref = img[0]
                    image = doc.extract_image(xref)
                    pix = fitz.Pixmap(doc, xref)
                    pix.save(os.path.join(workdir, "%s_p%s-%s.png" % (each_path[:-4], i, xref)))
    print("Done!")

extract_images(workdir)

page_images: 100%|██████████| 1/1 [00:00<00:00,  5.81it/s]
page_images: 100%|██████████| 1/1 [00:00<00:00,  9.87it/s]
page_images: 100%|██████████| 1/1 [00:00<00:00, 11.09it/s]
page_images: 100%|██████████| 1/1 [00:00<00:00, 12.00it/s]
page_images: 0it [00:00, ?it/s]:00<00:00,  9.43it/s]
page_images: 0it [00:00, ?it/s]
page_images: 0it [00:00, ?it/s]
pages: 100%|██████████| 7/7 [00:00<00:00, 15.05it/s]

Done!





If we inspect one of these images, we can see that sure enough, it is a correct image.

In some cases, this might not be possible. Another method is to convert the pdf pages to images. We can then pass the images to a vision LLM and ask it to extract the images. 

## Creating a vector database

Now we have our documents, we can create a vector database. We will use Chroma as before.

First, we use the `text_parser` we created before to split the documents into chunks, and create indices.

In [32]:

text_chunks = []
doc_idxs = []
for doc_idx, doc in enumerate(documents):
    cur_text_chunks = text_parser.split_text(doc.text)
    text_chunks.extend(cur_text_chunks)
    doc_idxs.extend([doc_idx] * len(cur_text_chunks))

In [37]:
len(text_chunks[-1])

2706

In [41]:
import chromadb
from chromadb.utils.embedding_functions import OpenAIEmbeddingFunction

import dotenv
import os
dotenv.load_dotenv()
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")

In [42]:
class DocumentDB:
    def __init__(self, name: str, model_name: str = "text-embedding-3-small"):
        self.model_name = model_name
        self.client = chromadb.PersistentClient(path="./")
        self.embedding_function = OpenAIEmbeddingFunction(api_key=OPENAI_API_KEY, model_name=model_name)
        self.chat_db = self.client.create_collection(name=name, embedding_function=self.embedding_function, metadata={"hnsw:space": "cosine"})
        self.id_counter = 0


    def add_chunks_to_db(self, chunks: list[str], doc_idxs: list[int]):
        """Add text chunks to the database.

        Args:
            chunks (list[str]): List of text chunks.
            doc_idxs (list[int]): List of corresponding document indices.
        """
        self.chat_db.add(
            documents=chunks,
            metadatas=[{"doc_idx": idx} for idx in doc_idxs],
            ids=[f"chunk_{self.id_counter + i}" for i in range(len(chunks))]
        )
        self.id_counter += len(chunks)


    def get_all_entries(self) -> dict:
        """Grab all of the entries in the database.

        Returns:
            dict: All entries in the database.
        """
        return self.chat_db.get()
    

    def clear_db(self, reinitialize: bool = True):
        """Clear the database of all entries, and reinitialize it.

        Args:
            reinitialize (bool, optional): _description_. Defaults to True.
        """
        self.client.delete_collection(self.chat_db.name)
        # re-initialize the database
        if reinitialize:
            self.__init__(self.chat_db.name, self.model_name)


    def query_db(self, query_text: str, n_results: int = 2) -> dict:
        """Given some query text, return the n_results most similar entries in the database.

        Args:
            query_text (str): The text to query the database with.
            n_results (int): The number of results to return.

        Returns:
            dict: The most similar entries in the database.
        """
        return self.chat_db.query(query_texts=[query_text], n_results=n_results)

In [43]:
doc_db = DocumentDB("paper_chunks")
doc_db.add_chunks_to_db(chunks=text_chunks, doc_idxs=doc_idxs)

In [50]:
sample_query = "Abstract"
results = doc_db.query_db(sample_query, n_results=3)
print(f"Sample query results for '{sample_query}':")
results

Sample query results for 'Abstract':


{'ids': [['chunk_1', 'chunk_0', 'chunk_14']],
 'distances': [[0.7424637365628772, 0.757231765145284, 0.7631574948636832]],
 'metadatas': [[{'doc_idx': 0}, {'doc_idx': 0}, {'doc_idx': 6}]],
 'embeddings': None,
 'documents': [['Bottom: The same network schematic presented so as to show the\nconducting pathways (black) that result from atomic filament formation within the gaps between groups. (B) Schematic showing the conductance of a device during\ndeposition of conducting nanoparticles follows (28) a power law ~(p −pc)1.3 (30) (the critical surface coverage is pc ~68%). arb, arbitrary units. (C) A scanning electron\nmicroscope image of a percolating device. Scale bar, 200 nm. (D) Schematic diagrams showing the following: left—in the subcritical, insulating phase at low coverage,\ngroups of particles are small and well separated, so that if an atomic switch connects two groups, then there are few possibilities that this will trigger another switching\nevent; right—in the supercritical, 

This is all pretty messy. Let's first try and stuff this into an LLM and see what happens.

In [51]:
from openai import OpenAI
client = OpenAI()

Our prompt will be simple for now.

In [52]:
system_prompt = (
    "You are a helpful academic assistant that is an expert at extracting information from academic papers."
    "You will be given a query, and some text that corresponds to a document."
    "You must answer the query using the information in the text."
    "Your answer must be concise and to the point."
    "If you are unsure of something, you say so."
    "You will be given a score at the end, so don't miss anything important!"
)



In [53]:
results["documents"][0]

['Bottom: The same network schematic presented so as to show the\nconducting pathways (black) that result from atomic filament formation within the gaps between groups. (B) Schematic showing the conductance of a device during\ndeposition of conducting nanoparticles follows (28) a power law ~(p −pc)1.3 (30) (the critical surface coverage is pc ~68%). arb, arbitrary units. (C) A scanning electron\nmicroscope image of a percolating device. Scale bar, 200 nm. (D) Schematic diagrams showing the following: left—in the subcritical, insulating phase at low coverage,\ngroups of particles are small and well separated, so that if an atomic switch connects two groups, then there are few possibilities that this will trigger another switching\nevent; right—in the supercritical, conducting phase at higher coverages, highly connected pathways across the network mean that when an atomic filament bridges a\ntunnel gap, an avalanche can propagate only to a few nearby tunnel gaps; center—in the critical p

In [54]:
def combined_context(documents: list[str], scores: list[float]) -> str:
    string = ""
    for document, score in zip(documents, scores):
        string += f"{document}\nCosine distance: {score:.2f}\n{'-'*10}\n"
    return string

def get_context(user_input: str, n_results: int = 2, doc_db: DocumentDB = doc_db) -> str:
    results = doc_db.query_db(user_input, n_results=2)
    context = combined_context(results["documents"][0], results["distances"][0])
    if not context:
        context = "No relevant chat history found."
    return context

query = "What are the main findings of this paper?"
context = get_context(query)
print(context)

Measurements over long time
periods are also necessary to avoid significant cutoffs in the power law
distributions (17). Here, we presented data from DC stimulus of four
devices, but the data presented are consistent with that obtained from
DC, pulsed, and ramped voltage stimulus of a further 10 devices.
Our electrical measurements were performed using two distinct
sets of measurement electronics to allow measurement of the device
conductance on two distinct time scales. The first method relies on a
picoammeter and is limited to a relatively slow sampling rate (0.1 s
sampling interval). The second method uses a fast digital oscilloscope
to allow a much higher sampling rate (200 ms sampling interval for the
data presented here). As shown in Figs. 2 to 4, both methods resulted
in qualitatively and quantitatively similar data, with similar power law
exponents for each of the main quantities of interest. Hence, our
results and conclusions are not influenced by the sampling rate.
Data analy

In [58]:
user_prompt = (
    f"Query: {query}\n\n"
    f"Context: {context}"
)

user_prompt

'Query: What are the main findings of this paper?\n\nContext: Measurements over long time\nperiods are also necessary to avoid significant cutoffs in the power law\ndistributions (17). Here, we presented data from DC stimulus of four\ndevices, but the data presented are consistent with that obtained from\nDC, pulsed, and ramped voltage stimulus of a further 10 devices.\nOur electrical measurements were performed using two distinct\nsets of measurement electronics to allow measurement of the device\nconductance on two distinct time scales. The first method relies on a\npicoammeter and is limited to a relatively slow sampling rate (0.1 s\nsampling interval). The second method uses a fast digital oscilloscope\nto allow a much higher sampling rate (200 ms sampling interval for the\ndata presented here). As shown in Figs. 2 to 4, both methods resulted\nin qualitatively and quantitatively similar data, with similar power law\nexponents for each of the main quantities of interest. Hence, our\

In [59]:
response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": user_prompt}
    ],
    stream=True,
    temperature=0.0
)

for chunk in response:
    if chunk.choices[0].delta.content is not None:
        print(chunk.choices[0].delta.content, end="")

The main findings of the paper include:

1. Consistency of electrical measurements across different devices and stimulus methods (DC, pulsed, and ramped voltage).
2. Validation of power law distributions in avalanche dynamics through statistical analysis, indicating that the data follows power law behavior rather than exponential distributions.
3. The choice of threshold for defining events in the conductance signal does not significantly affect the avalanche analysis.
4. The study demonstrates that correlations in experimental avalanche data are crucial, as shuffled data did not exhibit power law distributions. 

Overall, the research supports the presence of criticality in self-organized nanoscale networks.

Meh.

This is OK, but look at the quality of the retrieval process - the text chunks we are getting are not very useful!