# Running a basic RAG-powered LLM application using Mistral

**parse.py**: defines find_documents() and load_documents(), which locates and stores each file in a way that can be used by a RAG system.

This notebook loads all the information from files in the chosen directory into a Chroma DB collection. The files are divided according to the default parameters used by the loaders in load_documents. A query is used to extract information from the collection, and the query, context, and some extra information are combined into an LLM prompt that Mistral uses to respond.

Areas of improvement: 
1. change parameters by which documents are split upon loading (done in parse.py load_documents())
2. integrate an embedding model when documents are added to the Chroma collection (this file)
3. use a pipeline to make the retrieval multi-step or metadata-aware (this file-- but will likely require a lot of code that may end up in other files as well)


In [1]:
from parse import find_documents, load_documents
import chromadb
from chromadb.config import Settings
from langchain_community.llms import Ollama
from embedding_util import CustomEmbeddingFunction

  from .autonotebook import tqdm as notebook_tqdm


In [50]:
NUM_DOCUMENTS = 1
THRESHOLD = 1
llm = Ollama(model="llama2")
TARGET_DIR = 'SOURCE_DIRECTORY'

NOTE cell below: produces "ignoring wrote pointing object at x y (offset z)" message -- I believe this is coming from the pyPDFLoader functioncall in load_documents. Cause? Is it an issue or can we just leave it there?

In [90]:
paths, filenames = find_documents(TARGET_DIR)
documents = load_documents(paths, filenames)

len(documents)

Found: .DS_Store
Found: Love and Friendship July2022-IAPmagazine.pdf
Found: text.docx
Found: text.pdf
Found: proofs textbook.pdf
Found: Concepts in Thermal Physics Blundell.pdf
Found: programming in c (4th edition)  - stephen g. kochan(1).pdf
Found: mechanics textbook.pdf
Found: Algorithm Design and Applications[CPSC3210].pdf
Found: space-facts.docx


Ignoring wrong pointing object 6 0 (offset 0)
Ignoring wrong pointing object 8 0 (offset 0)


3944

In [92]:
client = chromadb.Client(Settings(allow_reset=True))
db = client.get_or_create_collection(
    name = 'test', embedding_function=CustomEmbeddingFunction()
)

db.add(
    ids = [str(i) for i in range(0, len(documents))],
    documents = [doc.page_content for doc in documents], 
    metadatas = [doc.metadata for doc in documents]
)

In [74]:
def create_prompt(context, question):
    str = f"""
    
    You are a helpful assistant that will use some provided context to answer the following question. Before you answer, read the context and think
    about how it relates to and answers the question. If you can't answer a question based on the context, simply state that you could not find any useful 
    information to help answer. Do not use any other information besides the provided context.

    {context}
    User:{question}

    Use this format:
    [Filepath] : 
    [information learned from source]

    Thank you!
    """

    return str

In [73]:
def print_recieved_documents(document_list):
    #print(f"LENGTH IS : {len(document_list['metadatas'][0])}")
    #print(f"LENGTH IS : {len(document_list['documents'][0])}")
    for idx, _ in enumerate(document_list['ids'][0]):
        print('************')
        print(f"Filepath: {document_list['metadatas'][0][idx]}")
        #print(f"distance: {document_list['distances'][0][idx]}")
        print(f"Content: {document_list['documents'][0][idx]}")
        print('************')

In [None]:
# code to view the prompt
#temp = db.query(query_texts='What is the mass of a proton?', n_results=3)
#print_recieved_documents(temp)

#print(create_prompt(temp, 'What is the mass of a proton?'))

In [87]:
def get_llm_response(question, show_context = True):
    results = db.query(query_texts = question, n_results = NUM_DOCUMENTS)
    if show_context:
        print_recieved_documents(results)

    relevant_docs = []

    for idx, dist in enumerate(results['distances'][0]):
        if dist < THRESHOLD:
            this_doc = [ results['metadatas'][0][idx], results['documents'][0][idx]]
            relevant_docs.append(this_doc)

    response = llm.invoke(create_prompt(relevant_docs, question))
    print(response)

In [88]:
QUERY = "Can you explain the laws of thermodynamics and how they are used?"

In [89]:
print(QUERY)
print()
print("MISTRAL WITH RAG:")
print()
get_llm_response(QUERY, show_context = False)
print()
print("MISTRAL:")
print(llm.invoke(QUERY))

Can you explain the laws of thermodynamics and how they are used?

MISTRAL WITH RAG:


[SOURCE_DIRECTORY/Concepts in Thermal Physics Blundell.pdf] :

In Chapter 11 of the provided context, the notion of a function of state, specifically internal energy, is introduced as one of the most useful concepts in thermodynamics. The chapter discusses in detail the first law of thermodynamics, which states that energy is conserved and heat is a form of energy. Expressions for the heat capacity measured at constant volume or pressure for an ideal gas are derived.

In Chapter 12, the key concept of reversibility and isothermal and adiabatic processes are introduced. The chapter provides a detailed explanation of how these concepts relate to the laws of thermodynamics and their practical applications.

Based on the provided context, it appears that the first law of thermodynamics and its related concepts are the primary focus of the text. Therefore, the information learned from the source can be us