## Notebook on learning about RAG

- Good resource: https://learnbybuilding.ai/tutorials/rag-from-scratch

### Benefits of RAG written in the tutorial

- You can include facts in the prompt to help the LLM avoid hallucinations
- You can (manually) refer to sources of truth when responding to a user query, helping to double check any potential issues.
- You can leverage data that the LLM might not have been trained on.

### The High Level Components of our RAG System
- a collection of documents (formally called a corpus)
- An input from the user
- a similarity measure between the collection of documents and the user input

In [1]:
corpus_of_documents = [
    "Take a leisurely walk in the park and enjoy the fresh air.",
    "Visit a local museum and discover something new.",
    "Attend a live music concert and feel the rhythm.",
    "Go for a hike and admire the natural scenery.",
    "Have a picnic with friends and share some laughs.",
    "Explore a new cuisine by dining at an ethnic restaurant.",
    "Take a yoga class and stretch your body and mind.",
    "Join a local sports league and enjoy some friendly competition.",
    "Attend a workshop or lecture on a topic you're interested in.",
    "Visit an amusement park and ride the roller coasters."
]

In [2]:
def jaccard_similarity(query, document):
    query = query.lower().split(" ")
    document = document.lower().split(" ")
    intersection = set(query).intersection(set(document))
    union = set(query).union(set(document))
    return len(intersection)/len(union)

In [3]:
def return_response(query, corpus):
    similarities = []
    for doc in corpus:
        similarity = jaccard_similarity(user_input, doc)
        similarities.append(similarity)
    return corpus_of_documents[similarities.index(max(similarities))]

## Understanding in the lines of code

In [4]:
query = "How should I take a walk ?"
document = corpus_of_documents[0]
query = query.lower().split(" ")
document = document.lower().split(" ")

In [5]:
intersection = set(query).intersection(set(document))
union = set(query).union(set(document))
print(intersection, "\n", union)

{'take', 'a', 'walk'} 
 {'i', 'park', 'walk', 'in', 'enjoy', 'how', 'fresh', 'take', 'should', 'the', 'air.', 'leisurely', 'and', 'a', '?'}


In [6]:
query = "How should I take a walk ?"
corpus = corpus_of_documents
similarities = []
for doc in corpus:
    similarity = jaccard_similarity(query, doc)
    similarities.append(similarity)

In [7]:
max(similarities)

0.2

In [8]:
similarities.index(max(similarities))

0

## Implemenation of an Open Source Semantic Search model

- Chromadb

In [10]:
import chromadb

In [11]:
import fitz # imports the pymupdf library
doc = fitz.open("2022ltr.pdf") # open a document
pdf_texts =  [page.get_text() for page in doc] # iterate the document pages
pdf_texts = [text for text in pdf_texts if len(text)>10]

In [5]:
pdf_texts[1]

'BERKSHIRE HATHAWAY INC.\nTo the Shareholders of Berkshire Hathaway Inc.:\nCharlie Munger, my long-time partner, and I have the job of managing the savings of a\ngreat number of individuals. We are grateful for their enduring trust, a relationship that often spans\nmuch of their adult lifetime. It is those dedicated savers that are forefront in my mind as I write\nthis letter.\nA common belief is that people choose to save when young, expecting thereby to maintain\ntheir living standards after retirement. Any assets that remain at death, this theory says, will usually\nbe left to their families or, possibly, to friends and philanthropy.\nOur experience has differed. We believe Berkshire’s individual holders largely to be of the\nonce-a-saver, always-a-saver variety. Though these people live well, they eventually dispense\nmost of their funds to philanthropic organizations. These, in turn, redistribute the funds by\nexpenditures intended to improve the lives of a great many people who a

In [7]:
from langchain.text_splitter import RecursiveCharacterTextSplitter, SentenceTransformersTokenTextSplitter

In [13]:
character_splitter = RecursiveCharacterTextSplitter(
    separators=["\n\n", "\n", ". ", " ", ""],
    chunk_size=1000,
    chunk_overlap=0
)
character_split_texts = character_splitter.split_text('\n\n'.join(pdf_texts))

print(character_split_texts[35])
print(f"\nTotal chunks: {len(character_split_texts)}")

Do the math: See’s rang up about 10 sales per minute during its prime operating time
(racking up $400,309 of volume during the two days), with all the goods purchased at a single
location selling products that haven’t been materially altered in 101 years. What worked for See’s
in the days of Henry Ford’s model T works now.
* * * * * * * * * * * *
Charlie, I, and the entire Berkshire bunch look forward to seeing you in Omaha on
May 5-6. We will have a good time and so will you.
February 25, 2023
Warren E. Buffett
Chairman of the Board
11

Total chunks: 36


In [19]:
token_splitter = SentenceTransformersTokenTextSplitter(chunk_overlap=0, tokens_per_chunk=256)

token_split_texts = []
for text in character_split_texts:
    token_split_texts += token_splitter.split_text(text)

print(token_split_texts[45])
print(f"\nTotal chunks: {len(token_split_texts)}")

do the math : see ’ s rang up about 10 sales per minute during its prime operating time ( racking up $ 400, 309 of volume during the two days ), with all the goods purchased at a single location selling products that haven ’ t been materially altered in 101 years. what worked for see ’ s in the days of henry ford ’ s model t works now. * * * * * * * * * * * * charlie, i, and the entire berkshire bunch look forward to seeing you in omaha on may 5 - 6. we will have a good time and so will you. february 25, 2023 warren e. buffett chairman of the board 11

Total chunks: 46


In [21]:
from chromadb.utils.embedding_functions import SentenceTransformerEmbeddingFunction

embedding_function = SentenceTransformerEmbeddingFunction()
print(embedding_function([token_split_texts[45]]))

modules.json: 100%|███████████████████████████████████████████████████████████████████████████| 349/349 [00:00<?, ?B/s]
To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
config_sentence_transformers.json: 100%|██████████████████████████████████████████████████████| 116/116 [00:00<?, ?B/s]
README.md: 100%|██████████████████████████████████████████████████████████████████████████| 10.7k/10.7k [00:00<?, ?B/s]
sentence_bert_config.json: 100%|████████████████████████████████████████████████████████████| 53.0/53.0 [00:00<?, ?B/s]
config.json: 100%|████████████████████████████████████████████████████████████████████████████| 612/612 [00:00<?, ?B/s]
pytorch_model.bin: 100%|██████████████████████████████████████████████████████████| 90.9M/90.9M [00:00<00:00, 93.2MB/s]
tokenizer_confi

[[-0.012467734515666962, -0.05428552255034447, 0.037649743258953094, -0.02378346025943756, 0.011054451577365398, 0.02136710099875927, -0.06601178646087646, -0.002584449015557766, 0.01125031616538763, -0.026465097442269325, 0.021672572940587997, 0.08212035149335861, -0.02570822462439537, -0.05478880926966667, -0.019596856087446213, -0.026295682415366173, 0.08333292603492737, -0.06668368726968765, -0.003572107758373022, -0.06448515504598618, -0.02125905081629753, -0.0254600141197443, -0.05526052787899971, 0.026272263377904892, -0.01911891996860504, 0.009858915582299232, -0.015476008877158165, -0.0454326868057251, -0.027211172506213188, -0.06809180229902267, -0.10900609940290451, -0.0048833186738193035, -0.041496992111206055, 0.01615954376757145, 0.013104509562253952, -0.04445979371666908, 0.06840939819812775, -0.060902856290340424, 0.04529665783047676, -0.05040785297751427, 0.07744777202606201, -0.04709912836551666, -0.04672233387827873, -0.017751457169651985, 0.07489844411611557, 0.0162

In [23]:
chroma_client = chromadb.PersistentClient(path="./chromadb")
chroma_collection = chroma_client.get_or_create_collection("Berkshire_Annual_Report", embedding_function=embedding_function)

ids = [str(i) for i in range(len(token_split_texts))]

chroma_collection.add(ids=ids, documents=token_split_texts)
chroma_collection.count()

46

In [15]:
chroma_client = chromadb.PersistentClient(path="./chromadb")
chroma_client.list_collections()

[Collection(name=Berkshire_Annual_Report)]

In [13]:
chroma_client = chromadb.PersistentClient(path="./chromadb")
chroma_collection = chroma_client.get_collection("Berkshire_Annual_Report")

In [17]:
query = "What is your focus area?"

results = chroma_collection.query(query_texts=[query], n_results=5)
retrieved_documents = results['documents'][0]

for document in retrieved_documents:
    print(document)
    print('\n')

.................................................................. 15. 8 28. 7 2004......................................................................... 4. 3 10. 9 2005......................................................................... 0. 8 4. 9


• patience can be learned. having a long attention span and the ability to concentrate on one thing for a long time is a huge advantage. • you can learn a lot from dead people. read of the deceased you admire and detest. • don ’ t bail away in a sinking boat if you can swim to one that is seaworthy. • a great company keeps working after you are not ; a mediocre company won ’ t do that. • warren and i don ’ t focus on the froth of the market. we seek out good long - term investments and stubbornly hold them for a long time. • ben graham said, “ day to day, the stock market is a voting machine ; in the long term it ’ s a weighing machine. ” if you keep making something more valuable, then some wise person is going to notice it and sta

## Using a local quantized LLM to fit into GPU