## Notebook on learning about RAG

- Good resource: https://learnbybuilding.ai/tutorials/rag-from-scratch

### Benefits of RAG written in the tutorial

- You can include facts in the prompt to help the LLM avoid hallucinations
- You can (manually) refer to sources of truth when responding to a user query, helping to double check any potential issues.
- You can leverage data that the LLM might not have been trained on.

### The High Level Components of our RAG System
- a collection of documents (formally called a corpus)
- An input from the user
- a similarity measure between the collection of documents and the user input

In [1]:
corpus_of_documents = [
    "Take a leisurely walk in the park and enjoy the fresh air.",
    "Visit a local museum and discover something new.",
    "Attend a live music concert and feel the rhythm.",
    "Go for a hike and admire the natural scenery.",
    "Have a picnic with friends and share some laughs.",
    "Explore a new cuisine by dining at an ethnic restaurant.",
    "Take a yoga class and stretch your body and mind.",
    "Join a local sports league and enjoy some friendly competition.",
    "Attend a workshop or lecture on a topic you're interested in.",
    "Visit an amusement park and ride the roller coasters."
]

In [2]:
def jaccard_similarity(query, document):
    query = query.lower().split(" ")
    document = document.lower().split(" ")
    intersection = set(query).intersection(set(document))
    union = set(query).union(set(document))
    return len(intersection)/len(union)

In [3]:
def return_response(query, corpus):
    similarities = []
    for doc in corpus:
        similarity = jaccard_similarity(user_input, doc)
        similarities.append(similarity)
    return corpus_of_documents[similarities.index(max(similarities))]

## Understanding in the lines of code

In [4]:
query = "How should I take a walk ?"
document = corpus_of_documents[0]
query = query.lower().split(" ")
document = document.lower().split(" ")

In [5]:
intersection = set(query).intersection(set(document))
union = set(query).union(set(document))
print(intersection, "\n", union)

{'walk', 'a', 'take'} 
 {'park', '?', 'the', 'should', 'leisurely', 'air.', 'a', 'and', 'walk', 'how', 'take', 'in', 'enjoy', 'fresh', 'i'}


In [9]:
query = "How should I take a walk ?"
corpus = corpus_of_documents
similarities = []
for doc in corpus:
    similarity = jaccard_similarity(query, doc)
    similarities.append(similarity)

In [12]:
max(similarities)

0.2

In [13]:
similarities.index(max(similarities))

0