# Retrieval Augmented Retrieval (RAG) - Reworking Hackers Guide

Let's rework the Retrieval Augmented Retrieval (RAG) section from the [hackers guide](https://www.youtube.com/watch?v=jkrNMKz9pWU&t=4232s).

This notebook as two main focus points

 - Re-creating the Jeremy's example on a Mac
 - Creating my own RAG application with a llama2 instance on my local machine.

Back in 2017/2018 my wife and I did a world trip, and we documented it on our blog [Wittmann-Tours.de](https://wittmann-tours.de). These 14 month were among the most exciting times of my life, but nonetheless, I start forgetting details. Therefore, I thought it would be great to have a large language model (LLM) which could help me remember the details.

This notebook is also available in a [blog post version](https://chrwittm.github.io/posts/2024-03-22-rag1-remembering-world-trip/) which contains more about the story and background information.

## Preparation

### Loading the Chat 

As shown in the [chat consumer notebook](https://github.com/chrwittm/lm-hackers/blob/main/20-local-llama-on-mac/35-notebook-chat-consumer.ipynb), let's reuse the chat developed earlier in [this blog post](https://chrwittm.github.io/posts/2024-02-23-chat-from-scratch/) / [this notebook](https://github.com/chrwittm/lm-hackers/blob/main/20-local-llama-on-mac/30-notebook-chat.ipynb).

In [1]:
import sys
sys.path.append('../notebook_chat')

from notebook_chat import ChatMessages, Llama2ChatVersion2

### Loading the Model

Loading the model, only required 2 lines of code (see below). Before we execute the cell, let's talk about the parameters:

- `n_ctx=4096`: This sets the context window to 4096 tokens. This is the maximum context window. We will need it as we will discuss later.
- `verbose=False`: This makes the model less talkative. It only prints the actual results when prompted. Please try turning it to `True` to see the result.

In [2]:
from llama_cpp import Llama
llm = Llama(model_path="../models/Llama-2-7b-chat/llama-2-7b-chat.Q4_K_M.gguf", n_ctx=4096, verbose=False)

llama_model_loader: loaded meta data with 19 key-value pairs and 291 tensors from ../models/Llama-2-7b-chat/llama-2-7b-chat.Q4_K_M.gguf (version GGUF V2)
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = LLaMA v2
llama_model_loader: - kv   2:                       llama.context_length u32              = 4096
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 11008
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head

### Loading example dataset

We are going to work with the [Wittmann-Tours.de](https://wittmann-tours.de) blog.

The blog is available for download as a dataset in [the Wittmann-Tours repo](https://github.com/chrwittm/wittmann-tours).

In [3]:
#!wget -P ../wt-blogposts https://github.com/chrwittm/wittmann-tours/raw/main/zip/blogposts-md.zip


In [4]:
#!unzip -o ../wt-blogposts/blogposts-md.zip -d ../wt-blogposts/

As a result we have all the blog posts in a folder called `wt-blogposts`.

## Revisiting the example from the hacker's guide

Let's ask the model about Jeremy Howard:

In [5]:
chat = Llama2ChatVersion2(llm, "Answer in a very concise and accurate way")
question = "Who is Jeremy Howard?"
chat.prompt_llama2_stream(f"{question}")

<span style='font-size: 16px;'>  Jeremy Howard is a well-known American actor, musician, and YouTube personality. He was born on July 26, 1984, in San Diego, California, and gained popularity through his YouTube channel, "Jeremy Howard," where he posts vlogs, comedy skits, and other content. He has also appeared in various TV shows and movies, including "The Goldbergs" and "Shameless."</span>

The result is not what we expect in this context. While the answer differs when you run it multiple times, it tends to be something like this: "Jeremy Howard is a well-known American actor, musician, and YouTube personality...". Presumeably it referes to [this Jeremy Howard](https://de.wikipedia.org/wiki/Jeremy_Howard).

To steer the model in the right direction, let's give it more context to answer the question and pass the [Wikipedia page about "our" Jeremy](https://en.wikipedia.org/wiki/Jeremy_Howard_(entrepreneur)) as well:

In [6]:
#!pip install wikipedia-api
from wikipediaapi import Wikipedia

In [7]:
wiki = Wikipedia('JeremyHowardBot/0.0', 'en')
jh_page = wiki.page('Jeremy_Howard_(entrepreneur)').text
jh_page = jh_page.split('\nReferences\n')[0]
print(jh_page[:500])

Jeremy Howard (born 13 November 1973) is an Australian data scientist, entrepreneur, and educator.He is the co-founder of fast.ai, where he teaches introductory courses, develops software, and conducts research in the area of deep learning.
Previously he founded and led Fastmail, Optimal Decisions Group, and Enlitic. He was President and Chief Scientist of Kaggle.
Early in the COVID-19 epidemic he was a leading advocate for masking.

Early life
Howard was born in London, United Kingdom, and move


In [8]:
len(jh_page.split())

613

Let's re-phrase the prompt:

In [9]:
question_with_context = f"""Answer the question with the help of the provided context.

## Question

{question}

## Context

{jh_page}"""

print(question_with_context[:500])

Answer the question with the help of the provided context.

## Question

Who is Jeremy Howard?

## Context

Jeremy Howard (born 13 November 1973) is an Australian data scientist, entrepreneur, and educator.He is the co-founder of fast.ai, where he teaches introductory courses, develops software, and conducts research in the area of deep learning.
Previously he founded and led Fastmail, Optimal Decisions Group, and Enlitic. He was President and Chief Scientist of Kaggle.
Early in the COVID-19 epi


In [10]:
chat = Llama2ChatVersion2(llm, "Answer in a very concise and accurate way")
chat.prompt_llama2_stream(f"{question_with_context}")


<span style='font-size: 16px;'>  Jeremy Howard is an Australian data scientist, entrepreneur, and educator who has made significant contributions to the field of deep learning and machine learning. He is the co-founder of fast.ai, where he teaches introductory courses, develops software, and conducts research in the area of deep learning. Howard has also founded several successful startups, including FastMail and Optimal Decisions Group, and has advised various organizations on data strategy and investment opportunities.
Early Life and Education:
Jeremy Howard was born in London, UK, and moved to Melbourne, Australia, in 1976. He attended Melbourne Grammar and studied philosophy at the University of Melbourne.
Career:
Howard started his career in management consulting, working at McKinsey & Co and AT Kearney. He then transitioned into entrepreneurship, founding two successful startups: FastMail and Optimal Decisions Group. He later became involved with Kaggle, where he served as President and Chief Scientist, and helped develop the Perl programming language.
Enlitic:
In 2014, Howard founded Enlitic to use machine learning to make medical diagnostics and clinical decision support tools faster, more accurate, and more accessible. He believes that machine learning algorithms are as good as or better than humans at many things and has taught data science at Singularity University.
fast.ai:
Together with Rachel Thomas, Howard co-founded fast.ai, a research institute dedicated to making deep learning more accessible. He teaches introductory courses, both online and in-person, and has developed the ULMFiT algorithm, which pioneered transfer learning and fine-tuning techniques in natural language processing.
Personal Life and Interests:
Howard is an angel investor and mentors and advises many startups. He has also contributed to a range of open-source projects as a developer and was a regular guest expert on Australia's most popular TV morning news program Sunrise. Howard used Spaced Repetitive Learning to develop usable Chinese language skills in just one year.</span>

Much better!

### How to pick a context

With so much information out there, it is challenging to provide the right context for a model. Let's explore how we can use embedding for this task. For more explanations, please check out my [blog post version](https://chrwittm.github.io/posts/2024-03-22-rag1-remembering-world-trip/) of this notebook.

In [11]:
#!pip install sentence_transformers

In [12]:
from sentence_transformers import SentenceTransformer

In [13]:
#emb_model = SentenceTransformer("BAAI/bge-small-en-v1.5", device=0)
emb_model = SentenceTransformer("BAAI/bge-small-en-v1.5", device="mps")

In [14]:
jh = jh_page.split('\n\n')[0]
print(jh)

Jeremy Howard (born 13 November 1973) is an Australian data scientist, entrepreneur, and educator.He is the co-founder of fast.ai, where he teaches introductory courses, develops software, and conducts research in the area of deep learning.
Previously he founded and led Fastmail, Optimal Decisions Group, and Enlitic. He was President and Chief Scientist of Kaggle.
Early in the COVID-19 epidemic he was a leading advocate for masking.


In [15]:
tb_page = wiki.page('Tony_Blair').text.split('\nReferences\n')[0]

In [16]:
tb = tb_page.split('\n\n')[0]
print(tb[:380])

Sir Anthony Charles Lynton Blair  (born 6 May 1953) is a British politician who served as Prime Minister of the United Kingdom from 1997 to 2007 and Leader of the Labour Party from 1994 to 2007. He served as Leader of the Opposition from 1994 to 1997 and held various shadow cabinet posts from 1987 to 1994. Blair was Member of Parliament (MP) for Sedgefield from 1983 to 2007. He


By calling encode, the model returns a tensor of activations for each document, if the activations are close to each other, the documents are similar to each other, if not, the documents contain different content.

In [17]:
q_emb,jh_emb,tb_emb = emb_model.encode([question,jh,tb], convert_to_tensor=True)

In [18]:
tb_emb.shape

torch.Size([384])

In [19]:
import torch.nn.functional as F

In [20]:
F.cosine_similarity(q_emb, jh_emb, dim=0)

tensor(0.7991, device='mps:0')

In [21]:
F.cosine_similarity(q_emb, tb_emb, dim=0)

tensor(0.5381, device='mps:0')

## Building the Wittmann-Tours RAG-based LLM

Let's do the same thing for Wittmann Tours.

Here is a question for our model, it cannot answer:

In [22]:
#question = "Wie hieß der Guide, der uns durch den Masoala Regenwald geführt hat?"
question = "What was the name of the guide who led us on our tour in the Masoala rain forest on Madagascar?"
chat = Llama2ChatVersion2(llm, "Answer in a very concise and accurate way")
chat.prompt_llama2_stream(f"{question}")

<span style='font-size: 16px;'>  The name of the guide who led your tour in the Masoala rainforest on Madagascar is... (drumroll) ...Rahel!</span>

Either it hallucinates, or it admits that it cannot figure it out. Let's provide more context.

Above, we have downloaded the whole blog [Wittmann Tour.de](https://wittmann-tours.de/) so that every blogpost is a markdown document.

Here is the blog post about Masoala:

In [23]:
path_to_blogpost = "../wt-blogposts/drei-tage-im-masoalaregenwald/index.md"

In [24]:
with open(path_to_blogpost, 'r') as file:
    content = file.read()

print(f"The blogpost has {len(content)} characters")
print(content[:1000])

The blogpost has 18435 characters
---
title: 'Drei Tage im Masoala-Regenwald'
description: ""
published: 2019-07-14
redirect_from: 
            - https://wittmann-tours.de/drei-tage-im-masoala-regenwald/
categories: "Brookesia, Chamäleon, Lemur, Madagaskar, Madagaskar, Maki, Masoala, Regenwald, roter Vari, Taggecko, Umweltschutz, Vari, Wald, Wanderung"
hero: ./img/wp-content-uploads-2019-06-CW-20180820-105656-0464-1024x683.jpg
---
# Drei Tage im Masoala-Regenwald

Nach einer knapp 2-stündigen Bootsfahrt von [Nosy Mangabe](http://wittmann-tours.de/nosy-mangabe) aus erreichten wir unser Ziel, die Masoala Forest Lodge. Wir landeten an einem Strand und gingen kaum 200 Meter landeinwärts, wo ein paar hübsche kleine Bungalows auf uns warteten. So viel Luxus hatten wir nach der vorherigen Campingnacht kaum erwartet. Um das gute Wetter - sprich kein Regen - auszunutzen, starteten wir umgehend auf die erste Wanderung durch den Urwald.

![Auf der Masoala-Halbinsel befindet sich das größte noch z

In [25]:
def get_question_with_context(question, context):
    return  f"""Answer the question with the help of the provided context.

    ## Question

    {question}

    ## Context

    {context}"""

#get_question_with_context(question, content)

Before we can ask the question in context it is important to realize that [the model we use](https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF) has a maximum context window of 4096 tokens, since the blog post is longer, I only pass the first section. Realizing this limitation, I will not solve this here, because the main goal it to understand how we can provide context at all.

In [26]:
#question = "Wie hieß der Guide, der uns durch den Masoala Regenwald geführt hat?"
question = "What was the name of the guide who led us on our tour in the Masoala rain forest on Madagascar?"
chat = Llama2ChatVersion2(llm, "Answer in a very concise and accurate way")
chat.prompt_llama2_stream(f"{get_question_with_context(question, content[:6000])}")

<span style='font-size: 16px;'>  Based on the provided context, the name of the guide who led the tour in the Masoala rainforest is Armand.</span>

The answer is correct!

The next challenge is pick the right blog post automatically. Let's start by gathering all blog posts.

In [27]:
import os
import glob

def get_blog_post_files(path):
    # Create a pattern to match all .md files in the directories under the base path
    pattern = os.path.join(path_to_blog, "**/*.md")

    # Use glob to find all files matching the pattern
    # The '**' pattern means "this directory and all subdirectories, recursively"
    # The '*.md' pattern means "all files ending with .md"
    file_list = glob.glob(pattern, recursive=True)

    # file_list now contains the full paths of all .md files
    return file_list

path_to_blog = "../wt-blogposts"
files = get_blog_post_files(path_to_blog)
files[0:3]


['../wt-blogposts/ritt-auf-paso-peruanos-im-colca-tal/index.md',
 '../wt-blogposts/tropical-treeclimbing-regenwald-auf-allen-etagen/index.md',
 '../wt-blogposts/hochland-kulinarisch-coca/index.md']

The first blog post in the list deals with horseback riding in Peru:

In [28]:
def get_blog_post(path):
    with open(path, 'r') as file:
        content = file.read()
    return content

print(get_blog_post(files[0])[:1000])    

---
title: 'Ritt auf Paso Peruanos im Colca Tal'
description: ""
published: 2018-11-05
redirect_from: 
            - https://wittmann-tours.de/ritt-auf-paso-peruanos-im-colca-tal/
categories: "Colca, Colca Canyon, Colca Tal, Inka, Paso, Paso Peruano, Peru, Peru, Pferde, Reiten, Viscacha"
hero: ./img/wp-content-uploads-2018-10-CW-20180514-083627-2590-1024x683.jpg
---
# Ritt auf Paso Peruanos im Colca Tal

In Peru ist man zu Recht sehr stolz auf die nationale Pferderasse des Landes, die [Paso Peruanos](https://de.wikipedia.org/wiki/Paso_Peruano). Ihre Besonderheit ist, dass sie eine spezielle, überaus bequeme Gangart haben, den [Paso Llano](https://de.wikipedia.org/wiki/Paso_Peruano#Gangmechanik), ähnlich dem Tölt der Islandpferde. Das Zuchtziel ("[Brio](https://de.wikipedia.org/wiki/Paso_Peruano#Interieur)") wird folgendermaßen definiert: "Eifrige Bereitwilligkeit kombiniert mit energischem Einsatz und ausdrucksvoller Präsentation". Auf diesen Prachtpferden wollten wir gerne reiten und 

Before we move on to generating the embeddings for the blog posts, we need to talk about language. The embedding model ([BAAI/bge-small-en-v1.5](https://huggingface.co/BAAI/bge-small-en-v1.5)) from the [hacker's guide](https://github.com/fastai/lm-hackers/blob/main/lm-hackers.ipynb) was trained on English, but the blog was written in German. It is surprising that it nonetheless performs well in German, but it would be "more correct" to use the multi-lingual model [BAAI/bge-m3](https://huggingface.co/BAAI/bge-m3). I have put both models in the next cell for testing. The small english model is a lot faster and it gets the job done, that is why this is the active version.

In [29]:
emb_model = SentenceTransformer("BAAI/bge-small-en-v1.5", device="mps")
#emb_model = SentenceTransformer("BAAI/bge-m3", device="mps")

Let's move on to generating the embeddings:

In [30]:
def get_text_embedding(text):
    return emb_model.encode(text, convert_to_tensor=True)

def get_blog_post_embedding(path):
    blog_post_text = get_blog_post(path)
    return get_text_embedding(blog_post_text)

In [31]:
get_blog_post_embedding(files[0])[0:5]

tensor([0.0235, 0.0092, 0.0103, 0.0846, 0.0199], device='mps:0')

In [32]:
question_embedding = get_text_embedding(question)

In [33]:
question_embedding[0:5]

tensor([-0.0154,  0.0802,  0.0510, -0.0560,  0.0006], device='mps:0')

In [34]:
def get_similarity(embedding1, embedding2):
    return F.cosine_similarity(embedding1, embedding2, dim=0)

For a first test, let's compare the similarity of the question to the 2 blog posts we saw above, the one about Masoala and the one about Peru:

In [35]:
print(question)
print(files[0])
get_similarity(question_embedding, get_blog_post_embedding(files[0]))

What was the name of the guide who led us on our tour in the Masoala rain forest on Madagascar?
../wt-blogposts/ritt-auf-paso-peruanos-im-colca-tal/index.md


tensor(0.4958, device='mps:0')

In [36]:
print(question)
print(path_to_blogpost)
get_similarity(question_embedding, get_blog_post_embedding(path_to_blogpost))

What was the name of the guide who led us on our tour in the Masoala rain forest on Madagascar?
../wt-blogposts/drei-tage-im-masoalaregenwald/index.md


tensor(0.6082, device='mps:0')

Not surprisingly, the similarity is higher for the Masoala blog post.

Just for fun, here is another one, but it gets a lower rating:

In [38]:
get_similarity(question_embedding, get_blog_post_embedding("../wt-blogposts/essen-mit-stern-hongkong-kulinarisch/index.md"))

tensor(0.3877, device='mps:0')

Let's combine what we have done so far and determine the blog post which fits best for a question:

In [39]:
def get_blog_post_as_context(question):

    best_match = ""
    best_match_embedding = get_text_embedding(best_match)
    question_embedding = get_text_embedding(question)
    best_similarity = get_similarity(question_embedding, best_match_embedding)
    #print(best_similarity)

    for file in files:
        #print(file)
        blog_post_embedding = get_blog_post_embedding(file)
        blog_post_similarity = get_similarity(question_embedding, blog_post_embedding)
        #print(blog_post_similarity)
        if blog_post_similarity > best_similarity:
            best_similarity = blog_post_similarity
            best_match = file
    
    return best_match

In [40]:
get_blog_post_as_context("What was the name of the guide who led us on our tour in the Masoala rain forest on Madagascar?")

'../wt-blogposts/drei-tage-im-masoalaregenwald/index.md'

Let's try a different question:

In [42]:
get_blog_post_as_context("Which culinary specialties did we eat in Tansania?")

'../wt-blogposts/tansania-kulinarisch-ugali/index.md'

And the final step: Let's ask a question and prompt the model with the context we found:

In [43]:
question = "Which culinary specialties did we eat in Hong Kong?"
blog_post_path = get_blog_post_as_context(question)
blog_post_context = get_blog_post(blog_post_path)
chat = Llama2ChatVersion2(llm, "Answer in a very concise and accurate way")
chat.prompt_llama2_stream(f"{get_question_with_context(question, blog_post_context[:6000])}")
print(f"Context used: {blog_post_path}")

<span style='font-size: 16px;'>  Based on the provided text, the following are the culinary specialties that were eaten in Hong Kong:
1. Milktea (Pantyhose Tea) - a traditional Chinese tea served with dim sum at a small, unassuming stall in a bustling market area.
2. Afternoon Tea - a classic British tradition served at the Peninsula Hotel, featuring sandwiches, scones, and pastries.
3. Gänsebraten (Roast Goose) - a typical Cantonese dish served at a small, unpretentious restaurant in a busy street.</span>

Context used: ../wt-blogposts/essen-mit-stern-hongkong-kulinarisch/index.md


The answer is correct!