# Q and A with RAG

In this notebook is a first dive into RAG. We will implement a general RAG system for a Q and A system and use it on 2 types of data: the hf Wikipedia data set and a joke dataset. Next we will evluate how it works and when it could be usefull. I based this notebook on the langchain handbook of the pinecone website :https://www.pinecone.io/learn/series/langchain/langchain-intro/


General RAG system:
1. Load llm (ollama's phi 3)
2. Load datasets
3. Tokenization: large textfiles (like wiki) need to be splitted before we tokenize.
4. Embed the text files (I use the hf all-MiniLM-l6-v2)
5. Make the vectordatabase. I use Pinecone.
6. QandA Retrieval



## Load LLM: Ollama: phi3

In [3]:
!pip install ollama
!pip install langchain
!pip install langchain_community
!pip install datasets
!pip install transformers
!pip install langchain_huggingface



In [4]:
import ollama

In [5]:
response = ollama.generate(model='llama3',
prompt='what is a qub?')
print(response['response'])

A fundamental question in the realm of quantum computing!

A qubit (quantum bit) is the basic unit of quantum information, similar to a classical bit (0 or 1). However, unlike classical bits, which can only exist in one of two states (0 or 1), qubits can exist in multiple states simultaneously, known as superposition. This means that a qubit can represent both 0 and 1 at the same time!

In other words, a qubit is a quantum-mechanical version of a classical bit that uses the principles of quantum mechanics to process information. Qubits are designed to take advantage of the strange behavior of particles at the atomic level, such as superposition, entanglement, and interference.

Here's a simple analogy to help illustrate the concept:

Imagine you have two envelopes, each containing either 0 or 1. A classical bit would be like opening one envelope and finding either 0 or 1 inside. A qubit is like having an envelope that contains both 0 and 1 at the same time! When you open it, the conten

In [6]:
from langchain_community.llms import Ollama

llm = Ollama(model="phi3")

llm.invoke("what is a qubit")

"A qubit, or quantum bit, represents the fundamental unit of information in quantum computing. Unlike classical bits that can be either 0 or 1 (but not both at once), a qubit can exist simultaneously in multiple states due to superposition—a key principle of quantum mechanics known as 'quantum parallelism'. This property allows for more complex and powerful computations than traditional binary systems, especially when coupled with another phenomenon called entanglement. When two or more qubits become entangled, the state of one (no matter how far apart they are) can instantly affect the others; this interconnectedness is critical in quantum error correction protocols to maintain coherence and accuracy during computations."

## Dataset

In [116]:
from datasets import load_dataset

data = load_dataset("wikipedia", "20220301.simple", split='train[:10000]')
data

Dataset({
    features: ['id', 'url', 'title', 'text'],
    num_rows: 10000
})

In [115]:
from datasets import load_dataset

datajokes = load_dataset("ysharma/short_jokes", split='train[:100000]')
datajokes

Dataset({
    features: ['ID', 'Joke'],
    num_rows: 100000
})

In [11]:
data[6]

{'id': '13',
 'url': 'https://simple.wikipedia.org/wiki/Alan%20Turing',
 'title': 'Alan Turing',
 'text': 'Alan Mathison Turing OBE FRS (London, 23 June 1912 – Wilmslow, Cheshire, 7 June 1954) was an English mathematician and computer scientist. He was born in Maida Vale, London.\n\nEarly life and family \nAlan Turing was born in Maida Vale, London on 23 June 1912. His father was part of a family of merchants from Scotland. His mother, Ethel Sara, was the daughter of an engineer.\n\nEducation \nTuring went to St. Michael\'s, a school at 20 Charles Road, St Leonards-on-sea, when he was five years old.\n"This is only a foretaste of what is to come, and only the shadow of what is going to be.” – Alan Turing.\n\nThe Stoney family were once prominent landlords, here in North Tipperary. His mother Ethel Sara Stoney (1881–1976) was daughter of Edward Waller Stoney (Borrisokane, North Tipperary) and Sarah Crawford (Cartron Abbey, Co. Longford); Protestant Anglo-Irish gentry.\n\nEducated in Dub

In [125]:
datajokes[8]

{'ID': 9,
 'Joke': "What do you do if a bird shits on your car? Don't ask her out again."}

## Tokenizing 

Because we cannot tokenize entire wikipediatexts we need to split it first up into chuncks. The jokes are mostly 1 sentence once and can be embedded like they are

In [11]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-4k-instruct")

#function textsplitter needs a lenfunction
# effective tokenization is done by tokenizer.tokenize(text) (easy peasy)
def token_len(text):
    tokens = tokenizer.tokenize(text)
    return len(tokens)

token_len("This is a test sentence to see if token_len gives something that could be true ")


18

In [13]:
#There is a limit on the tokenizer, so we divide the text up in chunks. Important to think about overlap and seperators
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=400,
    chunk_overlap=20,
    length_function=token_len,
    separators=["\n\n", "\n", " ", ""]
)

In [14]:
chunks = text_splitter.split_text(data[6]['text'])[:3]
chunks


['Alan Mathison Turing OBE FRS (London, 23 June 1912 – Wilmslow, Cheshire, 7 June 1954) was an English mathematician and computer scientist. He was born in Maida Vale, London.\n\nEarly life and family \nAlan Turing was born in Maida Vale, London on 23 June 1912. His father was part of a family of merchants from Scotland. His mother, Ethel Sara, was the daughter of an engineer.\n\nEducation \nTuring went to St. Michael\'s, a school at 20 Charles Road, St Leonards-on-sea, when he was five years old.\n"This is only a foretaste of what is to come, and only the shadow of what is going to be.” – Alan Turing.\n\nThe Stoney family were once prominent landlords, here in North Tipperary. His mother Ethel Sara Stoney (1881–1976) was daughter of Edward Waller Stoney (Borrisokane, North Tipperary) and Sarah Crawford (Cartron Abbey, Co. Longford); Protestant Anglo-Irish gentry.\n\nEducated in Dublin at Alexandra School and College; on October 1st 1907 she married Julius Mathison Turing, latter son o

In [15]:
token_len(chunks[0]), token_len(chunks[1]), token_len(chunks[2])

(370, 372, 384)

## Embeddings: huggingface
We choose for the *sentence-transformers/all-MiniLM-l6-v2* embedding now. This is a small, well performing model. Best is to keep an eye out in https://huggingface.co/spaces/mteb/leaderboard 

from langchain_huggingface.embeddings import HuggingFaceEmbeddings

In [16]:
import langchain_huggingface
from langchain_huggingface.embeddings import HuggingFaceEmbeddings

In [17]:
import getpass

inference_api_key = getpass.getpass("Enter your HF Inference API Key:\n\n")

Enter your HF Inference API Key:

 ········


In [84]:
from langchain_community.embeddings import HuggingFaceInferenceAPIEmbeddings

embeddings = HuggingFaceInferenceAPIEmbeddings(
    api_key=inference_api_key, model_name="sentence-transformers/all-MiniLM-l6-v2"
)

text = 'I have to poo my pants'
query_result = embeddings.embed_query(text) #difference embed_query and embed_documents!!!
embed_document = embeddings.embed_documents(text)
len(embed_document)

384

In [20]:
texts = [
    'this is the first chunk of text', 
    'and i also still have to poo'
]

res = embeddings.embed_documents(texts)
len(res), len(res[0])

(2, 384)

## Vectordatabase: Pinecone

Dimensionality must be of used model.

We first make and Index and fill the index, only later we fill everything the the vc

In [21]:
!pip install pinecone-client

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)




In [48]:
from pinecone import Pinecone, ServerlessSpec

pc = Pinecone(api_key="3eef2a06-271a-46d8-b0d0-9a6c3cbb835d")

In [68]:
import time

index_name = 'jokes'
existing_indexes = [
    index_info["name"] for index_info in pc.list_indexes()
]

# check if index already exists (it shouldn't if this is first time)
if index_name not in existing_indexes:
    
    # if does not exist, create index
    pc.create_index(
        index_name,
        dimension=384,  # dimensionality of phi3
        metric='cosine',
        spec=ServerlessSpec(
        cloud="aws",
        region="us-east-1"
    ))
    # wait for index to be initialized
    while not pc.describe_index(index_name).status['ready']:
        time.sleep(1)

# connect to index
index = pc.Index(index_name)
time.sleep(1)
# view index stats
index.describe_index_stats() #since we didnt fill these yet here should normally stand 0 total vector count. However, i reran this after filling the vd



{'dimension': 384,
 'index_fullness': 0.0,
 'namespaces': {},
 'total_vector_count': 0}

In [87]:
from tqdm.auto import tqdm
from uuid import uuid4

batch_limit = 100
counter = 0

texts = []
metadatas = []

for i, record in enumerate(tqdm(data)):
    counter += 1
    # first get metadata fields for this record
    metadata = {
        'wiki-id': str(record['id']),
        'source': record['url'],
        'title': record['title']
    }
    # now we create chunks from the record text
    record_texts = text_splitter.split_text(record['text'])
    # create individual metadata dicts for each chunk
    record_metadatas = [{
        "chunk": j, "text": text, **metadata
    } for j, text in enumerate(record_texts)]
    # append these to current batches
    texts.extend(record_texts)
    metadatas.extend(record_metadatas)
    # if we have reached the batch_limit we can add texts
    if len(texts) >= batch_limit:
    
        ids = [str(uuid4()) for _ in range(len(texts))]
        embeds = embeddings.embed_documents(texts)
        if counter%10 == 0:
            print(len(ids), len(embeds), len(texts))
        index.upsert(vectors=zip(ids, embeds, metadatas))
        
        metadatas = []
        if counter%10 == 0:
            print(len(ids), len(embeds), len(texts))
            

        texts = []         


if len(texts) > 0:
    ids = [str(uuid4()) for _ in range(len(texts))]
    embeds = embeddings.embed_documents(texts)
    index.upsert(vectors=zip(ids, embeds, metadatas))

 28%|███████████                             | 277/1000 [00:15<00:35, 20.31it/s]

102 102 102


 29%|███████████▍                            | 287/1000 [00:16<00:48, 14.60it/s]

102 102 102


 39%|███████████████▋                        | 393/1000 [00:21<00:25, 23.83it/s]

100 100 100


 40%|████████████████▏                       | 404/1000 [00:22<00:35, 16.74it/s]

100 100 100
104 104 104


 45%|██████████████████                      | 452/1000 [00:24<00:22, 24.85it/s]

104 104 104


 55%|██████████████████████▏                 | 554/1000 [00:30<00:24, 17.98it/s]

101 101 101


 58%|███████████████████████▎                | 584/1000 [00:32<00:21, 19.03it/s]

101 101 101


100%|███████████████████████████████████████| 1000/1000 [00:55<00:00, 18.15it/s]


In [71]:
index.describe_index_stats()

{'dimension': 384,
 'index_fullness': 0.0,
 'namespaces': {'': {'vector_count': 3479}},
 'total_vector_count': 3479}

In [82]:
len(ids), len(embeds)

(66, 66)

## Vectorstoring

In [72]:
from langchain.vectorstores import Pinecone

text_field = "text"  # the metadata field that contains our text

# initialize the vector store object
vectorstore = Pinecone(
    index, embeddings.embed_query, text_field
)

In [73]:
query = "who was Einstein?"


vectorstore.similarity_search(
    query,  # our search query
    k=3  # return 3 most relevant docs
)




[Document(metadata={'chunk': 0.0, 'source': 'https://simple.wikipedia.org/wiki/Albert%20Einstein', 'title': 'Albert Einstein', 'wiki-id': '2138'}, page_content="Albert Einstein (14 March 1879 – 18 April 1955) was a German-born American scientist. He worked on theoretical physics. He developed the theory of relativity. He received the Nobel Prize in Physics in 1921 for theoretical physics.\n\nHis famous equation is  (E = energy, m = mass, c = speed of light (energy = mass X speed of light²).\n\nAt the start of his career, Einstein didn't think that Newtonian mechanics was enough to bring together the laws of classical mechanics and the laws of the electromagnetic field. Between 1902–1909 he made the theory of special relativity to fix it. Einstein also thought that Isaac Newton's idea of gravity was not completely correct. So, he extended his ideas on special relativity to include gravity. In 1916, he published a paper on general relativity with his theory of gravitation."),
 Document(m

In [74]:
query = "who was einstein?"

res = embeddings.embed_documents(query)
res[0]

-0.01577218621969223

In [75]:

query = "who was Albert Einstein?"

vectorstore.similarity_search(
    query,  # our search query
    k=3  # return 3 most relevant docs
)

[Document(metadata={'chunk': 0.0, 'source': 'https://simple.wikipedia.org/wiki/Albert%20Einstein', 'title': 'Albert Einstein', 'wiki-id': '2138'}, page_content="Albert Einstein (14 March 1879 – 18 April 1955) was a German-born American scientist. He worked on theoretical physics. He developed the theory of relativity. He received the Nobel Prize in Physics in 1921 for theoretical physics.\n\nHis famous equation is  (E = energy, m = mass, c = speed of light (energy = mass X speed of light²).\n\nAt the start of his career, Einstein didn't think that Newtonian mechanics was enough to bring together the laws of classical mechanics and the laws of the electromagnetic field. Between 1902–1909 he made the theory of special relativity to fix it. Einstein also thought that Isaac Newton's idea of gravity was not completely correct. So, he extended his ideas on special relativity to include gravity. In 1916, he published a paper on general relativity with his theory of gravitation."),
 Document(m

## QandA Retrieval




In [76]:
from langchain.chains import RetrievalQA

qa = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=vectorstore.as_retriever()
)


In [77]:
qa.run(query)

  warn_deprecated(


'Albert Einstein was a German-born American scientist recognized for his groundbreaking work in theoretical physics, particularly the development of the theory of relativity and contributions to quantum mechanics with his explanation of the photoelectric effect. Born on March 14, 1879, in Ulm, Germany, Einstein received international fame after he developed a famous equation E=mc², which reveals the equivalence of mass (m) and energy (E), where c is the speed of light squared. He later extended his ideas to include gravity as part of this unified theory but faced challenges in reconciling general relativity with quantum mechanics, two pillars of modern physics that are still not fully integrated today. Einstein was a vocal advocate for socialism and Zionism during the Weimar Republic before migrating to America due to increasing antisemitic sentiments under Nazi rule; he became an American citizen in 1940 and spent his later life at Princeton, New Jersey.'

In [78]:
from langchain.chains import RetrievalQAWithSourcesChain

qa_with_sources = RetrievalQAWithSourcesChain.from_chain_type(
    llm=llm,
    chain_type="stuff",
    return_source_documents=True,
    retriever=vectorstore.as_retriever()
)

In [79]:
qa_with_sources(query)

  warn_deprecated(


{'question': 'who was Albert Einstein?',
 'answer': 'In the later years of his life, Albert Einstein resigned from his position in the Prussian Academy after returning to Germany during spring of 1914. As he completed The General Theory of Relativity, Nazi Party came into power which caused antisemitism towards Jews and called for a rejection against relativity theory as "Jewish physics." Due to these threats, Einstein decided not to go back to Berlin with Elsa, moving instead to Belgium. In 1940, he became an American citizen after being offered this option when returning from the United States during World War II. Albert Einstein\'s life journey began in Ulm, Württemberg, Germany on March 14, 1879 and ended with his death in Princeton, New Jersey in 1955 due to aortic aneurysm caused by cardiovascular disease after spending the first four years of life learning about science from everyday items. He was married twice - Mileva Marić initially at age 24 and then Elsa Löwenthal later in 

In [79]:
query = "Can you tell me about the history of Belgium?"


## Jokes

We will use a similar approach to use RAG for the purpose of joke generation. Now we need to use a new pinecone index (most wikipedia pages are not that funny). We also only uses the joke id as meta data. Some additional changes are that we do not need to split the jokes since they are small enough to be embedded directly and the meaning/punchline could get lost.

In [117]:
import time

index_name = 'jokes'
existing_indexes = [
    index_info["name"] for index_info in pc.list_indexes()
]

# check if index already exists (it shouldn't if this is first time)
if index_name not in existing_indexes:
    
    # if does not exist, create index
    pc.create_index(
        index_name,
        dimension=384,  # dimensionality of phi3
        metric='cosine',
        spec=ServerlessSpec(
        cloud="aws",
        region="us-east-1"
    ))
    # wait for index to be initialized
    while not pc.describe_index(index_name).status['ready']:
        time.sleep(1)

# connect to index
index = pc.Index(index_name)
time.sleep(1)
# view index stats
index.describe_index_stats() #since we didnt fill these yet here should normally stand 0 total vector count. However, i reran this after filling the vd


{'dimension': 384,
 'index_fullness': 0.0,
 'namespaces': {},
 'total_vector_count': 0}

In [141]:
from tqdm.auto import tqdm
from uuid import uuid4

batch_limit = 1

texts = []
metadatas = []

for i, record in enumerate(tqdm(datajokes)):
    # first get metadata fields for this record
    metadata = {
        'wiki-id': str(record['ID']),
    }
    # now we create chunks from the record text
    record_texts = [record['Joke']]
    # create individual metadata dicts for each chunk
    record_metadatas = [{
        "chunk": j, "Joke": text, **metadata
    } for j, text in enumerate(record_texts)]
    # append these to current batches
    texts.extend(record_texts)
    metadatas.extend(record_metadatas)
    # if we have reached the batch_limit we can add texts
    
    if len(texts) >= batch_limit:
    
        ids = [str(uuid4()) for _ in range(len(texts))]
        embeds = embeddings.embed_documents(texts)
        
        index.upsert(vectors=zip(ids, embeds, metadatas))
        
        metadatas = []
        texts = []         


if len(texts) > 0:
    ids = [str(uuid4()) for _ in range(len(texts))]
    embeds = embeddings.embed_documents(texts)
    index.upsert(vectors=zip(ids, embeds, metadatas))

  0%|                                    | 23/100000 [00:12<14:42:48,  1.89it/s]


KeyboardInterrupt: 

In [142]:
index.describe_index_stats()

{'dimension': 384,
 'index_fullness': 0.0,
 'namespaces': {'': {'vector_count': 348}},
 'total_vector_count': 348}

In [143]:
print(embeds)

[[-0.0014075927902013063, 0.0009633786976337433, 0.004709961824119091, 0.0790449008345604, 0.06980512291193008, -0.05523938685655594, 0.0894373431801796, -0.005364423152059317, 0.02147030085325241, -0.05204664543271065, -0.03943726792931557, -0.002272471319884062, -0.023620769381523132, -0.0065369498915970325, -0.0007013562135398388, 0.03782138228416443, -0.0045754569582641125, -0.02380620874464512, 0.06925386190414429, 0.07501229643821716, -0.00011765622912207618, 0.022024665027856827, 0.021772967651486397, -0.06946240365505219, 0.000519908033311367, 0.0012798439711332321, -0.06192940101027489, 0.058303140103816986, -0.08353668451309204, 0.12838317453861237, -0.008267860859632492, 0.11591611802577972, -0.02645707130432129, 0.006451504770666361, -0.05189138650894165, 0.008368570357561111, -0.004341499414294958, 0.08576838672161102, 0.049489524215459824, 0.00856676697731018, 0.010505023412406445, -0.010557482950389385, 0.028406541794538498, 0.04531765356659889, -0.05926913768053055, -0.

# vectorstoring

In [130]:
from langchain.vectorstores import Pinecone

text_field = "Joke"  # the metadata field that contains our text

# initialize the vector store object
vectorstore = Pinecone(
    index, embeddings.embed_query, text_field
)

In [138]:
query = "joke about musicians"

vectorstore.similarity_search(
    query,  # our search query
    k=3  # return 3 most relevant docs
)


[Document(metadata={'chunk': 0.0, 'wiki-id': '183'}, page_content='Why did the composer go to the chiropractor? Because he had Bach problems'),
 Document(metadata={'chunk': 0.0, 'wiki-id': '7'}, page_content='Why was the musician arrested? He got in treble.'),
 Document(metadata={'chunk': 0.0, 'wiki-id': '72'}, page_content="I'm terrible at telling jokes... I always punch up the fuck lines")]

In [132]:
from langchain.chains import RetrievalQA

qa = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=vectorstore.as_retriever()
)

qa.run(query)

'Why did Beethoven refuse to play music for his doggo? Because he wanted him to appreciate "pawsitively" classical compositions!'

In [137]:
llm.invoke(query)

'Why did the musician go to therapy? Because he needed a good "stress" reliever after hearing bad news!'