# WS24 - Intelligente Informationssysteme

## Block 3: Retrieval Augmented Generation

Build your first simple RAG with LangChain. We follow the LangChain Tutorial "Build a Retrieval Augmented Generation (RAG) App" found at <https://python.langchain.com/docs/tutorials/rag/>.

**Part 1: Prepare, Split and Indext Knowledge for Storing in Vector Databases**

1. Start with data: download and prepare the data you want to add as knowledge. We will extract data from some blog posts found at Lil's Blog (<https://lilianweng.github.io>) into LangChain Documents.
2. Split the Documents into Chanks.
3. Compute Embedding Vectors and store them in Vector Database

## 1. Download and prepare the data

In [None]:
# Use Beautiful Soup for Web-Crawling: https://www.crummy.com/software/BeautifulSoup/
# Load blog posts from "https://lilianweng.github.io/posts/"
import bs4
from urllib.request import urlopen
from bs4 import BeautifulSoup as soup

url = "https://lilianweng.github.io/posts/"

# opening connection, grabbing the HTML from the page
client = urlopen(url)
page_html = client.read()
client.close()

page_soup = soup(page_html, 'html.parser')

In [None]:
# page_soup.findChildren()

In [None]:
#<a aria-label=".." class="entry-link" href="https://lilianweng.github.io/posts/2024-07-07-hallucination/"></a>
blog_posts = []
cells = page_soup.find_all("a", attrs={"class": "entry-link"})
for cell in cells:
    if type(cell) == bs4.element.Tag:
        blog_posts.append( {'label': cell.get('aria-label'), 'link': cell.get('href')} )
print(f"{len(blog_posts)} posts found.")   

In [None]:
# Use Beautiful Soup for Web-Crawling: https://www.crummy.com/software/BeautifulSoup/
import bs4
from langchain_community.document_loaders import WebBaseLoader

# USER_AGENT environment variable
# Iterate throug all found blog posts
# Use SoupStrainer to keep post title, headers, and content from the full HTML. SoupStrainer is explained at 
# https://medium.com/codex/using-beautiful-soups-soupstrainer-to-save-time-and-memory-when-web-scraping-ea1dbd2e886f
# Use WebBaseLoader to get the requested documents https://python.langchain.com/docs/integrations/document_loaders/web_base/
docs = []

bs4_strainer = bs4.SoupStrainer(class_=("post-title", "post-header", "post-content"))
for blog_post in blog_posts:
    loader = WebBaseLoader(
        web_paths=(blog_post['link'],),
        bs_kwargs={"parse_only": bs4_strainer},
    )
    docs.extend(loader.load())

len(docs)

In [None]:
# Now we have a list of LangChain Documents. A Document is an object with some page_content (str) and metadata (dict).
print(docs[0].metadata)
print("Page content:")
print(docs[0].page_content[:500])

## 2. Split the Documents

Now we split each Document into chunks for embedding and vector storage. This should help us retrieve only the most relevant parts of the blog post at run time.

We split our documents into chunks of 1000 characters with 200 characters of overlap between chunks. 

The overlap helps mitigate the possibility of separating a statement from important context related to it. We use the RecursiveCharacterTextSplitter, which will recursively split the document using common separators like new lines until each chunk is the appropriate size.

In [None]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000,
                                               chunk_overlap=200,
                                               length_function=len,
                                               #is_separator_regex=False, #not working
                                               add_start_index=True,
                                               separators=["\n\n\n", "\n"]
                                              )


all_splits = text_splitter.split_documents(docs)

print(len(all_splits))
#print(docs[0].page_content)
print("===============")
print(all_splits[0].page_content)
print("---------------")
print(all_splits[1].page_content)
print("---------------")
print(all_splits[2].page_content)

## 3. Compute Embedding Vectors and store them in Vector Database

Now we need to index our text chunks so that we can search over them at runtime. The most common way to do this is to embed the contents of each document split and insert these embeddings into a vector database (or vector store). When we want to search over our splits, we take a text search query, embed it, and perform some sort of “similarity” search to identify the stored splits with the most similar embeddings to our query embedding. The simplest similarity measure is cosine similarity — we measure the cosine of the angle between each pair of embeddings (which are high dimensional vectors).

We can embed and store all of our document splits in a single command using the Chroma vector store and OpenAIEmbeddings model.

In [None]:
# We use nomic embedding porovoded by Ollama
# https://ollama.com/library/nomic-embed-text
# ollama pull nomic-embed-text
from langchain_ollama import OllamaEmbeddings

embeddings = OllamaEmbeddings(model="nomic-embed-text")

## try the embeddings
vector_1 = embeddings.embed_query(all_splits[0].page_content)
vector_2 = embeddings.embed_query(all_splits[1].page_content)

## compute cosine similarity
import numpy as np
def cosine_similarity(v1, v2):
    return np.dot(v1, v2) / ( np.linalg.norm(v1) * np.linalg.norm(v2))

similarity = cosine_similarity(np.array(vector_1), np.array(vector_2))
print("Cosine Similarity:", similarity)

In [None]:
# Use Chroma DB as vectore database to store all embeddings 
from langchain_chroma import Chroma
vectorstore = Chroma(persist_directory="vector_store", collection_name="lils_blogs", embedding_function=embeddings)

In [None]:
type(vectorstore)

In [None]:
help(vectorstore.add_documents)

In [None]:
for chunk in all_splits:
    id = vectorstore.add_documents(documents=[chunk])
    #print(f"chunk added with id {id}")


In [None]:
####### Test the vectorstore
help(vectorstore.similarity_search)

In [None]:
returned_docs = vectorstore.similarity_search("What kind of hallucination do LLMs have?", k=4)
for doc in returned_docs:
    print(doc.metadata)    
    print(doc.page_content)
    print("---------------")

## LlamaIndex - an alternative Text Splitter

In [None]:
#! pip install llama_index

In [None]:
from llama_index.core import Document
documents = [] # list of llama_index documents
for doc in docs:
    documents.append(Document(text=doc.page_content, metadata=doc.metadata))
print(len(documents))

In [None]:
# Parse text with a preference for complete sentences.
#
# In general, this class tries to keep sentences and paragraphs together. 
# Therefore compared to the original TokenTextSplitter, there are less likely 
# to be hanging sentences or parts of sentences at the end of the node chunk.

from llama_index.core.node_parser import SentenceSplitter

splitter = SentenceSplitter(
    chunk_size=200,     #words not characters
    chunk_overlap=20,
)
nodes = splitter.get_nodes_from_documents(documents)
print(len(nodes))

In [None]:
print(nodes[0].text)
print("---------------")
print(nodes[1].text)
print("---------------")
# Implementation of splitting text that looks at word tokens.print(nodes[2].text)

In [None]:

from llama_index.core.node_parser import TokenTextSplitter

splitter = TokenTextSplitter(
    chunk_size=200,     #words not characters
    chunk_overlap=20,
)
nodes = splitter.get_nodes_from_documents(documents)
print(len(nodes))

In [None]:
print(nodes[0].text)
print("---------------")
print(nodes[1].text)
print("---------------")
# Implementation of splitting text that looks at word tokens.print(nodes[2].text)

In [None]:
# https://medium.com/@bavalpreetsinghh/llama-index-a-comprehensive-guide-for-building-and-querying-document-indexes-27a13bb482a5
# https://medium.com/@bavalpreetsinghh/llamaindex-chunking-strategies-for-large-language-models-part-1-ded1218cfd30


In [None]:
# SentenceWindowNodeParser
# This component is responsible for parsing documents into individual sentences. 
# It creates nodes for each sentence, and each node includes a “window” containing the sentences surrounding it. 
# This means that instead of just having one isolated sentence, you have a context window of sentences around it.
from llama_index.core.node_parser import SentenceWindowNodeParser

help(SentenceWindowNodeParser)

#splitter = SentenceWindowNodeParser(
#    chunk_size=200,     #words not characters
#    chunk_overlap=20,
#)
#nodes = splitter.get_nodes_from_documents(documents)
#print(len(nodes))
