## RAG

* Load large datasets and ask the LLM questions about it.
* When you load a document, you end up with strings. Sometimes the strings will be too large to fit into the context window. In those circumstances we will use the RAG technique:
    * Split document in small chunks.
    * Transform text chunks in numeric chunks (embeddings).
    * Load embeddings to a vector database (aka Vector store).
    * Load question and retrieve the most relevant embeddings to respond it.
    * Send the embeddings to the LLM to format the response properly.

In [1]:
import os
from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv())
openai_api_key = os.environ["OPENAI_API_KEY"]

In [2]:
from langchain_community.document_loaders import TextLoader

# Load the text file "steve_jobs.txt"
loader = TextLoader("data/steve_jobs.txt")
loaded_data = loader.load()

"steve_jobs.txt" file contains 18,746 tokens which is bigger than the context window of GPT-3.5 Turbo model i.e., 16,385.

To count the number of tokens, refer: https://platform.openai.com/tokenizer

In [3]:
loaded_data

[Document(metadata={'source': 'data/steve_jobs.txt'}, page_content='Steven Paul Jobs (February 24, 1955 – October 5, 2011) was an American businessman, inventor, and investor best known for co-founding the technology company Apple Inc. Jobs was also the founder of NeXT and chairman and majority shareholder of Pixar. He was a pioneer of the personal computer revolution of the 1970s and 1980s, along with his early business partner and fellow Apple co-founder Steve Wozniak.\n\nJobs was born in San Francisco in 1955 and adopted shortly afterwards. He attended Reed College in 1972 before withdrawing that same year. In 1974, he traveled through India, seeking enlightenment before later studying Zen Buddhism. He and Wozniak co-founded Apple in 1976 to further develop and sell Wozniak\'s Apple I personal computer. Together, the duo gained fame and wealth a year later with production and sale of the Apple II, one of the first highly successful mass-produced microcomputers.\n\nJobs saw the comme

In [4]:
loaded_data[0].page_content

'Steven Paul Jobs (February 24, 1955 – October 5, 2011) was an American businessman, inventor, and investor best known for co-founding the technology company Apple Inc. Jobs was also the founder of NeXT and chairman and majority shareholder of Pixar. He was a pioneer of the personal computer revolution of the 1970s and 1980s, along with his early business partner and fellow Apple co-founder Steve Wozniak.\n\nJobs was born in San Francisco in 1955 and adopted shortly afterwards. He attended Reed College in 1972 before withdrawing that same year. In 1974, he traveled through India, seeking enlightenment before later studying Zen Buddhism. He and Wozniak co-founded Apple in 1976 to further develop and sell Wozniak\'s Apple I personal computer. Together, the duo gained fame and wealth a year later with production and sale of the Apple II, one of the first highly successful mass-produced microcomputers.\n\nJobs saw the commercial potential of the Xerox Alto in 1979, which was mouse-driven a

## Chunking / Splitting

* Document splitting is often a crucial preprocessing step for many applications. It involves breaking down large texts into smaller, manageable chunks. This process offers several benefits, such as ensuring consistent processing of varying document lengths, overcoming input size limitations of models, and improving the quality of text representations used in retrieval systems. There are several strategies for splitting documents, each with its own advantages.
* Text splitters split documents into smaller chunks for use in downstream applications.

#### Approaches
1. **Length-based**

* The most intuitive strategy is to split documents based on their length. This simple yet effective approach ensures that each chunk doesn't exceed a specified size limit. Key benefits of length-based splitting:
    * Straightforward implementation
    * Consistent chunk sizes
    * Easily adaptable to different model requirements

* Types of length-based splitting:
    * **Token-based**: Splits text based on the number of tokens, which is useful when working with language models.
    * **Character-based**: Splits text based on the number of characters, which can be more consistent across different types of text.

    [LangChain's `CharacterTextSplitter` with token-based splitting](https://python.langchain.com/api_reference/text_splitters/character/langchain_text_splitters.character.CharacterTextSplitter.html)

    [token-based splitting](https://python.langchain.com/docs/how_to/split_by_token/)

    [character-based splitting](https://python.langchain.com/docs/how_to/character_text_splitter/)

2. **Text-structured based**
* Text is naturally organized into hierarchical units such as paragraphs, sentences, and words. We can leverage this inherent structure to inform our splitting strategy, creating split that maintain natural language flow, maintain semantic coherence within split, and adapts to varying levels of text granularity. LangChain's `RecursiveCharacterTextSplitter` implements this concept:
    * The RecursiveCharacterTextSplitter attempts to keep larger units (e.g., paragraphs) intact.
    * If a unit exceeds the chunk size, it moves to the next level (e.g., sentences).
    * This process continues down to the word level if necessary.

    [Recursive Text Splitting](https://python.langchain.com/docs/how_to/recursive_text_splitter/)

3. **Document-structured based**
* Some documents have an inherent structure, such as HTML, Markdown, or JSON files. In these cases, it's beneficial to split the document based on its structure, as it often naturally groups semantically related text. Key benefits of structure-based splitting:
    * Preserves the logical organization of the document
    * Maintains context within each chunk
    * Can be more effective for downstream tasks like retrieval or summarization

* Examples of structure-based splitting:
    * **Markdown**: Split based on headers (e.g., #, ##, ###)
    * **HTML**: Split using tags
    * **JSON**: Split by object or array elements
    * **Code**: Split by functions, classes, or logical blocks

    [Markdown splitting](https://python.langchain.com/docs/how_to/markdown_header_metadata_splitter/)
    [Recursive JSON splitting](https://python.langchain.com/docs/how_to/recursive_json_splitter/)
    [Code splitting](https://python.langchain.com/docs/how_to/code_splitter/)
    [HTML splitting](https://python.langchain.com/docs/how_to/split_html/)

4. **Semantic meaning based**
* Unlike the previous methods, semantic-based splitting actually considers the content of the text. While other approaches use document or text structure as proxies for semantic meaning, this method directly analyzes the text's semantics. There are several ways to implement this, but conceptually the approach is split text when there are significant changes in text meaning. As an example, we can use a sliding window approach to generate embeddings, and compare the embeddings to find significant differences:
    * Start with the first few sentences and generate an embedding.
    * Move to the next group of sentences and generate another embedding (e.g., using a sliding window approach).
    * Compare the embeddings to find significant differences, which indicate potential "break points" between semantic sections.

* This technique helps create chunks that are more semantically coherent, potentially improving the quality of downstream tasks like retrieval or summarization.

    [Splitting text based on semantic meaning](https://python.langchain.com/docs/how_to/semantic-chunker/)

For more details: https://github.com/FullStackRetrieval-com/RetrievalTutorials/blob/main/tutorials/LevelsOfTextSplitting/5_Levels_Of_Text_Splitting.ipynb
Referred from: https://python.langchain.com/docs/concepts/text_splitters/

In [10]:
# character text splitter
from langchain_text_splitters import CharacterTextSplitter

text_splitter = CharacterTextSplitter(
    separator="\n\n", # paragraph separator
    chunk_size=2000, # chunk size
    chunk_overlap=200, # chunk overlap
    length_function=len, # function to calculate length
    is_separator_regex=False, # whether the separator is a regex
)

In [11]:
texts = text_splitter.create_documents([loaded_data[0].page_content])
texts

[Document(page_content="Steven Paul Jobs (February 24, 1955 – October 5, 2011) was an American businessman, inventor, and investor best known for co-founding the technology company Apple Inc. Jobs was also the founder of NeXT and chairman and majority shareholder of Pixar. He was a pioneer of the personal computer revolution of the 1970s and 1980s, along with his early business partner and fellow Apple co-founder Steve Wozniak.\n\nJobs was born in San Francisco in 1955 and adopted shortly afterwards. He attended Reed College in 1972 before withdrawing that same year. In 1974, he traveled through India, seeking enlightenment before later studying Zen Buddhism. He and Wozniak co-founded Apple in 1976 to further develop and sell Wozniak's Apple I personal computer. Together, the duo gained fame and wealth a year later with production and sale of the Apple II, one of the first highly successful mass-produced microcomputers.\n\nJobs saw the commercial potential of the Xerox Alto in 1979, wh

In [12]:
# number of chunks
len(texts)

55

In [14]:
# first chunk
texts[0].page_content

"Steven Paul Jobs (February 24, 1955 – October 5, 2011) was an American businessman, inventor, and investor best known for co-founding the technology company Apple Inc. Jobs was also the founder of NeXT and chairman and majority shareholder of Pixar. He was a pioneer of the personal computer revolution of the 1970s and 1980s, along with his early business partner and fellow Apple co-founder Steve Wozniak.\n\nJobs was born in San Francisco in 1955 and adopted shortly afterwards. He attended Reed College in 1972 before withdrawing that same year. In 1974, he traveled through India, seeking enlightenment before later studying Zen Buddhism. He and Wozniak co-founded Apple in 1976 to further develop and sell Wozniak's Apple I personal computer. Together, the duo gained fame and wealth a year later with production and sale of the Apple II, one of the first highly successful mass-produced microcomputers.\n\nJobs saw the commercial potential of the Xerox Alto in 1979, which was mouse-driven an

In [15]:
# Recursive Character Text Splitter
from langchain_text_splitters import RecursiveCharacterTextSplitter

recursive_splitter = RecursiveCharacterTextSplitter(
    chunk_size=26,
    chunk_overlap=4
)

texts = recursive_splitter.split_text(loaded_data[0].page_content)
texts

['Steven Paul Jobs (February',
 '24, 1955 – October 5,',
 '5, 2011) was an American',
 'businessman, inventor,',
 'and investor best known',
 'for co-founding the',
 'the technology company',
 'Apple Inc. Jobs was also',
 'the founder of NeXT and',
 'and chairman and majority',
 'shareholder of Pixar. He',
 'He was a pioneer of the',
 'the personal computer',
 'revolution of the 1970s',
 'and 1980s, along with his',
 'his early business',
 'partner and fellow Apple',
 'co-founder Steve Wozniak.',
 'Jobs was born in San',
 'San Francisco in 1955 and',
 'and adopted shortly',
 'afterwards. He attended',
 'Reed College in 1972',
 'before withdrawing that',
 'same year. In 1974, he',
 'he traveled through',
 'India, seeking',
 'enlightenment before',
 'later studying Zen',
 'Zen Buddhism. He and',
 'and Wozniak co-founded',
 'Apple in 1976 to further',
 'develop and sell',
 "Wozniak's Apple I",
 'I personal computer.',
 'Together, the duo gained',
 'fame and wealth a year',
 'later with pr

## Embeddings

* Embeddings are numerical representations of text (or other media formats) that capture relationships between inputs. Text embeddings work by converting text into arrays of floating point numbers, called vectors. These vectors are designed to capture the meaning of the text. The length of the embedding array is called the vector's dimensionality. A passage of text might be represented by a vector containing hundreds of dimensions.
* Imagine being able to capture the essence of any text - a tweet, document, or book - in a single, compact representation. This is the power of embedding models, which lie at the heart of many retrieval systems. Embedding models transform human language into a format that machines can understand and compare with speed and accuracy. These models take text as input and produce a fixed-length array of numbers, a numerical fingerprint of the text's semantic meaning. Embeddings allow search system to find relevant documents not just based on keyword matches, but on semantic understanding.
* Embeddings capture semantic meaning and context, which results in text with similar meanings having "closer" embeddings. For example, the sentence "I took my dog to the vet" and "I took my cat to the vet" would have embeddings that are close to each other in the vector space.
* You can use embeddings to compare different texts and understand how they relate. For example, if the embeddings of the text "cat" and "dog" are close together you can infer that these words are similar in meaning, context, or both. This enables a variety of common AI use cases.
* The base Embeddings class in LangChain provides two methods: one for embedding documents and one for embedding a query. The former, .embed_documents, takes as input multiple texts, while the latter, .embed_query, takes a single text. The reason for having these as two separate methods is that some embedding providers have different embedding methods for documents (to be searched over) vs queries (the search query itself). .embed_query will return a list of floats, whereas .embed_documents returns a list of lists of floats.

For more information: 

https://python.langchain.com/docs/how_to/embed_text/

https://huggingface.co/blog/getting-started-with-embeddings

https://python.langchain.com/docs/concepts/embedding_models/

OpenAI Embeddings Models: https://platform.openai.com/docs/guides/embeddings

Gemini Embeddings Models: https://ai.google.dev/gemini-api/docs/embeddings

MistralAI Embeddings Model: https://docs.mistral.ai/capabilities/embeddings/

Lits of available Embeddings Models: https://python.langchain.com/docs/integrations/text_embedding/



In [16]:
# with OpenAI Embeddings model
from langchain_openai import OpenAIEmbeddings

embeddings_model = OpenAIEmbeddings()

In [17]:
chunks_of_text = [
    "Hi there!",
    "Hello!",
    "What's your name?",
    "Bond, James Bond",
    "Hello Bond!"
]

In [18]:
embeddings = embeddings_model.embed_documents(chunks_of_text)

In [19]:
embeddings

[[-0.020285392180085182,
  -0.007132277823984623,
  -0.022800425067543983,
  -0.02626812271773815,
  -0.037446048110723495,
  0.02160641923546791,
  -0.0061891404911875725,
  -0.008993147872388363,
  0.008446954190731049,
  -0.016589056700468063,
  0.026852423325181007,
  -0.00741172581911087,
  -0.01357863750308752,
  -0.024108750745654106,
  0.006490817293524742,
  -0.02025998756289482,
  0.024261176586151123,
  -0.014645621180534363,
  0.016449332237243652,
  -0.016525544226169586,
  -0.007272001821547747,
  -0.008053186349570751,
  0.004715686663985252,
  -0.002000594511628151,
  -0.014861558564007282,
  -0.0060589429922401905,
  -0.0020831585861742496,
  -0.022990958765149117,
  0.019853517413139343,
  -0.03160304203629494,
  0.012943528592586517,
  0.01167966052889824,
  -0.008529518730938435,
  -0.009526640176773071,
  -0.0017989472253248096,
  -0.02736051008105278,
  -0.008167506195604801,
  0.002154608489945531,
  0.02398172952234745,
  -0.008726402185857296,
  0.0235498547554

In [20]:
# number of dimensions of the first embedding
len(embeddings[0])

1536

In [22]:
# first 5 dimensions of the first embedding
print(embeddings[0][:5])

[-0.020285392180085182, -0.007132277823984623, -0.022800425067543983, -0.02626812271773815, -0.037446048110723495]


## Vector stores (Vector Databases)

* A vector store stores embedded data and performs similarity search.


For more information: https://python.langchain.com/docs/integrations/vectorstores/

In [23]:
# Chroma vector store
from langchain_community.document_loaders import TextLoader
from langchain_text_splitters import CharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_chroma import Chroma

# load the text file "steve_jobs.txt", split it into chunks, create embeddings and store them in a Chroma vector store
loaded_document = TextLoader("data/steve_jobs.txt").load()

text_splitter = CharacterTextSplitter(
    chunk_size=2000,
    chunk_overlap=0
)
chunks_of_text = text_splitter.split_documents(loaded_document)
print("Number of chunks:", len(chunks_of_text))
print("\n\n")

vector_db = Chroma.from_documents(chunks_of_text, OpenAIEmbeddings())

Number of chunks: 55





In [24]:
vector_db

<langchain_chroma.vectorstores.Chroma at 0x1236a4350>

In [27]:
question = "Where was Steve Jobs staying when he died?"
# similarity search
response = vector_db.similarity_search(question)
# top results
print(response[0].page_content)

Both Apple and Pixar issued announcements of his death. Apple announced on the same day that they had no plans for a public service, but were encouraging "well-wishers" to send their remembrance messages to an email address created to receive such messages. Apple and Microsoft both flew their flags at half-staff throughout their respective headquarters and campuses.

Bob Iger ordered all Disney properties, including Walt Disney World and Disneyland, to fly their flags at half-staff from October 6 to 12, 2011. For two weeks following his death, Apple displayed on its corporate website a simple page that showed Jobs's name and lifespan next to his portrait in grayscale. On October 19, 2011, Apple employees held a private memorial service for Jobs on the Apple campus in Cupertino. It was attended by Jobs's widow, Laurene, and by Tim Cook, Bill Campbell, Norah Jones, Al Gore, and Coldplay. Some of Apple's retail stores closed briefly so employees could attend the memorial. A video of the s

## Vector Store as Retriever

* A vector store retriever is a retriever that uses a vector store to retrieve documents. It is a lightweight wrapper around the vector store class to make it conform to the retriever interface. It uses the search methods implemented by a vector store, like similarity search and MMR, to query the texts in the vector store.

For more information: 

https://python.langchain.com/docs/how_to/vectorstore_retriever/

https://python.langchain.com/docs/concepts/retrievers/

In [28]:
from langchain_community.document_loaders import TextLoader

loader = TextLoader("data/steve_jobs.txt")

In [31]:
# FAISS vector store
from langchain_text_splitters import CharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import FAISS

# load the text file "steve_jobs.txt", split it into chunks
loaded_document = loader.load()

text_splitter = CharacterTextSplitter(
    chunk_size=2000,
    chunk_overlap=0
)

chunks_of_text = text_splitter.split_documents(loaded_document)

# create embeddings and store them in a FAISS vector store
embeddings = OpenAIEmbeddings()

vector_db = FAISS.from_documents(chunks_of_text, embeddings)

In [32]:
vector_db

<langchain_community.vectorstores.faiss.FAISS at 0x122344ed0>

In [None]:
# retriever for the vector store
retriever = vector_db.as_retriever()

In [34]:
response = retriever.invoke("Where was Steve Jobs staying when he died?")
response

[Document(metadata={'source': 'data/steve_jobs.txt'}, page_content='Both Apple and Pixar issued announcements of his death. Apple announced on the same day that they had no plans for a public service, but were encouraging "well-wishers" to send their remembrance messages to an email address created to receive such messages. Apple and Microsoft both flew their flags at half-staff throughout their respective headquarters and campuses.\n\nBob Iger ordered all Disney properties, including Walt Disney World and Disneyland, to fly their flags at half-staff from October 6 to 12, 2011. For two weeks following his death, Apple displayed on its corporate website a simple page that showed Jobs\'s name and lifespan next to his portrait in grayscale. On October 19, 2011, Apple employees held a private memorial service for Jobs on the Apple campus in Cupertino. It was attended by Jobs\'s widow, Laurene, and by Tim Cook, Bill Campbell, Norah Jones, Al Gore, and Coldplay. Some of Apple\'s retail store

In [35]:
len(response)

4

By default, top 4 search results returned in the output.

In [36]:
# setting the number of results to return: top_k (in this case top 2 results)
retriever = vector_db.as_retriever(search_kwargs={"k": 2})
response = retriever.invoke("Where was Steve Jobs staying when he died?")
response

[Document(metadata={'source': 'data/steve_jobs.txt'}, page_content='Both Apple and Pixar issued announcements of his death. Apple announced on the same day that they had no plans for a public service, but were encouraging "well-wishers" to send their remembrance messages to an email address created to receive such messages. Apple and Microsoft both flew their flags at half-staff throughout their respective headquarters and campuses.\n\nBob Iger ordered all Disney properties, including Walt Disney World and Disneyland, to fly their flags at half-staff from October 6 to 12, 2011. For two weeks following his death, Apple displayed on its corporate website a simple page that showed Jobs\'s name and lifespan next to his portrait in grayscale. On October 19, 2011, Apple employees held a private memorial service for Jobs on the Apple campus in Cupertino. It was attended by Jobs\'s widow, Laurene, and by Tim Cook, Bill Campbell, Norah Jones, Al Gore, and Coldplay. Some of Apple\'s retail store