In [1]:
import os
import tqdm
import numpy as np
import pandas as pd

# API Setup

In [2]:
from dotenv import load_dotenv
load_dotenv(dotenv_path="../.env")

True

# Dataset

For this example we use the [Wikipedia](https://huggingface.co/datasets/wikipedia) dataset. Specifically, we use "Simple Wikipedia", due to its smaller size. The link also describes other variants of Wikipedia the dataset contains (including alternate languages and the full English Wikipedia).

Our dataset contains 205,328 documents but we sample 1000 for the purposes of this demo. Each datapoint contains the url of the page, the title of the page, and the text available on that page. For our example, we store the url and title as metadata and explicitly do not send the url to either the embedding model or the LLM.

In [3]:
from datasets import load_dataset
data = load_dataset("wikipedia", "20220301.simple", trust_remote_code=True)

In [4]:
sample_size = 1000
np.random.seed(42)
data = data["train"][np.random.choice(data["train"].shape[0], size=sample_size)]

In [5]:
data_index = 58
print(data["url"][data_index] + "\nTitle: " + data["title"][data_index] + "\nText: " + data["text"][data_index][:500] + "...")

https://simple.wikipedia.org/wiki/Robert%20Wagner
Title: Robert Wagner
Text: Robert John Wagner, Jr. (born February 10, 1930 in Detroit, Michigan, U.S.), is an American actor. His paternal grandparents were from Germany.

Early career 
His career began in 1950 as an extra in the movie The Happy Years. Then he got a role in the war films Halls of Montezuma (1951), and The Frogmen (1951), with Richard Widmark, What Price Glory? (1952) with James Cagney and directed by John Ford.

In 1953 he was nominated for a Golden Globe for Stars and Stripes Forever (1952).

Then would ...


In [6]:
# We load our dataset into a list of llamaindex documents
from llama_index.core import Document

documents = []
for i in tqdm.tqdm(range(len(data["text"]))):
    documents.append(
        Document(
            text=data["text"][i],
            metadata={"title": data["title"][i], "url": data["url"][i]},
            excluded_embed_metadata_keys=["url"], # We don't embed the url
            excluded_llm_metadata_keys=["url"], # We don't send the url to LLM
        )
    )

100%|██████████| 1000/1000 [00:00<00:00, 24050.60it/s]


In [7]:
documents[0]

Document(id_='2d1934ad-ff3b-477e-bf75-ae53ffbc8609', embedding=None, metadata={'title': 'Halal snack pack', 'url': 'https://simple.wikipedia.org/wiki/Halal%20snack%20pack'}, excluded_embed_metadata_keys=['url'], excluded_llm_metadata_keys=['url'], relationships={}, text='A halal snack pack, or HSP, is a dish that comes from Australia. It is made up of halal-certified doner kebab meat (mainly lamb, chicken or beef), chips, sauces (mainly chili, garlic and barbecue sauces) and often cheese. The exact origin of the halal snack pack is unknown. Halal snack packs have become more popular since 2015, when the Facebook group "Halal Snack Pack Appreciation Society" was created.\n\nReferences\n\nAustralia\nFast food', start_char_idx=None, end_char_idx=None, text_template='{metadata_str}\n\n{content}', metadata_template='{key}: {value}', metadata_seperator='\n')

# Preprocessing

**Data Cleaning:** In the case of this dataset, data is extracted at a particular point in time, so there's no chances of old or conflicting information. 

**Data Extraction:** In addition, as this is a preprocessed dataset used by the general community, it is also clean with no real need to optimize the extraction.

**PII:** Wikipedia does not contain PII.

**Data Enrichment:** While we don't use this information in this demo, we can store the url and page titles as part of the metadata. Given this information already exists, we just store it alongside our embeddings further on in this notebook. 

# Chunking

We first demonstrate basic fixed size chunking with overlaps via llamaindex's [TokenTextSplitter](https://docs.llamaindex.ai/en/stable/api_reference/node_parsers/token_text_splitter/).

However, for the rest of the demo, we use a mix of two chunking methods - fixed-size chunking and content-aware chunking. Specifically, we use llamaindex's [SentenceSplitter](https://docs.llamaindex.ai/en/stable/api_reference/node_parsers/sentence_splitter/) which attempts to keep sentences and paragraphs together while maintaining chunks of roughly equal size.

We use a chunk size of 512 which is the maximum size that our embedding model accepts. Within these chunks, we set an overlap of 20 tokens which corresponds to roughly 1-2 sentences.

You can examine other chunking methods implemented in llamaindex [here](https://docs.llamaindex.ai/en/stable/api_reference/node_parsers/).

### Fixed Size Chunking

In [8]:
from llama_index.core.node_parser import TokenTextSplitter
chunker = TokenTextSplitter(chunk_size=512, chunk_overlap=20)

In [9]:
sample = data["text"][data_index]
chunked = chunker.split_text(sample)
print(f"Document has ~{len(sample.split(' '))} words and was split into {len(chunked)} chunks of maximum 512 tokens.")

Document has ~581 words and was split into 2 chunks of maximum 512 tokens.


In [10]:
# Can result in sentences being cut off in the middle
chunked[0]

"Robert John Wagner, Jr. (born February 10, 1930 in Detroit, Michigan, U.S.), is an American actor. His paternal grandparents were from Germany.\n\nEarly career \nHis career began in 1950 as an extra in the movie The Happy Years. Then he got a role in the war films Halls of Montezuma (1951), and The Frogmen (1951), with Richard Widmark, What Price Glory? (1952) with James Cagney and directed by John Ford.\n\nIn 1953 he was nominated for a Golden Globe for Stars and Stripes Forever (1952).\n\nThen would have outstanding performances in movies as Prince Valiant (1954) directed by Henry Hathaway and the Western, Broken Lance (1954) with Spencer Tracy and directed by Edward Dmytryk. \n\nIn 1955 he obtained his first starred in the Western White Feather (1955), followed by Film-Noir A Kiss Before Dying (1956) and war film Between Heaven and Hell (1956) directed by Richard Fleischer.\n\nWagner played the legendary gunslinger Jesse James in the movie The True Story of Jesse James (1957) direc

In [11]:
chunked[1]

'Austin Powers in Goldmember (2002).\n\nIn the last decade, Wagner has worked in movies El padrino (2004), Hoot (2006), Man in the Chair (2007) with Christopher Plummer, The Wild Stallion (2009).\n\nTelevision \nRobert Wagner starred in three successful series for American television. One of them was It Takes a Thief, Alexander Mundy a spy performing dangerous missions for the government of the United States. I also work in the series Fred Astaire as Alistair Mundy.\n\nWagner starred in the series for 66 episodes between 1968 and 1970. and was nominated for an Emmy and Golden Globe Awards in 1970\n\nIn 1975 Wagner stars with Eddie Albert, the series of detectives created by Glen A. Larson, Switch (1975 - 1978). he plays detective Pete T. Ryan.\n\nHis television series Hart to Hart (1979-1984), was most successful of all, Wagner is Jonathan Hart who with his wife Jennifer Hart, Stefanie Powers, two detectives who solved the most difficult criminal cases in high society.  \n\nThe series 

### Content Aware Chunking

In [12]:
from llama_index.core.node_parser import SentenceSplitter
chunker = SentenceSplitter(chunk_size=512, chunk_overlap=20)

In [13]:
sample = data["text"][data_index]
chunked = chunker.split_text(sample)
print(f"Document has {len(sample.split(' '))} words and was split into {len(chunked)} chunks of maximum 512 tokens.")

Document has 581 words and was split into 2 chunks of maximum 512 tokens.


In [14]:
# Does not cut sentences off in the middle
chunked[0]

"Robert John Wagner, Jr. (born February 10, 1930 in Detroit, Michigan, U.S.), is an American actor. His paternal grandparents were from Germany.\n\nEarly career \nHis career began in 1950 as an extra in the movie The Happy Years. Then he got a role in the war films Halls of Montezuma (1951), and The Frogmen (1951), with Richard Widmark, What Price Glory? (1952) with James Cagney and directed by John Ford.\n\nIn 1953 he was nominated for a Golden Globe for Stars and Stripes Forever (1952).\n\nThen would have outstanding performances in movies as Prince Valiant (1954) directed by Henry Hathaway and the Western, Broken Lance (1954) with Spencer Tracy and directed by Edward Dmytryk. \n\nIn 1955 he obtained his first starred in the Western White Feather (1955), followed by Film-Noir A Kiss Before Dying (1956) and war film Between Heaven and Hell (1956) directed by Richard Fleischer.\n\nWagner played the legendary gunslinger Jesse James in the movie The True Story of Jesse James (1957) direc

In [15]:
# Note that there's no overlap here as the content starts in a new paragraph
# If you change the chunk size above and try this, you'll see that overlaps 
# occur when content is chunked mid-paragraph
chunked[1]

'In the last decade, Wagner has worked in movies El padrino (2004), Hoot (2006), Man in the Chair (2007) with Christopher Plummer, The Wild Stallion (2009).\n\nTelevision \nRobert Wagner starred in three successful series for American television. One of them was It Takes a Thief, Alexander Mundy a spy performing dangerous missions for the government of the United States. I also work in the series Fred Astaire as Alistair Mundy.\n\nWagner starred in the series for 66 episodes between 1968 and 1970. and was nominated for an Emmy and Golden Globe Awards in 1970\n\nIn 1975 Wagner stars with Eddie Albert, the series of detectives created by Glen A. Larson, Switch (1975 - 1978). he plays detective Pete T. Ryan.\n\nHis television series Hart to Hart (1979-1984), was most successful of all, Wagner is Jonathan Hart who with his wife Jennifer Hart, Stefanie Powers, two detectives who solved the most difficult criminal cases in high society.  \n\nThe series created by Sidney Sheldon, was an immed

### Perform Chunking on Dataset

Now we perform chunking on our entire dataset. We use the chunker's in-built function to do this on our list of documents. We get a list of [TextNodes](https://docs.llamaindex.ai/en/stable/module_guides/loading/documents_and_nodes/usage_nodes/) as output. 

In [16]:
nodes = chunker.get_nodes_from_documents(documents, show_progress=True)

Parsing nodes:   0%|          | 0/1000 [00:00<?, ?it/s]

In [17]:
nodes[0]

TextNode(id_='fbe900c5-7ae6-4a65-a1a3-02e99913b46a', embedding=None, metadata={'title': 'Halal snack pack', 'url': 'https://simple.wikipedia.org/wiki/Halal%20snack%20pack'}, excluded_embed_metadata_keys=['url'], excluded_llm_metadata_keys=['url'], relationships={<NodeRelationship.SOURCE: '1'>: RelatedNodeInfo(node_id='2d1934ad-ff3b-477e-bf75-ae53ffbc8609', node_type=<ObjectType.DOCUMENT: '4'>, metadata={'title': 'Halal snack pack', 'url': 'https://simple.wikipedia.org/wiki/Halal%20snack%20pack'}, hash='18afa25a1df2aeb36bc851ab9b89d218845aabd0f35d1072b044b4c2f71d611b')}, text='A halal snack pack, or HSP, is a dish that comes from Australia. It is made up of halal-certified doner kebab meat (mainly lamb, chicken or beef), chips, sauces (mainly chili, garlic and barbecue sauces) and often cheese. The exact origin of the halal snack pack is unknown. Halal snack packs have become more popular since 2015, when the Facebook group "Halal Snack Pack Appreciation Society" was created.\n\nReferen

# Embedding

For our embedding model we use the [BAAI/bge-small-en-v1.5](https://huggingface.co/BAAI/bge-small-en-v1.5) model. This is an MIT license model which is, at the time of writing, 44th on the [MTEB Retrieval leaderboard](https://huggingface.co/spaces/mteb/leaderboard). While there are better models available (as seen on the leaderboard), we choose this model for the demo as it's very small (33M parameters / ~120MB) and hence, very fast. The generated embeddings are 384-dimensional.

In [18]:
# Load embedding model
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
embedding_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5", embed_batch_size=32)



In [19]:
# Example of how to get an embedding and what it looks like
print(np.array(embedding_model.get_text_embedding(nodes[0].text)).shape)
np.array(embedding_model.get_text_embedding(nodes[0].text))

(384,)


array([-6.65795133e-02,  8.12166780e-02, -4.37078811e-03, -3.80715914e-02,
        3.39519978e-02,  9.77202877e-03,  5.77810593e-02, -1.30435536e-02,
       -3.30868997e-02, -2.43122838e-02, -1.58980209e-02, -8.46297741e-02,
       -7.33705889e-03,  4.63541150e-02,  3.02996412e-02, -3.61367203e-02,
        3.10154110e-02, -6.56468794e-02, -2.36510281e-02, -2.89180316e-02,
        2.44261045e-02,  7.21716508e-03, -6.03428483e-02, -6.56444952e-02,
        6.78342879e-02,  1.44856526e-02,  2.76592355e-02,  2.35805195e-02,
       -9.50826034e-02, -1.39449999e-01,  5.08472845e-02,  2.00165957e-02,
        6.72311615e-03, -5.49212694e-02, -4.95373532e-02,  8.13274551e-03,
        3.17828469e-02, -1.74102969e-02, -3.71145159e-02,  3.27568837e-02,
        4.01726142e-02,  5.12109138e-02,  3.02632526e-02, -1.65336076e-02,
        9.14637279e-03, -3.67910750e-02, -3.71675827e-02,  2.77871490e-02,
        6.12801351e-02,  2.07614843e-02, -7.96210486e-03,  4.86738347e-02,
       -2.19224058e-02, -

### Embedding & Indexing

Llamaindex stores embeddings in a [VectorStoreIndex](https://docs.llamaindex.ai/en/stable/module_guides/indexing/vector_store_index/) object. This can be any used with any vector store [supported](https://docs.llamaindex.ai/en/stable/module_guides/storing/vector_stores/) by Llamaindex. By default, this is a [SimpleIndex](https://docs.llamaindex.ai/en/stable/examples/vector_stores/SimpleIndexDemo/) which is a flat index. 

We can load all our chunks and embed them when creating a VectorStoreIndex:

In [20]:
from llama_index.core import VectorStoreIndex
index = VectorStoreIndex(nodes, embedding_model=embedding_model, show_progress=True)

Generating embeddings:   0%|          | 0/1201 [00:00<?, ?it/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


# Retrieval

Now that we have a vector index, we can query this index. Internally, the query is converted into an embedding and a flat similarity search occurs.

In [21]:
query = "Who is Robert Wagner?"
results = index.as_retriever(similarity_top_k=3).retrieve(query)

print(f"Query: {query}")
print("---" * 30)
for i, result in enumerate(results):
    print(f"Rank {i+1}: {result.metadata['title']} ({result.score})")
    print(result.text[:100] + "...")
    print("---" * 30)

Query: Who is Robert Wagner?
------------------------------------------------------------------------------------------
Rank 1: Robert Wagner (0.9024011659794268)
Robert John Wagner, Jr. (born February 10, 1930 in Detroit, Michigan, U.S.), is an American actor. H...
------------------------------------------------------------------------------------------
Rank 2: Robert Wagner (0.8440168825342981)
In the last decade, Wagner has worked in movies El padrino (2004), Hoot (2006), Man in the Chair (20...
------------------------------------------------------------------------------------------
Rank 3: Hans Richter (0.8048237868714503)
Hans Richter (born as Raab (now Györ) 4 April 1843; died Bayreuth 5 December 1916) was an Austro-Hun...
------------------------------------------------------------------------------------------


In [22]:
%%timeit
results = index.as_retriever(similarity_top_k=3).retrieve(query)

347 ms ± 77.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


# Basic RAG

So now we have an index and a way to query the index. To close the RAG loop here, we need to get an input from a user, pass it to the retriever, send the retrieved context to the LLM, and return the output. 

In [40]:
user_input = "What movies did Robert Wagner make?"

In [41]:
# Build our prompt
system_prompt = "You are a helpful AI assistant that can answer questions about a wide range of topics. Use the given context to answer the user's question. If the information is not present in the context, you can say 'I don't know'. Otherwise, include the URL of the source you used."

prompt = """Context: 
{context} 
-----
Question: {question} 
Answer: """

### LLM Choice

This demo supports two different APIs for models - OpenAI and AnyScale. Specifically, we use `gpt-4o` and `meta-llama/Meta-Llama-3-70B-Instruct` but any supported model should work. Note that AnyScale uses the same structure for API calls as OpenAI, just with a different url and API key.

In [42]:
def retrieve(index, user_input):
    results = index.as_retriever(similarity_top_k=3).retrieve(user_input)
    return "\n----------------\n".join([result.metadata["title"] + "\n" + result.text + f"\nURL: {result.metadata['url']}" for result in results])

In [43]:
# OpenAI
from openai import OpenAI
# Set up OpenAI client
client = OpenAI(
    api_key = os.environ.get('OPENAI_API_KEY'),
    base_url = os.environ.get('OPENAI_BASE_URL')
)

# Run retrieval
context = retrieve(index, user_input)

# Get output
chat_completion = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "system", "content": system_prompt},
              {"role": "user", "content": prompt.format(context=context, question=user_input)}],
    temperature=0.1
)
print(chat_completion.choices[0].message.content)

Robert Wagner made numerous movies throughout his career. Some of the notable ones include:

- The Happy Years (1950) (as an extra)
- Halls of Montezuma (1951)
- The Frogmen (1951)
- What Price Glory? (1952)
- Stars and Stripes Forever (1952)
- Prince Valiant (1954)
- Broken Lance (1954)
- White Feather (1955)
- A Kiss Before Dying (1956)
- Between Heaven and Hell (1956)
- The True Story of Jesse James (1957)
- The Longest Day (1962)
- The War Lover (1962)
- The Pink Panther (1963)
- Banning (1967)
- The Towering Inferno (1974)
- Midway (1976)
- The Concorde... Airport '79 (1979)
- Curse of the Pink Panther (1983)
- I Am the Cheese (1983)
- Austin Powers: International Man of Mystery (1997)
- Austin Powers: The Spy Who Shagged Me (1999)
- Austin Powers in Goldmember (2002)
- El padrino (2004)
- Hoot (2006)
- Man in the Chair (2007)
- The Wild Stallion (2009)

For more details, you can visit the source: [Robert Wagner on Simple Wikipedia](https://simple.wikipedia.org/wiki/Robert%20Wagne

In [44]:
# AnyScale
from openai import OpenAI
# Set up client
client = OpenAI(
    api_key = os.environ.get('ANYSCALE_API_KEY'),
    base_url = os.environ.get('ANYSCALE_BASE_URL')
)

# Run retrieval
context = retrieve(index, user_input)

# Get output
chat_completion = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3-70B-Instruct",
    messages=[{"role": "system", "content": system_prompt},
              {"role": "user", "content": prompt.format(context=context, question=user_input)}],
    temperature=0.1
)
print(chat_completion.choices[0].message.content)

According to the provided context, Robert Wagner made the following movies:

1. The Happy Years (1950)
2. Halls of Montezuma (1951)
3. The Frogmen (1951)
4. What Price Glory? (1952)
5. Stars and Stripes Forever (1952)
6. Prince Valiant (1954)
7. Broken Lance (1954)
8. White Feather (1955)
9. A Kiss Before Dying (1956)
10. Between Heaven and Hell (1956)
11. The True Story of Jesse James (1957)
12. The Longest Day (1962)
13. The War Lover (1962)
14. The Pink Panther (1963)
15. Banning (1967)
16. The Towering Inferno (1974)
17. Midway (1976)
18. The Concorde... Airport '79 (1979)
19. Curse of the Pink Panther (1983)
20. I Am the Cheese (1983)
21. Austin Powers (1997)
22. Austin Powers: The Spy Who Shagged Me (1999)
23. Austin Powers in Goldmember (2002)
24. El padrino (2004)
25. Hoot (2006)
26. Man in the Chair (2007)
27. The Wild Stallion (2009)

Source: https://simple.wikipedia.org/wiki/Robert%20Wagner


# Next Steps: Using a Vectorstore (Vector DB)

We use LanceDB instead of the default index used above. As described [here](https://lancedb.github.io/lancedb/ann_indexes/), LanceDB uses a disk-based IVF-PQ index. As they note in the same page, this is usually only necessary when you have 100k+ samples. 

In [28]:
# https://docs.llamaindex.ai/en/stable/examples/vector_stores/LanceDBIndexDemo/
from llama_index.vector_stores.lancedb import LanceDBVectorStore
from llama_index.core import StorageContext

# Create your DB locally
vector_store = LanceDBVectorStore(
    uri="./lancedb", table_name="test"
)
# Link to the collection on llamaindex
storage_context = StorageContext.from_defaults(vector_store=vector_store)

In [29]:
# Embed and index
index = VectorStoreIndex(nodes, embed_model=embedding_model, storage_context=storage_context, show_progress=True)

Generating embeddings:   0%|          | 0/1201 [00:00<?, ?it/s]

[2024-06-03T20:33:48Z WARN  lance::dataset] No existing dataset at /Users/akashsaravanan/Downloads/GenAI Bootcamp/genai-bootcamp/notebooks/lancedb/test.lance, it will be created


In [30]:
# Load the index from disk
vector_store = LanceDBVectorStore(
    uri="./lancedb", table_name="test"
)
index = VectorStoreIndex.from_vector_store(
    vector_store,
    embed_model=embedding_model,
)

In [31]:
query = "Who is Robert Wagner?"
results = index.as_retriever(similarity_top_k=3).retrieve(query)

print(f"Query: {query}")
print("---" * 30)
for i, result in enumerate(results):
    print(f"Rank {i+1}: {result.metadata['title']} ({result.score})")
    print(result.text[:100] + "...")
    print("---" * 30)

Query: Who is Robert Wagner?
------------------------------------------------------------------------------------------
Rank 1: Robert Wagner (0.6897997260093689)
Robert John Wagner, Jr. (born February 10, 1930 in Detroit, Michigan, U.S.), is an American actor. H...
------------------------------------------------------------------------------------------
Rank 2: Robert Wagner (0.6023688316345215)
In the last decade, Wagner has worked in movies El padrino (2004), Hoot (2006), Man in the Chair (20...
------------------------------------------------------------------------------------------
Rank 3: Hans Richter (0.46540871262550354)
Hans Richter (born as Raab (now Györ) 4 April 1843; died Bayreuth 5 December 1916) was an Austro-Hun...
------------------------------------------------------------------------------------------


In [32]:
%%timeit
results = index.as_retriever(similarity_top_k=3).retrieve(query)

24.7 ms ± 2.98 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


# Next Steps: Ingestion Pipeline

While we use a low level API above to execute each step individuall, we can also build this in the form of a pipeline. For the sake of completeness, we repeat all steps from the start including the data loading.

In [33]:
# Load our data again
from datasets import load_dataset
data = load_dataset("wikipedia", "20220301.simple", trust_remote_code=True)

sample_size = 1000
np.random.seed(42)
data = data["train"][np.random.choice(data["train"].shape[0], size=sample_size)]

In [34]:
# We load our dataset into a list of llamaindex documents
from llama_index.core import Document

documents = []
for i in tqdm.tqdm(range(len(data["text"]))):
    documents.append(
        Document(
            text=data["text"][i],
            metadata={"title": data["title"][i], "url": data["url"][i]},
            excluded_embed_metadata_keys=["url"], # We don't embed the url
            excluded_llm_metadata_keys=["url"], # We don't send the url to LLM
        )
    )

100%|██████████| 1000/1000 [00:00<00:00, 22805.05it/s]


In [35]:
# https://docs.llamaindex.ai/en/stable/examples/vector_stores/LanceDBIndexDemo/
from llama_index.vector_stores.lancedb import LanceDBVectorStore
from llama_index.core import StorageContext

# Create your DB locally
vector_store = LanceDBVectorStore(
    uri="./lancedb", table_name="pipeline_test"
)

In [36]:
from llama_index.core.ingestion import IngestionPipeline
from llama_index.core.node_parser import SentenceSplitter
from llama_index.embeddings.huggingface import HuggingFaceEmbedding

pipeline = IngestionPipeline(
    transformations=[
        SentenceSplitter(chunk_size=512, chunk_overlap=20),
        HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5", embed_batch_size=32),
    ],
    vector_store=vector_store
)



In [37]:
nodes = pipeline.run(documents=documents, show_progress=True)

Parsing nodes:   0%|          | 0/1000 [00:00<?, ?it/s]

Generating embeddings:   0%|          | 0/1201 [00:00<?, ?it/s]

[2024-06-03T20:41:39Z WARN  lance::dataset] No existing dataset at /Users/akashsaravanan/Downloads/GenAI Bootcamp/genai-bootcamp/notebooks/lancedb/pipeline_test.lance, it will be created


In [38]:
from llama_index.core import VectorStoreIndex
index = VectorStoreIndex(nodes, embed_model=embedding_model)

In [39]:
query = "Who is Robert Wagner?"
results = index.as_retriever(similarity_top_k=3).retrieve(query)

print(f"Query: {query}")
print("---" * 30)
for i, result in enumerate(results):
    print(f"Rank {i+1}: {result.metadata['title']} ({result.score})")
    print(result.text[:100] + "...")
    print("---" * 30)

Query: Who is Robert Wagner?
------------------------------------------------------------------------------------------
Rank 1: Robert Wagner (0.814323011795492)
Robert John Wagner, Jr. (born February 10, 1930 in Detroit, Michigan, U.S.), is an American actor. H...
------------------------------------------------------------------------------------------
Rank 2: Robert Wagner (0.7465573569177086)
In the last decade, Wagner has worked in movies El padrino (2004), Hoot (2006), Man in the Chair (20...
------------------------------------------------------------------------------------------
Rank 3: Hans Richter (0.6175803556028695)
Hans Richter (born as Raab (now Györ) 4 April 1843; died Bayreuth 5 December 1916) was an Austro-Hun...
------------------------------------------------------------------------------------------
