In [1]:
import tqdm
import numpy as np
import utils

# API Setup

In [2]:
from dotenv import load_dotenv
load_dotenv(dotenv_path="../.env")

True

# Dataset

We've abstracted away the code from the previous notebooks to focus on the concepts from this notebook.

In [3]:
data = utils.load_data(sample_size=100)

Repo card metadata block was not found. Setting CardData to empty.


# Preprocessing
We've abstracted away the code from the previous notebooks to focus on the concepts from this notebook.

In [4]:
documents = utils.preprocess_data(data)

100%|██████████| 100/100 [00:00<00:00, 2900.73it/s]


# Chunking
We've abstracted away the code from the previous notebooks to focus on the concepts from this notebook.

In [5]:
nodes = utils.chunk_documents(documents)

Parsing nodes:   0%|          | 0/100 [00:00<?, ?it/s]

Documents before chunking: 100
Documents after chunking: 837


# Embedding
We've abstracted away the code from the previous notebooks to focus on the concepts from this notebook.

In [6]:
embedding_model = utils.get_embedding_model(model_name="BAAI/bge-small-en-v1.5")



# Indexing

LlamaIndex stores embeddings in a [VectorStoreIndex](https://docs.llamaindex.ai/en/stable/module_guides/indexing/vector_store_index/) object. This can be used with any vector store [supported](https://docs.llamaindex.ai/en/stable/module_guides/storing/vector_stores/) by LlamaIndex. By default, this is a [SimpleIndex](https://docs.llamaindex.ai/en/stable/api_reference/storage/vector_store/simple/) which is a flat index. 

We load all our chunks and embed them when creating a VectorStoreIndex:

In [7]:
from llama_index.core import VectorStoreIndex
index = VectorStoreIndex(nodes, embed_model=embedding_model, show_progress=True)

Generating embeddings:   0%|          | 0/837 [00:00<?, ?it/s]

### Query the Index

In [8]:
query = "How many points did Michael Jordan actually score in his final NBA game?"
results = index.as_retriever(similarity_top_k=3).retrieve(query)

print(f"Query: {query}")
print("---" * 30)
for i, result in enumerate(results):
    print(f"Rank {i+1}: {result.metadata['title']} ({result.score})")
    print(result.text[:100] + "...")
    print("---" * 30)

Query: How many points did Michael Jordan actually score in his final NBA game?
------------------------------------------------------------------------------------------
Rank 1: Michael Jordan (0.797218871120604)
With the recognition that 2002 – 03 would be Jordan 's final season , tributes were paid to him thro...
------------------------------------------------------------------------------------------
Rank 2: Michael Jordan (0.7971195354652177)
In an injury-plagued 2001 – 02 season , he led the team in scoring ( 22.9 ppg ) , assists ( 5.2 apg ...
------------------------------------------------------------------------------------------
Rank 3: Michael Jordan (0.7867602798002393)
Jordan led the league with 28.7 points per game , securing his fifth regular-season MVP award , plus...
------------------------------------------------------------------------------------------


In [9]:
%%timeit
query = "How many points did Michael Jordan actually score in his final NBA game?"
results = index.as_retriever(similarity_top_k=3).retrieve(query)

67.5 ms ± 3.33 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


# Vectorstore / Vector DB

We use LanceDB, an embedded vector DB, instead of the default index used above. We use an embedded database as it's easier to setup and runs in-process. You can learn more about LanceDB [here](https://lancedb.github.io/lancedb/). As described [here](https://lancedb.github.io/lancedb/ann_indexes/), LanceDB uses a disk-based IVF-PQ index. It is noted on the site that this is usually only necessary when you have 100k+ samples. 

In [10]:
# https://docs.llamaindex.ai/en/stable/examples/vector_stores/LanceDBIndexDemo/
from llama_index.vector_stores.lancedb import LanceDBVectorStore
from llama_index.core import StorageContext

# Create your DB locally
vector_store = LanceDBVectorStore(
    uri="./lancedb", table_name="test"
)
# Link to the collection on llamaindex
storage_context = StorageContext.from_defaults(vector_store=vector_store)

In [11]:
# Embed and index
vdb_index = VectorStoreIndex(nodes, embed_model=embedding_model, storage_context=storage_context, show_progress=True)

Generating embeddings:   0%|          | 0/837 [00:00<?, ?it/s]

[2024-06-13T23:35:30Z WARN  lance::dataset] No existing dataset at /Users/akashsaravanan/Downloads/GenAI Bootcamp/genai-bootcamp/notebooks/lancedb/test.lance, it will be created


### Query the Index

In [12]:
query = "How many points did Michael Jordan actually score in his final NBA game?"
results = vdb_index.as_retriever(similarity_top_k=3).retrieve(query)

print(f"Query: {query}")
print("---" * 30)
for i, result in enumerate(results):
    print(f"Rank {i+1}: {result.metadata['title']} ({result.score})")
    print(result.text[:100] + "...")
    print("---" * 30)

Query: How many points did Michael Jordan actually score in his final NBA game?
------------------------------------------------------------------------------------------
Rank 1: Michael Jordan (0.6666018962860107)
With the recognition that 2002 – 03 would be Jordan 's final season , tributes were paid to him thro...
------------------------------------------------------------------------------------------
Rank 2: Michael Jordan (0.6664695143699646)
In an injury-plagued 2001 – 02 season , he led the team in scoring ( 22.9 ppg ) , assists ( 5.2 apg ...
------------------------------------------------------------------------------------------
Rank 3: Michael Jordan (0.6528033018112183)
Jordan led the league with 28.7 points per game , securing his fifth regular-season MVP award , plus...
------------------------------------------------------------------------------------------


In [13]:
%%timeit
query = "How many points did Michael Jordan actually score in his final NBA game?"
results = index.as_retriever(similarity_top_k=3).retrieve(query)

47 ms ± 6.23 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


### Check Index Disk Size

In [14]:
from pathlib import Path

def get_size(folder: str) -> int:
    return sum(p.stat().st_size for p in Path(folder).rglob('*'))

In [15]:
print(f"Index is {get_size('lancedb/test.lance') / (1024 ** 3):.4f} GB")

Index is 2.2392 GB
