# Ingestion Pipeline

Previously, we used a low level API to execute each step individually. However, we can simplify things by running it as a pipeline. Different frameworks may have different ways of doing this but the overall sequence should still be the same.

In [1]:
import tqdm
import numpy as np
import utils

# API Setup

In [2]:
from dotenv import load_dotenv
load_dotenv(dotenv_path="../.env")

True

# Dataset

We've abstracted away the code from the previous notebooks to focus on the concepts from this notebook.

In [3]:
data = utils.load_data(sample_size=100)

Repo card metadata block was not found. Setting CardData to empty.


# Preprocessing
We've abstracted away the code from the previous notebooks to focus on the concepts from this notebook.

Note: We can technically implement these as functions for our ingestion pipeline too. [[Reference](https://docs.llamaindex.ai/en/stable/examples/ingestion/advanced_ingestion_pipeline/#custom-transformation)]

In [4]:
documents = utils.preprocess_data(data)

100%|██████████| 100/100 [00:00<00:00, 3226.54it/s]


# Embedding Model

In [5]:
from llama_index.embeddings.huggingface import HuggingFaceEmbedding

# Create embedding model
embedding_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5", embed_batch_size=32)



# Vector DB

In [6]:
# https://docs.llamaindex.ai/en/stable/examples/vector_stores/LanceDBIndexDemo/
from llama_index.vector_stores.lancedb import LanceDBVectorStore

# Create your DB locally
vector_store = LanceDBVectorStore(
    uri="./lancedb", table_name="pipeline_test"
)

# Ingestion Pipeline

We insert each transformation to the input into the pipeline. In this case, we're performing chunking via the SentenceSplitter and embedding via our embedding model. The final transformed data is then stored in the vector store we created above.

In [7]:
from llama_index.core.ingestion import IngestionPipeline
from llama_index.core.node_parser import SentenceSplitter

pipeline = IngestionPipeline(
    transformations=[
        SentenceSplitter(chunk_size=512, chunk_overlap=20),
        embedding_model,
    ],
    vector_store=vector_store
)

In [8]:
# Run the pipeline
nodes = pipeline.run(documents=documents, show_progress=True)

Parsing nodes:   0%|          | 0/100 [00:00<?, ?it/s]

Generating embeddings:   0%|          | 0/852 [00:00<?, ?it/s]

[2024-06-11T15:43:01Z WARN  lance::dataset] No existing dataset at /Users/akashsaravanan/Downloads/GenAI Bootcamp/genai-bootcamp/notebooks/lancedb/pipeline_test.lance, it will be created


In [9]:
# Load the index from disk
from llama_index.core import VectorStoreIndex

index = VectorStoreIndex.from_vector_store(vector_store, embed_model=embedding_model)

# Retrieval

The retrieval process remains the same as before.

In [10]:
# Load embedding model
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
embedding_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5", embed_batch_size=32)



In [11]:
# Load the index from disk
vector_store = LanceDBVectorStore(
    uri="./lancedb", table_name="pipeline_test"
)
index = VectorStoreIndex.from_vector_store(
    vector_store,
    embed_model=embedding_model,
)

In [12]:
query = "How many points did Michael Jordan actually score in his final NBA game?"
results = index.as_retriever(similarity_top_k=3).retrieve(query)

print(f"Query: {query}")
print("---" * 30)
for i, result in enumerate(results):
    print(f"Rank {i+1}: {result.metadata['title']} ({result.score})")
    print(result.text[:100] + "...")
    print("---" * 30)

Query: How many points did Michael Jordan actually score in his final NBA game?
------------------------------------------------------------------------------------------
Rank 1: Michael Jordan (0.6712355613708496)
At several points he openly criticized his teammates to the media , citing their lack of focus and i...
------------------------------------------------------------------------------------------
Rank 2: Michael Jordan (0.6520943641662598)
Jordan led the NBA in scoring in 10 seasons ( NBA record ) and tied Wilt Chamberlain 's record of se...
------------------------------------------------------------------------------------------
Rank 3: Michael Jordan (0.6459461450576782)
After winning , they moved on for a rematch with the Jazz in the Finals . 
 The Bulls returned to th...
------------------------------------------------------------------------------------------
