In [5]:
import add_packages

import logging, sys

from my_configs import constants
from toolkit.llamaindex import (
	node_parsers, readers, cores, stores, ingestions, schemas, embeddings,
	query_engines, postprocessors,
)

cores.Settings.llm = llms.openai_llms["GPT-3.5-TURBO"]
cores.Settings.embed_model = embeddings.openai_embeddings["TEXT_EMBED_ADA_002"]
cores.Settings.chunk_size = 512

logging.basicConfig(stream=sys.stdout, level=logging.DEBUG)
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))

# Using LLMs

Choose the LLM for your application, you can use multiple.

LLMs used at various pipeline stages.

- During Indexing: Determine data relevance or summarize raw data and index the summaries instead.

- During Querying

  - During Retrieval (fetching data from index) can be given options and make decisions about where to find information. An agentic LLM can use tools to query different data sources.

  - During Response Synthesis, an LLM can combine answers to multiple sub-queries into a single coherent answer or transform data from unstructured text to JSON or another programmatic output format.

LlamaIndex: One interface for multiple choose any for any pipeline stage.

Instantiate an LLM and pass it to Settings, then pass it to other pipeline stages.

In [None]:
# Instantiate OpenAI with default gpt-3.5-turbo and adjust temperature. 
# VectorStoreIndex uses this for answering queries.
from llama_index.core import Settings

Settings.llm = llms.openai_llms["GPT-3.5-TURBO"]

documents = readers.SimpleDirectoryReader(
  f"{add_packages.APP_PATH}/data/llamaindex_tmp/1/"
).load_data()
index = stores.VectorStoreIndex.from_documents(documents)

## Available LLMs

Integrations with OpenAI, Hugging Face, PaLM, and more. 


### Local LLM

LlamaIndex supports hosted LLM APIs and allows running a local model like Llama2.

If Ollama is installed and running.

## Prompts

LlamaIndex has built-in prompts that handle data formatting, a key benefit of using the platform. Customization is also an option.


# Loading Data (Ingestion)

To use LLM, process and load data first. Similar to data cleaning/feature engineering pipelines in ML or ETL pipelines in traditional data settings.

Ingestion pipeline: Load, Transform, Index and store data.



## Loaders
Load data using data connectors (Reader) for LLM to act on it. These connectors 
ingest data from various sources and format it into Document objects, which contain
text and metadata.



### SimpleDirectoryReader

It creates documents from every file in a directory. It can read various formats like Markdown, PDFs, Word documents, PowerPoint decks, images, audio, and video.



### Readers from [LlamaHub](https://docs.llamaindex.ai/en/stable/understanding/loading/llamahub/)

There are numerous data sources available through LlamaHub, not all of which are built-in.

LlamaIndex downloads and installs DatabaseReader to query a SQL database and return results as a Document.



### Create Documents directly

Instead of using a loader, use a Document directly.


In [None]:
doc = cores.Document(text="text")


## Transformations

After loading the data, process and transform it before storing. Chunk, extract metadata, and embed each chunk for optimal retrieval and use by the LLM.

Transformation input/outputs are Node objects. Transformations can be stacked and reordered.

We offer a high-level and lower-level API for document transformation.



### High-Level API

Indexes have a .from_documents() method for parsing and chunking Document objects. For more control over document splitting, consider other options.

Under the hood, Document is split into Node objects, similar to Documents but with a relationship to their parent Document.

Customize core components by passing in a custom transformations list or applying to the global Settings.


In [None]:
documents = readers.SimpleDirectoryReader(
  f"{add_packages.APP_PATH}/data/llamaindex_tmp/1/"
).load_data()

text_splitter = node_parsers.SentenceSplitter(chunk_size=512, chunk_overlap=10)
cores.Settings.text_splitter = text_splitter

index = stores.VectorStoreIndex.from_documents(
	documents, transformations=[text_splitter],
)


### Lower-Level API

Define steps explicitly. Use transformation modules like text splitters and metadata extractors separately or combine them in Transformation Pipeline interface.



#### Split Documents into Nodes

Split documents into "chunks"/Node objects to process data into bite-sized pieces for retrieval/feeding to the LLM.

LlamaIndex supports various text splitters, including paragraph, sentence, token-based splitters, file-based splitters like HTML and JSON.

Can be used independently or as part of an ingestion pipeline.


In [None]:
documents = readers.SimpleDirectoryReader(
  f"{add_packages.APP_PATH}/data/llamaindex_tmp/1/"
).load_data()

pipeline = ingestions.IngestionPipeline(
	transformations=[
		node_parsers.TokenTextSplitter(),
	]
)

nodes = pipeline.run(documents)


## Adding Metadata

Add metadata to documents and nodes manually or with automatic extractors.



In [None]:
document = cores.Document(
	text="text",
	metadata={"filename": "<doc_file_name>", "category": "<category>"},
)


## Adding Embeddings

Insert node into vector index requires embedding. 


## Create and pass Nodes directly

Create nodes directly and pass a list of nodes to an indexer.



In [None]:
node1 = schemas.TextNode(text="<text_chunk>", id_="<node_id>")
node2 = schemas.TextNode(text="<text_chunk>", id_="<node_id>")

index = stores.VectorStoreIndex(nodes=[node1, node2])

# Indexing & Embedding

With data loaded, create an Index over Document objects or Nodes to begin querying.



## What is an Index?

In LlamaIndex, an Index is a data structure of Document objects for querying by an LLM, complementary to your querying strategy.

LlamaIndex offers various index types.


## Vector Store Index

It splits Documents into Nodes and creates vector embeddings for each node, which can be queried by an LLM.



### Embedding Definition

Vector embedding is a numerical representation of text semantics. Similar meanings result in similar embeddings.

The mathematical relationship enables semantic search, allowing LlamaIndex to locate text related to query terms' meaning rather than simple keyword matching. 

There are various types of embeddings with different efficiency, effectiveness, and computational costs. LlamaIndex defaults to text-embedding-ada-002, the default embedding from OpenAI. Different LLMs may require different embeddings.



### Vector Store Index embeds documents

Vector Store Index converts text into embeddings using an API from your LLM, which is known as "embedding your text". Generating embeddings for large amounts of text can be time-consuming due to multiple API calls.

Search embeddings by turning query into vector embedding and using VectorStoreIndex to rank based on semantic similarity.



### Top K Retrieval

Once ranking is complete, VectorStoreIndex returns the most-similar embeddings as corresponding text chunks. The number of embeddings returned is known as k, with the parameter controlling this known as top_k. This search type is often called "top-k semantic retrieval."



### Vector Store Index

Pass the Vector Store Index the list of Documents created during the loading stage.

from_documents also takes an optional argument show_progress. Set to True for progress bar display during index construction.

Build an index over a list of Node objects directly.

With indexed text, querying is ready. Embedding all text can be time-consuming and costly with a hosted LLM. Store embeddings first to save time and money.


In [None]:
documents = readers.SimpleDirectoryReader(
  f"{add_packages.APP_PATH}/data/llamaindex_tmp/1/"
).load_data()

index = stores.VectorStoreIndex.from_documents(documents, show_progress=True)


## Summary Index

A Summary Index is a simplified form of Index ideal for generating summaries of text in Documents. It stores all Documents and returns them to the query engine.



# Storing

Once data is loaded and indexed, store it to prevent re-indexing. By default, indexed data is stored in memory only.



## Persisting to disk

Use the .persist() method of every Index, which writes data to disk at the specified location.

Composable Graph: Example

Avoid re-loading and re-indexing data by loading the persisted index.

Important: Ensure that the same options used for initializing the index are passed during load_index_from_storage.


In [None]:
PERSIST_DIR = f"{add_packages.APP_PATH}/data/llamaindex_tmp/1/index"

documents = readers.SimpleDirectoryReader(
  f"{add_packages.APP_PATH}/data/llamaindex_tmp/1/"
).load_data()

index = stores.VectorStoreIndex.from_documents(documents, show_progress=True)

index.storage_context.persist(persist_dir=PERSIST_DIR)

# Rebuild storage context
storage_context = stores.StorageContext.from_defaults(persist_dir=PERSIST_DIR)

# Load index
index = stores.load_index_from_storage(storage_context)


## Using Vector Stores

Creating embeddings in a VectorStoreIndex can be costly, store them to avoid frequent re-indexing.

LlamaIndex supports various vector stores with different architectures, complexities, and costs. 

To store embeddings with Chroma:
- Initialize Chroma client
- Create Collection in Chroma to store data
- Assign Chroma as vector_store in StorageContext
- Initialize VectorStoreIndex with StorageContext



In [5]:
documents = readers.SimpleDirectoryReader(
  f"{add_packages.APP_PATH}/data/llamaindex_tmp/1/"
).load_data()

# initialize client, setting path to save data
db = stores.chromadb.PersistentClient(
	path=f"{add_packages.APP_PATH}/data/llamaindex_tmp/1/chroma_db"
)

# create collection
chroma_collection = db.get_or_create_collection("quickstart")

# assign chroma as the vector_store to the context
vector_store = stores.ChromaVectorStore(chroma_collection=chroma_collection)
storage_context = stores.StorageContext.from_defaults(vector_store=vector_store)

# create index
index = stores.VectorStoreIndex.from_documents(
	documents=documents, storage_context=storage_context, show_progress=True
)

# Now that data is loaded, indexed, and stored, it's time to query.
# create a query engine and query
query_engine = index.as_query_engine(llm=llms.openai_llms["GPT-3.5-TURBO"])
query = "What is Lisp?"
response = query_engine.query(query)

In [7]:
# Load embeddings directly if already created and stored, without loading 
# documents or creating a new VectorStoreIndex.
# initialize client, setting path to save data
db = stores.chromadb.PersistentClient(
	path=f"{add_packages.APP_PATH}/data/llamaindex_tmp/1/chroma_db"
)

# create collection
chroma_collection = db.get_or_create_collection("quickstart")

# assign chroma as the vector_store to the context
vector_store = stores.ChromaVectorStore(chroma_collection=chroma_collection)
storage_context = stores.StorageContext.from_defaults(vector_store=vector_store)

# create index
index = stores.VectorStoreIndex.from_vector_store(
	vector_store, storage_context=storage_context, show_progress=True
)

# Now that data is loaded, indexed, and stored, it's time to query.
# create a query engine and query
query_engine = index.as_query_engine(llm=llms.openai_llms["GPT-3.5-TURBO"])
query = "What is Lisp?"
response = query_engine.query(query)


## Inserting Documents or Nodes

Add new documents to your index using the insert method if you've already created an index.



In [None]:
documents = readers.SimpleDirectoryReader(
	f"{add_packages.APP_PATH}/data/llamaindex_tmp/1/"
).load_data()

index = stores.VectorStoreIndex([])

for doc in documents:
	index.insert(doc)

# Querying

Now data loaded, index built, stored for later, time for querying.

Querying is a prompt call to an LLM for a question, answer, summarization, or complex instruction.

Complex querying may require repeated/chained prompt + LLM calls or a reasoning loop across multiple components.



## Getting started

Use index to create The QueryEngine.


In [3]:
PERSIST_DIR = f"{add_packages.APP_PATH}/data/llamaindex_tmp/1/index"

documents = readers.SimpleDirectoryReader(
  f"{add_packages.APP_PATH}/data/llamaindex_tmp/1/"
).load_data()

index = stores.VectorStoreIndex.from_documents(documents, show_progress=True)

query_engine = index.as_query_engine(llm=llms.openai_llms["GPT-3.5-TURBO"])
query = "What is Lisp?"
response = query_engine.query(query)
print(response)

  from .autonotebook import tqdm as notebook_tqdm
Parsing nodes: 100%|██████████| 1/1 [00:00<00:00, 72.39it/s]


Generating embeddings: 100%|██████████| 15/15 [00:01<00:00, 11.64it/s]


Lisp is a programming language that was used in the development of the software for Viaweb. It is described as having a unique syntax with a heavy use of parentheses. Despite being less commonly used in business settings at the time, it provided Viaweb with a technical advantage that was not easily understood by competitors.



## Querying Stages

Querying consists of three distinct stages:
- Retrieval: Find and return relevant documents from the Index.
- Postprocessing: Rerank, transform, or filter retrieved Nodes.
- Response synthesis: Combine query, most-relevant data, and prompt to send to LLM for response.



## Customizing Querying Stages

LlamaIndex: low-level composition API for granular querying.

Example: Customize retriever to use different top_k number and add post-processing step requiring minimum similarity score for retrieved nodes to be included. Results vary based on relevance.

Customize by implementing interfaces for retrieval, response synthesis, and query logic.


In [2]:
documents = readers.SimpleDirectoryReader(
  f"{add_packages.APP_PATH}/data/llamaindex_tmp/1/"
).load_data()

index = stores.VectorStoreIndex.from_documents(documents, show_progress=True)

retriever = stores.VectorIndexRetriever(
	index=index,
	similarity_top_k=10,sto
)

response_synthesizer = cores.get_response_synthesizer()

# Configure desired node postprocessors.
node_postprocessors = [
	postprocessors.SimilarityPostprocessor(similarity_cutoff=0.7),
]

# assemble query engine
query_engine = query_engines.RetrieverQueryEngine(
	retriever=retriever,
	response_synthesizer=response_synthesizer,
	node_postprocessors=node_postprocessors,
)

In [3]:
query = "What is Lisp"
response = query_engine.query(query)


### Node postprocessors configuration

Advanced Node filtering and augmentation improve relevancy of retrieved Node objects, reducing LLM calls/cost and enhancing response quality.

- KeywordNodePostprocessor: filters nodes by required_keywords and exclude_keywords.
- SimilarityPostprocessor: filters nodes by setting a threshold on the similarity score (thus only supported by embedding-based retrievers)
- PrevNextNodePostprocessor: augments retrieved Node objects with additional relevant context based on Node relationships.



### Configuring response synthesis

After retriever fetches nodes, BaseSynthesizer combines information to synthesize final response.

Options:
- default: "Create and refine" an answer by making a separate LLM call for each retrieved Node. Ideal for detailed answers.

- compact: Stuff as many Node text chunks into each LLM call to maximize prompt size. Create and refine answers through multiple prompts if necessary.

- tree_summarize: Recursively construct a tree from Node objects based on the query. Return the root node as the response

- no_text: Runs retriever to fetch nodes for LLM without sending them. Check response.source_nodes for inspection.

- Accumulate: Apply query to each Node text chunk, accumulating responses into an array. Returns concatenated string of all responses. Ideal for running same query against each text chunk separately.




# Tracing and Debugging

Key for understanding and optimizing app.



## Basic logging

To understand your application, enable debug logging.



## Callback handler

LlamaIndex offers callbacks for debugging, tracking, and tracing library operations. The callback manager allows for adding multiple callbacks.

Track event data including duration and frequency.

A trace map of events is recorded for callbacks to use as needed. The LlamaDebugHandler will print the trace of events after most operations.

Simple callback handler example.



## Observability

LlamaIndex: one-click observability for building principled LLM applications in production.

This feature integrates the LlamaIndex library with observability tools. Configure a variable once to:
- View LLM/prompt inputs/outputs
- Ensure component outputs are performing well
- View call traces for indexing and querying.




# Evaluating


# Putting it all Together