# LlamaIndex Deep Dive

In the last session we saw the basic setup for building an retrieval application. We build that one using the LlamaIndex. In this session we are taking a deep dive in the LlamaIndex framework and exploring the basic components of it.

The basic components for LLM retrieval are
1. LLM - Large Language Models like OpenAI's GPT models or open-source ones like Llama, Alpaca, and Bloom. They generate the answers for the queries and help create the indexes.
2. Embeddings - Dense vector representations of words or sentences. Ideally, an embedding captures some of the semantics of the input by placing semantically similar inputs close together in the embedding space. This can be later used to find relevant sections given a query by doing a similarity search.
3. Vector Database - These can store vectors like the embedding vectors and perform fast similarity searches.
4. Indexes - An index is a database structure that you can use to improve the performance of database activity.

and we are going to see how LlamaIndex works with each of these. The notebooks are available [here](https://github.com/explodinggradients/notes/tree/main/retrieval)

first lets setup everything up ie
- get the OPENAI_API_KEY
- turn the logging level to INFO

In [1]:
# make sure you have a valid OpenAI API key and it set as an 
# environment variable "OPENAI_API_KEY".
import os
import logging
import sys

from IPython.display import Markdown, display

if os.environ.get("OPENAI_API_KEY") is None:
    logging.error("OPENAI_API_KEY not Found! Please add it as an env var")
    
# ennable DEBUG mode to more logs and understand what is happening.
logging.basicConfig(stream=sys.stdout, level=logging.INFO)
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))

load the data from the directory `./what_i_worked_on_pg/` which containers the text for the blog [What I Worked On](http://paulgraham.com/worked.html).

In [3]:
# load the data
from llama_index import SimpleDirectoryReader

documents = SimpleDirectoryReader('./what_i_worked_on_pg/').load_data()

## Nodes

`Node`s are the fundamental units of information in LlamaIndex. The represent the information units, relationships between other informational units and any metadata available. Currently Text and Image nodes are supported. There is a `NodeParser` whose job is to take the documents you give and convert it into Nodes.

Let take an example. We have already loaded the text from the blog in `documents`. The blog is very long, so in order to effectively index and retrieve the required information we have to split it. `NodeParser` does exactly that, in this case with the help of `TextSplitter`.

In [4]:
from llama_index.node_parser import SimpleNodeParser

parser = SimpleNodeParser()
nodes = parser.get_nodes_from_documents(documents)

len(nodes)

24

In [5]:
nodes

[Node(text='\t\t\n\nWhat I Worked On\n\nFebruary 2021\n\nBefore college the two main things I worked on, outside of school, were writing and programming. I didn\'t write essays. I wrote what beginning writers were supposed to write then, and probably still are: short stories. My stories were awful. They had hardly any plot, just characters with strong feelings, which I imagined made them deep.\n\nThe first programs I tried writing were on the IBM 1401 that our school district used for what was then called "data processing." This was in 9th grade, so I was 13 or 14. The school district\'s 1401 happened to be in the basement of our junior high school, and my friend Rich Draves and I got permission to use it. It was like a mini Bond villain\'s lair down there, with all these alien-looking machines — CPU, disk drives, printer, card reader — sitting up on a raised floor under bright fluorescent lights.\n\nThe language we used was an early version of Fortran. You had to type programs on punc

In [6]:
from llama_index.langchain_helpers.text_splitter import TokenTextSplitter

spliter = TokenTextSplitter(
    separator='.',    # can be also ' ', '\n' or specific to data
    chunk_size=400,   # number of tokens in each chunk
    chunk_overlap=20, # overlap between different chunks
)

parser = SimpleNodeParser(spliter)
tmp_nodes  = parser.get_nodes_from_documents(documents)

len(tmp_nodes)

44

Here we configured a `TokenTextSpliter` which is a type of `TextSpliter` that looks splits at token level. With the config we provied, the `NodeParser` breaks it down into smaller nodes, with smaller text in each node.

In [7]:
tmp_nodes[1]

Node(text=" My clearest memory is of the moment I learned it was possible for programs not to terminate, when one of mine didn't. On a machine without time-sharing, this was a social as well as a technical error, as the data center manager's expression made clear.\n\nWith microcomputers, everything changed. Now you could have a computer sitting right in front of you, on a desk, that could respond to your keystrokes as it was running instead of just churning through a stack of punch cards and then stopping. [1]\n\nThe first of my friends to get a microcomputer built it himself. It was sold as a kit by Heathkit. I remember vividly how impressed and envious I felt watching him sitting in front of it, typing programs right into the computer.\n\nComputers were expensive in those days and it took me years of nagging before I convinced my father to buy one, a TRS-80, in about 1980. The gold standard then was the Apple II, but a TRS-80 was good enough. This was when I really started programmin

## Indexes

This is where the magic lies. An index defines how you connect the different nodes together to represent the data so that when you're querying, the most relevent piece of information is retrieved as fast as possible. 

There are a buntch of indexes defined and and we will do indepth into this topic in the next session but lets see 2 examples to get a feel of what this is about.

1. Vector Index: A vector index has the embeddings stored with the data so that at query time we can run a similarity search and retrieve the nodes that are most relevent to the query.

In [8]:
from llama_index import GPTVectorStoreIndex

simple_index = GPTVectorStoreIndex.from_documents(documents)

query_engine = simple_index.as_query_engine()
resp = query_engine.query("What did the author do growing up?")
display(Markdown(f"<i>{resp}</i>"))

INFO:llama_index.token_counter.token_counter:> [build_index_from_nodes] Total LLM token usage: 0 tokens
> [build_index_from_nodes] Total LLM token usage: 0 tokens
INFO:llama_index.token_counter.token_counter:> [build_index_from_nodes] Total embedding token usage: 20729 tokens
> [build_index_from_nodes] Total embedding token usage: 20729 tokens
INFO:llama_index.token_counter.token_counter:> [retrieve] Total LLM token usage: 0 tokens
> [retrieve] Total LLM token usage: 0 tokens
INFO:llama_index.token_counter.token_counter:> [retrieve] Total embedding token usage: 8 tokens
> [retrieve] Total embedding token usage: 8 tokens
INFO:llama_index.token_counter.token_counter:> [get_response] Total LLM token usage: 1920 tokens
> [get_response] Total LLM token usage: 1920 tokens
INFO:llama_index.token_counter.token_counter:> [get_response] Total embedding token usage: 0 tokens
> [get_response] Total embedding token usage: 0 tokens


<i>
Growing up, the author wrote short stories, programmed on an IBM 1401, and nagged his father to buy him a TRS-80 microcomputer. He wrote simple games, a program to predict how high his model rockets would fly, and a word processor. He also studied philosophy in college, but switched to AI after becoming bored with it. He then took art classes at Harvard and applied to art schools, eventually attending RISD.</i>

2. List Index: The list index simply stores Nodes as a sequential chain. This means at query time you can you can use a retrieve and refine methods through each nodes in the list to get a more complete answer.

In [9]:
from llama_index import GPTListIndex

list_index = GPTListIndex.from_documents(documents)

qe = list_index.as_query_engine()
resp = qe.query("What did the author do growing up?")
display(Markdown(f"<i>{resp}</i>"))

INFO:llama_index.token_counter.token_counter:> [build_index_from_nodes] Total LLM token usage: 0 tokens
> [build_index_from_nodes] Total LLM token usage: 0 tokens
INFO:llama_index.token_counter.token_counter:> [build_index_from_nodes] Total embedding token usage: 0 tokens
> [build_index_from_nodes] Total embedding token usage: 0 tokens
INFO:llama_index.token_counter.token_counter:> [get_response] Total LLM token usage: 29018 tokens
> [get_response] Total LLM token usage: 29018 tokens
INFO:llama_index.token_counter.token_counter:> [get_response] Total embedding token usage: 0 tokens
> [get_response] Total embedding token usage: 0 tokens


<i>

Growing up, the author wrote short stories, programmed on an IBM 1401, and wrote programs for a TRS-80 microcomputer[1]. He also studied philosophy in college, but switched to AI when he found it boring. He reverse-engineered SHRDLU for his undergraduate thesis, and wrote a book about Lisp hacking while in grad school. He also took art classes at Harvard and applied to art schools. He then attended the Accademia di Belli Arti in Florence, where he took an entrance exam and studied painting. He also took classes at RISD and taught himself to paint. He then moved to New York City, where he got a job at Interleaf and wrote On Lisp. He also painted still lives in his bedroom and learned about the low end eating the high end. He then returned to RISD, where he took a color class and learned about signature styles. He then dropped out and moved to New York City, where he painted and learned more about the art world, taking advantage of a rent-controlled apartment in a building her mother owned. He also zoomed in on a tiny corner of New York City that wasn't rich, and taught himself to paint for free. He also took advantage of his time in</i>

### Inserting, Updating, Deleting and Saving an Index

Honestly this is trick and I couldn't get it to work properly. The index has methods like `.update()`, `.insert()` and `.delete()` but its confusing to use. What I could recomend instead is using `DocStore`.

(I've raised this issue in the community, if I get a better answer I'll update this section)

In [10]:
list_index.docstore.docs

{'79dbb492-7487-4dff-b708-e57a0075cba1': Node(text='\t\t\n\nWhat I Worked On\n\nFebruary 2021\n\nBefore college the two main things I worked on, outside of school, were writing and programming. I didn\'t write essays. I wrote what beginning writers were supposed to write then, and probably still are: short stories. My stories were awful. They had hardly any plot, just characters with strong feelings, which I imagined made them deep.\n\nThe first programs I tried writing were on the IBM 1401 that our school district used for what was then called "data processing." This was in 9th grade, so I was 13 or 14. The school district\'s 1401 happened to be in the basement of our junior high school, and my friend Rich Draves and I got permission to use it. It was like a mini Bond villain\'s lair down there, with all these alien-looking machines — CPU, disk drives, printer, card reader — sitting up on a raised floor under bright fluorescent lights.\n\nThe language we used was an early version of F

## Customizing LLM's

The Large Language Models are used to query the indexes and perform actions like summarisation, extraction etc when building different indexes. By default LlamaIndex uses OpenAI's models but you can add any that are supported through Langchain (see [complete list](https://python.langchain.com/en/latest/reference/modules/llms.html)).

Let try out [Cohere's Platform](https://docs.cohere.com/docs) for a change. You have to signup and create an API key under `COHERE_API_KEY` and have the `cohere` python package installed. For more information on how to use this check out [Langchains docs](https://python.langchain.com/en/latest/reference/modules/llms.html#langchain.llms.Cohere).

### `ServiceContext`
Any configurations to the index is defined through the `ServiceContext`. You can configure the LLM, prompt_helper, embeddings, noder_parser etc. To use Cohere API we should be changing 2 things
1. LLMPredictor: This is build on top of LangChain's llm modules. Makes available all the LLMs langchain has an interface to.
2. PromptHelper: This is utility class that helps define things like the max input size of the LLM, number of output tokens, chunk size, embeddings limits and much more.

but you can also use it to configure

3. LangchainEmbedding: This is a wrapper class for Langchains embeddings. See [embeddings module](https://langchain.readthedocs.io/en/latest/reference/modules/embeddings.html) for all the embeddings supported. 
4. LlamaLogger: Logging.

In [14]:
# Custom LLMS
from llama_index import LLMPredictor, GPTVectorStoreIndex, PromptHelper, ServiceContext
from langchain import OpenAI, Cohere

# define LLMPredictor
llm_predictor = LLMPredictor(
    llm=Cohere(temperature=0, model="command")
)
# define PromptHelper
max_input_size = 2048  # set maximum input size
num_output = 256       # set number of output tokens
max_chunk_overlap = 20
prompt_helper = PromptHelper(
    max_input_size,
    num_output,
    max_chunk_overlap
)

# use those to create the ServiceContext
service_context = ServiceContext.from_defaults(
    llm_predictor=llm_predictor, 
    prompt_helper=prompt_helper
)

cohere_index = GPTVectorStoreIndex.from_documents(
    documents, service_context=service_context
)

INFO:llama_index.token_counter.token_counter:> [build_index_from_nodes] Total LLM token usage: 0 tokens
> [build_index_from_nodes] Total LLM token usage: 0 tokens
INFO:llama_index.token_counter.token_counter:> [build_index_from_nodes] Total embedding token usage: 20729 tokens
> [build_index_from_nodes] Total embedding token usage: 20729 tokens


In [15]:
# OpenAI's LLM
qe = simple_index.as_query_engine()
resp = qe.query("What did the author do growing up?")
display(Markdown(f"<i>{resp}</i>"))

INFO:llama_index.token_counter.token_counter:> [retrieve] Total LLM token usage: 0 tokens
> [retrieve] Total LLM token usage: 0 tokens
INFO:llama_index.token_counter.token_counter:> [retrieve] Total embedding token usage: 8 tokens
> [retrieve] Total embedding token usage: 8 tokens
INFO:llama_index.token_counter.token_counter:> [get_response] Total LLM token usage: 1920 tokens
> [get_response] Total LLM token usage: 1920 tokens
INFO:llama_index.token_counter.token_counter:> [get_response] Total embedding token usage: 0 tokens
> [get_response] Total embedding token usage: 0 tokens


<i>
Growing up, the author wrote short stories, programmed on an IBM 1401, and nagged his father to buy him a TRS-80 microcomputer. He wrote simple games, a program to predict how high his model rockets would fly, and a word processor. He also studied philosophy in college, but switched to AI after becoming bored with it. He then took art classes at Harvard and applied to art schools, eventually attending RISD.</i>

In [16]:
# Cohere's generate
qe = cohere_index.as_query_engine()
resp = qe.query("What did the author do growing up?")
display(Markdown(f"<i>{resp}</i>"))

INFO:llama_index.token_counter.token_counter:> [retrieve] Total LLM token usage: 0 tokens
> [retrieve] Total LLM token usage: 0 tokens
INFO:llama_index.token_counter.token_counter:> [retrieve] Total embedding token usage: 8 tokens
> [retrieve] Total embedding token usage: 8 tokens
INFO:llama_index.token_counter.token_counter:> [get_response] Total LLM token usage: 1957 tokens
> [get_response] Total LLM token usage: 1957 tokens
INFO:llama_index.token_counter.token_counter:> [get_response] Total embedding token usage: 0 tokens
> [get_response] Total embedding token usage: 0 tokens


<i>
The author worked on writing and programming before college.</i>

## Customizing Embeddings

LlamaIndex allows you to define custom embedding modules. By default, we use `text-embedding-ada-002` from OpenAI. You can also choose to plug in embeddings from Langchain’s [embeddings module](https://langchain.readthedocs.io/en/latest/reference/modules/embeddings.html). There is a wrapper class, `LangchainEmbedding`, for integration into LlamaIndex. It is configured through the `ServiceContext`.

Let's try out HuggingFace's sentence_transformers embedding model. This works offline but make sure you have the `sentence_transformers` library installed.

In [19]:
from langchain.embeddings.huggingface import HuggingFaceEmbeddings
from llama_index import LangchainEmbedding, ServiceContext, StorageContext

# load in HF embedding model from langchain
embed_model = LangchainEmbedding(HuggingFaceEmbeddings())
service_context = ServiceContext.from_defaults(embed_model=embed_model)

hf_embs_index = GPTVectorStoreIndex.from_documents(documents)

# query with embed_model specified
qe = hf_embs_index.as_query_engine(
    mode="embedding", 
    verbose=True, 
    service_context=service_context
)
response = qe.query("What did the author do growing up?")
display(Markdown(f"<i>{resp}</i>"))

INFO:sentence_transformers.SentenceTransformer:Load pretrained SentenceTransformer: sentence-transformers/all-mpnet-base-v2
Load pretrained SentenceTransformer: sentence-transformers/all-mpnet-base-v2
INFO:sentence_transformers.SentenceTransformer:Use pytorch device: cpu
Use pytorch device: cpu
INFO:llama_index.token_counter.token_counter:> [build_index_from_nodes] Total LLM token usage: 0 tokens
> [build_index_from_nodes] Total LLM token usage: 0 tokens
INFO:llama_index.token_counter.token_counter:> [build_index_from_nodes] Total embedding token usage: 20729 tokens
> [build_index_from_nodes] Total embedding token usage: 20729 tokens
INFO:llama_index.token_counter.token_counter:> [retrieve] Total LLM token usage: 0 tokens
> [retrieve] Total LLM token usage: 0 tokens
INFO:llama_index.token_counter.token_counter:> [retrieve] Total embedding token usage: 8 tokens
> [retrieve] Total embedding token usage: 8 tokens
INFO:llama_index.token_counter.token_counter:> [get_response] Total LLM toke

<i>
The author worked on writing and programming before college.</i>

## Using Vector Stores

Vector stores / vector databases can be used with llamaIndex in 2 ways

1. Datastore - like any other data source, llamaIndex can load the data from the vector db and build indices on top of it. [ref](https://gpt-index.readthedocs.io/en/latest/how_to/integrations/vector_stores.html#loading-data-from-vector-stores-using-data-connector)
2. Index - use the vector db to build the indicies. This is what we will be looking at.

We will be using the [Chromadb](https://docs.trychroma.com/getting-started) project as our vector db. To get started all you have to do is install Chromadb with
```bash
pip install chromadb
```

In [28]:
import chromadb
from llama_index.vector_stores import ChromaVectorStore
from llama_index import StorageContext

# define chroma store
chroma_client = chromadb.Client()
chroma_collection = chroma_client.create_collection('quickstart')
vector_store = ChromaVectorStore(
    chroma_collection=chroma_collection
)

# 
sc = StorageContext.from_defaults(vector_store=vector_store)
index = GPTVectorStoreIndex.from_documents(
    documents=documents, 
    storage_context=sc
)
qe = index.as_query_engine()
resp = qe.query("What did the author do growing up?")
display(Markdown(f"<i>{resp}</i>"))

INFO:chromadb.telemetry.posthog:Anonymized telemetry enabled. See https://docs.trychroma.com/telemetry for more information.
Anonymized telemetry enabled. See https://docs.trychroma.com/telemetry for more information.
INFO:chromadb:Running Chroma using direct local API.
Running Chroma using direct local API.
Using embedded DuckDB without persistence: data will be transient
No embedding_function provided, using default embedding function: SentenceTransformerEmbeddingFunction
INFO:sentence_transformers.SentenceTransformer:Load pretrained SentenceTransformer: all-MiniLM-L6-v2
Load pretrained SentenceTransformer: all-MiniLM-L6-v2
INFO:chromadb.db.duckdb:Exiting: Cleaning up .chroma directory
Exiting: Cleaning up .chroma directory
INFO:chromadb.db.duckdb:Exiting: Cleaning up .chroma directory
Exiting: Cleaning up .chroma directory
INFO:sentence_transformers.SentenceTransformer:Use pytorch device: cpu
Use pytorch device: cpu
INFO:llama_index.token_counter.token_counter:> [build_index_from_no

<i>
The author grew up writing short stories, programming on an IBM 1401, and nagging his father to buy a TRS-80 microcomputer. He wrote simple games, a program to predict how high his model rockets would fly, and a word processor. He also studied philosophy in college, but found it boring and switched to AI. He took art classes at Harvard and applied to art schools.</i>

## Querying

quering is done by the `QueryEngine`. At a high-level all you have to do is create the query engine with default configs from your index and start querying

In [24]:
query_engine = simple_index.as_query_engine()
response = query_engine.query("What did the author do growing up?")
display(Markdown(f"<i>{response}</i>"))

INFO:llama_index.token_counter.token_counter:> [retrieve] Total LLM token usage: 0 tokens
> [retrieve] Total LLM token usage: 0 tokens
INFO:llama_index.token_counter.token_counter:> [retrieve] Total embedding token usage: 8 tokens
> [retrieve] Total embedding token usage: 8 tokens
INFO:llama_index.token_counter.token_counter:> [get_response] Total LLM token usage: 1920 tokens
> [get_response] Total LLM token usage: 1920 tokens
INFO:llama_index.token_counter.token_counter:> [get_response] Total embedding token usage: 0 tokens
> [get_response] Total embedding token usage: 0 tokens


<i>
Growing up, the author wrote short stories, programmed on an IBM 1401, and nagged his father to buy him a TRS-80 microcomputer. He wrote simple games, a program to predict how high his model rockets would fly, and a word processor. He also studied philosophy in college, but switched to AI after becoming bored with it. He then took art classes at Harvard and applied to art schools, eventually attending RISD.</i>

To get more controll you can modify the
1. `Retriever` - configures how to get the nodes. For eg list index support `ListIndexRetriever` that retrieves all the nodes while `List IndexEmbeddingRetriever` only retrieves the top-k nodes.

In [26]:
from llama_index import ResponseSynthesizer
from llama_index.retrievers import VectorIndexRetriever
from llama_index.query_engine import RetrieverQueryEngine

index = GPTListIndex.from_documents(documents)

# mode="default"
retriever = index.as_retriever(retriever_mode='default')
qe = RetrieverQueryEngine(retriever)
response = qe.query("What did the author do growing up?")
display(Markdown(f"<i>{response}</i>"))

# mode="embedding"
retriever = index.as_retriever(retriever_mode='embedding')
qe = RetrieverQueryEngine(retriever)
response = qe.query("What did the author do growing up?")
display(Markdown(f"<i>{response}</i>"))

INFO:llama_index.token_counter.token_counter:> [build_index_from_nodes] Total LLM token usage: 0 tokens
> [build_index_from_nodes] Total LLM token usage: 0 tokens
INFO:llama_index.token_counter.token_counter:> [build_index_from_nodes] Total embedding token usage: 0 tokens
> [build_index_from_nodes] Total embedding token usage: 0 tokens
INFO:llama_index.token_counter.token_counter:> [get_response] Total LLM token usage: 28944 tokens
> [get_response] Total LLM token usage: 28944 tokens
INFO:llama_index.token_counter.token_counter:> [get_response] Total embedding token usage: 0 tokens
> [get_response] Total embedding token usage: 0 tokens


<i>

Growing up, the author wrote short stories, programmed on an IBM 1401, and wrote programs for a TRS-80 microcomputer. He wrote simple games, a program to predict how high his model rockets would fly, and a word processor. He also studied philosophy in college, but switched to AI when he realized it was a more exciting field. He reverse-engineered SHRDLU for his undergraduate thesis, and wrote a book about Lisp hacking while in grad school. He also took art classes at Harvard and applied to art schools. He then attended the Accademia di Belli Arti in Florence, where he took an entrance exam and studied painting. He also worked at Interleaf, where he learned about the low end eating the high end, and then returned to RISD to study painting. He eventually moved to New York City, where he taught himself to paint, dropped out of RISD, and moved into a rent-controlled apartment in the Upper East Side. He then started a company called Viaweb, which wrote software to build online stores. He also wrote a WYSIWYG site builder, a shopping cart, and a manager to keep track of orders and statistics. He also learned a lot about retail, and wrote</i>

INFO:llama_index.token_counter.token_counter:> [get_response] Total LLM token usage: 1004 tokens
> [get_response] Total LLM token usage: 1004 tokens
INFO:llama_index.token_counter.token_counter:> [get_response] Total embedding token usage: 0 tokens
> [get_response] Total embedding token usage: 0 tokens


<i>
The author grew up writing short stories and programming on an IBM 1401. He also nagged his father to buy him a TRS-80 microcomputer, on which he wrote simple games, a program to predict how high his model rockets would fly, and a word processor. He also studied philosophy in college.</i>

2. `ResponseSynthesizer` - how the relevent nodes are combined to generate the answer.

In [None]:
index = GPTListIndex.from_documents(documents)

# mode="default"
response = index.query("What did the author do growing up?", response_mode="default")
display(Markdown(f"<i>{response}</i>"))

# mode="compact"
response = index.query("What did the author do growing up?", response_mode="compact")
display(Markdown(f"<i>{response}</i>"))

# mode="tree_summarize"
response = index.query("What did the author do growing up?", response_mode="tree_summarize")
display(Markdown(f"<i>{response}</i>"))

### Saving the Index
In order to save an index you can call `.save_to_disk(<name>.json)` and `.load_from_disk(<name)` to load.

>Note: If you had initialized the index with a custom `ServiceContext` object, you will also need to pass in the same `ServiceContext` during `load_from_disk()`.

In [None]:
# save 
index.save_to_disk('index.json')
loaded_index = GPTSimpleVectorIndex.load_from_disk('index.json')

# custom service context
service_context = ServiceContext.from_defaults(
    llm_predictor=llm_predictor
)
index = GPTSimpleVectorIndex.from_documents(
    documents, service_context=service_context
)
index = GPTSimpleVectorIndex.load_from_disk(
    "index.json", service_context=service_context
)