# Indexing Data

### Introduction

So far we have moved through the data loading stage, which has involved both retrieving our data (whether pdfs or html), and then chunking that data before ultimately storing it.

```python
from llama_index.core import Document
from llama_index.core.node_parser import SentenceSplitter
from llama_index.core import VectorStoreIndex



documents = [Document(text = doc_text)]
parser = SentenceSplitter(chunk_size=1024)
nodes = parser.get_nodes_from_documents(documents)

index = VectorStoreIndex(nodes)
```

Now as you can see, in that last step we stored the our nodes in the database.  And as you may remember, we would ultimately retrieve the most relevant nodes (ie. chunks) to our query embedding, and do so by using a similarity score.

But going through each individual node and calculating the similarity to the query vector can be time consuming.  So instead we can avoid calculating a similarity for each chunk to our question by determining how we index our data.

### Storage vs Indexing

One thing to note is that a vector store is different from a vector index.  **Storage** is the kind of database that where the nodes are stored.  This can include the [pinecone vector database](www.pinecone.io/), [Neo4J](https://neo4j.com/) or even postgres (with it's pgvector library).  Indexing is the strategy of storing these vectors.  

The llamaindex library can be confusing, because when we call the VectorStoreIndex() constructor it both creates an index and will create an in-memory storage if not otherwise specified.

```python
index = VectorStoreIndex(nodes)
# this creates both the index, and an in memory simple vector store
```

### Indexing strategies

1. VectorStoreIndex 

With the vector store index, when we pass through or generate our nodes, the VectorStoreIndex will call OpenAI's embeddings API to generate an embedding for each node.  When a query arrives, the index will similarity between the query vector and every vector in the index.  It then sends the most relevant documents to the llm.

The benefit of the vector store indexing is that it's simple, easy to implement, and provides perfect accuracy. The downside is it is slow, as when a query comes in, a calculation is performed for each stored vector.

```python
from llama_index import VectorStoreIndex, SimpleDirectoryReader

# Load documents and build index
documents = SimpleDirectoryReader(
    "../../examples/data/paul_graham"
).load_data()
index = VectorStoreIndex.from_documents(documents)

query_engine = index.as_query_engine()
response = query_engine.query("What did the author do growing up?")
```

This strategy of simply storing each of the nodes and then calculating a similarity score for each node is also known as a flat indexing strategy.

### Keyword Table Index

<img src="./keyword-table-index.png" width="60%">

Here, you can see that each node is tagged with certain keywords. 

With this strategy, during query time, the index will extract relevant keywords from the query, and then match the keywords with the already extracted Node keywords.  Then it will return the corresponding nodes to be synthesized by the LLM.

```python
from llama_index import GPTKeywordTableIndex
index = GPTKeywordTableIndex.from_documents(documents)
query_engine = index.as_query_engine()
response = query_engine.query("What is net operating income?")
```

A benefit of this strategy is that it's faster to retrieve the relevant nodes.  However, a downside is that to tag nodes with keywords, if using OpenAI, the more expensive Completion API is used.

If you run the index.py file in the codebase, you can see the results of using our KeywordIndex.  

```python
keyword_index = GPTKeywordTableIndex.from_documents(documents)
```

Then we call `keyword_index.index_struct` to see the resulting shape of the index.

<img src="./keyword-struct.png">

Notice that `table` is a dictionary, where the keys are the keywords and the value is a set of nodes.

`{'joseph carrubba': {}, 'francisco': {}}`

And from there if we would like to find the node associated with each keyword, we can do so by accessing that node from the index's docstore.

```python
node_id = 'b113c64b-873c-4936-83f2-207c002f8136'
keyword_index.docstore.docs[node_id].text
```

> One thing you may notice is that we are indexing some information from our pdf that may be pretty irrelevant -- like the list of sources at the end.  This may be a case only feeding part of the pdf to our database with some up front data engineering.

### Tree index

So the Keyword Index generates a list of keywords, and then associates nodes to these keywords.  The tree index builds on this idea, has nodes of text summaries, and then these summaries point to the original document nodes.

```python
from llama_index import GPTTreeIndex

tree_index = GPTTreeIndex.from_documents(documents)
```

For example, in this case if we look at the index_struct, we will see that one node points to child nodes.

```python
{'14988bff-bc2a-448f-95a6-5c60b18a3f94': ['b986e069-40e1-44bb-bd07-80e7528ea178', '722c6520-3da2-4c3d-aed7-e87b8e1d3005', 'fcbf63d5-1975-4381-969c-68ae2eaa1400']}
```

And this parent node is really a summary derived from those child nodes.

<img src="./summary-node.png">

So when a query comes in, it will first look at these parent nodes, and then from there traverse to the original nodes from the document, and send those to the llm.

<img src="./tree-index.png">

### Summary

In this lesson, we learned a few different mechanisms for indexing our data.  As we saw the VectorStore actually stores our data, where with indexing we can change *how* that data is stored.  

* VectorIndexStore - Here each node is embedded and stored.  When a query comes in, the similarity on all nodes is calculated.  This is accurate but takes time.

* Keyword Index - Only search through those with a matching keyword.

* Tree Index - Generates parent summary nodes.  When queried, searches through the nodes matching those summary nodes.

### Resources

[Leeway Hertz](https://www.leewayhertz.com/llamaindex)

[Benefits of Different Indexes](https://blog.gopenai.com/different-types-of-indexes-in-llamaindex-to-improve-your-rag-system-0fb13132cab6)

[Medium LLamaindex](https://betterprogramming.pub/llamaindex-how-to-use-index-correctly-6f928b8944c6)

[DataStax - VectorIndex](https://www.datastax.com/guides/what-is-a-vector-index)

[Llamaindex](https://docs.llamaindex.ai/en/stable/module_guides/indexing/index_guide.html)

[LLamaindex - other resources](https://docs.llamaindex.ai/en/stable/module_guides/indexing/indexing.html#other-index-resources)

[Building a Simple Vector Store](https://docs.llamaindex.ai/en/stable/examples/low_level/vector_store.html)

See also the FAISS library