# Indexing Data

### Introduction

So far we have moved through the data loading stage, which has involved both retrieving our data (whether pdfs or html), and then chunking that data before ultimately storing it.

```python
from llama_index.core import Document
from llama_index.core.node_parser import SentenceSplitter
from llama_index.core import VectorStoreIndex



documents = [Document(text = doc_text)]
parser = SentenceSplitter(chunk_size=1024)
nodes = parser.get_nodes_from_documents(documents)

index = VectorStoreIndex(nodes)
```

Now as you can see, in that last step we stored the our nodes in the database.  And as you may remember, we would ultimately retrieve the most relevant nodes (ie. chunks) to our query embedding, and do so by using a similarity score.

But going through each individual node and calculating the similarity to the query vector can be time consuming.  So instead we can avoid calculating a similarity for each chunk to our question by determining how we index our data.

### Storage vs Indexing

One thing to note is that a vector store is different from a vector index.  **Storage** is the kind of database that where the nodes are stored.  This can include the [pinecone vector database](www.pinecone.io/), [Neo4J](https://neo4j.com/) or even postgres (with it's pgvector library).  Indexing is the strategy of storing these vectors.  

The llamaindex library can be confusing, because when we call the VectorStoreIndex() constructor it both creates an index and will create an in-memory storage if not otherwise specified.

```python
index = VectorStoreIndex(nodes)
# this creates both the index, and an in memory simple vector store
```

### Indexing strategies

1. Flat indexing 

With flat indexing, we just store each vector as is, and when a query arrives, we then calculate distance from the query vector to every stored vector.  The benefit of flat indexing is that it's simple, easy to implement, and provides perfect accuracy. The downside is it is slow, as when a query comes in, a calculation is performed for each stored vector.

```python
from llama_index import VectorStoreIndex, SimpleDirectoryReader

# Load documents and build index
documents = SimpleDirectoryReader(
    "../../examples/data/paul_graham"
).load_data()
index = VectorStoreIndex.from_documents(documents)

query_engine = index.as_query_engine()
response = query_engine.query("What did the author do growing up?")
```

The above VectorStoreIndex uses just a flat indexing strategy.

### Keyword Table Index

<img src="./keyword-table-index.png" width="60%">

Here, you can see that each node is tagged with certain keywords.  With this strategy, during query time, the index will extract relevant keywords from the query, and then match the keywords with the already extracted Node keywords.  Then it will return the corresponding nodes to be synthesized by the LLM.

```python
from llama_index import GPTKeywordTableIndex
index = GPTKeywordTableIndex.from_documents(documents)
query_engine = index.as_query_engine()
response = query_engine.query("What is net operating income?")
```

For more information, please read through the following article:

[Medium Article](https://betterprogramming.pub/llamaindex-how-to-use-index-correctly-6f928b8944c6)

### Resources

[Medium LLamaindex](https://betterprogramming.pub/llamaindex-how-to-use-index-correctly-6f928b8944c6)

[DataStax - VectorIndex](https://www.datastax.com/guides/what-is-a-vector-index)

[Llamaindex](https://docs.llamaindex.ai/en/stable/module_guides/indexing/index_guide.html)

[LLamaindex - other resources](https://docs.llamaindex.ai/en/stable/module_guides/indexing/indexing.html#other-index-resources)

* See also the FAISS library

[Building a Simple Vector Store](https://docs.llamaindex.ai/en/stable/examples/low_level/vector_store.html)