In [None]:
%%html
<style>
    body {
        --vscode-font-family: "Segoe UI"
    }
</style>

There are 6 concepts here -
  1. Key-Value Store
  2. Document Store
  3. Index Store
  4. Vector Store
  5. Graph Store
  6. Different indexes (not to be confused with index stores) -
      * Summary Index
      * Vector Store Index
      * Tree Index
      * Keyword Table Index
      * Knowledge Graph Index

![storage_contex](./storage_context.png)

### Key-Value Store
This is the underlying physical storage layer. All data is stored as key-values. This abstraction is used by the document store and index store.

### Document Store
Real world content is ingested as documents, is parsed and chunked into text nodes and stored in the document store. Document stores use the underlying Key-Vaule to store to store the documents, possibly by using the doc ID as the key and the actual document as the value.

### Index Store
Directly from the [documentation](http://127.0.0.1:8000/module_guides/storing/index_stores.html) -
> Index stores contain lightweight index metadata (i.e., additional state information created when building an index).

TODO: Probe an index store to figure out what kind of metadata is stored here. Is it the `.metadata` of each node, or is it something else?

### Vector Store
This stores the embedding of each node. Any top level index that uses embeddings will use this store.

### Graph Store
This stores the so-called knowledge graph extracted from documents. This is similar to the vector store, in that only some indexes that deal with knowledge graphs will use this store.

### Indexes
Built on top of all these stores are different indexes. The indexes are just a way of arranging the nodes in the underlying doc store in some specific way that makes querying them for that specific purpose easy. I can then use any of the indexes as a query engine to query the underlying doc store. Some indexes like VectorStoreIndex and SummaryIndex use embeddings (optional in case of SummaryIndex) to aid the querying and some like SummaryIndex, KeywordTableIndex use keywords, etc. Indexes that use embeddings use the vector store in addition to the doc store and index store.

Very good visual explanation given [here](http://127.0.0.1:8000/module_guides/indexing/index_guide.html).

On the side there is also a Chat Store where we can store chat histories of different users. This does not seem to be related to any specific index.

Just to make life interesting there are a bunch of objects that show up as "vector stores" that are actually document stores, index stores, and vector store index all rolled into one.

This is not a very good layering of abstractions because it is leaking between different layers. It seems that the vector store was build to be expressly used by the vector store index and so on. Ideally, any component in the top layer abstraction should be able to use any other components in the immediately lower layer. But here its not like the summary index can use the graph store or anything like that.

From the documentation -
> Many vector stores (except FAISS) will store both the data as well as the index (embeddings). This means that you will not need to use a separate document store or index store. This also means that you will not need to explicitly persist this data - this happens automatically.

I tried this with Pinecone (see persist_pinecone.py) but it did not work. The documents do not get uploaded to Pinecone.

In [1]:
from dotenv import load_dotenv
load_dotenv()

True

In [2]:
from llama_index.storage import StorageContext
from llama_index.vector_stores import PineconeVectorStore
from llama_index import load_index_from_storage
import os
from pathlib import Path

See `persist_simple.py` for example of how to persist all the data (various datastores, indexes, etc.) to local filesystem. The following code loads this data.

In [3]:
WORKINGDIR = Path.home() / "Desktop" / "temp" / "llama-index"

In [4]:
storage_context = StorageContext.from_defaults(persist_dir=str(WORKINGDIR/"storage"))

In [5]:
index = load_index_from_storage(storage_context)
query_engine = index.as_query_engine()
resp = query_engine.query("How to create a new conda environment?")
print(resp)

To create a new conda environment, you can use the command `conda create --name ENVIRONMENT_NAME`. For example, to create a new environment named 'py35' with Python 3.5, you would use the command `conda create --name py35 python=3.5`.


In [6]:
storage_context == index.storage_context

True

In [7]:
print(storage_context.vector_store)
print(storage_context.docstore)
print(storage_context.index_store)

<llama_index.vector_stores.simple.SimpleVectorStore object at 0x176eaa8c0>
<llama_index.storage.docstore.simple_docstore.SimpleDocumentStore object at 0x176eaaa10>
<llama_index.storage.index_store.simple_index_store.SimpleIndexStore object at 0x176eaa980>


In [8]:
docs = list(storage_context.docstore.docs.values())
len(docs)

40

In [9]:
len(storage_context.index_store.index_structs())

1

In [11]:
len(storage_context.index_store.index_structs()[0].nodes_dict)

40

In [13]:
storage_context.vector_store.to_dict().keys()

dict_keys(['embedding_dict', 'text_id_to_ref_doc_id', 'metadata_dict'])

In [14]:
storage_context.vector_store.to_dict()["embedding_dict"].keys()

dict_keys(['c20c64bd-2bd0-4113-8089-75c5f0ddf85e', '404380a9-84f3-4467-ac4a-f35406a73cee', '409dfddb-5dfd-4036-b660-7a468c59f16c', 'b7310392-fe92-465e-a507-a8a9895bf805', '60785f8e-c15f-4603-9d21-c62935b99d50', '2a88d256-873a-4f9b-8c50-5314aae60902', '389d6a7a-2c00-4737-af9e-f1ca9eb334b7', 'cecc4bb3-b402-44a2-9d1a-02a3ec77d5b3', '633aa7bd-e5c4-4ed5-a074-e5d7beea28ce', '1664aab1-e1e5-428b-b32f-9d5f9551277c', '8055a17e-8b3a-4379-bdd6-a6edb3090a6c', 'ce362ae2-d77c-4173-b5c2-d06e480d8b54', 'd79a0228-9be5-42dd-88ef-35f8a81b5f10', '643d1499-3781-4cc1-85e3-c0cc2fa1bee9', '137fa1e4-a239-4cdf-b824-14309e4c7ad8', 'e0c942f5-d4bf-44bf-9c75-0aa381c3b7e5', '32c8e556-e59a-4a21-aa1a-9a89a3c079ec', 'ea82c976-85f6-4b34-b584-a0b5cbe250ca', 'd9a0842c-1b28-4206-ad54-249e9d32a39a', '1c0e2754-1755-4e87-80c7-e60d05eaf463', '2cc505ba-96ca-437d-a671-4836e6615206', 'e4c2dbb0-0cfd-49d8-b4b1-a77e0a1ff3f2', 'fa14a282-f6f4-4bba-b03d-f36d74a20c50', '6ec692d3-4981-435a-987b-a30b4d0d62bf', 'fe3ae110-4f78-415c-bbd6-5624

In [16]:
emb_dim = len(storage_context.vector_store.to_dict()["embedding_dict"]["c20c64bd-2bd0-4113-8089-75c5f0ddf85e"])
emb_dim

1536

By default llama-index will use the local filesystem for persistence. But I can use any filesystem that implements the `fsspec.AbstractFileSystem` protocol. See [example](http://127.0.0.1:8000/module_guides/storing/save_load.html#using-a-remote-backend) for how to use S3 as the storage layer.