# Indexing
Indexing is the process of keeping your vectorstore in-sync with the underlying data source.

# How to use the LangChain indexing API

# Why Use the Indexing API?
The indexing API saves time and resources by:

* Avoiding Duplication: It prevents the same document from being re-added.
* Tracking Changes: It only updates changed documents, reducing unnecessary re-computation of embeddings.
* Keeping Everything in Sync: When documents change or get deleted, the indexing API can ensure your vector store matches the latest data.

# How It Works
The indexing API uses a RecordManager to keep track of documents. It works with unique document identifiers, allowing it to manage:

* Document Hashes: Ensures duplicate content isn’t added.
* Write Timestamps: Tracks when documents were last updated.
* Source IDs: Each document has metadata to identify its original source, like a file name.

# Deletion Modes
When documents are added or updated, you can choose how aggressively the API cleans up old versions or deleted content:

* None: No automatic cleanup, just de-duplicates.
* Incremental: Continuously deletes old versions of modified documents.
* Full: Cleans up all old documents at the end of each indexing session.

# Use Case Examples:

* None: Good for first-time indexing or manual cleanup.
* Incremental: Updates continuously and minimizes "overlap" of old and new data.
* Full: Best for bulk updates when you need a clean reset for removed documents.

# Requirements
* The API works only with vector stores that support adding documents with IDs and deleting them by ID, like Chroma, Pinecone, Elasticsearch, etc. Avoid using the API with stores that were filled with other methods since the record manager might not recognize previous entries.

In [11]:
from langchain.vectorstores import FAISS
from langchain.embeddings import OpenAIEmbeddings
from langchain.schema import Document

# Step 1: Initialize the embedding model (OpenAI in this example)
embedding_model = OpenAIEmbeddings()

# Sample documents to index
documents = [
    Document(page_content="LangChain makes it easy to work with LLMs and chains of calls."),
    Document(page_content="OpenAI provides robust APIs for embedding and language generation."),
    Document(page_content="FAISS is an efficient library for similarity search and clustering."),
]

# Step 2: Create FAISS index from documents
# We’ll embed and index the documents using the embedding model.
vectorstore = FAISS.from_documents(documents, embedding_model)

# Step 3: Perform a similarity search
query = "How to use OpenAI embeddings?"
query_embedding = embedding_model.embed_query(query)

# Search for top 2 most similar documents
results = vectorstore.similarity_search_by_vector(query_embedding, k=2)

# Step 4: Display results
for i, result in enumerate(results, 1):
    print(f"Result {i}: {result.page_content}")


Result 1: OpenAI provides robust APIs for embedding and language generation.
Result 2: FAISS is an efficient library for similarity search and clustering.


# Using a Loader
You can also create a custom loader to load and split documents, making sure to set the source metadata for each document.

# In Summary
LangChain’s Indexing API is useful for:

* Managing document updates and deletions
* Avoiding redundant computations
* Keeping vector store data synchronized with document sources