- [Vector Store](#vector-store)
  - [Basic Implementation via `InMemoryVectorStore`](#basic-implementation-via-inmemoryvectorstore)
    - [Add Documents](#add-documents)
    - [Delete Documents](#delete)
    - [Search Over Documents](#search)
      - [Similarity Metrics](#similarity-metrics)
      - [Similarity Search](#similarity-search)
    - [Metadata filtering](#metadata-filtering)
  - [Advanced Search and Retrieval Techniques](#advanced-search-and-retrieval-techniques)
  - [PineCone](#pinecone)

# Vector Store

In [2]:
from os import path
from langchain_openai import OpenAIEmbeddings

pdf_path = "../sample_files/progit.pdf"

if not path.exists(pdf_path):
    raise Exception("Invalid path, File does not exits")

embedding_model = OpenAIEmbeddings(model="text-embedding-3-small")

## Basic Implementation via `InMemoryVectorStore`

In [3]:
from langchain_openai import OpenAIEmbeddings
from langchain_core.vectorstores import InMemoryVectorStore

# Initializing with an embedded model
vector_store = InMemoryVectorStore(embedding=embedding_model)

- Extracting Documents from pdf

In [4]:
from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import CharacterTextSplitter

raw_docs = []

# Initializing PDF loader
pdf_loader = PyPDFLoader(pdf_path)
async for doc in pdf_loader.alazy_load():
    current_page = doc.metadata["page_label"]

    if not current_page.isdigit():
        continue

    current_page_num = int(current_page)
    if current_page_num > 8:
        if current_page_num >= 492:
            break

        raw_docs.append(doc)
    else:
        continue

text_splitter = CharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200
)
documents = text_splitter.split_documents(raw_docs)

### Add Documents

- The **`add_documents`** method works with list of **`Document`** objects all have `page_content` and `meta_data` attribute. making them universal way to store unstructured text and associated metadata.

In [None]:
vector_store.add_documents(documents=documents)

- We have to provide **`ID's`** for the documents, so that instead of adding the same document multiple times, we can update the existing document.

In [None]:
vector_store.add_documents(documents=documents, ids=["doc1", "doc2"])

### Delete 

- To delete documents, use the delete method which takes a list of document IDs to delete.

In [15]:
vector_store.delete(ids=["doc1"])

### Search

- Vector store embed and store the documents that added. If we pass query it convert into vector and perform a similarity search over the embedded documents.
- There are two important concept for searching
  1. Needs a way to measure that similarity between query and any embedded documents. 
  2. Needs are algorithm to efficiently perform the similarity search across all embedded documents.

#### Similarity metrics

- The important thing about embeddings are they can be comparable using simple **Mathematical operations:**
  1. **Cosine Similarity:**  A metric used to measure how similar two vectors are, based on the angle between them, regardless of their magnitude.
  2. **Euclidean Distance:** Euclidean distance is a measure of the **straight-line** distance between two points in a **multi-dimensional space**
  3. **Dot Product:** An operation in linear algebra that takes two vectors and returns a single scalar value. It reflects how much two vectors point in the same direction.

- The choice of similarity metrics can something selected during initialization of vector store.

#### Similarity Search

- Given similarity metrics measure the distance between the embedded query and any embedded documents.
- There are various algorithms for efficiently search over all embedded documents, Many vector stores implement **`HNSW (Hierarchical Navigable Small World)`**, a graph based index structure that allows for similarity search. 
- In LangChain, under the hood use this algorithm.

In [None]:
query = "my query"
docs = vectorstore.similarity_search(query)

- Many vectorstores support search parameters to be passed with **`similarity_search`** method.
- [**PineCone**](#pinecone) supports several parameters. Many vectorstores support the k, which controls the number of Documents to return, and filter, which allows for filtering documents by metadata.

- query (str) – Text to look up documents similar to.
- k (int) – Number of Documents to return. Defaults to 4.
- filter (dict | None) – Dictionary of argument(s) to filter on metadata

### Metadata filtering

- **Vector Stores** implement a search algorithm to efficiently search over all embedded documents to find most relevant one. many vector stores also supports **Metadata filtering**.
- These two concepts work well together
    1. **Semantic Search:**  Query the unstructured data directly, often via embedding or keyword similarity.
    2. **Metadata Search:** Apply structured query to the metadata, filtering specific documents.

    ```py
    # Example of PineCone
    
    vector_store.similarity_search(
        "LangChain provides abstractions to make working with LLMs easy",
        k=2, # Number of documents return
        filter={"source": "tweet"},
    )
    ```

### [Advanced search and retrieval techniques](https://python.langchain.com/docs/concepts/vectorstores/#advanced-search-and-retrieval-techniques)

## PineCone