- [Vector Store](#vector-store)
  - [Basic Implementation via `InMemoryVectorStore`](#basic-implementation-via-inmemoryvectorstore)
    - [Add Documents](#add-documents)
    - [Delete Documents](#delete)
    - [Search Over Documents](#search)
      - [Similarity Metrics](#similarity-metrics)
      - [Similarity Search](#similarity-search)
    - [Metadata filtering](#metadata-filtering)
  - [Advanced Search and Retrieval Techniques](#advanced-search-and-retrieval-techniques)
  - [PineCone](#pinecone)

# Vector Store

In [8]:
from os import path
from langchain_openai import OpenAIEmbeddings

pdf_path = "../sample_files/progit.pdf"

if not path.exists(pdf_path):
    raise Exception("Invalid path, File does not exits")

embedding_model = OpenAIEmbeddings(model="text-embedding-3-small")

## Basic Implementation via `InMemoryVectorStore`

In [3]:
from langchain_openai import OpenAIEmbeddings
from langchain_core.vectorstores import InMemoryVectorStore

# Initializing with an embedded model
vector_store = InMemoryVectorStore(embedding=embedding_model)

- Extracting Documents from pdf

In [4]:
from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import CharacterTextSplitter

raw_docs = []

# Initializing PDF loader
pdf_loader = PyPDFLoader(pdf_path)
async for doc in pdf_loader.alazy_load():
    current_page = doc.metadata["page_label"]

    if not current_page.isdigit():
        continue

    current_page_num = int(current_page)
    if current_page_num > 8:
        if current_page_num >= 492:
            break

        raw_docs.append(doc)
    else:
        continue

text_splitter = CharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200
)
documents = text_splitter.split_documents(raw_docs)

### Add Documents

- The **`add_documents`** method works with list of **`Document`** objects all have `page_content` and `meta_data` attribute. making them universal way to store unstructured text and associated metadata.

In [None]:
vector_store.add_documents(documents=documents)

- We have to provide **`ID's`** for the documents, so that instead of adding the same document multiple times, we can update the existing document.

In [None]:
vector_store.add_documents(documents=documents, ids=["doc1", "doc2"])

### Delete 

- To delete documents, use the delete method which takes a list of document IDs to delete.

In [15]:
vector_store.delete(ids=["doc1"])

### Search

- Vector store embed and store the documents that added. If we pass query it convert into vector and perform a similarity search over the embedded documents.
- There are two important concept for searching
  1. Needs a way to measure that similarity between query and any embedded documents. 
  2. Needs are algorithm to efficiently perform the similarity search across all embedded documents.

#### Similarity metrics

- The important thing about embeddings are they can be comparable using simple **Mathematical operations:**
  1. **Cosine Similarity:**  A metric used to measure how similar two vectors are, based on the angle between them, regardless of their magnitude.
  2. **Euclidean Distance:** Euclidean distance is a measure of the **straight-line** distance between two points in a **multi-dimensional space**
  3. **Dot Product:** An operation in linear algebra that takes two vectors and returns a single scalar value. It reflects how much two vectors point in the same direction.

- The choice of similarity metrics can something selected during initialization of vector store.

#### Similarity Search

- Given similarity metrics measure the distance between the embedded query and any embedded documents.
- There are various algorithms for efficiently search over all embedded documents, Many vector stores implement **`HNSW (Hierarchical Navigable Small World)`**, a graph based index structure that allows for similarity search. 
- In LangChain, under the hood use this algorithm.

In [None]:
query = "my query"
docs = vectorstore.similarity_search(query)

- Many vectorstores support search parameters to be passed with **`similarity_search`** method.
- [**PineCone**](#pinecone) supports several parameters. Many vectorstores support the k, which controls the number of Documents to return, and filter, which allows for filtering documents by metadata.

- query (str) – Text to look up documents similar to.
- k (int) – Number of Documents to return. Defaults to 4.
- filter (dict | None) – Dictionary of argument(s) to filter on metadata

### Metadata filtering

- **Vector Stores** implement a search algorithm to efficiently search over all embedded documents to find most relevant one. many vector stores also supports **Metadata filtering**.
- These two concepts work well together
    1. **Semantic Search:**  Query the unstructured data directly, often via embedding or keyword similarity.
    2. **Metadata Search:** Apply structured query to the metadata, filtering specific documents.

    ```py
    # Example of PineCone
    
    vector_store.similarity_search(
        "LangChain provides abstractions to make working with LLMs easy",
        k=2, # Number of documents return
        filter={"source": "tweet"},
    )
    ```

### [Advanced search and retrieval techniques](https://python.langchain.com/docs/concepts/vectorstores/#advanced-search-and-retrieval-techniques)

## PineCone

### Setup

In [None]:
from os import environ, getenv
from getpass import getpass

from pinecone import Pinecone

if not getenv("PINECONE_API_KEY"):
    environ["PINECONE_API_KEY"] = getpass("Enter your Pinecone API key: ")


pine_cone_api_key = environ["PINECONE_API_KEY"]

pc = Pinecone(api_key=pine_cone_api_key)

## Initialization

- Before initialization, let's connect to PineCone **index**. If that named **index** not exist, than will create it.

In [33]:
from pinecone import ServerlessSpec

index_name = "embeddings-test-index"

if not pc.has_index(index_name):
    pc.create_index(
        name=index_name,
        dimension=1536,
        metric="cosine",
        spec=ServerlessSpec(cloud="aws", region="us-east-1")
    )

index = pc.Index(index_name)

### Embeddings

In [34]:
from langchain_openai import OpenAIEmbeddings

embedding_model = OpenAIEmbeddings(model="text-embedding-3-small")

In [35]:
from langchain_pinecone import PineconeVectorStore

vector_store = PineconeVectorStore(index=index, embedding=embedding_model)


### Manage vector store

In [36]:
# Let's add some documents

from uuid import uuid4
from langchain_core.documents import Document

document_1 = Document(
    page_content="I had chocolate chip pancakes and scrambled eggs for breakfast this morning.",
    metadata={"source": "tweet"},
)

document_2 = Document(
    page_content="The weather forecast for tomorrow is cloudy and overcast, with a high of 62 degrees.",
    metadata={"source": "news"},
)

document_3 = Document(
    page_content="Building an exciting new project with LangChain - come check it out!",
    metadata={"source": "tweet"},
)

document_4 = Document(
    page_content="Robbers broke into the city bank and stole $1 million in cash.",
    metadata={"source": "news"},
)

document_5 = Document(
    page_content="Wow! That was an amazing movie. I can't wait to see it again.",
    metadata={"source": "tweet"},
)

document_6 = Document(
    page_content="Is the new iPhone worth the price? Read this review to find out.",
    metadata={"source": "website"},
)

document_7 = Document(
    page_content="The top 10 soccer players in the world right now.",
    metadata={"source": "website"},
)

document_8 = Document(
    page_content="LangGraph is the best framework for building stateful, agentic applications!",
    metadata={"source": "tweet"},
)

document_9 = Document(
    page_content="The stock market is down 500 points today due to fears of a recession.",
    metadata={"source": "news"},
)

document_10 = Document(
    page_content="I have a bad feeling I am going to get deleted :(",
    metadata={"source": "tweet"},
)

documents = [
    document_1,
    document_2,
    document_3,
    document_4,
    document_5,
    document_6,
    document_7,
    document_8,
    document_9,
    document_10,
]
uuids = [str(uuid4()) for _ in range(len(documents))]

### Add items to vector store

In [37]:
vector_store.add_documents(documents=documents, ids=uuids)

['f1488b54-6e54-420c-9290-ad4221dcce28',
 '7c402871-55e1-4747-9001-ee0a48093439',
 'bb963abb-4564-479c-a2eb-7ceaeb7a1dcc',
 '41c0b0ac-339f-425f-acaf-6acbae15efd9',
 'f6f60f8e-063e-4713-9676-e915c6a28f13',
 'a8484eac-9709-41c3-8976-a81641c9e7c7',
 '75321538-01b9-4389-9c77-ea102198817a',
 '0260f310-4950-4ec6-a23d-0e103d981799',
 '399c3cf1-8529-484f-b2ca-b67f7a2dadbd',
 'e51887e3-abe8-43fc-863e-d898dfaeff02']

### Delete items from vector store

In [38]:
vector_store.delete(ids=[doc_ids[-1]])

## Query Directly

- Let's search simple similarity search over vector store.

In [41]:
result = vector_store.similarity_search(
    "LangChain provides abstractions to make working with LLMs easy",
    k=3,  # Number of documents to return, 4 default
    filter={"source": "tweet"},  # Add filtration with the help of metadata
)

for res in result:
    print(f"--> {res.page_content} [{res.metadata}]")

--> Building an exciting new project with LangChain - come check it out! [{'source': 'tweet'}]
--> LangGraph is the best framework for building stateful, agentic applications! [{'source': 'tweet'}]
--> Wow! That was an amazing movie. I can't wait to see it again. [{'source': 'tweet'}]


- We can also search with score:

In [44]:
results = vector_store.similarity_search_with_score(
    "Will it be hot tomorrow?", k=1, filter={"source": "news"}
)
for res, score in results:
    print(f"* [SIM={score:3f}] {res.page_content} [{res.metadata}]")

* [SIM=0.569310] The weather forecast for tomorrow is cloudy and overcast, with a high of 62 degrees. [{'source': 'news'}]


### Query by turning into retriever

- You can also transform the vector store into a retriever for easier usage in your chains.



In [45]:
retriever = vector_store.as_retriever(
    search_type="similarity_score_threshold",
    search_kwargs={"k": 1, "score_threshold": 0.4},
)
retriever.invoke("Stealing from the bank is a crime",
                 filter={"source": "news"})

[Document(id='41c0b0ac-339f-425f-acaf-6acbae15efd9', metadata={'source': 'news'}, page_content='Robbers broke into the city bank and stole $1 million in cash.')]